## Natural Language Processing

NLP is the process by which bodies of text are analyzed in order to predict a given feature of the text. For instance, if I had 10 different books, and wanted to determine which of them were science fiction, then I would run the contents of the books through an NLP algorithm to create a dataframe of numerical values associated with each word in each book. These numerical values are then used to predict the target. 

The metric of interest in this context is the TFIDF, which is the Term Frequency multiplied by the Inverse Document Frequency. In other words, TFIDF is the product of multiplying the frequency of the term in one document, by the rarity of the word across all documents.

Generally speaking, we don't want to run this analysis on every word. Some words will appear frequently in all documents and won't provide us with any predictive information; words such as "the" or "it". We get rid of these words using a stop_word method. We also dont wan't to run the analysis on the entirety of the word either, but instead it's root; words such as "running" can be reduced to "run".

Let's take a look at this process in code.

In [11]:
import nltk
import sklearn
import pandas as pd
import string, re
import urllib
import pandas as pd

url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/D.txt"
article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")

#### search for patterns in the data to pull out the words

In [2]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

#### move all letters to lower case


In [3]:
arta_tokens = [i.lower() for i in arta_tokens_raw]

#### import stopwords

In [4]:
# assign stop words to a variable

nltk.corpus.stopwords.words("english")
stop_words = set(nltk.corpus.stopwords.words('english'))

#### filter out words in the stopwords list

In [5]:
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

#### initiate a stemmer and stem the words

In [6]:
# this acts to find the roots of the words, or thereabouts

stemmer = nltk.SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

#### join all the words in the list into a single string 

In [10]:
cleaned_a = ' '.join(arta_stemmed)

#### perform tfidf analysis

In [12]:
tfidf = sklearn.feature_extraction.text.TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a])

#### pass into DataFrame for inspection

In [13]:
nlpskl = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
nlpskl

Unnamed: 0,abstain,achiev,action,adopt,affair,amazon,back,base,bring,busi,...,two,union,us,use,vocal,vote,welcom,without,word,would
0,0.052129,0.052129,0.052129,0.052129,0.052129,0.052129,0.104257,0.104257,0.052129,0.052129,...,0.104257,0.052129,0.156386,0.052129,0.052129,0.052129,0.052129,0.052129,0.052129,0.156386


## Next Steps

At this point, we have successfully transformed the text of the document into numerical values with which we can begin modeling. Because this example only has 1 document, and because we don't have a target to predict, we can't perform any actual modeling. Don't worry though, the modeling process is relatively simple compared to the data wrangling performed in this walkthrough.