# Natural Language Processing

## Steps
* [tokenization](#Tokenization)
* [vectorization](#Vectorization)
* [TD-IDF](#TF-IDF)

# Tokenization
## start small

In [1]:
import nltk

token_test = "Here is a sentence. Or two, I don't think there will be more."
token_test_2 ="i thought this sentence was good."
token_test_3 = "Here's a sentence... maybe two. Depending on how you like to count!"

In [2]:
# let's tokenize a document... into sentences
def make_sentences(doc):
    #pass
    return nltk.sent_tokenize(doc)
    
make_sentences(token_test_3)

["Here's a sentence... maybe two.", 'Depending on how you like to count!']

In [3]:
# let's tokenize a document into words
# with these 3 test cases what would you look out for?
def tokenize_it(doc):
    #pass
    return nltk.word_tokenize(doc)
    
tokenize_it(token_test)

['Here',
 'is',
 'a',
 'sentence',
 '.',
 'Or',
 'two',
 ',',
 'I',
 'do',
 "n't",
 'think',
 'there',
 'will',
 'be',
 'more',
 '.']

# Before running the next cell, let's look the nltk tokenizers in action
* https://text-processing.com/demo/tokenize/

In [4]:
# using the natural language toolkit library for tokenizing sentences
from nltk import sent_tokenize

In [5]:
sent_tokenize(token_test)

['Here is a sentence.', "Or two, I don't think there will be more."]

In [6]:
# let's tokenize a document into words now
from nltk import word_tokenize

In [7]:
# how would I find out which tokenizer nltk is using?
print(word_tokenize(token_test))
print(word_tokenize(token_test_2))
print(word_tokenize(token_test_3))

['Here', 'is', 'a', 'sentence', '.', 'Or', 'two', ',', 'I', 'do', "n't", 'think', 'there', 'will', 'be', 'more', '.']
['i', 'thought', 'this', 'sentence', 'was', 'good', '.']
['Here', "'s", 'a', 'sentence', '...', 'maybe', 'two', '.', 'Depending', 'on', 'how', 'you', 'like', 'to', 'count', '!']


## Intuitively how would we compare these 'documents'? 
By counting the amount of words in each document! 
<br>
This is known as a **bag of words**


## problems with comparing two documents?
ummm yea, ofc!

In [None]:
# write some potential problems here

### Stop words

In [11]:
from nltk.corpus import stopwords
print(stopwords.words('english')[:50])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


In [12]:
# stopwords are unique to each corpus/project you do
my_stopwords = set(stopwords.words('english'))

In [13]:
# take the stop words out of 'token_test'
[x for x in word_tokenize(token_test) if x not in my_stopwords]

['Here', 'sentence', '.', 'Or', 'two', ',', 'I', "n't", 'think', '.']

### Now that stop words are out of the way check out Stems and Lemmas in action
* https://text-processing.com/demo/stem/

In [14]:
from nltk.stem import LancasterStemmer, SnowballStemmer, RegexpStemmer, WordNetLemmatizer 

In [15]:
stem_sentence = """when data scientists are performing natural language 
 processing analysis, they must take\
 different verb tenses and singular versus plural words into account."""

In [16]:
snowball = SnowballStemmer('english')

In [18]:
# function to get stems and lemmas
# fill in comments
def stem_words(document,stemmer):
    #
    toks = word_tokenize(document)
    wrd_list = []
    #
    for word in toks:
        #
        wrd_list.append(stemmer.stem(word))
    #
    return " ".join(wrd_list)

In [19]:
stem_words(stem_sentence,snowball)

'when data scientist are perform natur languag process analysi , they must take differ verb tens and singular versus plural word into account .'

In [20]:
lancaster = LancasterStemmer()

In [21]:
stem_words(stem_sentence,lancaster)

'when dat sci ar perform nat langu process analys , they must tak diff verb tens and singul vers plur word into account .'

In [22]:
regex_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [23]:
stem_words(stem_sentence,regex_stemmer)

'when data scientist are perform natural languag process analysi , they must tak different verb tense and singular versu plural word into account .'

In [24]:
lemma = WordNetLemmatizer()

In [25]:
# function for lemmas
def lem_words(document,lemmer):
    toks = word_tokenize(document)
    wrd_list = []
    for word in toks:
        wrd_list.append(lemmer.lemmatize(word))
    return " ".join(wrd_list)

In [26]:
# test it out
lemma.lemmatize('things')

'thing'

In [27]:
lem_words(stem_sentence,lemma)

'when data scientist are performing natural language processing analysis , they must take different verb tense and singular versus plural word into account .'

# Vectorization
## this step happens after we account for stopwords and lemmas; depending on the library...
* we make a **Count Vector**, which is the formal term for a **bag of words**
* we use vectors to pass text into machine learning models


In [28]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Let's check out the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [29]:
# test the CountVectorizer method on 'basic_example'
basic_example = ['The Data Scientist wants to train a machine to train machine learning models.']
cv = CountVectorizer()
cv.fit(basic_example)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [None]:
# what info can we get from cv?
# hint -- look at the docs again

## Vectorization allows us to compare two documents

In [30]:
# use pandas to help see what's happening
import pandas as pd

In [31]:
# we fit the CountVectorizer on the 'basic_example', now we transform 'basic_example'
example_vector_doc_1 = cv.transform(basic_example)

In [33]:
# # what is the type 

print(type(example_vector_doc_1))

# # what does it look like

print(example_vector_doc_1)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 0)	1
  (0, 1)	1
  (0, 2)	2
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	2
  (0, 7)	2
  (0, 8)	1


In [34]:
# let's visualize it
example_vector_df = pd.DataFrame(example_vector_doc_1.toarray(), columns=cv.get_feature_names())
example_vector_df

Unnamed: 0,data,learning,machine,models,scientist,the,to,train,wants
0,1,1,2,1,1,1,2,2,1


In [37]:
# # here we compare new text to the CountVectorizer fit on 'basic_example'
new_text = ['the data scientist plotted the residual error of her model']
new_data = cv.transform(new_text)
new_count = pd.DataFrame(new_data.toarray(),columns=cv.get_feature_names())
new_count

Unnamed: 0,data,learning,machine,models,scientist,the,to,train,wants
0,1,0,0,0,1,2,0,0,0


## N-grams

In [38]:
# in this the object 'sentences' becomes the corpus
sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']

In [39]:
# go back to the docs for count vectorizer, how would we use an ngram
# pro tip -- include stop words
bigrams = CountVectorizer()

In [40]:
bigram_vector = bigrams.fit_transform(sentences)
bigram_vector

<4x26 sparse matrix of type '<class 'numpy.int64'>'
	with 33 stored elements in Compressed Sparse Row format>

In [41]:
print('There are '+str(len(bigrams.get_feature_names()))+ ' features for this corpus')
bigrams.get_feature_names()[:10]

There are 26 features for this corpus


['analysis',
 'competition',
 'data',
 'error',
 'gained',
 'good',
 'her',
 'in',
 'kaggle',
 'learning']

In [42]:
# let's visualize it
bigram_df = pd.DataFrame(bigram_vector.toarray(), columns=bigrams.get_feature_names())
bigram_df.head()

Unnamed: 0,analysis,competition,data,error,gained,good,her,in,kaggle,learning,...,scientist,sentiance,she,so,the,to,train,wants,was,won
0,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,1,2,2,1,0,0
1,1,0,1,1,0,0,2,1,0,0,...,1,0,0,0,2,0,0,0,0,0
2,1,1,0,0,0,1,1,0,1,0,...,0,0,1,1,0,0,0,0,1,1
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0


# TF-IDF
## Term Frequency - Inverse Document Frequency

In [43]:
tf_idf_sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']
# take out stop words
tfidf = TfidfVectorizer(stop_words='english')
# fit transform the sentences
tfidf_sentences = tfidf.fit_transform(tf_idf_sentences)

In [44]:
# visualize it
tfidf_df = pd.DataFrame(tfidf_sentences.toarray(), columns=tfidf.get_feature_names())

In [45]:
tfidf_df

Unnamed: 0,analysis,competition,data,error,gained,good,kaggle,learning,machine,model,models,plotted,residual,scientist,sentiance,train,wants,won
0,0.0,0.0,0.240692,0.0,0.0,0.0,0.0,0.305288,0.481384,0.0,0.305288,0.0,0.0,0.240692,0.0,0.610575,0.305288,0.0
1,0.325557,0.0,0.325557,0.412928,0.0,0.0,0.0,0.0,0.0,0.412928,0.0,0.412928,0.412928,0.325557,0.0,0.0,0.0,0.0
2,0.366739,0.465162,0.0,0.0,0.0,0.465162,0.465162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.465162
3,0.0,0.0,0.0,0.0,0.617614,0.0,0.0,0.0,0.486934,0.0,0.0,0.0,0.0,0.0,0.617614,0.0,0.0,0.0


In [46]:
# compared to bigrams
bigram_df

Unnamed: 0,analysis,competition,data,error,gained,good,her,in,kaggle,learning,...,scientist,sentiance,she,so,the,to,train,wants,was,won
0,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,1,2,2,1,0,0
1,1,0,1,1,0,0,2,1,0,0,...,1,0,0,0,2,0,0,0,0,0
2,1,1,0,0,0,1,1,0,1,0,...,0,0,1,1,0,0,0,0,1,1
3,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
