# Natural Language Processing

## Steps
* [tokenization](#Tokenization)
* [vectorization](#Vectorization)
* [TD-IDF](#TF-IDF)

# Tokenization
## start small

In [19]:
import nltk

In [1]:
token_test = "Here is a sentence. Or two, I don't think there will be more."
token_test_2 ="i thought this sentence was good."
token_test_3 = "Here's a sentence... maybe two. Depending on how you like to count!"

In [2]:
type(token_test_3)

str

In [6]:
# let's tokenize a document... into sentences
def make_sentences(doc):
    return doc.split('.')

make_sentences(token_test_3)

["Here's a sentence",
 '',
 '',
 ' maybe two',
 ' Depending on how you like to count!']

In [7]:
# let's tokenize a document into words
# with these 3 test cases what would you look out for?
def tokenize_it(doc):
    return doc.split(' ')

tokenize_it(token_test_3)

["Here's",
 'a',
 'sentence...',
 'maybe',
 'two.',
 'Depending',
 'on',
 'how',
 'you',
 'like',
 'to',
 'count!']

# Before running the next cell, let's look the nltk tokenizers in action
* https://text-processing.com/demo/tokenize/

In [8]:
# using the natural language toolkit library for tokenizing sentences
from nltk import sent_tokenize

In [14]:
sent_tokenize(token_test_3)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/Users/paulyun/nltk_data'
    - '/Users/paulyun/anaconda3/nltk_data'
    - '/Users/paulyun/anaconda3/share/nltk_data'
    - '/Users/paulyun/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [38]:
# let's tokenize a document into words now
from nltk import word_tokenize

In [39]:
# how would I find out which tokenizer nltk is using?
print(word_tokenize(token_test))
print(word_tokenize(token_test_2))
print(word_tokenize(token_test_3))

['Here', 'is', 'a', 'sentence', '.', 'Or', 'two', ',', 'I', 'do', "n't", 'think', 'there', 'will', 'be', 'more', '.']
['i', 'thought', 'this', 'sentence', 'was', 'good', '.']
['Here', "'s", 'a', 'sentence', '...', 'maybe', 'two', '.', 'Depending', 'on', 'how', 'you', 'like', 'to', 'count', '!']


## Intuitively how would we compare these 'documents'? 
By counting the amount of words in each document! 
<br>
This is known as a **bag of words**


In [40]:
# this will take one sentence as a string
def bag_o_words(bag):
    pass

bag_o_words(token_test)

## problems with comparing two documents?
ummm yea, ofc!

In [41]:
# write some potential problems here

### Stop words

In [42]:
from nltk.corpus import stopwords
print(stopwords.words('english')[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [43]:
# stopwords are unique to each corpus/project you do
my_stopwords = set(stopwords.words('english'))

In [44]:
# take the stop words out of 'token_test'
[x for x in word_tokenize(token_test) if x not in my_stopwords]

['Here', 'sentence', '.', 'Or', 'two', ',', 'I', "n't", 'think', '.']

In [46]:
print(word_tokenize(token_test))

['Here', 'is', 'a', 'sentence', '.', 'Or', 'two', ',', 'I', 'do', "n't", 'think', 'there', 'will', 'be', 'more', '.']


### Now that stop words are out of the way check out Stems and Lemmas in action
* https://text-processing.com/demo/stem/

In [47]:
from nltk.stem import LancasterStemmer, SnowballStemmer, RegexpStemmer, WordNetLemmatizer 

In [48]:
stem_sentence = """when data scientists are performing natural language processing analysis, they must take\
 different verb tenses and singular versus plural words into account."""

In [49]:
snowball = SnowballStemmer('english')

In [50]:
# function to get stems and lemmas
# fill in comments
def stem_words(document,stemmer):
    # 
    toks = word_tokenize(document)
    wrd_list = []
    # 
    for word in toks:
        # 
        wrd_list.append(stemmer.stem(word))
    # 
    return " ".join(wrd_list)

In [51]:
stem_words(stem_sentence,snowball)

'when data scientist are perform natur languag process analysi , they must take differ verb tens and singular versus plural word into account .'

In [52]:
lancaster = LancasterStemmer()

In [53]:
stem_words(stem_sentence,lancaster)

'when dat sci ar perform nat langu process analys , they must tak diff verb tens and singul vers plur word into account .'

In [54]:
regex_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [55]:
stem_words(stem_sentence,regex_stemmer)

'when data scientist are perform natural languag process analysi , they must tak different verb tense and singular versu plural word into account .'

In [56]:
lemma = WordNetLemmatizer()

In [57]:
# function for lemmas
def lem_words(document,lemmer):
    toks = word_tokenize(document)
    wrd_list = []
    for word in toks:
        wrd_list.append(lemmer.lemmatize(word))
    return " ".join(wrd_list)

In [58]:
# test it out
lemma.lemmatize('things')

'thing'

In [59]:
lem_words(stem_sentence,lemma)

'when data scientist are performing natural language processing analysis , they must take different verb tense and singular versus plural word into account .'

# Vectorization
## this step happens after we account for stopwords and lemmas; depending on the library...
* we make a **Count Vector**, which is the formal term for a **bag of words**
* we use vectors to pass text into machine learning models


In [60]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Let's check out the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [61]:
# test the CountVectorizer method on 'basic_example'
basic_example = ['The Data Scientist wants to train a machine to train machine learning models.']
cv = CountVectorizer(stop_words='english')
cv.fit(basic_example)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [32]:
# what info can we get from cv?
# hint -- look at the docs again
cv.get_feature_names()

['data', 'learning', 'machine', 'models', 'scientist', 'train', 'wants']

In [35]:
cv.stop_words_

set()

In [37]:
cv.vocabulary_

{'data': 0,
 'scientist': 4,
 'wants': 6,
 'train': 5,
 'machine': 2,
 'learning': 1,
 'models': 3}

## Vectorization allows us to compare two documents

In [62]:
# use pandas to help see what's happening
import pandas as pd

In [63]:
# we fit the CountVectorizer on the 'basic_example', now we transform 'basic_example'
example_vector_doc_1 = cv.transform(basic_example)

In [64]:
# what is the type 
print(type(example_vector_doc_1))

<class 'scipy.sparse.csr.csr_matrix'>


In [65]:
# what does it look like
print(example_vector_doc_1)

  (0, 0)	1
  (0, 1)	1
  (0, 2)	2
  (0, 3)	1
  (0, 4)	1
  (0, 5)	2
  (0, 6)	1


In [66]:
# let's visualize it
example_vector_df = pd.DataFrame(example_vector_doc_1.toarray(), columns=cv.get_feature_names())
example_vector_df

Unnamed: 0,data,learning,machine,models,scientist,train,wants
0,1,1,2,1,1,2,1


In [67]:
# here we compare new text to the CountVectorizer fit on 'basic_example'
new_text = ['the data scientist plotted the residual error of her model']
new_data = cv.transform(new_text)
new_count = pd.DataFrame(new_data.toarray(),columns=cv.get_feature_names())
new_count

Unnamed: 0,data,learning,machine,models,scientist,train,wants
0,1,0,0,0,1,0,0


## N-grams

In [68]:
# in this the object 'sentences' becomes the corpus
sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']

In [74]:
# go back to the docs for count vectorizer, how would we use an ngram
# pro tip -- include stop words
bigrams = CountVectorizer(ngram_range=(1,2), stop_words='english')

In [75]:
bigram_vector = bigrams.fit_transform(sentences)
bigram_vector

<4x36 sparse matrix of type '<class 'numpy.int64'>'
	with 41 stored elements in Compressed Sparse Row format>

In [76]:
print('There are '+str(len(bigrams.get_feature_names()))+ ' features for this corpus')
bigrams.get_feature_names()[:10]

There are 36 features for this corpus


['analysis',
 'analysis good',
 'competition',
 'data',
 'data scientist',
 'error',
 'error model',
 'gained',
 'gained sentiance',
 'good']

In [77]:
# let's visualize it
bigram_df = pd.DataFrame(bigram_vector.toarray(), columns=bigrams.get_feature_names())
bigram_df.head()

Unnamed: 0,analysis,analysis good,competition,data,data scientist,error,error model,gained,gained sentiance,good,...,scientist,scientist plotted,scientist wants,sentiance,train,train machine,wants,wants train,won,won kaggle
0,0,0,0,1,1,0,0,0,0,0,...,1,0,1,0,2,2,1,1,0,0
1,1,0,0,1,1,1,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0


# TF-IDF
## Term Frequency - Inverse Document Frequency

In [78]:
tf_idf_sentences = ['The Data Scientist wants to train a machine to train machine learning models.',
'the data scientist plotted the residual error of her model in her analysis',
'Her analysis was so good, she won a Kaggle competition.',
'The machine gained sentiance']
# take out stop words
tfidf = TfidfVectorizer(stop_words='english')
# fit transform the sentences
tfidf_sentences = tfidf.fit_transform(tf_idf_sentences)

In [79]:
# visualize it
tfidf_df = pd.DataFrame(tfidf_sentences.toarray(), columns=tfidf.get_feature_names())

In [80]:
tfidf_df

Unnamed: 0,analysis,competition,data,error,gained,good,kaggle,learning,machine,model,models,plotted,residual,scientist,sentiance,train,wants,won
0,0.0,0.0,0.240692,0.0,0.0,0.0,0.0,0.305288,0.481384,0.0,0.305288,0.0,0.0,0.240692,0.0,0.610575,0.305288,0.0
1,0.325557,0.0,0.325557,0.412928,0.0,0.0,0.0,0.0,0.0,0.412928,0.0,0.412928,0.412928,0.325557,0.0,0.0,0.0,0.0
2,0.366739,0.465162,0.0,0.0,0.0,0.465162,0.465162,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.465162
3,0.0,0.0,0.0,0.0,0.617614,0.0,0.0,0.0,0.486934,0.0,0.0,0.0,0.0,0.0,0.617614,0.0,0.0,0.0


In [81]:
# compared to bigrams
bigram_df

Unnamed: 0,analysis,analysis good,competition,data,data scientist,error,error model,gained,gained sentiance,good,...,scientist,scientist plotted,scientist wants,sentiance,train,train machine,wants,wants train,won,won kaggle
0,0,0,0,1,1,0,0,0,0,0,...,1,0,1,0,2,2,1,1,0,0
1,1,0,0,1,1,1,1,0,0,0,...,1,1,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,1
3,0,0,0,0,0,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0


In [90]:
# now let's test out our TfidfVectorizer
test_tdidf = tfidf.transform(['this is a test scientist','look at me I am a residual test scientist'])

In [91]:
# this is a vector
test_tdidf

<2x18 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [92]:
test_tfidf_df = pd.DataFrame(test_tdidf.toarray(), columns=tfidf.get_feature_names())
test_tfidf_df

Unnamed: 0,analysis,competition,data,error,gained,good,kaggle,learning,machine,model,models,plotted,residual,scientist,sentiance,train,wants,won
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.785288,0.61913,0.0,0.0,0.0,0.0
