<div class="alert alert-danger" role="alert">
    <span style="font-size:20px">&#9888;</span> <span style="font-size:16px">This is a read-only notebook! If you want to make and save changes, save a copy by clicking on <b>File</b> &#8594; <b>Save a copy</b>. If this is already a copy, you can delete this cell.</span>
</div>

# Basic NLP modeling


In this notebook we introduce the more common language models used for machine learning.
This notebook is the sequence of the ["01- Basic NLP preprocessing.ipynb"](http://localhost:8888/lab/tree/template/notebooks/02%20-%20Basic%20NLP%20modeling.ipynb) notebook. Run the previous notebook to generate the necessary data at the correct folder.

In [1]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')

In [2]:
import pandas as pd
from nlp import BagOfWords, WordEmbedding

pd.set_option('display.max_colwidth', 500)

In [4]:
# Dataset generated by the "01- Basic NLP preprocessing.ipynb" notebook. Run the previous notebook to generate the data.

dataset = pd.read_csv('sample_output/pre_processed_sentences.csv')
text_field = "ConsumerComplaintNarrative"

dataset

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,sever item credit report mine notic recent appli credit card wasnt approv prompt check credit report item list mine need remov usddeptof
1,1,call sever occa request student loan updat reflect correct payment equifax continu reflect inaccur report make incorrect payment histori reflect correct sure ontim payment remov drop payment histori current contact origin creditor told late payment reflect account havent late year howev equifax continu reflect inform correct destroy charact limit abil obtain credit
2,2,follow account open close mine ident compromi affect emot credit file well im demand immedi remov follow unauthor unknown account
3,3,person problem affect affect everi consum countri fight make sever phone call get free annual credit report even deni credit agenc direct total pay site scam direct total peopl low incom senior peopl high incom mean incom reli call credit compani also realiz complaint toss never address die
4,4,transunion report fraudul account name fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account didnt benefit appli author account
...,...,...
9995,9995,predatori lend unauthor credit inquiri friday receiv unsolicit call banker said would like review account see refin offer avail author banker run credit inquiri call end receiv notic credit monitor servic three credit bureau receiv hard credit inquiri contact make sure knew author credit inquiri refu remov phone number also call list also contact unsolicit text without consent continu falsifi stori dismiss neg impact credit report unabl provid evid consent credit inquiri
9996,9996,experian report incorrectli collectionchargeoff amount well report past due balanc partial account number plea see page attach credit report account need report balanc past due remov credit report fal report collectionchargeoff seriou harm credit score contact bureau remov howev success
9997,9997,may concern write disput fraudul charg account amount victim ident theft make author charg request charg remov financ charg relat fraudul amount credit well receiv accur statement request made pursuant fair credit bill act amend truth lend act usc b cfr see also cfr b write request method verif disput initi subsequ respon receiv enclo letter accord fcra section request inform review complet accuraci appropri lieu send inform reopen disput ensur proper investig perform would appreci time resp...
9998,9998,follow inquiri fraudul im demand remov


## Regularized bag of words

**Bag of words** is a simplifyed model of a text that consists in calculating the word frequencies in the corpus, disregarding grammar and word order [[1]](https://en.wikipedia.org/wiki/Bag-of-words_model). 

Additionally we can normalize this bag of words by their inverse frequency in the document (**TFIDF**) [[2]](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) in this way we can give a more realistic way to some words that are very frequent in the idiom but doesn't cary much meaning (like articles or other words that were not removed as stop words).

### Regular word counts

In [4]:
bow_model, word_counts = BagOfWords.fit_regular_bow(dataset[text_field])
word_counts

<10000x8475 sparse matrix of type '<class 'numpy.int64'>'
	with 414886 stored elements in Compressed Sparse Row format>

### TFIDF word counts

In [5]:
tfidf_word_counts = BagOfWords.fit_tfidf_bow(dataset[text_field])
tfidf_word_counts

(TfidfTransformer(norm=None),
 <10000x8475 sparse matrix of type '<class 'numpy.float64'>'
 	with 414886 stored elements in Compressed Sparse Row format>)

### Normalized word counts

In [6]:
norm_word_counts = BagOfWords.fit_normalized_bow(dataset[text_field])
norm_word_counts

<10000x8475 sparse matrix of type '<class 'numpy.float64'>'
	with 414886 stored elements in Compressed Sparse Column format>

In [7]:
len(dataset[text_field])

10000

The resulting matrices are of the dimensions \<number of rows from dataset\> x \<number of words in the corpus\>, in our case 10000x8475.

This matrices can be fed to machine learning algorithms in order to model text properties, such as general sentiment present in a text or a basic identification of whether a text contains a complaint or not.



## Word embeddings

Another way to extract features from a text corpus is by vectorizing it by using **word embeddings** [[3]](https://machinelearningmastery.com/what-are-word-embeddings/). This is a more robust representation of text data because it considers the semantics of each word. 

In this approach each word is transformed into a N-dimensional array. The dimensions of this array are related to semantics about the object, therefore words with similar meaning will be next to each other in this N-dimensional vectorial space. 

In [1]:
# Add path of the folder 'resources' to the path from which we can import modules  
import sys
sys.path.append('../utilities')

import pandas as pd
from nlp import BagOfWords, WordEmbedding

pd.set_option('display.max_colwidth', 500)
dataset = pd.read_csv('sample_output/pre_processed_sentences.csv')
text_field = "ConsumerComplaintNarrative"

dataset

Unnamed: 0,Row,ConsumerComplaintNarrative
0,0,sever item credit report mine notic recent appli credit card wasnt approv prompt check credit report item list mine need remov usddeptof
1,1,call sever occa request student loan updat reflect correct payment equifax continu reflect inaccur report make incorrect payment histori reflect correct sure ontim payment remov drop payment histori current contact origin creditor told late payment reflect account havent late year howev equifax continu reflect inform correct destroy charact limit abil obtain credit
2,2,follow account open close mine ident compromi affect emot credit file well im demand immedi remov follow unauthor unknown account
3,3,person problem affect affect everi consum countri fight make sever phone call get free annual credit report even deni credit agenc direct total pay site scam direct total peopl low incom senior peopl high incom mean incom reli call credit compani also realiz complaint toss never address die
4,4,transunion report fraudul account name fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account fraudul credit account didnt benefit appli author account
...,...,...
9995,9995,predatori lend unauthor credit inquiri friday receiv unsolicit call banker said would like review account see refin offer avail author banker run credit inquiri call end receiv notic credit monitor servic three credit bureau receiv hard credit inquiri contact make sure knew author credit inquiri refu remov phone number also call list also contact unsolicit text without consent continu falsifi stori dismiss neg impact credit report unabl provid evid consent credit inquiri
9996,9996,experian report incorrectli collectionchargeoff amount well report past due balanc partial account number plea see page attach credit report account need report balanc past due remov credit report fal report collectionchargeoff seriou harm credit score contact bureau remov howev success
9997,9997,may concern write disput fraudul charg account amount victim ident theft make author charg request charg remov financ charg relat fraudul amount credit well receiv accur statement request made pursuant fair credit bill act amend truth lend act usc b cfr see also cfr b write request method verif disput initi subsequ respon receiv enclo letter accord fcra section request inform review complet accuraci appropri lieu send inform reopen disput ensur proper investig perform would appreci time resp...
9998,9998,follow inquiri fraudul im demand remov


In [2]:
embedding = WordEmbedding()

Loading embedding model to memory...
Done!


In [3]:
result = embedding.transform(dataset[text_field])
result

No vector in "disput disput"
No vector in "fraudul inquiri"
No vector in "disput disput disput disput disput disput disput disput disput disput disput disput disput disput disput disput disput disput disput"
No vector in "unauthor inquiri transunion"
No vector in "inquir"
No vector in "unauthor inquiri"
No vector in "inquiri knowledg"


array([[-0.07753553,  0.02278539, -0.00518638, ...,  0.06176115,
        -0.00800042, -0.02298897],
       [-0.06949715,  0.02393772,  0.01911396, ..., -0.00575323,
         0.01801201, -0.0687839 ],
       [-0.02784819,  0.00189209,  0.05388686, ..., -0.03203269,
        -0.022662  , -0.03652842],
       ...,
       [-0.06088218,  0.04120695,  0.02956175, ..., -0.03692451,
        -0.02326574, -0.00577193],
       [-0.05966187,  0.0357666 ,  0.02453613, ..., -0.06591797,
         0.10949707, -0.02438354],
       [-0.00722395,  0.06133597,  0.00013188, ..., -0.01401084,
        -0.04476493, -0.04678345]], dtype=float32)

In [4]:
result.shape

(10000, 300)

The returned matrix of the `embedding.transform` function is in the correct format for being used in machine learning algorithms from the scikit-learn library.

## Conclusion

The methods shown above can be used to feed machine learning algorithms and be used for a diverse set of problems. 