# NLP Modeling

How do we quantify a document?

- [Setup](#setup)
- [Data Representation](#data-representation)
    - [Bag of Words](#bag-of-words)
    - [TF-IDF](#tf-idf)
    - [Bag Of Ngrams](#bag-of-ngrams)
- [Modeling](#modeling)
    - [Modeling Results](#modeling-results)
- [Next Steps](#next-steps)

## Setup

In [2]:
from pprint import pprint
import pandas as pd
import nltk
import re

In [3]:

def clean(text: str) -> list:
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split() # tokenization
    return [wnl.lemmatize(word) for word in words if word not in stopwords]


## Data Representation

Simple data for demonstration.

In [4]:

data = [
    'Python is pretty cool',
    'Python is a nice programming language with nice syntax',
    'I think SQL is cool too',
]


### Bag of Words

```python
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
bag_of_words = cv.fit_transform(data)
```

Here `bag_of_words` is a **sparse matrix**. Usually you should keep it as such,
but for demonstration we'll view the data within.

```python
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=cv.get_feature_names())
```

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
# initalize the object
cv = CountVectorizer()

# transforming the object into a bag of words
bag_of_words = cv.fit_transform(data)

In [17]:
# Sparse matrix is a matrix with more 0's than any other numbers in it
bag_of_words

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [18]:
bag_of_words.todense()

matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1],
        [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])

In [19]:
cv.get_feature_names()

['cool',
 'is',
 'language',
 'nice',
 'pretty',
 'programming',
 'python',
 'sql',
 'syntax',
 'think',
 'too',
 'with']

In [20]:
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=cv.get_feature_names())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,1,1,0,0,1,0,1,0,0,0,0,0
1,0,1,1,2,0,1,1,0,1,0,0,1
2,1,1,0,0,0,0,0,1,0,1,1,0


### TF-IDF

- TF = "Term Frequency" 
- IDF = "Inverse Document Frequency"
- a measure that helps identify how important a word is in a document
- combination of how often a word appears in a document (**tf**) and how unqiue the word
  is among documents (**idf**)
- used by search engines
- naturally helps filter out stopwords
- tf is for a single document, idf is for a corpus

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bag_of_words = tfidf.fit_transform(data)

pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=tfidf.get_feature_names()).round(1)

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.5,0.4,0.0,0.0,0.6,0.0,0.5,0.0,0.0,0.0,0.0,0.0
1,0.0,0.2,0.3,0.7,0.0,0.3,0.3,0.0,0.3,0.0,0.0,0.3
2,0.4,0.3,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.5,0.0


In [22]:
# if we take out object and run .idk_ on it it will give us our idf scores
tfidf.idf_

array([1.28768207, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.69314718])

In [23]:
tfidf.get_feature_names()

['cool',
 'is',
 'language',
 'nice',
 'pretty',
 'programming',
 'python',
 'sql',
 'syntax',
 'think',
 'too',
 'with']

In [31]:
list(zip(tfidf.get_feature_names(), tfidf.idf_))

[('cool', 1.2876820724517808),
 ('is', 1.0),
 ('language', 1.6931471805599454),
 ('nice', 1.6931471805599454),
 ('pretty', 1.6931471805599454),
 ('programming', 1.6931471805599454),
 ('python', 1.2876820724517808),
 ('sql', 1.6931471805599454),
 ('syntax', 1.6931471805599454),
 ('think', 1.6931471805599454),
 ('too', 1.6931471805599454),
 ('with', 1.6931471805599454)]

In [30]:
pd.Series(dict(zip(tfidf.get_feature_names(), tfidf.idf_))).sort_values()

is             1.000000
cool           1.287682
python         1.287682
language       1.693147
nice           1.693147
pretty         1.693147
programming    1.693147
sql            1.693147
syntax         1.693147
think          1.693147
too            1.693147
with           1.693147
dtype: float64

### Bag Of Ngrams

For either `CountVectorizer` or `TfidfVectorizer`, you can set the `ngram_range`
parameter.

In [33]:
cv = CountVectorizer(ngram_range=(2, 2))
bag_of_words = cv.fit_transform(data)

In [34]:
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=cv.get_feature_names()).T

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,0,1,2
cool too,0,0,1
is cool,0,0,1
is nice,0,1,0
is pretty,1,0,0
language with,0,1,0
nice programming,0,1,0
nice syntax,0,1,0
pretty cool,1,0,0
programming language,0,1,0
python is,1,1,0


## Modeling

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df = pd.read_csv('spam_clean.csv')

In [36]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [37]:
cv = CountVectorizer()
X = cv.fit_transform(df.text.apply(clean).apply(' '.join))
y = df.label

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)

tree.score(X_train, y_train)

0.9306708548350908

In [41]:
pd.DataFrame(X_train[:5].todense(), columns=cv.get_feature_names())

Unnamed: 0,008704050406,0089my,0121,01223585236,01223585334,0125698789,02,020603,0207,02070836089,...,zebra,zed,zero,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
tree.predict(X_train)

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [45]:
tree.score(X_test, y_test)

0.9147982062780269

### Modeling Results

A super-useful feature of decision trees and linear models is that they do some
built-in feature selection through the coefficeints or feature importances:

In [48]:
# the numbers in this array display how important each word is to the model
tree.feature_importances_

array([0., 0., 0., ..., 0., 0., 0.])

In [49]:
cv.get_feature_names()

['008704050406',
 '0089my',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '020603',
 '0207',
 '02070836089',
 '02072069400',
 '02073162414',
 '02085076972',
 '020903',
 '021',
 '050703',
 '0578',
 '06',
 '060505',
 '061104',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '071104',
 '07123456789',
 '0721072',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '0784987',
 '0789xxxxxxx',
 '0794674629107880867867',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '0800',
 '08000407165',
 '08000776320',
 '08000839402',
 '08000930705',
 '08000938767',
 '08001950382',
 '08002888812',
 '08002986030',
 '08002986906',
 '08002988890',
 '08006344447',
 '0808',
 '08081263000',
 '08081560665',
 '0825',
 '0844',
 '08448350055',
 '08448714184',
 '0845',
 '08450542832',
 '08452810071',
 '08452810073'

In [51]:
list(zip(cv.get_feature_names(), tree.feature_importances_))

[('008704050406', 0.0),
 ('0089my', 0.0),
 ('0121', 0.0),
 ('01223585236', 0.0),
 ('01223585334', 0.0),
 ('0125698789', 0.0),
 ('02', 0.0),
 ('020603', 0.0),
 ('0207', 0.0),
 ('02070836089', 0.0),
 ('02072069400', 0.0),
 ('02073162414', 0.0),
 ('02085076972', 0.0),
 ('020903', 0.0),
 ('021', 0.0),
 ('050703', 0.0),
 ('0578', 0.0),
 ('06', 0.0),
 ('060505', 0.0),
 ('061104', 0.0),
 ('07008009200', 0.0),
 ('07046744435', 0.0),
 ('07090201529', 0.0),
 ('07090298926', 0.0),
 ('07099833605', 0.0),
 ('071104', 0.0),
 ('07123456789', 0.0),
 ('0721072', 0.0),
 ('07732584351', 0.0),
 ('07734396839', 0.0),
 ('07742676969', 0.0),
 ('07753741225', 0.0),
 ('0776xxxxxxx', 0.0),
 ('07786200117', 0.0),
 ('077xxx', 0.0),
 ('078', 0.0),
 ('07801543489', 0.0),
 ('07808', 0.0),
 ('07808247860', 0.0),
 ('07808726822', 0.0),
 ('07815296484', 0.0),
 ('07821230901', 0.0),
 ('0784987', 0.0),
 ('0789xxxxxxx', 0.0),
 ('0794674629107880867867', 0.0),
 ('0796xxxxxx', 0.0),
 ('07973788240', 0.0),
 ('07xxxxxxxxx', 0

In [46]:
pd.Series(dict(zip(cv.get_feature_names(), tree.feature_importances_))).sort_values().tail(12)

probably    0.006182
youre       0.010495
stop        0.011767
ill         0.017176
service     0.020439
mobile      0.026495
reply       0.042182
later       0.059484
claim       0.073024
text        0.086027
txt         0.280117
call        0.357470
dtype: float64

## Next Steps

- Try other model types

    [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
    ([`sklearn`
    docs](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html))
    is a very popular classifier for NLP tasks.

- Look at other metrics, is accuracy the best choice here?

- Try ngrams instead of single words

- Try a combination of ngrams and words (`ngram_range=(1, 2)` for words and
  bigrams)

- Try using tf-idf instead of bag of words

- Combine the top `n` performing words with the other features that you have
  engineered (the `CountVectorizer` and `TfidfVectorizer` have a `vocabulary`
  argument you can use to restrict the words used)

In [52]:

    best_words = (
        # or, e.g. lm.coef_
        pd.Series(dict(zip(cv.get_feature_names(), tree.feature_importances_)))
        .sort_values()
        .tail(5)
        .index
    )

    cv = CountVectorizer(vocabulary=best_words)
    X = cv.fit_transform(df.text.apply(clean).apply(' '.join))

    # for demonstration
    pd.DataFrame(X.todense(), columns=cv.get_feature_names())


Unnamed: 0,later,claim,text,txt,call
0,0,0,0,0,0
1,0,0,0,0,0
2,0,0,1,1,0
3,0,0,0,0,0
4,0,0,0,0,0
...,...,...,...,...,...
5567,0,1,0,0,1
5568,0,0,0,0,0
5569,0,0,0,0,0
5570,0,0,0,0,0
