# Modeling

In [2]:
from pprint import pprint
import pandas as pd
import nltk
import re

def clean(text: str) -> list:
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split() # tokenization
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

In [3]:
#corpus we are using for modeling
data = [
    'Python is pretty cool',
    'Python is a nice programming language with nice syntax',
    'I think SQL is cool too',
]

## We can represent our data using:
- Bag of words
- n-grams
- TF-IDF

<hr style="border:2px solid black"> </hr>

## Bag of Words
- document frequency
- how often does word show up in document

**Example**:

"Mary had a little lamb, little lamb, little lamb"

|a | had | lamb | little | Mary |
|--|-----|------|--------|------|
|1 | 1   | 3    | 3      | 1    |

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
bag_of_words = cv.fit_transform(data)
bag_of_words

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

**Here bag_of_words is a sparse matrix. Usually you should keep it as such, but for demonstration we'll view the data within.**

count vectorizer
- turning documents into a vector of word counts

<br>

sparse matrix
- more items with value 0, then items who are NOT 0

In [6]:
#visualize the actual matrix 
bag_of_words.todense()

matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1],
        [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])

In [7]:
#turn this word bag matrix into a DF
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=cv.get_feature_names())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,1,1,0,0,1,0,1,0,0,0,0,0
1,0,1,1,2,0,1,1,0,1,0,0,1
2,1,1,0,0,0,0,0,1,0,1,1,0


<hr style="border:2px solid black"> </hr>

## Inverse Document Frequency
- 1 / (how many documents the word appears in)
- a measure that helps identify how important a word is in a document
- combination of how often a word appears in a document (tf) and how unqiue the word is among documents (idf)
- used by search engines
- naturally helps filter out stopwords
- tf is for a single document, idf is for a corpus

## TF-IDF
**TF**:
- term frequency: how often it appears

<br>

**IDF**:
- helps uniquely identify a document.
    - if its only in one document, it might be unique

**Examples**:
- doesn't show up frequently = low tf-idf
- shows up a lot in a single document= high tf AND high idf
- "is" from wordbag.. shows up in all 3 documents = high tf, low idf

<br>

- tf-idf(word, doc, D) = tf(word, doc) × idf(word, D)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidfs = tfidf.fit_transform(data)

pprint(data)
pd.DataFrame(tfidfs.todense(), columns=tfidf.get_feature_names())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.480458,0.373119,0.0,0.0,0.631745,0.0,0.480458,0.0,0.0,0.0,0.0,0.0
1,0.0,0.197673,0.334689,0.669378,0.0,0.334689,0.25454,0.0,0.334689,0.0,0.0,0.334689
2,0.38377,0.298032,0.0,0.0,0.0,0.0,0.0,0.504611,0.0,0.504611,0.504611,0.0


<hr style="border:2px solid black"> </hr>

## Bag of Ngrams
- sets/groups of words
- bigrams, trigrams, etc

In [10]:
#show range of words between 2 and 2 (just bigrams)
cv = CountVectorizer(ngram_range=(2, 2))
bag_of_words = cv.fit_transform(data)

In [11]:
#create DF of bigrams per sentence
pprint(data)
pd.DataFrame(bag_of_words.todense(), columns=cv.get_feature_names())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool too,is cool,is nice,is pretty,language with,nice programming,nice syntax,pretty cool,programming language,python is,sql is,think sql,with nice
0,0,0,0,1,0,0,0,1,0,1,0,0,0
1,0,0,1,0,1,1,1,0,1,1,0,0,1
2,1,1,0,0,0,0,0,0,0,0,1,1,0


<hr style="border:3px solid black"> </hr>

# Modeling

In [12]:
#imports
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

#read in spam/ham dataset
df = pd.read_csv('spam_clean.csv')

In [13]:
#term frequency
cv = CountVectorizer()
#apply cleaning function to text, then join everything back together
X = cv.fit_transform(df.text.apply(clean).apply(' '.join))
#label (ham or spam)
y = df.label

#train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

#create model - decision tree classifier
tree = DecisionTreeClassifier(max_depth=5)
#fit
tree.fit(X_train, y_train)

#accuracy score <-- 93%
tree.score(X_train, y_train) 

0.9306708548350908

In [16]:
#another way to predict accuracy
#the number of times that X_train is equal to y_train
(tree.predict(X_train) == y_train).mean()

0.9306708548350908

In [14]:
#accuracy score of test split <-- 91%
tree.score(X_test, y_test)

0.9147982062780269

## Modeling Results
A super-useful feature of decision trees and linear models is that they do some built-in feature selection through the coefficeints or feature importances:

In [18]:
#see which features matter most
tree.feature_importances_

array([0., 0., 0., ..., 0., 0., 0.])

In [19]:
#number of unique words in the entire corpus
tree.feature_importances_.shape

(8791,)

In [20]:
#panda Series where index is unique words and value is IDF
pd.Series(dict(zip(cv.get_feature_names(), tree.feature_importances_))).sort_values().tail(20)

#seeing the word 'call' seems to be very important

ela            0.000000
embarassed     0.000000
elaborate      0.000000
elaborating    0.000000
embarassing    0.000000
didnt          0.003178
asa            0.003320
evening        0.005961
co             0.006182
youre          0.010495
stop           0.011767
ill            0.013857
service        0.020439
mobile         0.026495
reply          0.042182
later          0.059484
claim          0.073024
text           0.086027
txt            0.280117
call           0.357470
dtype: float64

## What's Next?
- try other model types
- use same model with different representations (ngrams vs TF-IDF)
- look at accuracy, recall, precision
- Look at other metrics, is accuracy the best choice here?
- Try ngrams instead of single words
- Try a combination of ngrams and words (ngram_range=(1, 2) for words and bigrams)
- Try using tf-idf instead of bag of words
- Combine the top n performing words with the other features that you have engineered (the CountVectorizer and TfidfVectorizer have a vocabulary argument you can use to restrict the words used)

### Note:
After going through modeling...
- go back and change lemmatizing to stemming,
- does it change the results of model accuracy?
- if theres only a small increase in performance, go with stemming (for performance)
- you can do either (no wrong way to clean)

- can use KNN, Random Forest, Decision Tree
- recall, percision, accuracy... which is most important?