# NLP Modeling

The "big idea" of modeling is to determine what a document is all about; which words are important or not. The main task is to determine the weight of each word, relative to the document.


## What
- Introducing our _Dramatis Personae_, the characters in our play:
    - `Term frequency` is a direct way to measure what a document is about, but it over-emphasizes common terms. Consider term frequency the baseline, kinda like how median, median, or mode can be baselines. It's at least somewhere to start, even if it's a blunt tool w/ some issues.
    - `TF` = # times a word occurs divided by the total amount of words. 
    - `Bag of words` is a representation of a document as a vector, where the values indicate word frequency.
        ```
        string = "Mary had a little lamb, little lamb, little lamb."
        string = string.replace(",", "")
        words = string.split()
        bag_of_words = pd.Series(words).value_counts()
        ```
    - Word clouds are a visual bag of words with larger font sizes representing higher term frequency
    - Inverse Document Frequency, `IDF`, tells us how much information a word provides. 
        - A higher IDF means that a word provides more information. That is, it is more relevant within a single document.
        - As the number of documents that a word appears in increases, the IDF value decreases.
        - Example: if "Codeup" appears frequently in every document in a list of documents, then the word doesn't add much new information on any given individual document.
        - Example: if "scholarship" shows up a whole bunch one one or two documents, but not frequently across the corups of documents, then we can conclude that that word conveys more meaning.
        
    - `TF-IDF` is the product of `tf * idf` and is 


## So What?
- Determining what a document is about is both valuable and challening.
- Term frequency is super sensitive to noise
- TF-IDF is super common and has been used in the majority of text based recommendation systems. See [tf-idf in Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

## Now What?


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import pandas as pd
import numpy as np
import re
import unicodedata
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [2]:
df = pd.read_csv("tng.csv")

top_15_characters = df.character.value_counts().index[0:15]

top_15 = df[df.character.isin(top_15_characters)]
top_15

Unnamed: 0,episode_name,line,character
0,Encounter at Farpoint,Difficult? Simply solve the mystery of Farpoi...,DATA
1,Encounter at Farpoint,As simple as that.,PICARD
2,Encounter at Farpoint,Farpoint Station. Even the name sounds myster...,TROI
3,Encounter at Farpoint,"It's hardly simple, Data, to negotiate a frie...",PICARD
4,Encounter at Farpoint,Inquiry. The word snoop?,DATA
...,...,...,...
51983,All Good Things,Of course. Have a seat.,RIKER
51984,All Good Things,"Would you care to deal, sir?",DATA
51985,All Good Things,"Oh, er, thank you, Mister Data. Actually, I u...",PICARD
51986,All Good Things,You were always welcome.,TROI


In [3]:
df.shape

(51988, 3)

In [4]:
ADDITIONAL_STOPWORDS = ['r', 'u', '2', 'ltgt']

def clean(text):
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
    text = (unicodedata.normalize('NFKD', text)
             .encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return " ".join([wnl.lemmatize(word) for word in words if word not in stopwords])

In [5]:
df.line = df.line.apply(clean)
df.head()

Unnamed: 0,episode_name,line,character
0,Encounter at Farpoint,difficult simply solve mystery farpoint station,DATA
1,Encounter at Farpoint,simple,PICARD
2,Encounter at Farpoint,farpoint station even name sound mysterious,TROI
3,Encounter at Farpoint,hardly simple data negotiate friendly agreemen...,PICARD
4,Encounter at Farpoint,inquiry word snoop,DATA


In [6]:
# We'll use this split function later to create in-sample and out-of-sample datasets for modeling
def split(df, stratify_by=None):
    """
    3 way split for train, validate, and test datasets
    To stratify, send in a column name
    """
    
    train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
    
    train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

In [7]:
train, validate, test = split(top_15, 'character')

In [8]:
# Setup our X variables
X_train = train.line
X_validate = validate.line
X_test = test.line

In [9]:
# Setup our y variables
y_train = train.character
y_validate = validate.character
y_test = test.character

In [10]:
# Create the tfidf vectorizer object
tfidf = TfidfVectorizer()

# Fit on the training data
tfidf.fit(X_train)

# Use the object
X_train_vectorized = tfidf.transform(X_train)
X_validate_vectorized = tfidf.transform(X_validate)
X_test_vectorized = tfidf.transform(X_test)

In [11]:
# Now you have a vactorized dataset and its fit on the clasification model.
lm = LogisticRegression().fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [12]:
train = pd.DataFrame(dict(actual=y_train))
validate = pd.DataFrame(dict(actual=y_validate))
test = pd.DataFrame(dict(actual=y_test))

In [13]:
train['predicted'] = lm.predict(X_train_vectorized)
validate["predicted"] = lm.predict(X_validate_vectorized)
test['predicted'] = lm.predict(X_test_vectorized)

In [14]:
# Train Accuracy
(train.actual == train.predicted).mean()

0.5438073394495413

In [15]:
(validate.actual == validate.predicted).mean()

0.3987585616438356

In [16]:
# A little less due to overfiting, need to create a feature for each episode to get the gap closer together.

In [17]:
# Now that we have a trained model, lets use our model to predict the charecter of any given line.
lines = pd.Series([
    "we have a responsibility", 
    "set phasers to stun", 
    "the warp drive is about to go critical", 
    "What does it mean to be human? I cannot calculate feelings", 
    "Romulan bird of prey decloaking off the port bow"
])

# apply clean 
lines = lines.apply(clean)

# We have to vectorize these inputs if we'regoing to be able to use the classification model.
lines = tfidf.transform(lines)
lines

<5x13346 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [18]:
lm.predict(lines)

array(['RIKER', 'RIKER', 'LAFORGE', 'DATA', 'WORF'], dtype=object)

In [19]:
wesley = train[train.actual == "Wesley"]

In [20]:
# Accuracy 
(wesley.actual == wesley.predicted).mean()

nan

In [21]:
from sklearn.metrics import classification_report
print(classification_report(train.actual, train.predicted))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

    COMPUTER       0.97      0.24      0.38       272
     CRUSHER       0.72      0.28      0.41      1521
        DATA       0.62      0.72      0.66      3008
      GUINAN       0.00      0.00      0.00       234
     LAFORGE       0.63      0.48      0.55      2048
     LWAXANA       1.00      0.00      0.01       201
      PICARD       0.49      0.88      0.63      5984
     PULASKI       1.00      0.00      0.01       256
           Q       1.00      0.01      0.01       274
       RIKER       0.51      0.48      0.50      3421
          RO       0.75      0.02      0.03       194
       TASHA       1.00      0.01      0.02       247
        TROI       0.56      0.27      0.36      1612
      WESLEY       0.73      0.06      0.11       693
        WORF       0.61      0.42      0.50      1835

    accuracy                           0.54     21800
   macro avg       0.71      0.26      0.28     21800
weighted avg       0.59   

In [22]:
characters = train.actual.value_counts().index.tolist()

In [23]:
for character in characters:
    character_lines = train[train.actual == character]
    accuracy = (character_lines.actual == character_lines.predicted).mean()
    print(f'Predicting {character} has {round(accuracy, 2)}')

Predicting PICARD has 0.88
Predicting RIKER has 0.48
Predicting DATA has 0.72
Predicting LAFORGE has 0.48
Predicting WORF has 0.42
Predicting TROI has 0.27
Predicting CRUSHER has 0.28
Predicting WESLEY has 0.06
Predicting Q has 0.01
Predicting COMPUTER has 0.24
Predicting PULASKI has 0.0
Predicting TASHA has 0.01
Predicting GUINAN has 0.0
Predicting LWAXANA has 0.0
Predicting RO has 0.02
