# Supervised learning with NLP

Use BOW models or tf-idf as features  
Example: Predict Sci-Fi or Action based on plot summary  
Features: num words, named entities, language

* Dataset consists of movie plots and genre  
* Goal: create bag-of-word vectors for movie plots  

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = ...
y = df['Sci-Fi']
X_train, X_test, y_train, y_test = train_test_split(
    df['plot'], y, test_size=0.33, random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')

# Create bag of words vector, mapping of words and ids and vectors 
# representing how many times each word appears in plot
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)

TypeError: 'ellipsis' object is not subscriptable

## TfidfVectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

tfidf_train = tfidf_vectorizer.fit_transform(X_train.values)
tfidf_test = tfidf_vectorizer.transform(X_test.values)

print(tfidf_vectorizer.get_feature_names()[:10])
print(tfidf_test.A[:10])

In [None]:
# Converting to DFs to get a better idea of what's in a vector
count_df = pd.DataFrame(count_train.A, 
                        columns=count_vectorizer.get_feature_names())
tfidf_df = pd.DataFrame(tfidf_train.A, 
                        columns=tfidf_vectorizer.get_feature_names())

## Naive Bayes

Given data how likely is outcome.  
Each word from CountVectorizer acts as a feature  
**Doesn't** work well with floats, use SVM or linear models.  
Always good idea to test NB first to determine if works well.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [None]:
nb_clf = MutinomialNB()
## count_vectorizer
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)

In [None]:
metrics.confusion_matrix(y_test, pred, labels=[0,1])
[['tp', 'fp'],
 ['fn', 'tn']]

## Simple NLP, complex problems
* Translation has a long way to go  
    * Word vectors between language don't always align with more intricate differences
* Sentiment analysis
    * Sarcasm, "I liked it but it could have been better"  
* Predjudice and Biases  

### Improving model
Tweaking alpha levels with tfidf vectors  

In [None]:
alphas = np.arange(0, 1, 0.1)
def train_and_predict(alpha):
    nb_classifier = MultinomialNB(alpha=alpha)
    nb_classifier.fit(tfidf_train, y_train)
    pred = nb_classifier.predict(tfidf_test)
    score = metrics.accuracy_score(y_test, pred)
    return score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', score)
    print()

### Inspecting the model
Map important vector weights back to acutal words using inspection techniques

In [None]:
class_labels = nb.classifier.classes_
feature_names = tfidf_vectorizer.get_feature_names()
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print first class label (FAKE) and top 20 feat with weights entries
print(class_labels[0], feat_with_weights[:20])
# Print second class label (REAL) and bottom 20 feature weights
print(class_labels[1], feat_with_weights[-20:])