# Training our own sentiment analysis

But we're going to do it differently this time! Not just a list of pre-programmed words.

## Read in our data

Lots of docs? `glob.glob`, same as always.

In [1]:
import glob

# Go into reviews, then go into txt_sentoken,
# then go into BOTH 'pos' and 'neg' directories
# then get every text file inside of there
filenames = glob.glob("reviews/txt_sentoken/*/*.txt")
content = [open(filename).read() for filename in filenames]

In [None]:
import pandas as pd

# Create a dataframe
df = pd.DataFrame({
    'filename': filenames,
    'content': content
})

# Let's see them
df.head(2)

## The sentiment is in the filename, let's extract it

In [None]:
# Extract out the sentiment from the filename
df['sentiment'] = df.filename.str.extract("txt_sentoken/(.*)/cv", expand=False)
df.head(2)

In [None]:
df.tail(2)

# What are we doing? An introduction to CLASSIFICATION.

We have a bunch of movie reviews in categories. Maybe someone sends us a new review, what category does the new review belong in?

We're going to train a classifier to recognize positive and negative reviews, so that if someone sends us a new review, we'll know if it's something we want to see without having to actually read the review.

RULE IS: For classification algorithms, YOU MUST HAVE CATEGORIES ON YOUR ORIGINAL DATASET.

**For clustering**

1. You'll get a lot of documents
2. You feed it to an algorithm, tell it create `x` number of categories
3. The machine gives you back categories whether they make sense or not

**For classification (which we are doing now)**

1. You'll get a lot of documents
2. You'll classify some of them into categories that you know and love
3. You'll ask the algorithm what categories a new bunch of unlabeled documents end up in

All mean the same thing: CATEGORY = CLASS = LABEL

The reason why you use machine learning is to not do things manually. So if you can do things manually, do it. Otherwise just try different algorithms until one works well (but you might need to know some upsides or downsides of each to interpret that).

# Why do we want to classify anything?

Hmmmm, maybe to [identify fake news](http://www.fakenewschallenge.org/) without reading every story ever?

# How are we classifying? NAIVE BAYES.

## How does Naive Bayes work?

NAIVE BAYES WORKS WITH TEXT (kind of)

**Bayes Theorem (kind of)**

* If you see a word that is normally in a spam email, there's a higher chance it's spam
* If you see a word that is normally in a non-spam email, there's a higher chance it's not spam

**Naive:** every word/feature/etc is independent of any other word

FOR US: If you see words that are normally in positive reviews, it's probably a positive review.

Secret trick: you can't just use text, you have to convert into numbers (vectorization to the rescue)

## Types of Naive Bayes

Naive Bayes works on words, and SOMETIMES your text is long and SOMETIMES your text is short.

**Multinominal Naive Bayes - (multiple numbers)**: You count the words. You care about whether a word appears once or twice or three times or ten times. *This is better for long passages*

**Bernoulli Naive Bayes - True/False Bayes:** You only care if the word shows up (`True`) or it doesn't show up (`False`) - *this is better for short passages*

What kind of Bayes should we use this time?

# Preparing our data (2 steps)

Remember, machine learning **only likes numbers**, so we need to jump through some hoops first.

## Prep 1: Convert our labels into numbers

Our labels are only `neg` and `pos`, so maybe we could just make positive 1 and negative 0?

This is a way to do it if you like being fancy and manually doing things...

In [None]:
def make_label(row):
    if row.sentiment == "WWWHHHAAAATTTT??????":
        return 1
    else:
        return 0

df['sentiment_label'] = df.apply(make_label, axis=1)
df.head(3)

...but with 0/1 it's really easier to say "sentiment, are you positive?" and get `True` and `False` and then use `.astype(int)` to convert it to `0`/`1`.

In [None]:
df['sentiment_label'] = (df.sentiment == 'pos').astype(int)
df.head(3)

In [None]:
df.tail(3)

## Preparation 2: we need to build our list of features

That's going to be our **list of words**. Same vectorizer thing we **always use**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(stop_words='english')
matrix = vec.fit_transform(df.content)
words_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
words_df.head()

# Using our classifier

##  Step 1: Import and create the classifier

What kind of Bayes classifier are we going to use?

In [None]:
from sklearn import naive_bayes

clf = WWWHHAAATTTTT????

## Step 2: Training the classifier

Teaching a classifier is called **training** or **fitting**. The classifier needs us to give it two things:
    
* **Our training features**: what the words are
* **Our training labels**: whether it's positive or negative

### What are our features?

Remember, numbers only!

### What are our labels?

Remember, numbers only!

### Now let's actually train/teach/fit our classifier

Remember, this is called **fitting**, so it's going to be `clf.fit`. We'll also refer to it as training or maybe even teaching.

* First parameter is the **features**
* Second parameter is the **labels**

So easy.

No errors = we did a good job.

## Step 3: Using the classifier

Now that we trained it, how do we use it? Well, the point of a classifier is to **process new content to see which category it belongs in**, so let's get some new content.

In [None]:
texts = [
    "I hate this movie it's terrible it's no good at all",
    "I love this directory his movies are great"
]

### Step 3.1: Preparing our incoming data

Remember when we did `vec.fit_transform`? That had the vectorizer learn all of the words (fit) and then count all the words (transform). Since it's already memorized all of the words that went into the classifier, we don't want to teach it any new ones - we just want it to count some new sentences.

### Step 3.2: Predicting our incoming data

Once we have another matrix, we can use `clf.predict` to predict the labels for our sentences. We have to give it the **matrix**, though, since it doesn't understand words.

That `[0,1]` matches up with our labels - it means the first one is negative and the second one is positive. What words does it use to decide? **Use this cut-and-pasted code!**

In [None]:
# From http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
import numpy as np

"""Prints features with the highest coefficient values, per class"""
class_labels = ['pos', 'neg']
feature_names = vec.get_feature_names()
for i, class_label in enumerate(class_labels):
    try:
        top10 = np.argsort(clf.coef_[i])[-10:]
        features_names = vec.get_feature_names()
        print("%s: %s" % (class_label, " ".join(features_names[j] for j in top10)))
    except:
        pass

## Step 4: Testing our classifier

The big question here is **but is our classifier actually any good?**

Even though we **built** a classifier, that doesn't actually **mean anything.** Maybe it's horrible. How can we test it?

It'd be cheating if we tested it against something it already knew, so we need to get a little fancier than that.

### Step 4.1: Preparating our split

I always call this test/train split and then type it wrong.

In [None]:
# train_test_split will split our data into two parts
from sklearn.model_selection import train_test_split

# Splitting into...
# X = are all our features
# y = are all our labels
# X_train are our features to train on (80%)
# y_train are our labels to train on (80%)
# X_test are our features to test on (20%)
# y_train are our labels to test on (20%)

X_train, X_test, y_train, y_test = train_test_split(
    words_df.values, 
    df.sentiment_label, 
    test_size=0.2) 

# the first parameter is our FEATURES. can't just do words_df, it won't work :(
# the second parameter is the LABEL as a number (so 0/1, not neg/pos)
# 80% training, 20% testing

### What do those variables look like?

What are `X_train` and `X_test`?

What are `y_train` and `y_test`?

### Step 4.2: Fitting and scoring against our test data

Now we'll fit again with **only our training data**, then test it against the **testing data** with `clf.score`.

While we're at it, how does it do with things it's already seen?

# What about other classifiers?

There are a handful of other classifiers, and some might be better! While there are general rules about what kind to use, usually you just switch around until you get the best performance. We'll go more in-depth next week, but let's play around with one for now.

## Decision Trees

This is a simple one to understand called a **decision tree**, they're usually pretty good.

### Step 1: Import and create the classifier

In [None]:
from sklearn import tree

# Let's add max_depth=3 before we draw!
clf = tree.DecisionTreeClassifier()

### Step 2: Train and score the new classifier

We have fewer steps this time because we've already **converted our data to numbers**, done our splits and all of that.

Okay, they might not be that good **this** time, but how does it do on data it's already seen?

What a wreck! This is called **overfitting** and we'll, yes, talk about it more next week.

### Step 3: Understanding

As a reward for your hard work, let's draw a pretty picture. We'll need to **remake the classifier so it can be drawn** - go up and change the classifier to `clf = tree.DecisionTreeClassifier(max_depth=3)` and re-run the fitting and scoring.

In [None]:
%matplotlib inline

import pydotplus
from IPython.display import Image  

dot_data = tree.export_graphviz(clf, out_file=None, 
    feature_names=vec.get_feature_names(),  
    class_names=clf.classes_.astype(str),  
    filled=True, rounded=True,  proportion=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())

# Epilogue

Let's read some [fake news challenge code](https://github.com/FakeNewsChallenge/fnc-1-baseline/blob/master/feature_engineering.py)!