# Reticulating Splines
We are using the following packages:
* re: short for regular expressions, we use this to match patterns in text.
* copy: a helper function for getting a copy of a variable.
* csv: we use this to parse CSV files.
* nltk: short for natural language toolkit.  This provides tools for our classifier and for text preprocessing.

In [None]:
import re
from copy import copy
import csv

from nltk import NaiveBayesClassifier
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Opening Source
We begin by opening our source file, which is a CSV containing two columns.  The first column is the text of the tweet, the second column is the label.

The syntax here may not be immediately intuitive.  The return object from the ```open()``` function is not the contents of the file as you might expect.  Instead, it returns a file object, which you can iterate over to get one line at a time.  You may need this functionality if the file is too large to fit into memory.

In our case, it is small, so we load the whole thing using a ```for``` loop.

In [None]:
tweets = list()
file_handler = open("coded_tweets.csv", encoding="utf8")

file_reader = csv.reader(file_handler)
next(file_reader) # The first row is the header, which we do not want, so we skip it.
for line in file_reader: tweets.append(line)

file_handler.close()

In [None]:
train = tweets[:175]
test = tweets[175:]

# Functions for Wrangling
We will apply functions to clean up our data in multiple passes.  Firstly, we will clean the whole text before it is split by spaces.  This will allow us to use regular expressions that match whole phrases.  For this we will define a ```clean_whole_text()``` function, which will take a string as its first argument and return the same string minus anything we want to clean out.

Secondly, we define a function called ```is_valid_word()``` which takes a word as its first argument and returns ```True``` if we want to include the word in our results, or ```False``` if we want to filter it out.  This is the stage where we will remove stop words and punctuation.  We can blacklist any word we want in this function.

Finally, we define a function called ```normalize()```, which takes a word as its first argument and returns the same word normalized.  Here is where we will do lowercasing and lemmatizing.

In [None]:
def clean_whole_text(text):
    
    match_patterns = [
        "^RT @\S+: ",
        "@\S+",
        "#[aA]cademic[tT]witter",
        "https?://\S+",
        "[().,!?\"]",
        "&\S+;"
    ]
    for pattern in match_patterns:
        text = re.sub(pattern, "", text)
    
    return text

In [None]:
stop_words = stopwords.words('english')

def is_valid_word(word):
    if word in stop_words: return False
    return True

In [None]:
lemmatizer = WordNetLemmatizer()

def normalize(word):
    word = lemmatizer.lemmatize(word)
    return word.lower()

# Extract Features
This function will take a tweet string as an input and return a dict of features where the words are the keys and the values are ```True``` if the word is in the tweet, or ```False``` if it is not.

The reason we use the ```copy()``` function here is because if we do not, Python will pass a _reference_ to the ```features_template``` variable.  This means each time this function gets called, the ```features_template``` variable will contain all the ```True``` values from the previous time the function was called.

We take this opportunity to normalize the word and only add it to the features list if it is considered valid according to our ```is_valid_word()``` function.

In [None]:
def extract_features(text):    
    words = text.split()
    features = copy(features_template)
    
    for word in words:
        pretty = normalize(word)
        if is_valid_word(pretty): features[pretty] = True
    
    return features

# Preparing All Features from Source
Here we prepare the ```features_template``` variable, which we will be our starting point for each tweet.  We first need to go through the entire dataset (including test) to capture all the words.  Then, we add each word to the ```features_template``` variable as a key with a value of ```False```. 

In [None]:
features_template = dict()
for dataset in (train, test):
    for tweet in dataset:
        cleaned_tweet = clean_whole_text(tweet[0])
        for word in cleaned_tweet.split():
            pretty = normalize(word)
            if is_valid_word(pretty): features_template[pretty] = False

# Extract Features
The nltk classifier is not expecting a raw tweet string.  This classifier will not split or tokenize anything for us.  We need to prepare the features in advance.  The classifier ```train()``` function expects a list of lists.  The 0th index in that list must be a dictionary of features, and the 1st index must be a label.

We go through both the training set and the testing set.  For each tweet, we clean the text using our ```clean_whole_text()``` function.  Then, we extract features and append the features with the label to the new set.

This will transform our data from a form like this:

```[["This is some tweet's text", "Some label"],["A second tweet", "Another label"]]```

And turn it into this:

```[[{"some": True, "tweet": True, "text": True, "second": False}, "Some label"],   [{"some": False, "tweet": True, "text": False, "second": True}, "Another label]]```

In [None]:
train_features = list()
for tweet in train:
    cleaned_tweet = clean_whole_text(tweet[0])
    features = extract_features(cleaned_tweet)
    label = tweet[1]
    train_features.append((features, label))
    
test_features = list()
for tweet in test:
    cleaned_tweet = clean_whole_text(tweet[0])
    features = extract_features(cleaned_tweet)
    label = tweet[1]
    test_features.append((features, label))

# Training the Classifier
This ```train()``` method will return a trained classifier object.  We can take that classifier and ask it to predict a label for any other features we give it.

In [None]:
classifier = NaiveBayesClassifier.train(train_features)

# Predicting Labels

In [None]:
for tweet in test_features:
    predicted_label = classifier.classify(tweet[0])
    print("Agree" if predicted_label == tweet[1] else "Disagree")