# Training a Positive / Negative Text Classifiier

Before doing our subreddit analysis, we will train a model to classify whether a title is positive or negative. We will be using a dataset of 1,600,000 labelled Tweets ([download](http://kaggle.com/kazanova/sentiment140)).

## Setup
The dependancies are listed out in requirements.txt. They can be quickly installed with Pip by running the following command

`python -m pip -r requirements.txt`

## Summary
We succesfully preprocessed a corpus of Twitter Tweets and used them to train a Naive-Bayes Classifier with ~71% accuracy. The model has been pickled to `bin/classifer.o` for future usage.

In [117]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer as wnl
from nltk.corpus import stopwords, wordnet
from nltk.tag import pos_tag
from nltk import classify, NaiveBayesClassifier

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

import re, random, pickle

## Preprocessing Functions
Some things that need to be done before we can train our model.

1. Tokenization
   All our tweets need to be tokenized to be processed. Tokenization isn't as simple as something like `str.split('.')` though. For example, "Mr.John likes iced coffee". If we ran this code, then the sentence would be improperly tokenized since "Mr.John" should not be split into 2 tokens.

2. Lemmatization
   This is the process of mapping a word to its root. It is similar to stemming. Many words have the same meaning. For instance, "great", "greater", and "greatest" all have the same root.

3. Normalization - Stopword removal + Lowercasing

   Uppercase and lowercase words hold the same value. As well, we will remove stopwords. Stopwords are commonly used words in language that are not *significant* parts of the sentence. For example, "a", "are", "may".
4. Noise Removal

   Remove twitter handles, hashtags, phone numbers, and special characters which can interfere which hold no text value and may interfere with our training process.
   With our dataset, emoji's have already been removed.
   


In [37]:
"""
Returns a generator to get all tokens in a list of sentences
"""
def get_all_tokens(sentences):
    for sent in sentences:
        for token in sent:
            yield token

"""
Returns the corresponding wordnet tag for a parts of speach (pos) tag
"""
def get_wordnet_tag(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

"""
Return dictionary for existing dataset to for model
"""
def sentence_to_dict(dataset):
    for tokens in dataset:
        yield dict([token, True] for token in tokens)

## Loading Data

In [206]:
df = pd.read_csv('data/training.1600000.processed.noemoticon.csv', encoding='ISO-8859-1')
df.columns = ['sentiment', 'id', 'date', 'flag', 'user', 'text']
df.drop(['id', 'date', 'flag', 'user'], axis=1, inplace=True) # Dont need these cols

positives = df.loc[df['sentiment'] == 4]
negatives = df.loc[df['sentiment'] == 0]

df = pd.concat([positives.head(5000), negatives.head(5000)])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 799999 to 4999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  10000 non-null  int64 
 1   text       10000 non-null  object
dtypes: int64(1), object(1)
memory usage: 234.4+ KB


## Preprocessing

In [215]:
def preprocess_tweet(text, lang='english'):
    lemmatizer = wnl()
    stop_words = stopwords.words(lang)

    result = []
        
    for part, tag in pos_tag(text.split()):
        # Remove links
        part = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', part)

        # Remove handles
        part = re.sub('(@[A-Za-z0-9_]+)', '', part)

        part = part.lower().strip()

        part = part.replace(':)', 'smile')
        part = part.replace(':(', 'sad')
        part = part.replace(':/', 'frown')
        part = part.replace(';)', 'wink')
        part = part.replace(':D', 'big smile')
        part = part.replace(';D', 'big smile')
        part = part.replace(';d', 'big smile')

        # If it's a stopword we dont want to add to our token list
        if part in stop_words:
            continue

        wordnet_pos = get_wordnet_tag(tag)

        result.append(lemmatizer.lemmatize(part, wordnet_pos))
        
    return " ".join(result)

df['text'] = df['text'].apply(preprocess_tweet)

### Before Preprocessing

@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D

### After Preprocessing

In [216]:
print(df.iloc[0][1])

www.youtube.com/titomi15..watch videos,comment&amp;suscribe!! &amp; follow , cheer x god bless


## Preparing the dataset
We need to format the data as a labelled featureset. This is a requirement to train the model.
We will do a standard 80/20 train - test split

In [217]:
# Shuffle
df = shuffle(df)

# Partition to train and test
df_train, df_test = train_test_split(df, test_size=0.3)

In [218]:
# Create the dict
def create_dataset(df):
    dataset = []
    for i, row in df.iterrows():
        tweet = row['text']
        sent = row['sentiment']
        text_dict = dict([word, True] for word in word_tokenize(tweet))
        dataset.append((text_dict, sent))
    return dataset
        
d_train = create_dataset(df_train)
d_test = create_dataset(df_test)

In [219]:
d_train[2]

({'almost': True, 'bedtime': True}, 0)

## Training

Training our model with the Naive Bayes Classifier and using `pickle` to serialize our model so it can be reused later. It is 'naive' since it assumes that all our features are independent of each other. In our data, we only have 2 features (text, positive)

In [220]:
classifier = NaiveBayesClassifier.train(d_train)

print('Accuracy: {}'.format(classify.accuracy(classifier, d_test)))

# Save our model so we can reuse it later!
pickle.dump(classifier, open('bin/classifier.o', 'wb'))

Accuracy: 0.7046666666666667


## Future Improvements

This is a good model for now. But we can make some improvements in the future

- We can try out other model architectures such as LSTM neural network if we have good data
- Improve preprocessing such as adding text enrichment where needed
- Add another category, "Neutral"
- Checkout the ROC