# Text Classification with Traditional Models

The Twitter dataset (`tweets.csv`) was scraped from February of 2015 for sentiment analysis on US airline tweets. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

We want to train a supervised machine learning model that, given each new tweet, predicts the sentiment class of that tweet (i.e., positive, negative, or neutral). You should choose a traditional classification model, such as Naive Bayes, but try out different feature representation approaches to optimize the performance. 

## Importing Modules

In [12]:
import numpy
import pandas
import nltk
import nltk.corpus
import gensim.models
import sklearn.metrics
import sklearn.ensemble
import sklearn.naive_bayes
import sklearn.model_selection
import sklearn.feature_extraction.text

## Loading the Dataset

In [3]:
df = pandas.read_csv("../../datasets/tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Splitting the Data into Training and Test Sets

In [4]:
x = df["text"]
y = df["airline_sentiment"]
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

## Feature Engineering

In [5]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=5)
vectorizer.fit(x_train)

new_x_train = vectorizer.transform(x_train).toarray()
new_x_test = vectorizer.transform(x_test).toarray()

print("new_x_train:", new_x_train.shape)
print("new_x_test:", new_x_test.shape)

new_x_train: (10980, 2590)
new_x_test: (3660, 2590)


## Training a Model

In [6]:
model = sklearn.naive_bayes.MultinomialNB()
model.fit(new_x_train, y_train);

## Test the Trained Model

In [7]:
y_predicted = model.predict(new_x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
print("Accuracy = {:.2f}".format(accuracy))

Accuracy = 0.73


## Predicting Sentiments

In [8]:
tweets = ["This is a very bad airline!", "I love your good flights."]
encoded_tweets = vectorizer.transform(tweets).toarray()
predicted_class = model.predict(encoded_tweets)
predicted_class

array(['negative', 'positive'], dtype='<U8')

# An Alternative Solution

### Loading the word2vec model

In [13]:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format("/tmp/GoogleNews-vectors-negative300.bin.gz", binary=True)

### Caculate average word embdding vectors for tweets

In [20]:
x_word_embedding = []
stop_words = nltk.corpus.stopwords.words("english")
tokenizer = nltk.tokenize.TweetTokenizer()
for tweet in df["text"]:
    words = tokenizer.tokenize(tweet)
    feature_vector = numpy.zeros(300)
    ctr = 0
    for word in words:
        if word in word2vec_model and word not in stop_words:
            feature_vector += word2vec_model[word]
            ctr += 1
    if ctr > 0:
        feature_vector /= ctr
    x_word_embedding.append(feature_vector)

### Splitting and training and test sets

In [21]:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_word_embedding, y)

### Training a model

In [22]:
model = sklearn.ensemble.RandomForestClassifier()
model.fit(x_train, y_train);

### Testing the trained model

In [24]:
y_predicted = model.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
print("Accuracy = {:.2f}".format(accuracy))

Accuracy = 0.74
