# Sentiment Analysis

Is it possible to track the positivity and negativity of MP's tweets? Let's attempt this by:

1. Train a classifier
2. Run it on the tweets in tweets.csv

## Stage 1: Train a Classifier

Use NLTK and Sklearn to create a classifier for the sentiment of tweets.

In [4]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report

In [5]:
nltk.download()
# if you haven't already - download the twitter_samples corpus

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Training Data

Use the tweets from the training data in the nltk corpus.

This is a set of tweets labelled as either positive or negative in sentiment.

Hopefully these will generalise to the MP tweet dataset. Further work could involve investigating other training sets.

In [6]:
twitter_samples = nltk.corpus.twitter_samples

In [7]:
# What files have we got?
twitter_samples.fileids() 

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [8]:
# Put data into a dataframe

pos_df = pd.DataFrame()
neg_df = pd.DataFrame()

pos_tweets = []
neg_tweets = []

strings = twitter_samples.strings('positive_tweets.json')
for string in strings:
    pos_tweets.append(string)
    
strings = twitter_samples.strings('negative_tweets.json')
for string in strings:
    neg_tweets.append(string)

In [9]:
pos_df['text'] = pos_tweets
neg_df['text'] = neg_tweets

In [10]:
pos_df['class'] = 1
neg_df['class'] = 0

In [11]:
tweet_df = pos_df.append(neg_df)

### Build and Train a Model

We will use a standard document classifier using sklearn. Each tweet is classified as either 1=positive or 0=negative.

Three stages to the classifier:

1. CountVectorizer - make a sparse array of the words in the corpus along with the frequencies.
2. tf-idf - Adjust the frequency this for tweet length and how often it occurs in other documents. See wiki for more info
3. Train classifier. We will try Naïve Bayes and a SVM.


Some things to consider:

1. Do we get rid of URLs?


In [12]:
# split into data and target

X = tweet_df['text']
y = tweet_df['class']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [14]:
# Make a pipeline

stop_words = nltk.corpus.stopwords.words('english')

tweet_classifier = Pipeline([('count_vectorise', CountVectorizer(tokenizer=word_tokenize, stop_words=stop_words)),
    ('tf-idf', TfidfTransformer()),
    ('classify', MultinomialNB()),
    ])

In [15]:
tweet_classifier.fit(X_train, y_train)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Pipeline(steps=[('count_vectorise', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=..._tf=False, use_idf=True)), ('classify', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [16]:
pred = tweet_classifier.predict(X_test)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [17]:
print(classification_report(y_test, pred))

             precision    recall  f1-score   support

          0       0.93      0.94      0.93      1529
          1       0.94      0.92      0.93      1471

avg / total       0.93      0.93      0.93      3000



In [18]:
print(confusion_matrix(y_test, pred))

[[1439   90]
 [ 115 1356]]
