<a id='top'></a>

# Twitter Sentiment Analysis in Python: The Base Model

This notebook will eventually be populated with:
1. Theory/explanation of TF-IDF and log regression, log odds, etc. 
2. Word vectorization
3. Comparison of TF-IDF top scoring words and words from regularized models, i.e. LASSO

## Contents

1. [TF-IDF](#tfidf)
2. [log regression](#log)

Basemodel: Log Regression and TF-IDF

You can read more about TF-IDF and simple regressions in this [paper](http://www.cs.ubc.ca/~nando/540-2013/projects/p9.pdf) and this dapper blog [post](https://www.ocf.berkeley.edu/~janastas/supervised-learning-with-text-1-03-01-sheet.html)

And, well, while we're at it I've enjoyed the documentation for [gensim](https://radimrehurek.com/gensim/models/tfidfmodel.html)

<a id='tfidf'></a>

### TF-IDF

[back to top](#top)

In [1]:
import pandas as pd 
import numpy as np
data = pd.read_csv("../../core/data/tweet_global_warming.csv", encoding="latin") #load the corpus
print("Total tweets: {}".format(data.shape[0]))

Total tweets: 6090


In [2]:
import gensim
def read_data(data_file):
    for i, line in enumerate (data_file): 
        yield gensim.utils.simple_preprocess (line)

dataset = list(read_data(data['tweet']))
print (dataset[0]) #list of lists - usually bad 

['global', 'warming', 'report', 'urges', 'governments', 'to', 'act', 'brussels', 'belgium', 'ap', 'the', 'world', 'faces', 'increased', 'hunger', 'and', 'link']


In [3]:
import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

In [4]:
bloblist = list(map(tb, data.iloc[:,0]))
for i, blob in enumerate(bloblist):
    print("Top words in tweet {}".format(i + 1))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:3]:
        print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
    if i == 10:
        break

Top words in tweet 1
	Word: act|BRUSSELS, TF-IDF: 0.45801
	Word: Belgium, TF-IDF: 0.45801
	Word: hunger, TF-IDF: 0.45801
Top words in tweet 2
	Word: poverty, TF-IDF: 0.73515
	Word: Fighting, TF-IDF: 0.66839
	Word: Africa, TF-IDF: 0.66415
Top words in tweet 3
	Word: Vatican, TF-IDF: 0.51245
	Word: failed, TF-IDF: 0.51245
	Word: offsets, TF-IDF: 0.48534
Top words in tweet 4
	Word: Vatican, TF-IDF: 0.51245
	Word: failed, TF-IDF: 0.51245
	Word: offsets, TF-IDF: 0.48534
Top words in tweet 5
	Word: URUGUAY, TF-IDF: 0.58289
	Word: Tools, TF-IDF: 0.58289
	Word: Needed, TF-IDF: 0.58289
Top words in tweet 6
	Word: JaymiHeimbuch, TF-IDF: 0.48854
	Word: sejorg, TF-IDF: 0.46151
	Word: Intensifying, TF-IDF: 0.42745
Top words in tweet 7
	Word: around, TF-IDF: 0.61988
	Word: us|A, TF-IDF: 0.36861
	Word: doubters, TF-IDF: 0.33369
Top words in tweet 8
	Word: Migratory, TF-IDF: 0.81423
	Word: Stay, TF-IDF: 0.76918
	Word: Strategy, TF-IDF: 0.73722
Top words in tweet 9
	Word: Competing, TF-IDF: 0.48854
	Wo

In [5]:
import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

dct = Dictionary(dataset)
corpus = [dct.doc2bow(line) for line in dataset]
model = TfidfModel(corpus)
vector = model[corpus[0]]

In [6]:
for i in range(len(vector)):
    print("{!s:.5}\t{}".format(vector[i][1], dataset[0][i]))

0.270	global
0.090	warming
0.243	report
0.372	urges
0.372	governments
0.297	to
0.034	act
0.318	brussels
0.347	belgium
0.287	ap
0.089	the
0.174	world
0.055	faces
0.062	increased
0.331	hunger
0.035	and
0.172	link


<a id='log'></a>

TF-IDF labeling may come in handy when choosing words to represent our feature space... Let's build a log regression!

### Log Regression

[back to top](#top)

In [7]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
data = pd.read_csv("../../core/data/tweet_global_warming.csv", encoding="latin")
print("Full dataset: {}".format(data.shape[0]))
data['existence'].fillna(value='ambiguous', inplace = True) #replace NA's 
data['existence'].replace(('Y', 'N'), ('Yes', 'No'), inplace=True) 
#data.dropna(inplace=True)
tweets = data.iloc[:,0]
sentiment = data.iloc[:,1]
print("Number of unique words: {}".format(len(np.unique(np.hstack(tweets)))))

top_words = 20000
max_words = 30
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tweets)
X = token.texts_to_sequences(texts=tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
print(np.unique(sentiment,return_counts=True))

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Full dataset: 6090
Number of unique words: 5541
(array(['No', 'Yes', 'ambiguous'], dtype=object), array([1114, 3111, 1865]))


In [8]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.511
testing score: 0.502
(array(['No', 'Yes', 'ambiguous'], dtype=object), array([  76, 2864,  105]))


The classification is biased toward 'Yes'. This baseline model is not performing super well

In [590]:
X_train[:3]

array([[    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     4,     5,    51,   121,    32,  3963,
         1669,   351,   105,    89,  1483,    38,    59,  3964,     1,
            8,     7,  6137],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,    98,     6,  4408,   265,   375,     6,  4409,    12,
           11,     2,    39,   273,     6,   603,  4410,   100,     4,
            5,  4411,    17],
       [    0,     0,     0,     0,     0,     0,     0,    12,    59,
         5850,    69,   102,   102,   151, 11098,   132,   356,     2,
            3,    14, 11099,    52,  5593,    15,  3063,  5592,     1,
            8,     7, 11100]], dtype=int32)

In [591]:
d = token.word_index
for name, age in d.items():    # for name, age in list.items():  (for Python 3.x)
    if age == 6 or age == 2 or age == 17:
        print(name)

climate
the
link


Eventually we'll inspect the coefficients on these words in the log regression, that will tell us if they indicate sentiment one way or the other!

In [9]:
sorted_d = sorted(token.word_docs.items(), key=lambda x: x[1], reverse=True)
sorted_d[:10]

[('http', 3898),
 ('climate', 3447),
 ('change', 3279),
 ('global', 3002),
 ('warming', 2915),
 ('ly', 2371),
 ('bit', 2173),
 ('the', 1978),
 ('to', 1682),
 ('of', 1375)]

Let's see what kind of a boost we get with removing stop words.

I'll add one preprocession step to remove stopwords using nltk:

In [13]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
stop = stopwords.words('english') + list(string.punctuation)
tokenized_tweets = []
for index, tweet in enumerate(tweets):
    tokens = [i for i in word_tokenize(tweet.lower()) if i not in stop]
    tokenized_tweets.append(tokens)

In [15]:
top_words = 20000
max_words = 25
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tokenized_tweets)
X = token.texts_to_sequences(texts=tokenized_tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.513
testing score: 0.506
(array(['No', 'Yes', 'ambiguous'], dtype=object), array([  41, 2991,   13]))


In [17]:
sorted_d = sorted(token.word_docs.items(), key=lambda x: x[1], reverse=True)
sorted_d[:10]

[('http', 3892),
 ('climate', 3360),
 ('change', 3207),
 ('global', 2925),
 ('warming', 2783),
 ('...', 1331),
 ('link', 975),
 ('rt', 898),
 ("'s", 879),
 ('via', 522)]

we can also try stemming

In [18]:
from nltk.stem.lancaster import LancasterStemmer
from nltk.tokenize import RegexpTokenizer

# Tokenize and stem
tkr = RegexpTokenizer('[a-zA-Z0-9@]+')
stemmer = LancasterStemmer()

tokenized_corpus = []

for i, tweet in enumerate(tweets):
    tokens = [stemmer.stem(t) for t in tkr.tokenize(tweet) if not t.startswith('@')]
    tokenized_corpus.append(tokens)

In [19]:
top_words = 20000
max_words = 30
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tokenized_tweets)
X = token.texts_to_sequences(texts=tokenized_tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))
print(np.unique(model.predict(X_test),return_counts=True))

training score: 0.514
testing score: 0.507
(array(['No', 'Yes', 'ambiguous'], dtype=object), array([  57, 2974,   14]))


In [20]:
sorted_d = sorted(token.word_docs.items(), key=lambda x: x[1], reverse=True)
sorted_d[:10]

[('http', 3892),
 ('climate', 3360),
 ('change', 3207),
 ('global', 2925),
 ('warming', 2783),
 ('...', 1331),
 ('link', 975),
 ('rt', 898),
 ("'s", 879),
 ('via', 522)]

I would eventually like to use TF-IDF to select words in 2-3 length vectors and see if these short vectors can be used to classify sentiment (see last cell)

In [21]:
top_words = 60
max_words = 1
test_split = 0.5

#convert X to ints
token = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~',
                  lower=True, split=' ', char_level=False, oov_token=None)
token.fit_on_texts(texts=tweets)
X = token.texts_to_sequences(texts=tweets)

X_train, X_test, Y_train, Y_test = train_test_split(X,sentiment, test_size=test_split)

X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', n_jobs=-1, max_iter=1e4,
                           C=.0001, penalty='l1')
model.fit(X_train, Y_train)
print("training score: {:.3}".format(model.score(X_train, Y_train)))
print("testing score: {:.3}".format(model.score(X_test, Y_test)))

training score: 0.514
testing score: 0.507


In [25]:
df = pd.DataFrame()
cols = ["word", "value", "prediction"]
for val, key in enumerate(token.word_index):
    word = key
    word_ind = val + 1
    word_pred = model.predict(word_ind)
    new = pd.DataFrame([[word, word_ind, word_pred]], columns=cols)
    df = pd.concat([df, new])
    if val == 59:
        break
df

Unnamed: 0,word,value,prediction
0,http,1,[Yes]
0,climate,2,[Yes]
0,change,3,[Yes]
0,global,4,[Yes]
0,warming,5,[Yes]
0,the,6,[Yes]
0,ly,7,[Yes]
0,bit,8,[Yes]
0,to,9,[Yes]
0,of,10,[Yes]


In [550]:
df["log odds no"] = df["value"] * model.coef_[0]
df["log odds yes"] = df["value"] * model.coef_[1]
df["log odds nan"] = df["value"] * model.coef_[2]