The goal of this Text Processing project is to implement sentiment analysis of tweets using Logistic Regression by building a model that will determine the tone (neutral, positive, negative) of the text.


## Steps:
### 1.Represent the twits as vectors we can input to our algorithm: Bag of words 
#### 1.1. Bag of words:
#### 1.2. feature vector:
#### 1.3. Removing high frequency words : tf-idf

### 2.Data cleaning:
#### 2.1.Inspecting stop words:
#### 2.2 Removing stop words:
#### 2.3. Removing ponctuation, html, and other characters:
#### 2.4.Reducing a word to its root : steaming

In [1]:
#Libraries:
import math
import numpy as np
import pandas as pd
import sklearn as skl

In [7]:
#load data: 
train = pd.read_csv('data/train.csv', encoding='latin-1')
test = pd.read_csv('data/test.csv', encoding='latin-1')

print(train.shape)
train.head(10)

(99989, 3)


Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...
5,6,0,or i just worry too much?
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sunny Again Work Tomorrow :-| ...
8,9,1,handed in my uniform today . i miss you ...
9,10,1,hmmmm.... i wonder how she my number @-)


-The structure of a twit varies a lot between twit and twit. They have different lengths, letters, numbers, extrange characters, etc.

-A lot of words are not correctly spelled, for example the word "Juuuuuuuuuuuuuuuuussssst" or the word "frie" instear of "friend", this makes it hard to mesure how positive or negative are the words withing the corpus of twits. If they were all correct dictionary words, we could use a lexicon to punctuate words. However because of the nature of social media language, we cannot do that.

-So we need a way of scoring the words such that words that appear in positive twits have greater score that those that appear in negative twits.

## 1.Represent the twits as vectors we can input to our algorithm: Bag of words 
Format and score the text input from text to a vector.
### 1.1. Bag of words:

One thing we could do to represent the twits as equal-sized vectors of numbers is the following:
- Create a list (vocabulary) with all the unique words in the whole corpus of twits, and the number of times they occure.
- We construct a feature vector from each twit that contains the counts of how often each word occurs in the particular twit

In [11]:
#construct the bag of words:
from sklearn.feature_extraction.text import CountVectorizer

#intiniate counter
counter = CountVectorizer()

#fit twits
bag = counter.fit_transform(train['SentimentText'])

#print bag of words vocabulary
counter.vocabulary_#  returns a Python dictionary that maps the unique words to integer indices.

{'is': 71938,
 'so': 93674,
 'sad': 90542,
 'for': 65062,
 'my': 80817,
 'apl': 21665,
 'friend': 65503,
 'missed': 79478,
 'the': 97552,
 'new': 81633,
 'moon': 80032,
 'trailer': 99049,
 'omg': 83232,
 'its': 72089,
 'already': 14693,
 '30': 1718,
 'omgaga': 83233,
 'im': 70919,
 'sooo': 93951,
 'gunna': 67860,
 'cry': 57814,
 've': 101379,
 'been': 31808,
 'at': 25403,
 'this': 97908,
 'dentist': 59550,
 'since': 92847,
 '11': 279,
 'was': 102394,
 'suposed': 96026,
 'just': 73528,
 'get': 66440,
 'crown': 57664,
 'put': 87411,
 'on': 83282,
 '30mins': 1754,
 'think': 97880,
 'mi': 79029,
 'bf': 33599,
 'cheating': 48348,
 'me': 78448,
 't_t': 96554,
 'or': 83552,
 'worry': 103928,
 'too': 98750,
 'much': 80504,
 'juuuuuuuuuuuuuuuuussssst': 73586,
 'chillin': 49562,
 'sunny': 95932,
 'again': 9686,
 'work': 103863,
 'tomorrow': 98680,
 'tv': 99723,
 'tonight': 98722,
 'handed': 68325,
 'in': 71113,
 'uniform': 100724,
 'today': 98559,
 'miss': 79465,
 'you': 104998,
 'hmmmm': 69534,

### 1.2. feature vector:

In [63]:
# construct a feature vector from each twit
bag.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### 1.3. Removing high frequency words : tf-idf

#### Term frequencies:
Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. 

These values in the feature vectors are called the raw term frequencies: tf(t,d ) —the number of times a term t occurs in a document d. ( the number of times a term from the vocabulary in each twit).

#### How relevant are words? Term frequency-inverse document frequency (tf-idf)
We could use these raw term frequencies to score the words in our algorithm. There is a problem though: If a word is very frequent in all documents, then it probably doesn't carry a lot of information. In order to tacke this problem we can use term frequency-inverse document frequency, which will reduce the score the more frequent the word is accross all twits. 

**Note:**  tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today.

The tf–idf is the product of two statistics, term frequency and inverse document frequency. 
- Term frequency(tf): the number of times a term occurs in a document is called its term frequency.
- Inverse document frequency:  is a measure of how much information the word provides, it diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

In [12]:
#reduce the frequency of relevant words:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
# norm: normalize the tf-idfs so that they're all in the same scale and thus work better with Logistic Regression.

np.set_printoptions(precision=2)

# Feed the tf-idf transformer with our previously created Bag of Words
tfidf.fit_transform(bag).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## 2.Data cleaning:
### 2.1.Inspecting stop words:



In [64]:
#Real vocabulary, the most common words:
from collections import Counter

#count word
vocab = Counter()
for twit in train.SentimentText:
    for word in twit.split(' '):
        vocab[word] += 1

print(pd.DataFrame(vocab.most_common(20)))
print('''The most common words are meaningless in terms of sentiment: I, to, the, and... they don't give any information on positiveness or negativeness. They're basically noise that can most probably be eliminated. Let's see the whole distribution to convince ourselves of this:''')
                                      
    

       0       1
0         123916
1      I   32879
2     to   28810
3    the   28087
4      a   21321
5    you   21180
6      i   15995
7    and   14565
8     it   12818
9     my   12385
10   for   12149
11    in   11199
12    is   11185
13    of   10326
14  that    9181
15    on    9020
16  have    8991
17    me    8255
18    so    7612
19   but    7220
The most common words are meaningless in terms of sentiment: I, to, the, and... they don't give any information on positiveness or negativeness. They're basically noise that can most probably be eliminated. Let's see the whole distribution to convince ourselves of this:


In [15]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [32]:
#visualizing over represented words: stopwords
def plot_distribution(vocabulary):
    hist, edges = np.histogram(list(map(lambda x:math.log(x[1]),vocabulary.most_common())), 
                               density=True, bins=500)

    p = figure(tools="pan,wheel_zoom,reset,save",
               toolbar_location="above",
               title="Word distribution accross all twits")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#B3DE69", )
    show(p)

plot_distribution(vocab)
print('''
It's clear now that a portion of the words are overly represented. These kind of words are called stop words, and it is a common practice to remove them when doing text analysis. Let's do it and see the distribution again:''')




It's clear now that a portion of the words are overly represented. These kind of words are called stop words, and it is a common practice to remove them when doing text analysis. Let's do it and see the distribution again:


### 2.2 Removing stop words:

In [41]:
# import natural language toolkit (NLTK)
# NLTK contains the stopwords list we should remove
import nltk

#downloadig stopwords file locally
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/anwar/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [43]:
#importing stopwords
from nltk.corpus import stopwords

#loading English stop words
stop = stopwords.words('english')

#removing stop words:
vocab_reduced = Counter()
for w, c in vocab.items():
    if not w in stop:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('', 123916),
 ('I', 32879),
 ("I'm", 6416),
 ('like', 5086),
 ('-', 4922),
 ('get', 4864),
 ('u', 4194),
 ('good', 3953),
 ('love', 3494),
 ('know', 3472),
 ('go', 2990),
 ('see', 2868),
 ('one', 2787),
 ('got', 2774),
 ('think', 2613),
 ('&amp;', 2556),
 ('lol', 2419),
 ('going', 2396),
 ('really', 2287),
 ('im', 2200)]

In [48]:
#plotting after removing stopwords:
plot_distribution(vocab_reduced)
print('''After removing the stopwords,We still se a very uneaven distribution. If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like &amp. ''')

After removing the stopwords,We still se a very uneaven distribution. If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like &amp. 


### 2.3. Removing ponctuation, html, and other characters:

In [65]:
# importing regular expression:
import re

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text


### 2.4.Reducing a word to its root : steaming

We also need a tokenizer to break down our twits in individual words. We will implement two tokenizers, a regular one and one that does steaming.

In [66]:
#importing PorterStemmer
from nltk.stem import PorterStemmer

porter = PorterStemmer()

#tokenizer that split text into individual words:
def tokenizer(text):
    return text.split()

#reduce the words into its root
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


## 3.Training Logistic Regression

We are finally ready to train our algorythm. We need to choose the best hyperparameters like the learning rate or regularization strength. We also would like to know if our algorithm performs better steaming words or not, or removing html or not, etc...

To take these decisions methodically, we can use a Grid Search. Grid search is a method of training an algorythm with different variations of parameters to latter select the best combination.

In [56]:
from sklearn.model_selection import train_test_split

# split the dataset in train and test
X = train['SentimentText']
y = train['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y) 

# In the code line above, stratify will create a train set with the same class balance than the original set

In [58]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

# Note: This may take a long while to execute, like... 1 or 2 hours
# gs_lr_tfidf.fit(X_train, y_train)

In [60]:
# print('Best parameter set: ' + str(gs_lr_tfidf.best_params_))
# print('Best accuracy: %.3f' % gs_lr_tfidf.best_score_)

Interestingly, the set of parameters that best results give us are:

- A regularization strength of 1.0 using l2 regularization.
- Using our preprocessor (removing html, keeping emoticons, etc) does improve the performance
- Surprisingly, removing stop words does not improve accuracy
- word steming doesn't seem to help either

As you can see, sometimes intuition may lead to wrong decisions, and it's important to test all our assumptions.



In [62]:
# Let's see what's our best accuracy then:
# clf = gs_lr_tfidf.best_estimator_
# print('Accuracy in test: %.3f' % clf.score(X_test, y_test))



In [None]:
# predict on test set
# preds = clf.predict(test['SentimentText'])
