## Twitter Sentiment Prediction using NLP

### Importing the data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('Tweets.csv')
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


### Data Cleaning

**Tokenization** is the process breaking complex data like paragraphs into simple units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
1. **Sentence tokenization** : split a paragraph into list of sentences using sent_tokenize() method
2. **Word tokenization** : split a sentence into list of words using word_tokenize() method

We will be using **Word tokenization** to convert all the text to words

Import all the libraries required to perform tokenization on input data.

In [3]:
from nltk.tokenize import word_tokenize

**Stop Words** refers to the most common words in a language (such as "the", "a", "an", "in") which helps in formation of sentence to make sense, but these words does not provide any significance in language processing so remove it .

In computing, stop words are words which are filtered out before or after processing of natural language data (text). 

You can check list of stopwords by running below code snippet

In [4]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
stops

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

**Remove Punctuations**

To remove punctuations from the list of words, import all punctuations and add them in the stop word list.

In [5]:
import string

punctuations = list(string.punctuation)
stops += punctuations

### Stemming

**Stemming** is a normalization technique where list of tokenized words are converted into shorten root words to remove redundancy. Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form.

A computer program that stems word may be called a stemmer.

A stemmer reduce the words like fishing, fished, and fisher to the stem fish. The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu .

It removes suffices, like "ing", "ly", "s", etc. by a simple rule-based approach. It reduces the corpus of words but often the actual words get neglected.

**Various Stemming algorithms**
1. **Porter stemming algorithm**: This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem.
2. **Lancaster stemming algorithm**: It was developed at Lancaster University and it is another very common stemming algorithms.
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem.
3. **Regular Expression stemming algorithm** : With the help of this stemming algorithm, we can construct our own stemmer.
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression.
4. **Snowball stemming algorithm**: NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method.

In [6]:
stem_words = ["play", "played", "playing", "player", "happier", "happiness", "universe", "universal"]
from nltk.stem import PorterStemmer #Here we have used the porter stemming algorithm
ps = PorterStemmer()
for w in stem_words:
    print (ps.stem(w))

play
play
play
player
happier
happi
univers
univers


### **Lemmatization**

Major drawback of stemming is it produces Intermediate representation of word. Stemmer may or may not return meaningful word.

To overcome this problem , Lemmatization comes into picture.
Stemming algorithm works by cutting suffix or prefix from the word.On the contrary Lemmatization consider morphological analysis of the words and returns meaningful word in proper form.

The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem,

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.

### POS Tag

Parts of speech Tagging is responsible for reading the text in a language and assigning some specific token (Parts of Speech) to each word.=
POS tag tell us about grammatical information of words of the sentence by assigning specific token (Determiner, noun, adjective , adverb ,verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.

Word can have more than one POS depending upon context where it is used. we can use POS tags as statistical NLP tasks it distinguishes sense of word which is very helpful in text realization and infer semantic information from gives text for sentiment analysis.

In [7]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

In [8]:
lem.lemmatize("good", pos = 'a')

'good'

In [9]:
lem.lemmatize("better", pos = 'a')

'good'

In [10]:
lem.lemmatize("painting", pos = 'n') 
# Here painting is a noun which means painting can't be converted into paint. For eg
# "This painting is beautiful". Here painting cannot be changed.

'painting'

In [11]:
lem.lemmatize("painting", pos = 'v')
# Here painting is a verb which means it can be converted into paint.
# "I love painting"

'paint'

In [12]:
# Lemmatize

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

 #creating simple tags to pass into the lemmatizer
def get_simple_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def clean_tweet_data(words):
    output_words = []
    i = 0
    for w in words:
        if w.lower() not in stops:
            pos = pos_tag([w])
            clean_word = lemmatizer.lemmatize(w, pos = get_simple_pos(pos[0][1]))
            output_words.append(clean_word.lower())
    return output_words

In [13]:
def clean_data(df):
    tweets_data = []
    for i in range(df.shape[0]):
        tokenised_words = word_tokenize(df['text'][i])
        tweets_data.append((clean_tweet_data(tokenised_words), df['airline_sentiment'][i]))
    return tweets_data

In [14]:
tweets_data_clean = clean_data(df)

In [15]:
len(tweets_data_clean)

14640

In [16]:
import random
random.seed(2)
random.shuffle(tweets_data_clean)  
## shuffling the training exapmles.

In [17]:
tweets_data_train = tweets_data_clean[0:12000]
tweets_data_test = tweets_data_clean[12000:]

### Building Feature Set

In [18]:
# Get all words from training data
all_words = []
for tweet_data in tweets_data_train:
    all_words += tweet_data[0]

In [19]:
import nltk

#will retrurn a freq distribution object
freq = nltk.FreqDist(all_words)
len(freq)

13739

In [20]:
 #choosing the top 6000 frequency words
common = freq.most_common(8000)
features = [i[0] for i in common]

In [21]:
#will return true/false if the word in present in the tweet text or not
def get_feature_dictionary(words):
    current_features = {}
    words_set = set(words)
    for w in features:
        current_features[w] = w in words_set
    return current_features

In [22]:
tweets_data_train = [(get_feature_dictionary(tweet_words), sentiment) for tweet_words, sentiment in tweets_data_train]

In [23]:
tweets_data_test = [(get_feature_dictionary(tweet_words), sentiment) for tweet_words, sentiment in tweets_data_test]

### Classification using NLTK Naive Bayes

In [24]:
from nltk import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(tweets_data_train)

In [25]:
tweets_data_test[0]

({'flight': True,
  'united': False,
  'usairways': False,
  'americanair': True,
  'southwestair': False,
  'jetblue': False,
  "n't": False,
  'get': False,
  "'s": False,
  'http': False,
  'hour': False,
  'cancelled': False,
  'thanks': False,
  'service': False,
  'time': False,
  'help': False,
  'customer': False,
  '...': False,
  'u': False,
  'call': False,
  'bag': False,
  'wait': False,
  'go': False,
  'plane': False,
  "'m": True,
  'hold': False,
  'need': False,
  'amp': False,
  'fly': False,
  'make': False,
  'would': False,
  'thank': False,
  '2': False,
  'still': False,
  'one': False,
  'day': False,
  'please': False,
  'delayed': False,
  'back': False,
  'ca': False,
  'gate': False,
  'try': True,
  'flightled': False,
  'virginamerica': False,
  'say': False,
  'airline': False,
  'take': False,
  'seat': False,
  "'ve": False,
  'phone': False,
  "''": False,
  '``': False,
  'change': False,
  'late': False,
  'like': False,
  'today': False,
  'delay':

In [26]:
nltk.classify.accuracy(classifier, tweets_data_test)

0.7727272727272727

In [27]:
classifier.show_most_informative_features(30)

Most Informative Features
                passbook = True           positi : negati =     40.4 : 1.0
                 amazing = True           positi : negati =     32.6 : 1.0
                favorite = True           positi : negati =     30.0 : 1.0
             outstanding = True           positi : negati =     27.4 : 1.0
                discount = True           neutra : negati =     26.9 : 1.0
                   kudos = True           positi : negati =     25.2 : 1.0
                    rude = True           negati : neutra =     24.4 : 1.0
                 helpful = True           positi : neutra =     23.1 : 1.0
                 awesome = True           positi : negati =     22.8 : 1.0
               beautiful = True           positi : negati =     22.2 : 1.0
                    rock = True           positi : negati =     22.2 : 1.0
                   smile = True           positi : negati =     22.2 : 1.0
               wonderful = True           positi : negati =     21.8 : 1.0

### **Sklearn Classifiers within NLTK**

There is a Sklearn classifier that gives uses of NLTK a way to call the underlying scikit-learn classifier through their code in Phyton.

To construct a scikit-learn estimator object, then use that to construct a SklearnClassifier. E.g., to wrap a linear SVM with default settings:

$$from \;sklearn.svm \;import LinearSVC$$

$$from\; nltk.classify.scikitlearn\; import\; SklearnClassifier$$

$$classifier = SklearnClassifier(LinearSVC())$$

### Classification using SVC

In [28]:
#Using Sklearn Classifier within Nltk
from sklearn.svm import SVC
from nltk.classify.scikitlearn import SklearnClassifier

In [29]:
svc = SVC()
classifier_sklearn = SklearnClassifier(svc)

In [30]:
classifier_sklearn.train(tweets_data_train)

<SklearnClassifier(SVC())>

In [31]:
nltk.classify.accuracy(classifier_sklearn, tweets_data_test)

0.7791666666666667

### Classification using Random Forests

In [32]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
classifier_sklearn2 = SklearnClassifier(rf)

In [33]:
classifier_sklearn2.train(tweets_data_train)

<SklearnClassifier(RandomForestClassifier())>

In [34]:
nltk.classify.accuracy(classifier_sklearn2, tweets_data_test)

0.765530303030303

### Count Vectorizer

Count Vectorizer is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

It is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation.

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. 

In [35]:
sentiments_train = [sentiment for tweet, sentiment in tweets_data_train]

In [36]:
sentiments_test= [sentiment for tweet, sentiment in tweets_data_test]

In [37]:
tweets_data_train = [' '.join(tweet) for tweet, sentiment in tweets_data_train]

In [39]:
tweets_data_test = [' '.join(tweet) for tweet, sentiment in tweets_data_test]

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer(max_features = 8000)
tweets_data_train_vec = count_vec.fit_transform(tweets_data_train)
tweets_data_test_vec = count_vec.transform(tweets_data_test)

In [41]:
# from sklearn.svm import SVC
# from sklearn.model_selection import GridSearchCV

# clf = SVC()
# grid = {'C': [1e2, 1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [1e-3, 5e-4, 1e-4, 5e-3]}
# abc = GridSearchCV(clf, grid)
# abc.fit(tweets_data_train, sentiments)
# abc.best_estimator_

In [42]:
svc = SVC()
svc.fit(tweets_data_train_vec, sentiments_train)
svc.score(tweets_data_test_vec, sentiments_test)

0.6178030303030303

### **N-Grams**

An N-gram is an N-token sequence of words: a 2-gram (called a bigram) is a two-word sequence of words like “really good”, “not good”, or “your homework”, and a 3-gram (trigram) is a three-word sequence of wor¯ds like “not at all”, or “turn off light”.

Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features. The default ngram_range is (1,1).

Instead of using a single word as feature, we can use a pair of words or three words as one features for our model.

In [44]:
count_vec = CountVectorizer(max_features = 6000, ngram_range=(2,3))
tweets_data_train_vec = count_vec.fit_transform(tweets_data_train)
tweets_data_test_vec = count_vec.transform(tweets_data_test)

In [45]:
svc = SVC()
svc.fit(tweets_data_train_vec, sentiments_train)
svc.score(tweets_data_test_vec, sentiments_test)

0.6178030303030303

## Results

Classification using NLTK Naive Bayes: 0.77

Classification using Sklearn SVC: 0.78

Classification using Sklearn Random Forest: 0.76

Classification using Count Vectorizer and SVC: 0.62

Classification using Count Vectorizer with N-Grams and SVC: 0.62