## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [232]:
import pandas as pd
from bs4 import BeautifulSoup
import re #for regex
from nltk.corpus import stopwords #nltk - natural language toolkit
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

In [233]:
df = pd.read_csv("tweets.csv", encoding = 'Unicode_escape')

In [234]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [235]:
df.shape

(9093, 3)

In [236]:
df.isna().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [237]:
#remove NA
df_new = df.dropna(subset = ['tweet_text'])

In [238]:
df_new.shape

(9092, 3)

In [239]:
df_new.isna().sum()

tweet_text                                               0
emotion_in_tweet_is_directed_at                       5801
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [240]:
def preprocess(text):
    try:
        return text.decode('ascii')
    except Exception as e:
        return ""

In [241]:
df_new['text'] = [preprocess(text) for text in df_new.tweet_text]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [242]:
df_new.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,


In [243]:
df_new['text'].tolist()

['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [244]:
df_new.is_there_an_emotion_directed_at_a_brand_or_product.unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [245]:
df_new2 = df_new[(df_new.is_there_an_emotion_directed_at_a_brand_or_product == 'Negative emotion') | (df_new.is_there_an_emotion_directed_at_a_brand_or_product == 'Positive emotion')]

In [246]:
df_new2.shape

(3548, 4)

In [249]:
df_new2.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,


### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [250]:
def text_to_words( raw_text ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    text = BeautifulSoup(raw_text).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))  

In [251]:
# Get the number of reviews based on the dataframe column size
num_text = df_new2["tweet_text"].size



In [252]:
num_text

3548

In [253]:
df_new2.reset_index(inplace=True)

In [254]:
df_new2["tweet_text"][3190]

"I think #SXSW has taken it upon itself to make it clear to me that a Gen 1 iPhone ain't gonna cut it anymore."

In [255]:
# Initialize an empty list to hold the clean reviews
clean_tweet_text = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list 
for i in range( 0, num_text ):
    # Call our function for each one, and add the result to the list of
    # clean reviews
    clean_tweet_text.append( text_to_words( df_new2["tweet_text"][i]))

In [256]:
clean_tweet_text[1000]

'headed sxsw hear talk integrated social network dynamics team synergy got ipad ready go'

In [257]:
print("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
# vect = CountVectorizer(analyzer = "word",   \
#                              tokenizer = None,    \
#                              preprocessor = None, \
#                              stop_words = None,   \
#                              max_features = 5000) 

vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None) 
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
tweet_data_features = vect.fit_transform(clean_tweet_text)

# Numpy arrays are easy to work with, so convert the result to an 
# array
tweet_data_features = tweet_data_features.toarray()

Creating the bag of words...



In [258]:
print(tweet_data_features.shape)

(3548, 5647)


### 5. Find number of different words in vocabulary

In [259]:
# Take a look at the words in the vocabulary
vocab = vect.get_feature_names()
print(vocab)



#### Tip: To see all available functions for an Object use dir

In [260]:
len(vocab)

5647

5647 words exist in the vocabulary

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [261]:
df_new2['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

2978 positive and 570 negative emotions in the dataset

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [262]:
df_new2['Labels'] = df_new2.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1,'Negative emotion':0 })

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [263]:
df_new2.head()

Unnamed: 0,index,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Labels
0,0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,,0
1,1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,,1
2,2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,,1
3,3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,,0
4,4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [264]:
# df_ind = df_new2['tweet_text']
df_target = df_new2['Labels']

In [265]:
df_target.shape

(3548,)

In [266]:
# Split X and y into training and test set in 75:25 ratio

X_train, X_test, y_train, y_test = train_test_split(tweet_data_features, df_target, test_size=0.3)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [267]:
# Invoking the NB Gaussian function to create the model
# fitting the model in the training data set
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [268]:
model.score(X_train , y_train)   

0.941602899718083

In [269]:
test_pred = model.predict(X_test)

print(metrics.classification_report(y_test, test_pred))
print(metrics.confusion_matrix(y_test, test_pred))

              precision    recall  f1-score   support

           0       0.34      0.51      0.41       166
           1       0.90      0.82      0.86       899

   micro avg       0.77      0.77      0.77      1065
   macro avg       0.62      0.66      0.63      1065
weighted avg       0.81      0.77      0.79      1065

[[ 85  81]
 [166 733]]


In [270]:
#Logistic regression

# Fit the model on 30%
model2 = LogisticRegression()
model2.fit(X_train, y_train)
test_pred_LR = model2.predict(X_test)
model_score_LR = model2.score(X_test, y_test)
print(model_score_LR)



0.8798122065727699




In [271]:
print(metrics.classification_report(y_test, test_pred_LR))
print(metrics.confusion_matrix(y_test, test_pred_LR))

              precision    recall  f1-score   support

           0       0.76      0.33      0.46       166
           1       0.89      0.98      0.93       899

   micro avg       0.88      0.88      0.88      1065
   macro avg       0.83      0.66      0.70      1065
weighted avg       0.87      0.88      0.86      1065

[[ 55 111]
 [ 17 882]]


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [272]:
def tokenize_predict(vect, x_train, x_test, y_train, y_test):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [273]:
vect = CountVectorizer(ngram_range=(1,2)) 
X = df_new2.tweet_text
labels = df_new2['Labels']
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3)



In [274]:
tokenize_predict(vect,X_train, X_test, y_train, y_test)

Features:  25883
Accuracy:  0.8760563380281691


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [275]:
vect = CountVectorizer(stop_words='english') 
tokenize_predict(vect,X_train, X_test, y_train, y_test)

Features:  4846
Accuracy:  0.8582159624413146


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [276]:
vect = CountVectorizer(stop_words='english', max_features=300) 
tokenize_predict(vect,X_train, X_test, y_train, y_test)

Features:  300
Accuracy:  0.815962441314554


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [277]:
vect = CountVectorizer(ngram_range=(1,2), max_features=15000) 
tokenize_predict(vect,X_train, X_test, y_train, y_test)

Features:  15000
Accuracy:  0.8741784037558685


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [278]:
vect = CountVectorizer(ngram_range=(1,2), min_df=2) 

In [279]:
tokenize_predict(vect,X_train, X_test, y_train, y_test)

Features:  7896
Accuracy:  0.8647887323943662
