## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
from IPython.core.interactiveshell import InteractiveShell

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB , MultinomialNB
from sklearn import metrics
import pandas as pd      
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
import nltk

InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings('ignore')

In [2]:

data = pd.read_csv("tweets.csv",encoding = 'unicode_escape',keep_default_na=False)

In [3]:
data.shape

(9093, 3)

In [4]:
data = data.dropna()

In [5]:
data.shape

(9093, 3)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9093 entries, 0 to 9092
Data columns (total 3 columns):
tweet_text                                            9093 non-null object
emotion_in_tweet_is_directed_at                       9093 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    9093 non-null object
dtypes: object(3)
memory usage: 284.2+ KB


In [7]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [8]:

def preprocess(text):
    try:
        
        # Remove all the special characters
        processed_tweet = re.sub(r'\W', ' ', text)
 
        # remove all single characters
        processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)
 
        # Remove single characters from the start
        processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet) 
 
        # Substituting multiple spaces with single space
        processed_tweet= re.sub(r'\s+', ' ', processed_tweet, flags=re.I)
 
        # Removing prefixed 'b'
        processed_tweet = re.sub(r'^b\s+', '', processed_tweet)
 
        # Converting to Lowercase
        processed_tweet = processed_tweet.lower()
 
        return str(processed_tweet)
    
    except Exception as e:
        return ""

In [9]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [10]:
data.shape

(9093, 4)

In [11]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 have 3g iphone after 3 hrs tweeting ...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipad ipho...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they ...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope this year festival isn as crashy as...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa may...


In [12]:
data.columns

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product', 'text'],
      dtype='object')

In [13]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [14]:
data["text"].value_counts()

rt mention google to launch major new social network called circles possibly today link sxsw                                                    11
rt mention marissa mayer google will connect the digital amp physical worlds through mobile link sxsw                                           10
win free ipad 2 from webdoc com sxsw rt                                                                                                          7
google to launch major new social network called circles possibly today link sxsw                                                                7
at sxsw apple schools the marketing experts link                                                                                                 4
marissa mayer google will connect the digital amp physical worlds through mobile link sxsw                                                       4
before it even begins apple wins sxsw link                                                                            

In [15]:
data["emotion_in_tweet_is_directed_at"].value_counts()

                                   5802
iPad                                946
Apple                               661
iPad or iPhone App                  470
Google                              430
iPhone                              297
Other Google product or service     293
Android App                          81
Android                              78
Other Apple product or service       35
Name: emotion_in_tweet_is_directed_at, dtype: int64

### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [16]:
#data_s3_b4 = data.copy()

In [17]:
data = data[data["is_there_an_emotion_directed_at_a_brand_or_product"].isin(["Positive emotion","Negative emotion"])]

In [18]:
data.shape

(3548, 4)

In [19]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [20]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 have 3g iphone after 3 hrs tweeting ...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipad ipho...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they ...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope this year festival isn as crashy as...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa may...


In [21]:
data.isnull().sum().sum()

0

In [22]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].isnull().sum().sum()

0

In [23]:
data["emotion_in_tweet_is_directed_at"].isnull().sum().sum()

0

In [24]:
data["tweet_text"].isnull().sum().sum()

0

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [25]:
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(data["text"])
# summarize
print(vectorizer.vocabulary_)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [26]:
# encode document
vector = vectorizer.transform(data["text"])
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())


(3548, 6022)
<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [27]:
dtm_df = pd.DataFrame(vector.toarray(), columns=vectorizer.get_feature_names())
print(dtm_df)

      000  02  03  0310apple  08  10  100  100s  100tc  101  ...    ûïmute  \
0       0   0   0          0   0   0    0     0      0    0  ...         0   
1       0   0   0          0   0   0    0     0      0    0  ...         0   
2       0   0   0          0   0   0    0     0      0    0  ...         0   
3       0   0   0          0   0   0    0     0      0    0  ...         0   
4       0   0   0          0   0   0    0     0      0    0  ...         0   
5       0   0   0          0   0   0    0     0      0    0  ...         0   
6       0   0   0          0   0   0    0     0      0    0  ...         0   
7       0   0   0          0   0   0    0     0      0    0  ...         0   
8       0   0   0          0   0   0    0     0      0    0  ...         0   
9       0   0   0          0   0   0    0     0      0    0  ...         0   
10      0   0   0          0   0   0    0     0      0    0  ...         0   
11      0   0   0          0   0   0    0     0      0    0  ...

### 5. Find number of different words in vocabulary

In [28]:
dir(vectorizer)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_limit_features',
 '_sort_features',
 '_validate_vocabulary',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words',
 'input',
 'inverse_transform',
 'lowercase',
 'max_df',
 'max_features',
 'min_df',
 'ngram_range',
 'preprocessor',
 'set_params',
 'stop_words',
 'stop_word

In [29]:

vectorizer.vocabulary_

{'wesley83': 5775,
 'have': 2461,
 '3g': 91,
 'iphone': 2831,
 'after': 252,
 'hrs': 2609,
 'tweeting': 5484,
 'at': 463,
 'rise_austin': 4433,
 'it': 2855,
 'was': 5728,
 'dead': 1391,
 'need': 3547,
 'to': 5351,
 'upgrade': 5574,
 'plugin': 3972,
 'stations': 4939,
 'sxsw': 5096,
 'jessedee': 2887,
 'know': 2989,
 'about': 173,
 'fludapp': 2056,
 'awesome': 524,
 'ipad': 2821,
 'app': 389,
 'that': 5247,
 'you': 5953,
 'll': 3141,
 'likely': 3109,
 'appreciate': 410,
 'for': 2090,
 'its': 2858,
 'design': 1455,
 'also': 312,
 'they': 5272,
 're': 4237,
 'giving': 2268,
 'free': 2124,
 'ts': 5452,
 'swonderlin': 5087,
 'can': 876,
 'not': 3616,
 'wait': 5698,
 'should': 4682,
 'sale': 4489,
 'them': 5258,
 'down': 1613,
 'hope': 2580,
 'this': 5286,
 'year': 5935,
 'festival': 1982,
 'isn': 2850,
 'as': 452,
 'crashy': 1276,
 'sxtxstate': 5124,
 'great': 2346,
 'stuff': 5007,
 'on': 3689,
 'fri': 2130,
 'marissa': 3285,
 'mayer': 3319,
 'google': 2308,
 'tim': 5322,
 'reilly': 4316,
 

#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [30]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [31]:
data['Label']=data["is_there_an_emotion_directed_at_a_brand_or_product"].map({"Negative emotion":0, "Positive emotion":1})

In [32]:
data['Label'].value_counts()

1    2978
0     570
Name: Label, dtype: int64

In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3548 entries, 0 to 9088
Data columns (total 5 columns):
tweet_text                                            3548 non-null object
emotion_in_tweet_is_directed_at                       3548 non-null object
is_there_an_emotion_directed_at_a_brand_or_product    3548 non-null object
text                                                  3548 non-null object
Label                                                 3548 non-null int64
dtypes: int64(1), object(4)
memory usage: 166.3+ KB


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [34]:

print(vector.shape,data['Label'].shape)
X_train_cv, X_test_cv, y_train, y_test = train_test_split(vector, data['Label'].values, test_size=0.25)
print(X_train_cv.shape,y_train.shape,X_test_cv.shape,y_test.shape)
X_train_cv=X_train_cv.toarray()
X_test_cv=X_test_cv.toarray()
print(X_train_cv.shape,y_train.shape,X_test_cv.shape,y_test.shape)

(3548, 6022) (3548,)
(2661, 6022) (2661,) (887, 6022) (887,)
(2661, 6022) (2661,) (887, 6022) (887,)


## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [35]:
model_nb = GaussianNB()
model_nb.fit(X_train_cv, y_train)
model_nb.score(X_train_cv, y_train)
model_nb.score(X_test_cv, y_test)
test_pred_nb = model_nb.predict(X_test_cv)
print(metrics.classification_report(y_test, test_pred_nb))
print(metrics.confusion_matrix(y_test, test_pred_nb))
accuracy = (test_pred_nb == y_test).mean()
accuracy

GaussianNB(priors=None)

0.9297256670424653

0.7835400225479143

             precision    recall  f1-score   support

          0       0.35      0.54      0.43       132
          1       0.91      0.83      0.87       755

avg / total       0.83      0.78      0.80       887

[[ 71  61]
 [131 624]]


0.7835400225479143

In [36]:
#model = LogisticRegression(C=0.2, dual=True)
#model.fit(X_train_tfidf, y_train)
#preds = model.predict(X_test_tfidf)
#accuracy = (preds == y_test).mean()

model_logr = LogisticRegression()
model_logr.fit(X_train_cv, y_train)
model_logr.score(X_train_cv, y_train)
model_logr.score(X_test_cv, y_test)
test_pred_logr = model_logr.predict(X_test_cv)
print(metrics.classification_report(y_test, test_pred_logr))
print(metrics.confusion_matrix(y_test, test_pred_logr))
accuracy = (test_pred_logr == y_test).mean()
accuracy

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

0.9804584742577979

0.8793686583990981

             precision    recall  f1-score   support

          0       0.67      0.38      0.48       132
          1       0.90      0.97      0.93       755

avg / total       0.86      0.88      0.86       887

[[ 50  82]
 [ 25 730]]


0.8793686583990981

## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [37]:
X_train, X_test, Y_train, Y_test = train_test_split(data["text"], data["Label"], test_size=0.30, random_state=1)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(2483,) (1065,) (2483,) (1065,)


In [38]:
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, Y_train)
    Y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(Y_test, Y_pred_class))

In [39]:
vectorizer1 = CountVectorizer()
tokenize_test(vectorizer1)

Features:  4994
Accuracy:  0.8676056338028169


### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [40]:
vectorizer2 = CountVectorizer(ngram_range=(1,2))
tokenize_test(vectorizer2)

Features:  25332
Accuracy:  0.8751173708920188


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [41]:
vectorizer3 = CountVectorizer( stop_words='english')
tokenize_test(vectorizer3)

Features:  4758
Accuracy:  0.8666666666666667


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [42]:
vectorizer4 = CountVectorizer(max_features =300)
tokenize_test(vectorizer4)

Features:  300
Accuracy:  0.8112676056338028


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [43]:
vectorizer5 = CountVectorizer(ngram_range=(1,2),max_features =15000)
tokenize_test(vectorizer5)

Features:  15000
Accuracy:  0.8788732394366198


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [44]:
vectorizer6 = CountVectorizer(ngram_range=(1,2), min_df = 2)
tokenize_test(vectorizer6)

Features:  8112
Accuracy:  0.8685446009389671
