# Business Case
As a 3rd party independant company, we have been asked by various companies such as Apple and Google to determine how they can increase customer popularity by perfecting the features that customers are most unhappy with. More specifically, they want us to utilize a database full of tweets with manually encoded sentiments for each corresponding tweet, in order to determine which products/services customers feel negatively towards. With that information, they want us to hand them the code that will help them predict the emotion towards a tweet in order that they can easily organize which tweets go into which column. In short, our job is to create a model which can predict future tweet sentiment as well as relating to them which products/services customers feel negatively towards.

# The Data

The dataset used in this project is available on data.world <a href="https://data.world/crowdflower/brands-and-product-emotions">here</a>. It is called Brands and Product Emotions and it contains a little over 9000 rows of tweets. Each tweet has a manually encoded emotion paired with it (positive, negative, or no emotion). 

# Project Outline
* [Import Modules and Dataset](#import)
* [Basic Data Inspection](#inspect)
* [Preprocess data](#preprocess)
 * [Tokenization](#tokenization)
 * [Lemetization](#lemitization)
 * [Filtering stop words](#stop_words)
 * [Vectorization](#vectorize)
     * Count Vectorizor
     * TFIDF Vectorizor
     * Word Embeddings
* [EDA](#eda)
* [Modeling](#model)
* [Conclusion](#conclusion)
* [Business Recommendations](#business_rec)

# <a id='import'>Import</a>

In [120]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns

%matplotlib inline

In [121]:
df = pd.read_csv('tweet_sentiment.csv',encoding='latin1')

# <a id='inspect'>Inspect</a>

In [122]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [123]:
len(df)

9093

In [124]:
print(df['tweet_text'][0])

.@wesley83 I have a 3G iPhone. After 3 hrs tweeting at #RISE_Austin, it was dead!  I need to upgrade. Plugin stations at #SXSW.


In [125]:
df.isnull().sum()

tweet_text                                               1
emotion_in_tweet_is_directed_at                       5802
is_there_an_emotion_directed_at_a_brand_or_product       0
dtype: int64

In [126]:
df['emotion_in_tweet_is_directed_at'].value_counts(dropna=False)

NaN                                5802
iPad                                946
Apple                               661
iPad or iPhone App                  470
Google                              430
iPhone                              297
Other Google product or service     293
Android App                          81
Android                              78
Other Apple product or service       35
Name: emotion_in_tweet_is_directed_at, dtype: int64

In [127]:
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts(dropna=False)

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
I can't tell                           156
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [128]:
df = df[df.is_there_an_emotion_directed_at_a_brand_or_product != 'I can\'t tell']
df['is_there_an_emotion_directed_at_a_brand_or_product'].value_counts(dropna=False)

No emotion toward brand or product    5389
Positive emotion                      2978
Negative emotion                       570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

In [132]:
df['tweet_text'][6]

nan

In [137]:
df.dropna(subset=['tweet_text'], inplace=True)

In [138]:
df.head(10)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion
5,@teachntech00 New iPad Apps For #SpeechTherapy...,,No emotion toward brand or product
7,"#SXSW is just starting, #CTIA is around the co...",Android,Positive emotion
8,Beautifully smart and simple idea RT @madebyma...,iPad or iPhone App,Positive emotion
9,Counting down the days to #sxsw plus strong Ca...,Apple,Positive emotion
10,Excited to meet the @samsungmobileus at #sxsw ...,Android,Positive emotion


In [150]:
df.rename(columns={"emotion_in_tweet_is_directed_at": "emotion_directed_at", "is_there_an_emotion_directed_at_a_brand_or_product": "emotion"}, inplace=True)

In [151]:
df.head()

Unnamed: 0,tweet_text,emotion_directed_at,emotion
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


# <a id='preprocess'>Preprocess</a>

In [139]:
# For now, just going to use tweet text and emtotion column
from sklearn.model_selection import train_test_split
X = df['tweet_text']
y = df['is_there_an_emotion_directed_at_a_brand_or_product']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [140]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [141]:
text_clf = Pipeline([('Tfidf', TfidfVectorizer()),
                    ('clf', LinearSVC())])

In [142]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('Tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [143]:
predictions = text_clf.predict(X_test)

In [144]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


In [145]:
print(confusion_matrix(y_test, predictions))

[[  50  104   35]
 [  20 1319  273]
 [   9  413  458]]


In [147]:
print(classification_report(y_test,predictions))

                                    precision    recall  f1-score   support

                  Negative emotion       0.63      0.26      0.37       189
No emotion toward brand or product       0.72      0.82      0.77      1612
                  Positive emotion       0.60      0.52      0.56       880

                         micro avg       0.68      0.68      0.68      2681
                         macro avg       0.65      0.53      0.56      2681
                      weighted avg       0.67      0.68      0.67      2681



In [148]:
print(accuracy_score(y_test, predictions))

0.6814621409921671


# Preprocess 2

In [169]:
new_df = df[df.emotion != 'No emotion toward brand or product']
new_df['emotion'].value_counts()

Positive emotion    2978
Negative emotion     570
Name: emotion, dtype: int64

In [170]:
X = new_df['tweet_text']
y = new_df['emotion']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [171]:
text_clf = Pipeline([('Tfidf', TfidfVectorizer()),
                    ('clf', LinearSVC())])

In [172]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('Tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [173]:
predictions = text_clf.predict(X_test)

In [174]:
print(classification_report(y_test,predictions))

                  precision    recall  f1-score   support

Negative emotion       0.77      0.42      0.54       173
Positive emotion       0.90      0.98      0.93       892

       micro avg       0.89      0.89      0.89      1065
       macro avg       0.83      0.70      0.74      1065
    weighted avg       0.88      0.89      0.87      1065



In [175]:
print(confusion_matrix(y_test, predictions))

[[ 73 100]
 [ 22 870]]


In [177]:
print(accuracy_score(y_test, predictions))

0.8854460093896713


## <a id='tokenization'>Tokenization</a>

## <a id='lemitization'>Lemitization</a>

## <a id='stop_words'>Filtering stop words/ adding in own stop words</a>

## <a id='vectorize'>Vectorizing - words to numbers</a>


### Count Vectorizer

### TFIDF Vectorizer

### Word Embeddings

# <a id='eda'>EDA</a>
* Word cloud (for presentation) - neg, pos, neutral
* Most common words in pos, neg, neutral
* Top highest highest tfidf score (pos, neg, neutral)
* Any other curious distributions in data

# <a id='model'>Model</a>

* Naive bayes good for test
* Tree forest type to get feature importances
* Multi-class project - include neutral

## LDA - second nlp notebook (level up)

# <a id='conclusion'>Conclusion</a>

# <a id='business_rec'>Business Recommendations</a>