# Exercise 12

## Analyze how travelers expressed their feelings on Twitter

A sentiment analysis job about the problems of each major U.S. airline. 
Twitter data was scraped from February of 2015 and contributors were 
asked to first classify positive, negative, and neutral tweets, followed
by categorizing negative reasons (such as "late flight" or "rude service").

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

# read the data and set the datetime as the index
tweets = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Tweets.zip', index_col=0)

tweets.head()

Unnamed: 0_level_0,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [3]:
tweets.shape

### Proportion of tweets with each sentiment

In [5]:
tweets['airline_sentiment'].value_counts()

### Proportion of tweets per airline

In [7]:
tweets['airline'].value_counts()

In [8]:
pd.Series(tweets["airline"]).value_counts().plot(kind = "bar",figsize=(8,6),rot = 0)

In [9]:
pd.crosstab(index = tweets["airline"],columns = tweets["airline_sentiment"]).plot(kind='bar',figsize=(10, 6),alpha=0.5,rot=0,stacked=True,title="Sentiment by airline")

# Exercise 12.1 

Predict the sentiment using CountVectorizer

use Random Forest classifier

In [11]:
from sklearn.model_selection import cross_val_score
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_dtm = vect.fit_transform(X)
    print('Features: ', X_dtm.shape[1])
    rf = RandomForestClassifier()
    print(pd.Series(cross_val_score(rf, X_dtm, y, cv=10)).describe())

In [12]:
%sh
pip install nltk

In [13]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer

In [14]:
X = tweets['text']
y = tweets['airline_sentiment'].map({'negative':-1,'neutral':0,'positive':1})

In [15]:
X.head()

In [16]:
y.head()

In [17]:
vect = CountVectorizer()
tokenize_test(vect)

# Exercise 12.2 

Remove stopwords, then predict the sentiment using CountVectorizer.

use Random Forest classifier

In [19]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

# Exercise 12.3

Increase n_grams size (with and without stopwords),  then predict the sentiment using CountVectorizer

use Random Forest classifier

### With stopwords and n_grams size range(1, 4)

In [22]:
vect2 = CountVectorizer(stop_words = 'english', ngram_range=(1, 4))
tokenize_test(vect2)

### Without stopwords and n_grams size range(1, 4)

In [24]:
vect2 = CountVectorizer(ngram_range=(1, 4))
tokenize_test(vect2)

### With stopwords and n_grams size range(1, 6)

In [26]:
vect2 = CountVectorizer(stop_words = 'english', ngram_range=(1, 6))
tokenize_test(vect2)

### Without stopwords and n_grams size range(1, 6)

In [28]:
vect2 = CountVectorizer(ngram_range=(1, 6))
tokenize_test(vect2)

# Exercise 12.4

Predict the sentiment using TfidfVectorizer.

use Random Forest classifier

In [30]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words = 'english')
dtm = vect.fit_transform(X)

print('Features: ', dtm.shape[1])
rf = RandomForestClassifier()
print(pd.Series(cross_val_score(rf, dtm, y, cv=10)).describe())

In [31]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 4), max_features = 100)
dtm = vect.fit_transform(X)

print('Features: ', dtm.shape[1])
rf = RandomForestClassifier()
print(pd.Series(cross_val_score(rf, dtm, y, cv=10)).describe())