# Classification Predict Student Solution


### Predict Overview: Twitter Sentiment Classification

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received. Your company has been awarded the contract to:

- 1. Create a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.
- 2. Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to the data science team, by your manager via email reads as follow:

> The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo. The dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43,943 tweets were collected. Each tweet is labelled as one of 4 classes, which are described below:

> - 2 News: the tweet links to factual news about climate change

> - 1 Pro: the tweet supports the belief of man-made climate change

> - 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change

> - -1 Anti: the tweet does not believe in man-made climate change Variable definitions

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [24]:
# Libraries for data loading, data manipulation and data visulisation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
sns.set()

# Libraries for data preparation and model building
import string
import preprocessor as p
from nltk.tokenize import word_tokenize
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Model importation
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import pickle

# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


In [25]:
df_train = pd.read_csv('resources/train.csv')

In [26]:
# Display all columns of the dataframe
df_train

Unnamed: 0,sentiment,message,tweetid
0,-1,RT @darreljorstad: Funny as hell! Canada deman...,897853122080407553
1,-1,All the biggest lies about climate change and ...,925046776553529344
2,-1,The Coming Revelation Of The $q$Global Warming...,696354236850786305
3,-1,RT @DineshDSouza: Let's see if the world ends ...,846806509732483072
4,-1,RT @SteveSGoddard: Obama has no control over t...,628085266293653504
...,...,...,...
30754,2,RT @TIME: The Pentagon warned that climate cha...,958155326259367937
30755,2,Study finds that global warming exacerbates re...,956048238615900163
30756,2,RT @MikeySlezak: The global green movement pre...,800258621485391872
30757,2,RT @ProfEdwardsNZ: NYC Mayor says NY will go f...,871365767895404545


<a id="four"></a>
## 4. Feature Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In this section, observations from the EDA will be addressed, futhermore, new features that are necessary for the model will be created.

To deal with the data quality issues, tweet-preprocessor library will be used.

In [27]:
features = df_train['message']
target = df_train['sentiment']

In [28]:
# Instantiate CountVectorzer Object
vect = TfidfVectorizer(preprocessor=list, tokenizer=list, ngram_range=(1,2), min_df=2, strip_accents='ascii', smooth_idf=False)

train_vect = vect.fit(features)



In [29]:
model_save_path = "resources/tfidfvect_team1.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(train_vect,file)

In [30]:
X_train = train_vect.transform(features)

In [31]:
features = X_train.toarray()
features.shape

(30759, 7343)

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


#### Create a Word2Vector object

The vectorizer is used to transform a given text into a vector on the basis of the frequency of each word that occurs in the entire text.

For this project, CountVectorizer will be used.

## Logistic Regression

In [32]:
# Instantiate a LogisticRegression object
logreg = LogisticRegression(penalty='l2', C= 0.1, solver='liblinear', multi_class='ovr')

In [33]:
# Train the model
logistic_reg = logreg.fit(features, target)

In [34]:
model_save_path = "resources/logistic_reg_team1.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(logistic_reg,file)

## Naive Bayes

In [35]:
# Create a MultinomialNB object
clf_multiNB = MultinomialNB()

In [36]:
# Train the model
nb_model = clf_multiNB.fit(features, target)

In [37]:
# Save the NB model
model_save_path = "resources/NB_team1.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(nb_model,file)

## Random Forest

In [38]:
# Create a Random Forest classifier object
clf_RF = RandomForestClassifier(n_estimators=10)

In [39]:
# Train the model
RF_model = clf_RF.fit(features, target)

In [40]:
# Save the RF model
model_save_path = "resources/RF_team1.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(RF_model,file)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

## Resources

[Tweet-Preprocessor](https://pypi.org/project/tweet-preprocessor/#description)