# Classification Predict Student Solution


### Predict Overview: Twitter Sentiment Classification

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received. Your company has been awarded the contract to:

- 1. Create a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.
- 2. Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to the data science team, by your manager via email reads as follow:

> The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo. The dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43,943 tweets were collected. Each tweet is labelled as one of 4 classes, which are described below:

> - 2 News: the tweet links to factual news about climate change

> - 1 Pro: the tweet supports the belief of man-made climate change

> - 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change

> - -1 Anti: the tweet does not believe in man-made climate change Variable definitions

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
sns.set()

# Libraries for data preparation and model building
import string
import preprocessor as p
from nltk.tokenize import word_tokenize
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Model importation
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import pickle

# Metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


In [2]:
df_train = pd.read_csv('resources/train.csv')

In [3]:
# Display all columns of the dataframe
df_train

Unnamed: 0,sentiment,message,tweetid
0,-1,RT @darreljorstad: Funny as hell! Canada deman...,897853122080407553
1,-1,All the biggest lies about climate change and ...,925046776553529344
2,-1,The Coming Revelation Of The $q$Global Warming...,696354236850786305
3,-1,RT @DineshDSouza: Let's see if the world ends ...,846806509732483072
4,-1,RT @SteveSGoddard: Obama has no control over t...,628085266293653504
...,...,...,...
30754,2,RT @TIME: The Pentagon warned that climate cha...,958155326259367937
30755,2,Study finds that global warming exacerbates re...,956048238615900163
30756,2,RT @MikeySlezak: The global green movement pre...,800258621485391872
30757,2,RT @ProfEdwardsNZ: NYC Mayor says NY will go f...,871365767895404545


<a id="four"></a>
## 4. Feature Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In this section, observations from the EDA will be addressed, futhermore, new features that are necessary for the model will be created.

To deal with the data quality issues, tweet-preprocessor library will be used.

In [4]:
#p.set_options(p.OPT.EMOJI, p.OPT.MENTION, p.OPT.RESERVED, p.OPT.ESCAPE_CHAR,p.OPT.SMILEY, p.OPT.HASHTAG,p.OPT.NUMBER,p.OPT.URL)

#### Create a Function for Data Cleaning

The function created will automate the following process:
- Replace all ULRs with 'url-web'
- Clean each row by removing mention, hashtags, emoji and other irrelevant text components

In [5]:
def clean_data(df):
    
    # regex to search for all URL in the DataFrame
    pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
    subs_url = r'url-web'
    
    # Replace all URL with the subs_url
    #df['regex_cleaned_message'] = df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
    df['message'] = df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
    
    # Use tweet-preprocessor to lean each row of the dataframe based on the settings in the p.set_options
    df['message'] = df['message'].apply(p.clean)
    
    df = df.drop('tweetid', axis=1)
    
    return df

In [6]:
df_train_clean = clean_data(df_train)

#### Create a Pipeline to Preprocess for Feature Engineering

The pipiline created will automate the following phases:
- Remove punctuations
- Remove stopwords

Load SpaCY NLP medium model.

In [7]:
nlp = spacy.load("en_core_web_sm")

In [8]:
# Get the list of stopwords from spacy framework
stopwords = list(STOP_WORDS)

In [9]:
def pipeline(message):
    message = nlp(message)
    
    # Remove punctuations
    message = [token for token in message if not token.is_punct]
    
    # Remove stopwords
    message = [token for token in message if not token.is_stop]
    
    #lemmatize each word and convert them to lower case
    message = " ".join([token.lemma_.lower() for token in message])
    
    
    
    return message
    

In [10]:
features = df_train_clean['message']
target = df_train_clean['sentiment']

In [11]:
# Instantiate CountVectorzer Object
vect = TfidfVectorizer(stop_words='english', max_features=100, ngram_range=(1,2))
trained_vect = vect.fit(features)

In [12]:
len(trained_vect.get_feature_names())



100

In [15]:
X_train = trained_vect.transform(features)

In [16]:
features = X_train.toarray()
features.shape

(30759, 100)

In [17]:
X_train.shape

(30759, 100)

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


#### Split the data into train and test data

Given that we rarely have a population to work with, and also, it is computationally expensive to do so, we ought to ensure that we have a means of testing our models on UNSEEN data. i.e. data that wasn't part of the training components so as to be ables to measure the predicitive accuracy of our model.

The essence of this is to ensure that our model do not fit so well with the training data and nece fail in generalising the population, giving rise to what is known as OVERFITTING(a situation where we have less error during training but very high error in the testing phase).

To achieve this, we tend to split our available data set into training and testing data, where the training data are used to train the model while the testing data is used to test the accuracy of our model.
Where the data set is large enough, it can be divided into three sets adding the Validation set to the training and testing set.
>
To split our data set, sklearn has a very easy function to help us achieve that, the train_test_split function. The function takes in an array or df and splits them at random(although the random state can be maintained) and returns 2n number of outputs. Where n = number of arrays/dataframes.

In [202]:
# Split train and test data with a test-size of 0.2 and random state of 42 for reproduceability
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

#### Create a Word2Vector object

The vectorizer is used to transform a given text into a vector on the basis of the frequency of each word that occurs in the entire text.

For this project, CountVectorizer will be used.

In [203]:
# Instantiate CountVectorzer Object
vect = CountVectorizer(tokenizer=pipeline, min_df=10, ngram_range=(1,3))

### Logistic Regression

Logistic regression is a classification algorith, that predicts the probability of an event occurring using a logistic function. This implies that the outcome is not numerical but rather, categorical.

To explain further with an example, if Linear Regression predicts how much a customer is willing to pay if they buy a company's product, then Logistic regression predicts whether a customer will buy a company's product or not.
As explained above, Logistic Regression can be used to make very fundamental forecast.

$$P(X) = \displaystyle \frac{e^{\beta_0 + \beta_1 X}}{1+e^{\beta_0 + \beta_1 X}}$$

where $P(X)$ is the probability of X belonging to class 1, and $\beta_0$ and $\beta_1$ are the intercept and regression coefficient respectively, just like in a linear regression model.

Logistic regression can transform into its logit form, where the log of the odds is equal to a linear model.

\begin{align}
1 - P(X) &= \displaystyle \frac{1}{1+e^{\beta_0 + \beta_1 X}} \\
\therefore \log \left( \frac{P(X)}{1-P(X)} \right) &= {\beta_0 + \beta_1 X}
\end{align}

\begin{align}
Where \left( \frac{P(X)}{1-P(X)} \right) = Odds
\end{align}

Hence, the logit function can be re-written as

\begin{align}
\therefore \log \left( Odds \right) &= {\beta_0 + \beta_1 X}
\end{align}

Since Logistic Regression predicts the probability of an event occuring, its prediction is between 0 and 1. Usually, a threshold is picked above which classification is assigned 1 and below which it is assigned 0. 0.5 is normally chosen.

Although usually used for bianry classification, Logistic regression can be used for multi class classification. This can be achieved in sklearn by setting the Logistic Regression parameter "mult_class" to "ovr" which means One-vs-Rest.
The intuition behind the ovr is the same with binary classification, only that in ovr the possible outcomes are:
- The probability of the event occurring (p), and
- The probability of the event not occuring (1 - p)


In [18]:
# Instantiate a LogisticRegression object
logreg = LogisticRegression(penalty='l2', C= 0.1, solver='liblinear', multi_class='ovr')

In [19]:
# Train the model
logistic_reg = logreg.fit(features, target)

In [20]:
model_save_path = "resources/logistic_reg.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(logistic_reg,file)

In [21]:
model_save_path = "resources/countvect.pkl"

with open(model_save_path,'wb') as file:
    pickle.dump(trained_vect,file)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

## Resources

[Tweet-Preprocessor](https://pypi.org/project/tweet-preprocessor/#description)