<a href="https://colab.research.google.com/github/muiruric/Athena_Python/blob/master/2301ACDS_TeamBM3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**2301ACDS_TeamBM3**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: EA - Twitter Sentiment Classification Challenge

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.



With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.



Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

<a id="cont"></a>

## Table of Contents
<a href=#one>1. Introduction</a>

<a href=#two>2. Importing Packages</a>


<br><br><br>
<a href=#x>X. Model</a>

 <a id="one"></a>
## 1. Introduction
<a href=#cont>Back to Table of Contents</a>

### Goal

To predict an individual’s belief in climate change based on their tweets!


### Dataset Description

**Where is this data from?**
The collection of this data was funded by a Canada Foundation for Innovation JELF Grant to Chris Bauch, University of Waterloo. The dataset aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43,943 tweets were collected. Each tweet is labelled as one of 4 classes, which are described below.

**Class Description**

- 2 News: the tweet links to factual news about climate change

- 1 Pro: the tweet supports the belief of man-made climate change

- 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change

- -1 Anti: the tweet does not believe in man-made climate change Variable definitions

**Features**

**sentiment:** Which class a tweet belongs in (refer to Class Description above)

**message:** Tweet body

**tweetid:** Twitter unique id



**The files provided**

**train.csv** - You will use this data to train your model.

**test.csv** - You will use this data to test your model.

**SampleSubmission.csv** - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the tweetid's must be correct.



 <a id="two"></a>
## 2. Importing Packages
<a href=#cont>Back to Table of Contents</a>


In [None]:

# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


import nltk
from nltk.stem import WordNetLemmatizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
import string
import urllib
import re

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix

<a id="three"></a>
## 3. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [None]:
df_train = pd.read_csv('/train.csv')
df_test = pd.read_csv('/test.csv')

In [None]:
df_train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [None]:
df_train.shape

(15819, 3)

In [None]:
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \r\nPu...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [None]:
df_test.shape

(10546, 2)

In [None]:
df_test.isnull().sum()

message    0
tweetid    0
dtype: int64

In [None]:
df_train.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

In [None]:
df_train['sentiment'].unique()

array([ 1,  2,  0, -1], dtype=int64)

In [None]:
df = pd.concat([df_train, df_test])

## Exploratory Data Analysis

### Text Cleaning

The following are the data cleaning techniques used to preprocess the raw data before conducting analysis.

- Removal of retweets and duplicate tweets

- Handling of hyperlinks

- Remove punctuation and noise

- Convert text to lowercase

- Handling contractions

- Handling emojis and emoticons








In [None]:
# removal of retweets and duplicate tweets
df.drop_duplicates(inplace = True)

In [None]:
#removing the RT sign
def removing_retweet(text):
    retweet_pattern = r'RT @\w+|@\w+'
    cleaned_text = re.sub(retweet_pattern, '', text)
    return cleaned_text

df['message'] = df['message'].apply(removing_retweet)

In [None]:
#removal of hyperlinks from the data
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
df['message'] = df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [None]:
#removing the punctuation
def remove_punctuation(text):
    punctuation_removed = text.translate(str.maketrans('', '', string.punctuation))
    return punctuation_removed
df['message'] = df['message'].apply(remove_punctuation)

In [None]:
#conversion of all the text to lowercase
def to_lowercase(text):
    lowercase = text.lower()
    return lowercase

df['message'] = df['message'].apply(to_lowercase)

In [None]:
#removing the emojis
import emoji

def remove_emojis(text):
    cleaned_text = emoji.demojize(text)
    return cleaned_text
df['message'] = df['message'].apply(remove_emojis)

We opt to remove the digits within the data such as the years in order to ensure that numerical tokens are not treated as seperate tokens. It might also help to reduce the vocabuay size and minimize the noise

In [None]:
#removal of digits such as years
def remove_digits(text):
    cleaned = re.sub(r'\d+','', text)
    return cleaned
df['message']= df['message'].apply(remove_digits)

In [None]:
df.head()

Unnamed: 0,sentiment,message,tweetid
0,1.0,polyscimajor epa chief doesnt think carbon dio...,625221
1,1.0,its not like we lack evidence of anthropogenic...,126103
2,2.0,researchers say we have three years to act on...,698562
3,1.0,todayinmaker wired was a pivotal year in the...,573736
4,1.0,its and a racist sexist climate change denyi...,466954


## Preparing Text Data for Exploratory Data Analysis (EDA)

- Tokenisation

- Stemming and Lemmatisation

- Removal of stop words


**Tokenisation**

In [None]:
tokeniser = TreebankWordTokenizer()
df['tokens'] = df['message'].apply(tokeniser.tokenize)

**Stemming and Lemitisation**


In [None]:
'''stemmer = SnowballStemmer('english')

def mbti_stemmer(words, stemmer):
    return [stemmer.stem(word) for word in words]

df['stem'] = df['tokens'].apply(mbti_stemmer, args=(stemmer, ))'''

"stemmer = SnowballStemmer('english')\n\ndef mbti_stemmer(words, stemmer):\n    return [stemmer.stem(word) for word in words]\n\ndf['stem'] = df['tokens'].apply(mbti_stemmer, args=(stemmer, ))"

In [None]:
df['tokens'].head()

0    [polyscimajor, epa, chief, doesnt, think, carb...
1    [its, not, like, we, lack, evidence, of, anthr...
2    [researchers, say, we, have, three, years, to,...
3    [todayinmaker, wired, was, a, pivotal, year, i...
4    [its, and, a, racist, sexist, climate, change,...
Name: tokens, dtype: object

In [None]:
#lemmatization
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = []
    for word in text:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    return ' '.join(lemmatized_words)

df['lemmatized'] = df['tokens'].apply(lemmatize_text)

In [None]:
df['lemmatized'].head()

0    polyscimajor epa chief doesnt think carbon dio...
1    it not like we lack evidence of anthropogenic ...
2    researcher say we have three year to act on cl...
3    todayinmaker wired wa a pivotal year in the wa...
4    it and a racist sexist climate change denying ...
Name: lemmatized, dtype: object

In [None]:
nltk.download('words')

english_words = set(nltk.corpus.words.words())

def remove_non_english_words(text):
    words = text.split()
    english_words_filtered = [word for word in words if word.lower() in english_words and len(word)>1]
    cleaned_text = ' '.join(english_words_filtered)
    return cleaned_text
df['cleaned'] = df['lemmatized'].apply(remove_non_english_words)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\colette\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [None]:
df['cleaned'].head()

0    chief doesnt think carbon dioxide is main caus...
1    it not like we lack evidence of anthropogenic ...
2    researcher say we have three year to act on cl...
3    wired wa pivotal year in the war on climate ch...
4    it and racist climate change bigot is leading ...
Name: cleaned, dtype: object

**Remove Stop words**


In [None]:
##insert stopwords code here

In [None]:
# seperate test data and train data
train = df.dropna(subset=['sentiment'])
test = df[df['sentiment'].isnull()]
test.drop(['sentiment'], axis=1, inplace=True)


### Text feature extraction

**Bag of words**

In [None]:
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary
        with each word as a key, and the value represents the number of
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

In [None]:
sentiment_labels = list(train['sentiment'].unique())

In [None]:
sentiment = {}
for pp in sentiment_labels:
    df = train.groupby('sentiment')
    sentiment[pp] = {}
    for row in df.get_group(pp)['tokens']:
        sentiment[pp] = bag_of_words_count(row, sentiment[pp])

In [None]:
#print(sentiment)

In [None]:
all_words = set()
for pp in sentiment_labels:
    for word in sentiment[pp]:
        all_words.add(word)

In [None]:
sentiment['all'] = {}
for pp in sentiment_labels:
    for word in all_words:
        if word in sentiment[pp].keys():
            if word in sentiment['all']:
                sentiment['all'][word] += sentiment[pp][word]
            else:
                sentiment['all'][word] = sentiment[pp][word]

In [None]:
total_words = sum([v for v in sentiment['all'].values()])
total_words

In [None]:
_ = plt.hist([v for v in sentiment['all'].values() if v < 10],bins=10)
plt.ylabel("# of words")
plt.xlabel("word frequency")

In [None]:
print(type(sentiment))

In [None]:
max_count = 10
remaining_word_index = [k for k, v in sentiment['all'].items() if v > max_count]

In [None]:
print(type(remaining_word_index))

In [None]:
hm = []
for p, p_bow in sentiment.items():
    df_bow = pd.DataFrame([(k, v) for k, v in p_bow.items() if k in remaining_word_index], columns=['Word', p])
    df_bow.set_index('Word', inplace=True)
    hm.append(df_bow)

# create one big dataframe
df_bow = pd.concat(hm, axis=1)
df_bow.fillna(0, inplace=True)

In [None]:
df_bow.sort_values(by='all', ascending=False).head(10)

In [None]:
df_bow.head()

In [None]:
train.head()

In [None]:
train_processed = train[['tweetid','stem','lemma','sentiment']]
train_processed.head()

In [None]:
remaining_word_index

In [None]:

def remove_unnecessary_words(words):
    #words = words.lower()
    return [x for x in words if x in remaining_word_index]

train_processed['stem'] = train_processed['stem'].apply(remove_unnecessary_words)
train_processed['stem'].head(10)

In [None]:
train_processed['lemma'] = train_processed['lemma'].apply(remove_unnecessary_words)
#Remove digits and words containing digits
train_processed['lemma'].head(10)

In [None]:

sd = train_processed[['stem','lemma']].head(2)
sd

 <a id="x"></a>
## X. Model
<a href=#cont>Back to Table of Contents</a>

 <a id="#"></a>
## X. Model Evaluation

<a href=#cont>Back to Table of Contents</a>

Classification Accuracy

Logarithmic Loss

Confusion Matrix

Area under Curve

F1 Score

Mean Absolute Error

Mean Squared Error


https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

In [None]:
y

In [None]:
# Saving each metric to add to a dictionary for logging

f1 = f1_score(y_test, y_pred, average='micro')
precision = precision_score(y_test, y_pred,  average='micro')
recall = recall_score(y_test, y_pred,  average='micro')

In [None]:
print(f1)

In [None]:
# Create dictionaries for the data we want to log

params = {"random_state": 7,
          "model_type": "logreg",
          "scaler": "standard scaler",
          "param_grid": str(param_grid),
          "stratify": True
          }
metrics = {"f1": f1,
           "recall": recall,
           "precision": precision
           }

In [None]:
# Log our parameters and results
experiment.log_parameters(params)
experiment.log_metrics(metrics)

In [None]:
experiment.end()

In [None]:
experiment.display()

In [None]:
sample = pd.read_csv('datasets/sample_submission.csv')
sample