# Team 6 CT Classification

## Climate Change Sentiment analysis - 2020

### Project Description


>Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

>With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

>Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

https://www.kaggle.com/c/climate-change-belief-analysis/overview

![alt text](https://images.unsplash.com/photo-1578825141469-690ba22eede0?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1051&q=80)
<span>Photo by <a href="https://unsplash.com/@markusspiske?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Markus Spiske</a> on <a href="/s/photos/business-dead-planet?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Notebook Explanation

First, we tried 3 base models with no preprocessing of the text to get a benchmark score. We then extracted info from the tweets and added additional columns such as character and words count.

The next step was cleaning the tweets. For example, we removed the URLs, the <i>'RT'</i> retweet tags and the mentions. We applied both stemming and lemmatisation normalising methods to test which performs better. Following that, we vectorized the text using 3 methods: Count Vectoriser, TD-IDF and Word2Vec. We added additional columns for each of the combinations of methods and ran each combination through the base models again and compared the performance. This was an iterative process until we got the best performing cleaning method.

After deciding on the best cleaning method, we did an Exploratory data analysis (EDA) on the raw tweets to retrieve data about mentions and hashtags. We also did EDA on the categorised processed tweets to extract insights from most the different sentiment labels.

Next, we compared base models with processed tweets. We selected the top 3 performing models and used GridSearchCV to find the best parameters to train these models with. We compared the F1 score and selected the best performing model. Below is a model process flow diagram visualising the approach we took to solve this problem.

<img src="resources/Model Process Flow.png" />

<a id = "top"></a>

# Table of contents

1. [Importing packages](#packages) <br><br>

2. [Loading and viewing data](#data) <br><br>

3. [Data description](#description) <br><br>

4. [Sentiment description](#sentiment)<br><br>

5. [Data extraction](#extraction)<br><br>

6. [Text cleaning](#cleaning) <br><br>

7. [Exploratory data analysis](#eda)<br><br>

8. [Balancing dataset](#balance) <br><br>

9. [Base model](#base) <br><br>

10. [Model evaluation](#evaluation) <br><br>

11. [Model optimisation](#optim) <br><br>

12. [Conclusion](#conclusion) <br><br>

# 1. Importing packages <a name="packages"></a>
[Return to top](#top) <br><br>

The following packages need to be installed.

- Spacy - pip install spacy==2.2.4
- NTLK - pip indstal nltk==3.5

In [None]:
# Processing
import numpy as np
import pandas as pd
import re


# Visualisation
import seaborn as sns
import matplotlib.pyplot as plt

# Spacy packages
import spacy
from spacy import displacy
from spacy.lang.en.stop_words import STOP_WORDS

# NLTK packages
import nltk
from nltk.tokenize import word_tokenize,TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('punkt', quiet=True)

# Sklearn packagesw
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import resample
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
from sklearn_pandas import DataFrameMapper

# Utils
from collections import Counter
import itertools, string, operator, re, unicodedata, nltk
from wordcloud import WordCloud

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


# Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 2. Loading  and viewing data <a name="data"></a>
[Return to top](#top) <br><br>


- Data loads from Github repository
- Training dataset has 15819 entries
- Testing dataset has 10546 entries
- Viewing the first 5 rows of datasets


In [None]:
# Import data
train_df = pd.read_csv('https://raw.githubusercontent.com/jonnybegreat/test-repo/master/twitter_train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/jonnybegreat/test-repo/master/twitter_test.csv')

In [None]:
# Make copy of train_df assigning to variable df and view the first 5 rows
df = train_df.copy()
df.head()

In [None]:
# View the first 5 rows of test dataset
test_df.head()

# 3. Data description <a name="description"></a>
[Return to top](#top) <br><br>

<b>Training data columns (15819 entries):</b>
- Sentiment - Labelled sentiment classification
- Message - Tweet to analysed
- Tweet ID - ID of unique tweet

<b>Test data columns (10546 entries):</b>
- Message - Tweet to be analysed
- Tweet ID - ID of unique tweet


<i>Training data info and data types:</i>

In [None]:
df.info()

<i>Testing data info and data types:</i>

In [None]:
test_df.info()

# 4. Sentiment description <a name="sentiment"></a>
[Return to top](#top) <br><br>

The table displays the description of each sentiment category:

![alt text](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2205222%2F8e4d65f2029797e0462b52022451829c%2Fdata.PNG?generation=1590752860255531&alt=media)

<i>How many tweets are there in each category?</i>

In [None]:
#Bar graph of sentiment count
sns.catplot(x = 'sentiment', kind = 'count', edgecolor = '.6',
            palette = 'pastel', data = df);

Majority of the tweets are positive(1) towards climate change. The least amount of tweets are negative (-1) towards climate change. This shows data is unbalanced and can affect our prediction results. Later in this notebook, we will explore resampling methods to balance the data.

# 5. Data extraction <a name="extraction"></a>
[Return to top](#top) <br><br>


Create new columns of the following features of the tweets:
- Tokenise tweet
- Categorise as retweet or not
- Hashtag extraction and count
- Mention extraction and count
- Word and character count
- Average word length
- Stop word count per

All of the above methods are applied to the test data as well.

In [None]:
# Create copies of train and test dataframes
df_with_metadata = df.copy()
test_df_with_metadata = test_df.copy()

In [None]:
def hashtag_column(x):
    '''
    This function extracts hashtags from tweets and
    adds them to a new hashtags column.
    '''
    hashtags = []
    new_tag_list = []
        
    #Find all the items that start with a '#'
    for i in x:
        ht = re.findall(r"#(\w+)", i)
        hashtags.append(ht)         
    
    #Replace empty tag lists with NaN
    for tag in hashtags:
        if tag == []:
            tag = np.nan
        new_tag_list.append(tag)
        
    return new_tag_list

In [None]:
# Mentions function - NEEDS TO BE EDITED TO INCLUDE MULTIPLE MENTIONS
def mention_column(x):
    '''
    This function extracts mentions from tweets and
    add it to a new hashtags column.
    '''
    mentions = []
    new_mention_list = []

    #Find all the items that start with a '@'
    for i in x:
        ht = re.findall(r"@(\w+)", i)
        mentions.append(ht)         

    #Replace empty mention lists with NaN        
    for tag in mentions:
        if tag == []:
            tag = np.nan
        new_mention_list.append(tag)
        
    return new_mention_list

In [None]:
# Tokenized tweet - apply word_tokenize from NLTK
df_with_metadata['message_token'] = df_with_metadata['message'].apply(lambda x: word_tokenize(x))


In [None]:
# Categorise as retweet or not

#New 'retweet' column to return yes if the first 2 characters is 'RT', else return no
df_with_metadata['retweet'] = ['yes' if df_with_metadata['message'][i][:2] == 'RT' 
                               else 'no' for i in range(len(df_with_metadata))]


In [None]:
# Create new column 'hashtags' and extract hashtags with hashtag_column function.
df_with_metadata['hashtags'] = hashtag_column(df_with_metadata['message'])

test_df_with_metadata['hashtags'] = hashtag_column(test_df_with_metadata['message'])

#Count how many times a word starts with #
df_with_metadata['hashtag_count'] = df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word.startswith('#')]))

test_df_with_metadata['hashtag_count'] = test_df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word.startswith('#')]))


In [None]:
# Create new column 'mentions' and extract mentions with mention_column function.
df_with_metadata['mentions'] = mention_column(df_with_metadata['message'])

test_df_with_metadata['mentions'] = mention_column(test_df_with_metadata['message'])

#Count how many times a word starts with @
df_with_metadata['mention_count'] = df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word.startswith('@')]))

test_df_with_metadata['mention_count'] = test_df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word.startswith('@')]))


In [None]:
# Character count
df_with_metadata['char_count'] = df_with_metadata['message'].str.len()
test_df_with_metadata['char_count'] = test_df_with_metadata['message'].str.len()

# Word count
df_with_metadata['word_count'] = df_with_metadata['message'].str.split().str.len()
test_df_with_metadata['word_count'] = test_df_with_metadata['message'].str.split().str.len()


In [None]:
# Average word length
df_with_metadata['avg_word_length'] = df_with_metadata['message'].apply(
    lambda tweet: round(sum([len(word) for word in tweet.split()]) / len(tweet.split()),2))

test_df_with_metadata['avg_word_length'] = test_df_with_metadata['message'].apply(
    lambda tweet: round(sum([len(word) for word in tweet.split()]) / len(tweet.split()),2))

In [None]:
# Stop word count
df_with_metadata['stopword_count'] = df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word in STOP_WORDS]))

test_df_with_metadata['stopword_count'] = test_df_with_metadata['message'].apply(
    lambda tweet: len([word for word in tweet.split() if word in STOP_WORDS]))

# 6. Text cleaning <a name="cleaning"></a>
[Return to top](#top) <br><br>


The following text cleaning processes were applied to the tweets:
- Upon inspection, there is a common recurrence of the following special character combination: 'Ã¢â‚¬Â¦'
- Tokenise text
- Make the text lowercase
- Expand contracted words
- Part of speech tagging
- Lemmatising text - replace with base words
- Remove numbers
- Remove punctuation
- Remove stop words 

Stop words are commonly used words such as 'the', 'a' and 'in'.

In [None]:
#Contraction dictionary
c_dict = {
  "ain't": "am not",
  "aren't": "are not",
  "can't": "cannot",
  "can't've": "cannot have",
  "'cause": "because",
  "could've": "could have",
  "couldn't": "could not",
  "couldn't've": "could not have",
  "didn't": "did not",
  "doesn't": "does not",
  "don't": "do not",
  "hadn't": "had not",
  "hadn't've": "had not have",
  "hasn't": "has not",
  "haven't": "have not",
  "he'd": "he would",
  "he'd've": "he would have",
  "he'll": "he will",
  "he'll've": "he will have",
  "he's": "he is",
  "how'd": "how did",
  "how'd'y": "how do you",
  "how'll": "how will",
  "how's": "how is",
  "i'd": "I would",
  "i'd've": "I would have",
  "i'll": "I will",
  "i'll've": "I will have",
  "i'm": "I am",
  "i've": "I have",
  "isn't": "is not",
  "it'd": "it had",
  "it'd've": "it would have",
  "it'll": "it will",
  "it'll've": "it will have",
  "it's": "it is",
  "let's": "let us",
  "ma'am": "madam",
  "mayn't": "may not",
  "might've": "might have",
  "mightn't": "might not",
  "mightn't've": "might not have",
  "must've": "must have",
  "mustn't": "must not",
  "mustn't've": "must not have",
  "needn't": "need not",
  "needn't've": "need not have",
  "o'clock": "of the clock",
  "oughtn't": "ought not",
  "oughtn't've": "ought not have",
  "shan't": "shall not",
  "sha'n't": "shall not",
  "shan't've": "shall not have",
  "she'd": "she would",
  "she'd've": "she would have",
  "she'll": "she will",
  "she'll've": "she will have",
  "she's": "she is",
  "should've": "should have",
  "shouldn't": "should not",
  "shouldn't've": "should not have",
  "so've": "so have",
  "so's": "so is",
  "that'd": "that would",
  "that'd've": "that would have",
  "that's": "that is",
  "there'd": "there had",
  "there'd've": "there would have",
  "there's": "there is",
  "they'd": "they would",
  "they'd've": "they would have",
  "they'll": "they will",
  "they'll've": "they will have",
  "they're": "they are",
  "they've": "they have",
  "to've": "to have",
  "wasn't": "was not",
  "we'd": "we had",
  "we'd've": "we would have",
  "we'll": "we will",
  "we'll've": "we will have",
  "we're": "we are",
  "we've": "we have",
  "weren't": "were not",
  "what'll": "what will",
  "what'll've": "what will have",
  "what're": "what are",
  "what's": "what is",
  "what've": "what have",
  "when's": "when is",
  "when've": "when have",
  "where'd": "where did",
  "where's": "where is",
  "where've": "where have",
  "who'll": "who will",
  "who'll've": "who will have",
  "who's": "who is",
  "who've": "who have",
  "why's": "why is",
  "why've": "why have",
  "will've": "will have",
  "won't": "will not",
  "won't've": "will not have",
  "would've": "would have",
  "wouldn't": "would not",
  "wouldn't've": "would not have",
  "y'all": "you all",
  "y'alls": "you alls",
  "y'all'd": "you all would",
  "y'all'd've": "you all would have",
  "y'all're": "you all are",
  "y'all've": "you all have",
  "you'd": "you had",
  "you'd've": "you would have",
  "you'll": "you you will",
  "you'll've": "you you will have",
  "you're": "you are",
  "you've": "you have"
}
c_re = re.compile('(%s)' % '|'.join(c_dict.keys()))


In [None]:
# Creating library objects
tokenizer = TweetTokenizer()
lemmatizer = WordNetLemmatizer()
punc = list(set(string.punctuation))

# Adding additional stop words
additional_stopwords = ['', ' ', 'say', 's', 'u', 'ap', 'afp', '...',
                        'n', '\\', 'â', '’', '¢', '…', 'it is', "do not"]

stop_words = ENGLISH_STOP_WORDS.union(additional_stopwords)

# Function to expand contracted words
def expandContractions(text, c_re = c_re):
    '''
    The function replaces contracted words with
    their expanded form from c_dict.
    '''
    def replace(match):
        return c_dict[match.group(0)]
    return c_re.sub(replace, text)

# Function to extract parts of speech

def get_word_net_pos(treebank_tag):
    '''
    Function to return treebank tag with description.
    '''
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

    
# Function to lemmamtise words

def lemma_wordnet(tagged_text):
    '''
    This function returns the lemmatized base of each
    word by using wordnet lemmatizer.
    ''' 
    final = []
    for word, tag in tagged_text:
        wordnet_tag = get_word_net_pos(tag)
        if wordnet_tag is None:
            final.append(lemmatizer.lemmatize(word))
        else:
            final.append(lemmatizer.lemmatize(word, pos = wordnet_tag))
    return final


# Function to process text

def process_text(text):
    '''
    Function to tokenise text, make all the text lowercase, 
    expanding contracted words, extracting parts of speech tags,
    lemmatising text, and removing numbers, punctuation and stop words.
    '''
    # Tokenize text using tokenizer
    tokenized = tokenizer.tokenize(text)
    
    # Make each word lowercase
    lower = [item.lower() for item in tokenized] 
    
    #Expand contracted words
    decontract = [expandContractions(item, c_re = c_re) for item in lower]
    
    # POS tagging
    tagged = nltk.pos_tag(decontract) 
    
    # Lemmatize words using lemma_wordnet function
    lemma = lemma_wordnet(tagged) 
    
    # Remove numbers from text
    no_num = [re.sub('[0-9]+', '', each) for each in lemma] 
    
    # Remove punctuation if not in punc list
    no_punc = [w for w in no_num if w not in punc] 
    
    # Remove stop words is words in stop_words list
    no_stop = [w for w in no_punc if w not in stop_words] 
    
    return no_stop # Return processed text

# Clean the tweets on training and testing data
df_with_metadata['cleaned_text'] = df_with_metadata['message'].apply(process_text)
test_df_with_metadata['cleaned_text'] = test_df_with_metadata['message'].apply(process_text)


Let's look at the sentiment and cleaned tweet tokens in comparison with the original tweets.

In [None]:
# Display cleaned tweets
df_with_metadata[['message','cleaned_text','sentiment']]

# 7. Exploratory data analysis <a name="eda"></a>
[Return to top](#top) <br><br>


We will be exploring our data and drawing information from the original tweets, the cleaned tweets and the data we have extracted.

Let's take another look at the columns that were created.

In [None]:
df_with_metadata.head()

### Retweets
Let's start by looking at the retweets. 

In [None]:
#Bar graph of number of rewteets
sns.catplot(x = "retweet", kind = "count", edgecolor = ".6",palette = "pastel",
            data = df_with_metadata);

valuecounts = df_with_metadata['retweet'].value_counts()

# Print percentage of each value
print('Yes: ', round(valuecounts[0]/
                     len(df_with_metadata['retweet'])*100,2),'%')
print('No: ', round(valuecounts[1]/
                    len(df_with_metadata['retweet'])*100,2),'%')

Just over 60% of these tweets are Retweets! There might be some duplicate tweets. We explore by taking a look at the top 10 most retweeted tweets and how many times they were retweeted.

In [None]:
#View the top 10 retweeted tweets
df_rt_counts = pd.DataFrame(df_with_metadata['message']
                            .astype(str).value_counts())
df_rt_counts.head(10)

We have alot of the same tweet occurences so we would assume that the sentiment would be the same for each of them. Let's look at the top duplicate tweet to confirm this.

In [None]:
#View sentiment of tweet with the highest retweet count
df_most_duplicated_tweet = pd.DataFrame(df_with_metadata[df_with_metadata['message'] == "RT @StephenSchlegel: she's thinking about how she's going to die because your husband doesn't believe in climate change https://t.co/SjoFoNÃ¢â‚¬Â¦"])

df_most_duplicated_tweet[['message','sentiment']]


There are quite a couple of duplicate tweets with the same sentiment. This might be a problem and lead to over importance of certain categories.
 
There is also a recurrence of the following special character combination: 'Ã¢â‚¬Â¦'. Let's Remove duplicates and this special character set before we continue.


In [None]:
# View shape of metadata_df
df_with_metadata.shape


In [None]:
# Removing special character combinations
df_with_metadata['message'] = [re.sub('Ã¢â‚¬Â¦', '', i)
                               for i in df_with_metadata['message']]

print(df_with_metadata['message'][10])


In [None]:
# Drop duplicated retweeted tweets
df_with_metadata.drop_duplicates(subset = "message", 
                     keep = False, inplace = True) 

# View shape of metadata_df with duplicated removed
print(df_with_metadata.shape)


In [None]:
# View df with metadata again
df_with_metadata.head()


### Hashtags and Mentions

We can tell a lot from the sentiment of tweets by looking at the hashtags which are used. Which hashtags appear the most in these tweets?

In [None]:
# Drop null values from hashtags column
df_hashtags = df_with_metadata['hashtags'].dropna()

#Join all the text in the list and remove apostrophes
all_words_hastags = ' '.join([text for text in df_hashtags.astype(str)])
all_words_hastags = all_words_hastags.replace("'", "")

# Create word cloud
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
                      max_font_size = 110).generate(all_words_hastags)

# Plot word cloud
print('Word cloud of top hashtags')
plt.figure(figsize = (10, 7))
plt.imshow(wordcloud, interpolation = "bilinear")
plt.axis('off')
plt.show()


Looking at popular hashtags across all categories its seems as though the prominent hashtags include: 'Climatechange','climate','ActOnClimate'
These are to be expected as this dataset are tweets related to climate change.

Other prominent hashtags include: 'Paris Agreement', 'Trump','MAGA' etc. *This makes it seem as though these tweets are **American** and are **politically related**.*

The appearance of the hashtags 'Election night' and 'I'm Voting Because' also makes it seem as though these tweets were **sampled from twitter during the 2016 American presidential election.**

Let's have a look at mentions to confirm our hypothesis:

In [None]:
# Drop null values from mentions column
df_mentions = df_with_metadata['mentions'].dropna()

# Join all the text in the list and remove apostrophes
all_words_mentions = ' '.join([text for text in df_mentions.astype(str)])
all_words_mentions = all_words_mentions.replace("'", "")

# Create word cloud
wordcloud = WordCloud(width = 800, height = 500, random_state = 21,
                    max_font_size = 110).generate(all_words_mentions)

# Plot word cloud
print('Word cloud of top mentions')
plt.figure(figsize = (10, 7))
plt.imshow(wordcloud, interpolation = "bilinear")
plt.axis('off')
plt.show()


There is a much wider spread of mentions with DonalTrump topping the list. The mentions which are prominent include : **'realDonaldTrump', 'POTUS', 'SenSanders', 'CNN'** etc. Which also makes it seem as though these tweets were taking during the election time.
One has to be cautious when analyzing mentions as there are two types of main mentions on Twitter :

1) **Where a twitter profile is referred to**

2) **Where a twitter profile Retweets something**

Since a mention occurs every time a tweet is retweeted, it might be worth looking into the mentions of only the retweets if time had allowed.

Let's have a look at character count distribution for these tweets:


#### Character and word count

In [None]:
# Display distribution of total characters
sns.distplot(df_with_metadata['char_count'])


We can see from this plot that most of the tweets are using the full amount of characters allowed (140). We will see how these changes in each category. 

### EDA for categories

Let's dig a bit deeper into the tweets in each category to see if we can find out the reason they were classified in this manner as well as seeing if we can identify any similarities.

We start by separating the dataframes to analyze each category and looking at the most common mentions and hashtags.

In [None]:
# Create dataframes for each category
df_negative_tweets = df_with_metadata[df['sentiment'] == -1]
df_neutral_tweets = df_with_metadata[df['sentiment'] == 0]
df_positive_tweets = df_with_metadata[df['sentiment'] == 1]
df_news_tweets = df_with_metadata[df['sentiment'] == 2]
df_negative_tweets.head()


#### Most common mentions
We look at the most common mentions per category.

In [None]:
# Drop null values from mentions column in negative tweets
df_neg_mentions = df_negative_tweets['mentions'].dropna()

# Join all the text in the list and remove apostrophes
all_words_mentions_neg = ' '.join([text for text in df_neg_mentions.astype(str)])
all_words_mentions_neg = all_words_mentions_neg.replace("'", "")

# Create word cloud
wordcloud1 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_mentions_neg)




# Drop null values from mentions column in neutral tweets
df_neut_mentions = df_neutral_tweets['mentions'].dropna()

# Join all the text in the list and remove apostrophes
all_words_mentions_neut = ' '.join([text for text in df_neut_mentions.astype(str)])
all_words_mentions_neut = all_words_mentions_neut.replace("'", "")

# Create word cloud
wordcloud2 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_mentions_neut)



# Drop null values from mentions column in positive tweets
df_pos_mentions = df_positive_tweets['mentions'].dropna()

# Join all the text in the list and remove apostrophes
all_words_mentions_pos = ' '.join([text for text in df_pos_mentions.astype(str)])
all_words_mentions_pos = all_words_mentions_pos.replace("'","")

# Create word cloud
wordcloud3 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_mentions_pos)



# Drop null values from mentions column in news tweets
df_news_mentions = df_news_tweets['mentions'].dropna()

# Join all the text in the list and remove apostrophes
all_words_mentions_news = ' '.join([text for text in df_news_mentions.astype(str)])
all_words_mentions_news = all_words_mentions_news.replace("'","")

# Create word cloud
wordcloud4 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_mentions_news)



# Plot all 4 word clouds to compare
fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (15, 11))
fig.suptitle('Mentions')

# Negative mention word cloud
axes[0,0].imshow(wordcloud1, interpolation = "bilinear")
axes[0,0].axis('off')
axes[0,0].set_title('NEGATIVE')

# Neutral mention word cloud
axes[1,0].imshow(wordcloud2, interpolation = "bilinear")
axes[1,0].axis('off')
axes[1,0].set_title('NEUTRAL')

# Positive mention word cloud
axes[0,1].imshow(wordcloud3, interpolation = "bilinear")
axes[0,1].axis('off')
axes[0,1].set_title('POSITIVE')

# News mention word cloud
axes[1,1].imshow(wordcloud4, interpolation = "bilinear")
axes[1,1].axis('off')
axes[1,1].set_title('NEWS')
fig.tight_layout()

#### Most common hastags
Now we look at the most common hashtags per category.

In [None]:
# Drop null values from hastags column in negative tweets
df_neg_hashtags = df_negative_tweets['hashtags'].dropna()

# Join all the text in the list and remove apostrophes
all_words_hashtags_neg = ' '.join([text for text in df_neg_hashtags.astype(str)])
all_words_hashtags_neg = all_words_hashtags_neg.replace("'", "")

#Create word cloud
wordcloud1 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_hashtags_neg)



# Drop null values from hastags column in neutral tweets
df_neut_hashtags = df_neutral_tweets['hashtags'].dropna()

# Join all the text in the list and remove apostrophes
all_words_hashtags_neut = ' '.join([text for text in df_neut_hashtags.astype(str)])
all_words_hashtags_neut = all_words_hashtags_neut.replace("'", "")

# Create word cloud
wordcloud2 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_hashtags_neut)



# Drop null values from hastags column in postive tweets
df_pos_hashtags = df_positive_tweets['hashtags'].dropna()

# Join all the text in the list and remove apostrophes
all_words_hashtags_pos = ' '.join([text for text in df_pos_hashtags.astype(str)])
all_words_hashtags_pos = all_words_hashtags_pos.replace("'", "")

#Create word cloud
wordcloud3 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_hashtags_pos)



# Drop null values from hastags column in news tweets
df_news_hashtags = df_news_tweets['hashtags'].dropna()

# Join all the text in the list and remove apostrophes
all_words_hashtags_news = ' '.join([text for text in df_news_hashtags.astype(str)])
all_words_hashtags_news = all_words_hashtags_news.replace("'", "")

# Create word cloud
wordcloud4 = WordCloud(width = 800, height = 500, random_state = 21,
                       max_font_size = 110).generate(all_words_hashtags_news)



# Plot all 4 word clouds to compare
fig, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (15, 11))
fig.suptitle('Hashtags')

# Negative hashtag word cloud
axes[0,0].imshow(wordcloud1, interpolation = "bilinear")
axes[0,0].axis('off')
axes[0,0].set_title('NEGATIVE')

# Neutral hashtag word cloud
axes[1,0].imshow(wordcloud2, interpolation = "bilinear")
axes[1,0].axis('off')
axes[1,0].set_title('NEUTRAL')

# Positive hashtag word cloud
axes[0,1].imshow(wordcloud3, interpolation = "bilinear")
axes[0,1].axis('off')
axes[0,1].set_title('POSITIVE')

# News hashtag word cloud
axes[1,1].imshow(wordcloud4, interpolation = "bilinear")
axes[1,1].axis('off')
axes[1,1].set_title('NEWS')
fig.tight_layout()


#### Most Important Words

Let's put together a dataframe of the top words, hashtags and mentions so that we can see what words are influencing each category.

In [None]:
# Word frequency function
def word_freq(clean_text_list, top_n):
    '''
    This function returns a dataframe with the most common words
    in the text and the count of their frequency. It takes in a
    list of clean text, and top n frequency.
    '''
    flat = [item for sublist in clean_text_list for item in sublist]
    with_counts = Counter(flat)
    top = with_counts.most_common(top_n)
    word = [each[0] for each in top]
    num = [each[1] for each in top]
    return pd.DataFrame([word, num]).T


In [None]:
# # add borders to dataframes
# %%HTML
# <style type="text/css">
# table.dataframe td, table.dataframe th {
#     border: 1px  black solid !important;
#   color: black !important;
# }
# </style>

In [None]:
# Use word_freq function to retrieve the most common words

# Top 20 most frequent words
topn = 20

# Word frequency of all sentiments
all_words_list = df_with_metadata['cleaned_text'].tolist()
all_top = word_freq(all_words_list, topn)

# Word frequency of negative sentiments
neg_words_list = df_negative_tweets['cleaned_text'].tolist()
neg_top = word_freq(neg_words_list, topn)

# Word frequency of neutral sentiments
neut_words_list = df_neutral_tweets['cleaned_text'].tolist()
neut_top = word_freq(neut_words_list, topn)

# Word frequency of positive sentiments
pos_words_list = df_positive_tweets['cleaned_text'].tolist()
pos_top = word_freq(pos_words_list, topn)

# Word frequency of news sentiments
news_words_list = df_news_tweets['cleaned_text'].tolist()
news_top = word_freq(news_words_list, topn)

# Create new dataframe from the top words
df_top = pd.concat([all_top,neg_top, neut_top, pos_top, news_top], axis = 1)
cols = ['All','Count','Negative', 'Count', 'Neutral',
        'Count', 'Positive', 'Count', 'News', 'Count']
df_top.columns = cols

# Return dataframe with top words
df_top


As we can see :

- for all words : **'climate', 'change','rt', 'global',and 'warming'** all are at the top of the word counts. These are top   occurences throughout all categories.

- for negative words : **'science', 'cause','real', and 'scam'** stand out as top words that are distinct to negative.

- for news words : **'fight', 'epa','pruit', 'scientist',and 'new'** stand out as top words that are distinct to news.

How we address these words in our model is important as they could carry importance in different categories as well as influence predictions.

We can see that the positive sentiment has a higher number of occurences for top words due to the imbalance of positive tweets.


#### Most Important Hashtags

In [None]:
# Use word_freq function to retrieve the most common hastags

# Top 20 most common hashtags
topn = 20

# Hashtag frequency of all sentiments
all_words_list_hashtags = df_with_metadata['hashtags'].dropna().tolist()
all_top_hashtags = word_freq(all_words_list_hashtags, topn)

# Hashtag frequency of negative sentiments
neg_words_list_hashtags = df_negative_tweets['hashtags'].dropna().tolist()
neg_top_hashtags = word_freq(neg_words_list_hashtags, topn)

# Hashtag frequency of neutral sentiments
neut_words_list_hashtags = df_neutral_tweets['hashtags'].dropna().tolist()
neut_top_hashtags = word_freq(neut_words_list_hashtags, topn)

# Hashtag frequency of positive sentiments
pos_words_list_hashtags = df_positive_tweets['hashtags'].dropna().tolist()
pos_top_hashtags = word_freq(pos_words_list_hashtags, topn)

# Hashtag frequency of news sentiments
news_words_list_hashtags = df_news_tweets['hashtags'].dropna().tolist()
news_top_hashtags = word_freq(news_words_list_hashtags, topn)

# Create new dataframe from the top hashtags
df_top = pd.concat([all_top_hashtags,neg_top_hashtags,neut_top_hashtags,
                    pos_top_hashtags, news_top_hashtags], axis = 1)
cols = ['All','Count','Negative', 'Count', 'Neutral',
        'Count', 'Positive', 'Count', 'News', 'Count']
df_top.columns = cols

# Return dataframe with top hashtags
df_top

It looks like there are not many repetative hashtags being used in the Negative and the Neutral tweets (probably due to them having the lowest number of samples in the dataset)

News and Positive on the other hand have quite a few repeat hashtags. These top hashtags could be used as heavier weighted predictors. Specifically for news.

#### Most Important Mentions

In [None]:
# Use word_freq function to retrieve the most common mentions

# Top 20 most common mentions
topn = 20

# Mention frequency of all sentiments
all_words_list_mentions = df_with_metadata['mentions'].dropna().tolist()
all_top_mentions = word_freq(all_words_list_mentions, topn)

# Mention frequency of negative sentiments
neg_words_list_mentions = df_negative_tweets['mentions'].dropna().tolist()
neg_top_mentions = word_freq(neg_words_list_mentions, topn)

# Mention frequency of neutral sentiments
neut_words_list_mentions = df_neutral_tweets['mentions'].dropna().tolist()
neut_top_mentions = word_freq(neut_words_list_mentions, topn)

# Mention frequency of positive sentiments
pos_words_list_mentions = df_positive_tweets['mentions'].dropna().tolist()
pos_top_mentions = word_freq(pos_words_list_mentions, topn)

# Mention frequency of news sentiments
news_words_list_mentions = df_news_tweets['mentions'].dropna().tolist()
news_top_mentions = word_freq(news_words_list_mentions, topn)

# Create new dataframe from the top hashtags
df_top = pd.concat([all_top_mentions,neg_top_mentions, neut_top_mentions,
                    pos_top_mentions, news_top_mentions], axis = 1)
cols = ['All','Count','Negative', 'Count', 'Neutral',
        'Count', 'Positive', 'Count', 'News', 'Count']
df_top.columns = cols

# Return dataframe with top hashtags
df_top

From our research - most of these mentions occur in the retweets. Similar to the hashtags, the negative and neutral tweets have low mention ocurrences. It would be a good idea to put importance of these mentions as there are high occurence of certain mentions in different categories. Notably :

- @realDonaldTrump at the top of the positive list - probably due to alot of tweets being directed towards him
- @thehill and @CNN at the top of the News list which is to be expected and could be used for predictions



### EDA Summary 
Let's summarise a few key points of what we have found so far:

1) About 60% of these tweets are positive towards climate change - This indicates an imbalanced training dataset.

2) This data seems to be taken from Americans around the time of the 2016 US presidential elections.

3) @realDonaldTrump is the top mentioned account 

4) 'Climatechange', 'climate', and 'Trump' are the three most used hashtags

5) The full character limit of 140 characters was used in the majority of these tweets.

6) Retweets create duplicate data points in the dataset

7) Top organisations include renowned climate change activists which could be useful from a business networking perspective

# 8. Balancing dataset <a name="balance"></a>

[Return to top](#top) <br><br>

We balance the classes so we can work with the same amount of classified classes for each sentiment. We tried 2 methods to achieve this:

1) Downsample majority classes<br>
2) Upsample minority classes


#### Downsample

In [None]:
# Upsampling - class size is 50% of the majority pro (1) class
class_size_up = int(len(df_with_metadata[df_with_metadata['sentiment'] == 1])/2)
print('Upsampling class size:', class_size_up)

# Downsamplng - class size is the size of the minority anti (-1) class
class_size_down = int(len(df_with_metadata[df_with_metadata['sentiment' ] == -1]))
print('Downsamplng class size:', class_size_down)


In [None]:
pro = df_with_metadata[df_with_metadata['sentiment'] == 1]
news = df_with_metadata[df_with_metadata['sentiment'] == 2]
neutral = df_with_metadata[df_with_metadata['sentiment'] == 0]
anti = df_with_metadata[df_with_metadata['sentiment'] == -1]

In [None]:
# Downsampling all classes to meet the minority class count - 1296

#Downsample pro - sentiment = 1
pro_downsampled = resample(pro,
                          replace = False, # Sample without replacement
                          n_samples = class_size_down, # Match minority class
                          random_state = 27) # Reproducible results

#Downsample news - sentiment = 2
news_downsampled = resample(news,
                          replace = False, # Sample without replacement
                          n_samples = class_size_down, # Match minority class
                          random_state = 27) # Reproducible results

#Downsample neutral - sentiment = 0
neutral_downsampled = resample(neutral,
                          replace = False, # Sample without replacement
                          n_samples = class_size_down, # Match minority class
                          random_state = 27) # Reproducible results

# Combine downsampled majority class with minority class
downsampled = pd.concat([pro_downsampled, news_downsampled,
                         neutral_downsampled, anti])

# Check new class counts
print(downsampled['sentiment'].value_counts())

In [None]:
#Bar graph of balanced sentiment count
sns.catplot(x = "sentiment", kind = "count", edgecolor = ".6",
            palette = "pastel",data = downsampled);

#### Upsampling

In [None]:
# Upsampling all classes to meet 50% of the majority class count - 4265

#Downsample pro - sentiment = 1
pro_downsampled = resample(pro,
                          replace = False, # sample without replacement
                          n_samples = class_size_up, # match majority class
                          random_state = 27) # reproducible results

# Upsample anti - sentiment = -1
anti_upsampled = resample(anti,
                          replace = True, # sample with replacement
                          n_samples = class_size_up, # match majority class
                          random_state = 27) # reproducible results

# Upsample news - sentiment = 2
news_upsampled = resample(news,
                          replace = True, # sample with replacement
                          n_samples = class_size_up, # match majority class
                          random_state = 27) # reproducible results

#Upsample neutral - sentiment = 0
neutral_upsampled = resample(neutral,
                          replace = True, # sample with replacement
                          n_samples = class_size_up, # match majority class
                          random_state = 27) # reproducible results

# Combine downsampled majority class with minority class
upsampled = pd.concat([pro_downsampled, anti_upsampled,
                       news_upsampled, neutral_upsampled])

# Check new class counts
print(upsampled['sentiment'].value_counts())

In [None]:
#Bar graph of balanced sentiment count
sns.catplot(x="sentiment", kind="count", edgecolor=".6",
            palette="pastel",data=upsampled);

# 9. Base model <a name="base"></a>

[Return to top](#top) <br><br>

We run a selection of basic model through our raw data to get a benchmark of the f1 score that we will attempt to improve. The base models include:

- Logistic regression
- Linear SVC
- Gaussian Naive Bayes
- Decision tree
- Random forest

Let's have another look at the information we have extracted from the data.

In [None]:
# Review metadata df
df_with_metadata.head()

In [None]:
# Create feature_df by filtering columns from metadata
feature_df = df_with_metadata.filter([
                    'sentiment','message', 'retweet',
                    'hashtag_count', 'mention_count', 'char_count',
                    'word_count', 'avg_word_length', 'stopword_count'
                                     ], axis = 1)

# Replace yes and no with 1 and 0 in retweet columns
feature_df['retweet'] = feature_df['retweet'].map(dict(yes = 1, no = 0))

# View feature_df
feature_df.head()


In [None]:
#Vectorize text and keep addtional features with DataFrameMapper
data = feature_df

# Use TF-IDF vecotirzer on original text
mapper = DataFrameMapper([
     ('message', TfidfVectorizer()),
     ('retweet', None),
     ('hashtag_count', None),
     ('mention_count', None),
     ('char_count', None),
     ('word_count', None),
     ('avg_word_length', None),
 ])

# X features
X = mapper.fit_transform(data)

# Y label
y = y = feature_df['sentiment']

In [None]:
# Split dataset with 20% test size
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.2,  
                                                    random_state = 27)

In [None]:
# Use MaxAbsScaler() to normalise values
max_abs_scaler = preprocessing.MaxAbsScaler()

# Fit to and transform X_train
X_train = max_abs_scaler.fit_transform(X_train)  

# Transform X_test
X_test = max_abs_scaler.transform(X_test)

In [None]:
# Names of base models
names = ['Logistic Regression',
         'Linear SVC', 'Gaussian Naive Bayes',          
         'Decision Tree', 'Random Forest']

In [None]:
# Classifiers models and some hyperparameters
classifiers = [
    LogisticRegression(max_iter = 500, multi_class = 'ovr', solver = 'lbfgs'), 
    LinearSVC(),
    GaussianNB(),
    DecisionTreeClassifier(max_depth = 5),
    RandomForestClassifier(max_depth = 5, n_estimators = 10,
                           max_features = 1)   
]

In [None]:
# Create empty lists and dictionaries
results = []

models = {}
confusion = {}
class_report = {}

# Iterate through names and classifiers list and fit models
for name, clf in zip(names, classifiers):    
    print ('Fitting {:s} model...'.format(name))
    run_time = %timeit -q -o clf.fit(X_train, y_train)
    
    # Predictions
    print ('... predicting')
    y_pred = clf.predict(X_train)   
    y_pred_test = clf.predict(X_test)
    
    # Model accuracy
    print ('... scoring')
    accuracy  = metrics.accuracy_score(y_train, y_pred)
    precision = metrics.precision_score(y_train, y_pred, average = 'weighted')
    recall    = metrics.recall_score(y_train, y_pred, average = 'weighted')
    
    f1        = metrics.f1_score(y_train, y_pred, average = 'weighted')    
    f1_test   = metrics.f1_score(y_test, y_pred_test, average = 'weighted')    
    
    # Save the results to dictionaries
    models[name] = clf    
    confusion[name] = metrics.confusion_matrix(y_train, y_pred)
    class_report[name] = metrics.classification_report(y_train, y_pred)
    
    results.append([name, accuracy, precision, recall,
                    f1, f1_test, run_time.best])

# Create dataframe of reasults    
results = pd.DataFrame(results, columns = ['Classifier', 'Accuracy',
                                           'Precision', 'Recall', 'F1 Train',
                                           'F1 Test', 'Train Time'])

# Set index as 'Classifier'
results.set_index('Classifier', inplace = True)


In [None]:
# Return dataframe of result sorted by f1 values
results.sort_values('F1 Train', ascending = False)

We plot the result in a bar graph.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
results.sort_values('F1 Train', ascending=False, inplace=True)
results.plot(y=['F1 Test'], kind='bar', ax=ax[0], xlim=[0,1.1], ylim=[0.30,0.92])
results.plot(y='Train Time', kind='bar', ax=ax[1])

#### Cross Validation

We apply cross validation checks to the base models to avoid overfitting our model to the data.

In [None]:
cv = []
for name, model in models.items():
    print ()
    print(name)
    scores = cross_val_score(model, X=X[:n].toarray(), y=y[:n], cv=10)
    print("Accuracy: {:0.2f} (+/- {:0.4f})".format(scores.mean(), scores.std()))
    cv.append([name, scores.mean(), scores.std() ])
    
cv = pd.DataFrame(cv, columns=['Model', 'CV_Mean', 'CV_Std_Dev'])
cv.set_index('Model', inplace=True)

# 10. Model evaluation<a name="evaluation"></a>

[Return to top](#top) <br><br>

From the cross validation results of the base models, we select the top performing models and run the models on our processed text and additional features. The best performing models include:
- Model 1
- Model 2
- Model 3

We run each of these models on a variety of combinations of text preprocessing and text cleaning methods and have concluded that we get the best results with the following cleaning methods:
- Lemmatize text
- Do not remove stop words
- Keep hashtags, mentions and retweets
- Normalise text using MaxAbsScaler()

# 11. Model optimisation<a name="optim"></a>

[Return to top](#top) <br><br>

From the cross validation results of the base models, we select the top performing models and run the models on our processed text and additional features. The best performing models include:
- Model 1
- Model 2
- Model 3

We run a grid search to determine the best parameters for the models.

In [None]:
# Grid search

# 12. Conclusion<a name="optim"></a>

[Return to top](#top) <br><br>


Here we discuss:
- Model description
- Model performance
- What else we can try
- Business case value

# Preprocessing

In [None]:
# Remove links

In [None]:
# Tokenize (experiment with different tokenizers)

In [None]:
# Convert to lowercase

In [None]:
# POS tagging (experiment)

In [None]:
# Perform NER and add your own values (experiment)

In [None]:
# Stemming (experiment)

In [None]:
# Lemmatization (experiment)

In [None]:
# Remove punctuation

In [None]:
# Remove numbers

In [None]:
# Create custom stopwords list (add 'RT')

In [None]:
# Remove stopwords

# EDA

In [None]:
# Most common words for all tweets

In [None]:
# Most common words for tweets per category

In [None]:
# Most common bigrams per category

In [None]:
# Average word count for each tweet per category

In [None]:
# Average character count for each tweet per category

In [None]:
# Investigate hashtags

In [None]:
# Investigate mentions

In [None]:
# Investigate Retweets

In [None]:
# Investigate Emoticons

In [None]:
# Investigate Clustering

In [None]:
# Revise stopwords list

# Vectorize

In [None]:
# Import vectorizers

In [None]:
# Tfidf vectorizer for base model

In [None]:
# Perform gridsearch or vectorizer parameters

In [None]:
# Create function to check different vectorizers performance

# Topic Modelling

In [None]:
# LDA clustering - have to use countvectorizer

In [None]:
# NNMF - use Tfidf Vectorizer - do this in conjunction with 1vR model

# Balance Data

In [None]:
# Use upsampling for base model

In [None]:
# Test different balancing techniques

In [None]:
# Understand which data needs to be resampled and what it does to the model

# Models

In [None]:
# Import easy models

In [None]:
# Fit Models

In [None]:
# Check best performing for baseline model (use cross validation)

In [None]:
# Perform grid search for best parameters for best model

In [None]:
# Investigate more complex modelling techniqes and implement separately

# Model Evaluation

In [None]:
# Look at model performance (confusion matrix,accuracy,f1score)

In [None]:
# 11. Model Optimisation<a name="hyperparameter"></a>

[Return to top](#top) <br><br>

From the cross validation results of the base models, we select the top performing models and run the models on our processed text and additional features. The best performing models include:
1) Model 1
2) Model 2
3) Model 3# Figure out what is causing the false positives and false negatives and update model to fix these