#  Climate Change Belief Analysis 2022 

## By Datafluent Inc. (Team JM_3)


<div align="center" style="width: 500px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://www.itu.int/en/mediacentre/backgrounders/PublishingImages/climate-change-backgrounder.jpg"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>
    </div>

## Board Members
1. **Prince Okon** *(CEO)*
2. **Marvic Cocouvi** *(Director Marketing and Promotions)*
3. **Abiemwense Omokaro** *(Director IT/Technical Support)*
4. **Buhari Shehu** *(Wrangler General)*
5. **Nqosa Lehloenya** *(Director Business Management)*
6. **Kefa Kiprono**

## Table of Contents


## Outline
- Introduction
- Exploratory Data Analysis
- Model Building
- Conclusion and Recommendations

# 1.0 Introduction

## 1.1 Project Overview
Industrialization is the enabler of modern economic growth and development. However, this comes at the cost of emitting greenhouse gases that contribute, negatively, to climate change and ultimately global warming. Governments, eco-conscious organizations, and civil societies around the world are constantly exploring ways to reduce their carbon footprints. Despite many indices pointing towards climate change, many people are of the belief that climate change is a hoax. In this project, we are going to use the novel tweets of some individuals to build a Machine Learning (ML) model to identify their beliefs about climate change. This model’s outcome will help companies to predict how their eco-friendly products will be received by their prospective customers and thus enable them to make strategic business decisions.

## 1.2 Installing Dependencies and Importing Packages
In order to successfully build the models, there is a need to pip install some dependencies. Thus, we install the *autotime library, Comet, imblearn, and nltk.*

In [3]:
# installations
%%capture
!pip install ipython-autotime
!pip install comet-ml
!pip install imblearn --user
!pip install --user -U nltk
!pip install wordcloud
%load_ext autotime

time: 33.8 s (started: 2022-06-21 11:39:12 +01:00)


In [None]:
conda install -c anaconda -c conda-forge -c comet_ml comet_ml

In [10]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np # for linear algebra
import pandas as pd # for importing, creating and manipulating dataframes

# Visualization Packages
import matplotlib.pyplot as plt
import seaborn as sns
# Warnings
import warnings 
warnings.simplefilter(action="ignore", category=FutureWarning)

# Import comet_ml 
from comet_ml import Experiment

# Packages for text manipulation and Natural language processing
import re
from string import punctuation
import nltk
nltk.download(['stopwords','punkt'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


# Train-test split package
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.utils import resample


# Libraries for data preparation and model building and evaluations
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_score
from imblearn.pipeline import pipeline
from imblearn.over_sampling import SMOTE

from sklearn import metrics

time: 0 ns (started: 2022-06-21 11:56:10 +01:00)


[nltk_data] Downloading package stopwords to C:\Users\Buhari
[nltk_data]     Shehu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\Buhari
[nltk_data]     Shehu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Create an experiment with your api key
experiment = Experiment(
    api_key="KQ1UTh7hBvPLWlz3034oIgusG",
    project_name="global-warming-climate-change-sentiment-analysis-zm3",
    workspace="okonp07@-gmail-com",)

## Problem statement

Before a company begins to develop products for a segment of the population that believes in climate change and the effects of climate change on the environment and world, such a company must first establish the existence of such a demography. Even when it is clear that a section of the population indeed believs that there is a threat of environmental change, we must assess such factors as how much of a threat they believe climate change is. How far they may be willing to go to protect the environment and how passionate they are about supporting the efforts of others to fight climate change.

By this experiment, we aim to explore machine learning as a method to assist us in identifying whether or not a person believes in climate change and could possibly be converted to a new customer based on their tweets. We will develop ML models that are capable of classifying tweets leaning positively towards a believe that Climate change is actuallya problem  and hence desirous of proferring a solution to this problem or leaning negatively towards that believe, hence treating climatic change as no threat. To produce such these models, we must first select an appropriate set of documents for training that has been annotated. Then, we pre-process and clean these documents, to enable us generate a feature and target set from them. We will then select relevant ML models, on which to train these datasets and evaluate their performance. Model evaluation is done both by comparing model predictions against a human panel at block level and comparing model performance against data that have been annotated but not used for training using cross-fold validation. Once a satisfactory performance of the model has been achieved, we interpret the patterns learned and apply them for further decision-making in the context of the experiment.



### Import libraries
In the following cell, we shall import the libraries that are neccesary for us to use in completing the project. 

### Loading the Data
To load your data, first ensure that the raw data and the notebook file are in the same folder on your local machine. The code below will load both the train and test data set into your notebook. If the files are not in the same folder, you will have to point to the directory in your machine or cloud location where the file is located. After loading your data, it is good practice to call up the loaded data just to verify that the data actually loaded as it should.

In [None]:
#Load the train and test data sets from their respective CSV files
train = pd.read_csv('train.csv')
test = pd.read_csv('test_with_no_labels.csv')

<a id="EDA"></a>
### Exploratory Data analysis


# ALL EDA / DATA PRE-PROCESSING HERE

Do not combine the train and test data in processing. carry out analysis to show insights that may be beneficial in explaining the sentiments shown in user tweets. The data pre-processing should be split into Univariate and multi variate analysis. Your EDA must tell the sory of the data. Some useful questions: 
* What is the sample size?
* What key- words are useful to establish sentiments?
* What are the sources of the data (News with verifiable sources, Informal tweets, etc )
* Establish sentiments and their percentages in the data (Pro, Anti, Neutral, etc)
* Check for words most commonly featured in the dataset
* Frwequent Hashtags (Pro and anti)
* Consider other conditionalities that will generate insiteful Visuals for you data and use them. 

Remember that EDA is all about visual presentation. Use visuals to tell the Story of your data.

### Data Wrangling Function
The following function will clean and preprocess any tweet parsed into it. 

In [None]:
def tweet_preprocessing(tweet):
    
    '''
    This functions cleans tweets from line breaks, URLs, numbers, etc.
    '''
    
    tweet = tweet.lower() #to lower case
    tweet = tweet.replace('\n', ' ') # remove line breaks
    tweet = tweet.replace('\@(\w*)', '') # remove mentions
    tweet = re.sub(r"\bhttps://t.co/\w+", '', tweet) # remove URLs
    tweet = re.sub('\w*\d\w*', '', tweet) # remove numbers
    tweet = re.sub(r'\#', '', tweet) # remove hashtags. To remove full hashtag: '\#(\w*)'
    tweet = re.sub('\w*\d\w*', '', tweet) # removes numbers?
    tweet = re.sub(' +', ' ', tweet) # remove 1+ spaces

    return tweet

### Train-test Split
After we create a function for preprocessing we must split the data into labels and features (X and y) in order to enable us the run the models on our data sets.

In [None]:
# Splitting the labels and features
train['processed'] = train['message'].apply(tweet_preprocessing)
X = train['processed'].values
y = train['sentiment'].values

In [None]:
# preprocess testing data by applying our function
test['processed'] = test['message'].apply(tweet_preprocessing)

<a id="feature"></a>
# Feature Selection

### Naive Bayes Classifier 

In [None]:
# Splitting the labels and fetures into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,random_state=42,stratify=y)

In [None]:
mnb = Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])
#fitting the model
mnb.fit(X_train, y_train)

#apply model on test data
y_pred_mnb = mnb.predict(X_test)

In [None]:
# Classification report
print(classification_report(y_test, y_pred_mnb))

As we can see the '-1' and '0' class are poorly predicted when using unbalanced data. Once we implement resampling their f1-score increases for these model but only slightly. While at the same time the overall accuracy is slightly reduced.

<a id="modelling"></a>
# MODELLING

### SVC and LinearSVC

SVC Provides a best fit to catergorize our data this fit can be nonlinear, while a linearSVC provides a linear interpolation.

In [None]:
#SVC
svc = Pipeline([('Count',CountVectorizer()),('classify',SVC(max_iter=300,C=1))])

In [None]:
#linearSVC
linsvc = Pipeline([('Count',CountVectorizer()),('classify',LinearSVC(max_iter=300,C=1))])

### Logistic Regression

Models the discrete probability distribution between classes and classifies based on the inflection point of the curve.

In [None]:
#Logistic Regression
lr = Pipeline([('Count',CountVectorizer()),('classify',LogisticRegression(max_iter=300))])

### KNN
The KNN classifier assumes that all data points that similar data points tend to form clusters, close together. It classifies points that are close into the same class.K is the number of neighbours. So K=3 implies we will make our predictions based off of the 3 closest points to the data point beign assessed.

In [None]:
# Invoke the KNN classifier
knn = Pipeline([('Count',CountVectorizer()),('classify',KNeighborsClassifier(n_neighbors=3))])

### Decision Tree

The decision tree uses a tree-like model of decisions and their possible consequences.Starting from the decision itself (called a "node"), each branch of the decision tree represents a possible decision, outcome, or reaction and it works up until there is only one possible outcome left.
<div align="center" style="width: 500px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://www.aihr.com/wp-content/uploads/decision-trees-in-analytics.png"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>

In [None]:
#Decision Tree
dt = Pipeline([('Count',CountVectorizer()),('classify',DecisionTreeClassifier())])

### Random Forest
Using the decision tree as a base estimator,each estimator is trained on a different bootstrap sample having the same size as the training set. At each node of the forest, features are sampled without replacement to increase randomization. Nodes are split to maximise information gain.   

<div align="center" style="width: 500px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/rfc_vs_dt11.png"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>

In [None]:
# Call up the Random Forest Sampler
rf = Pipeline([('Count',CountVectorizer()),('classify',RandomForestClassifier())])

### Evaluate Model Performance

In [None]:
num=3
# SVC
scores = cross_val_score(
        svc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' SVC models is ' + str(sum(scores)/len(scores)))

In [None]:
#linearSVC
scores = cross_val_score(
        linsvc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+ ' LinearSVC models is ' + str(sum(scores)/len(scores)))

In [None]:
#Logistic Regression
scores = cross_val_score(
        lr, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' Logistic Regression models is ' + str(sum(scores)/len(scores)))

In [None]:
#KNN
scores = cross_val_score(
        knn, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

In [None]:
#Decision Tree
scores = cross_val_score(
        dt, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' Decision Tree models is ' + str(sum(scores)/len(scores)))

In [None]:
#Random Forest
scores = cross_val_score(
        rf, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

The Logistic Regression Model and the LinearSVC model perform the best. The best performance for every model is found when resampling is not done. This could be because because upsampling the minority classes to the level of the majority class results in too much overfitting.

#### Tuning parameters

We take a look and see if we can improve our best 2 models: linearSVC and Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV
Cs = [0.001, 0.01, 0.1, 1, 10]
param_grid = {
    'C'     : Cs
    }
grid_SVM = GridSearchCV(LogisticRegression(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_

In [None]:
param_grid = {'C'     : Cs }
grid_SVM = GridSearchCV(LinearSVC(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_

<a id="conclusion"></a>
# Conclusion

#### Model performance
Several strategies we attempted to improve model performance, ranging from data processing techniques to clean the tweets, data balancing strategies, cross validation and grid search for the best values for model hyperparameters.



### What else we can try
Language models and the use of neural networks were two other strategies that we may implement to see how it will compare with essemble tree based models. It will also provide an opportunity for us to see how neural network models perforn on Natural Language tasks.

### Business case value

The analysis shows generally that the sentiment from the negative tweets arise from a a class of people who believe that climate change and the incidence of global warming is a hoax, or at best overated. Most of these persons defend their ideology with strong resolve and it will therefore be counter productive to try to market environmentally friendly products to them as an effort towards sustainability or to combat the incidence of Climate change. This class are more likely to become customers if other aspects of the product is promoted to them. They are more likely to purchase a product because of quality, fairness in price, use case etc rather than just because it is sustainably produced or good for the environmenmt.

Conversely, People from the other end of the spectrum who display positive sentiments towards climate change definitely believe that climate change is an issue. They show some willingness to "do something" to play their role in combating this issue. What is yet unclear is if their sentiments would translate to any meaningful influence on their product aquisition habbits. They are definitely a better group to target with promotions that highlights the sustainability and environmenmtal friendliness of the products. To be safe though, this message should be embedded in other qualities of the product so that the environmental friendliness would be the Icing on the cake. It would be great for them if they have a product which is good already but also is sustainably produced. 


It will be beneficial for companies or organizations to band together and form groups where Some organisations are mentioned in the tweets, many which share the same values and ideals when it comes to protecting the environment, who have a substantial membership and following on social media of individuals who share the same values and ideals. The formation of potential partnerships with these organisations could lead to brand exposure with individuals who in their daily lives make conscious decisions with regards to the products and services they purchase.

We recommend that the latter strategy of pursuing partnerships with like minded organisations will yield the best results, in terms of finding a group of potential customers who share the same values and ideals, and would be likely to purchase your products and services.

<a id="save"></a>
# SUBMISSION

For our final model, we build a stacking classifier to combine Logistic Regression, LinearSVC and Random Forest

In [None]:
estimators = [
       ('rf', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',RandomForestClassifier())])),
         
        ('lnsvc', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LinearSVC(C=0.1))])),
         
        ('MNB',Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])),
    
        ('lr', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LogisticRegression(C=1))]))]

In [None]:
clf = StackingClassifier(
        estimators=estimators
    )

#fitting the model
clf.fit(X, y)

In [None]:
# End experiment
experiment.end()

In [None]:
# Display results on comet page
experiment.display()

In [None]:
# Creating the unseen set, so that we can post to Kaggle and recieve a score based on the performance
x_unseen = test['processed']

submission = pd.DataFrame(
    {'tweetid': test['tweetid'],
     'sentiment': clf.predict(x_unseen)
    })

# save DataFrame to csv file for submission
submission.to_csv("Submission_final.csv", index=False)