<a href="https://colab.research.google.com/github/kwanda2426/classification-predict-streamlit-template/blob/master/Untitled1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2021/22 Climate Change Belief Analysis Predict

## Team 10
- Casper Kruger
- Gudani Mbedzi
- Kwanda Mazibuko
- Lucy Zandile Lushaba

## Introduction

The Glasgow Climate Pact, adopted by almost 200 countries has raised an alarm and concerns that human activities have caused a rise of around 1.1°C of global warming to date. The impact is already being felt around the globe. These developments have been shared with the world through various media outlets. In recent years social media in particular Twitter has risen to be the prefered source of information. This has created a massive source of unstructured of data. This type of data contains a variety of topics and can be anaylsed to find the sentiment behind it.

### Problem Statement

The developments around global warming and the increase use of social media has forced companies to build products and services around lessen the impact on the environment and the sentiments expressed by their market target on social media platform.
Therefore it is important that methods such as sentimental analysis are explored to provide companies with solutions that can provide insights for future marketing strategies thus increasing company profit margins.

### Objectives of the Research
The key objectives of this project are as follow:

- Data analysis to discover the overview of the data
- Data cleaning to remove errors and unnecessary data
- Exploratory data analysis to discover an in depth view of the data
- Build models that can determine a sentiment based on a tweet
- Predict the sentiment based on a tweet
- Evaluate the accuracy of the best performimg machine learning model

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Data Preprocessing</a>

<a href=#four>4. Exploratory Data Analysis (EDA)</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## Importing packages

In [2]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
#
import string
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns
from statsmodels.graphics.correlation import plot_corr
from scipy.stats import skew
from scipy.stats import kurtosis
import statistics

# datetime
import datetime

# Libraries for data preparation and model building
from sklearn.pipeline import Pipeline
import statsmodels.formula.api as sm
from statsmodels.formula.api import ols
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import RFE
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import boxcox, zscore
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, PolynomialFeatures

# saving my model
import pickle

#ignoring warnings
import warnings
warnings.filterwarnings('ignore')


  import pandas.util.testing as tm


In [3]:
#making sure that we can see all rows and cols
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

<a id="two"></a>
## Loading the Data

In [4]:
# importing datasets
df_train = pd.read_csv('/content/sample_data/train.csv')
df_test = pd.read_csv('/content/sample_data/test_with_no_labels.csv')

FileNotFoundError: ignored

### Overview of the datasets

Each tweet is labelled as one of the following classes:

- [2]  News: the tweet links to factual news about climate change
- [1]  Pro: the tweet supports the belief of man-made climate change
- [0]  Neutral: the tweet neither supports nor refutes the belief of man-made climate change
- [-1] Anti: the tweet does not believe in man-made climate change

In [None]:
#Checking df_train dataset head
display(df_train.head())

#Checking df_train dataset information
df_train.info()

- The dataset has 15819 entries and 3 variables that do not have null values. There are two interger variables and one object variable.

In [None]:
#Checking df_test dataset head
display(df_test.head())

#Checking df_test dataset information
df_test.info()

- The dataset has 10546 entries and 2 variables that do not have null values. There is one interger variables and one object variable.

<a id="three"></a>
## Data Preprocessing

In [None]:
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = ''
df_train['message'] = df_train['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [None]:
df_test['message'] = df_test['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [None]:
def lower_case(df, column_name):
    
    df[column_name] = df[column_name].str.lower()
    
    return df

In [None]:
lower_case(df_train, 'message').head()

Unnamed: 0,sentiment,message,tweetid
0,1,"polyscimajor epa chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? via @mashable",625221
1,1,it's not like we lack evidence of anthropogenic global warming,126103
2,2,rt @rawstory: researchers say we have three years to act on climate change before it’s too late …,698562
3,1,#todayinmaker# wired : 2016 was a pivotal year in the war on climate change,573736
4,1,"rt @soynoviodetodas: it's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #electionnight",466954


In [None]:
lower_case(df_test,'message').head()

Unnamed: 0,message,tweetid
0,europe will now be looking to china to make sure that it is not alone in fighting climate change…,169760
1,combine this with the polling of staffers re climate change and womens' rights and you have a fascist state.,35326
2,"the scary, unimpeachable evidence that climate change is already here: #itstimetochange #climatechange @zeroco2_;..",224985
3,@karoli @morgfair @osborneink @dailykos \nputin got to you too jill ! \ntrump doesn't believe in climate change at all \nthinks it's s hoax,476263
4,rt @fakewillmoore: 'female orgasms cause global warming!'\n-sarcastic republican,872928


In [None]:
def lookup_dict(text, dictionary):
    
    for word in text.split():
        
        if word in dictionary:
            
            if word in text.split():
                
                text = text.replace(word, dictionary[word])
                
    return text

In [None]:
contractions = {"doesn't" : 'does not',"wouldn't": "would not","it's" : 'it is',
               "i'm": "i am", "we're" : 'we are',"i've":'i have',"let's" : 'let us',
               "couldn't" : 'could not',"don't" : 'do not', 'lol' : 'laugh out loud',
               'ftl': 'for the loss', 'fwiw': 'for what it is worth', 'imo' : 'in my opinion',
               'diaf': 'die in a fire','dm': 'direct message', 'afaik':'as far as i know',
               'imho': 'in my humble opinion', 'tbh': 'to be honest','icymi': 'in case you missed it',
               'idk': 'i do not know', 'mt': 'modified tweet', 'smh':'shaking my head',
               'smdh':'shaking my damn head','nts':'note to self','ifykyk':'if you know, you know',
               'ijs':'i am just saying', 'tbqh':'to be quite honest','fyi':'for your information',
               'idc':'i do not care','hth':'happy to help', 'hth':'hear to help','hifw':'how i feel when',
               "we've":'we have',"i'd":'i would', "i'll":"i will",'rt':''}

In [None]:
df_train['clean_message'] = df_train['message'].apply(lambda x: lookup_dict(x, contractions))

In [None]:
df_test['clean_message'] = df_test['message'].apply(lambda x: lookup_dict(x, contractions))

In [None]:
df_train.head()

Unnamed: 0,sentiment,message,tweetid,clean_message
0,1,"polyscimajor epa chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? via @mashable",625221,"polyscimajor epa chief does not think carbon dioxide is main cause of global warming and.. wait, what!? via @mashable"
1,1,it's not like we lack evidence of anthropogenic global warming,126103,it is not like we lack evidence of anthropogenic global warming
2,2,rt @rawstory: researchers say we have three years to act on climate change before it’s too late …,698562,@rawstory: researchers say we have three years to act on climate change before it’s too late …
3,1,#todayinmaker# wired : 2016 was a pivotal year in the war on climate change,573736,#todayinmaker# wired : 2016 was a pivotal year in the war on climate change
4,1,"rt @soynoviodetodas: it's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #electionnight",466954,"@soynoviodetodas: it is 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #electionnight"


In [None]:
df_test.head()

Unnamed: 0,message,tweetid,clean_message
0,europe will now be looking to china to make sure that it is not alone in fighting climate change…,169760,europe will now be looking to china to make sure that it is not alone in fighting climate change…
1,combine this with the polling of staffers re climate change and womens' rights and you have a fascist state.,35326,combine this with the polling of staffers re climate change and womens' rights and you have a fascist state.
2,"the scary, unimpeachable evidence that climate change is already here: #itstimetochange #climatechange @zeroco2_;..",224985,"the scary, unimpeachable evidence that climate change is already here: #itstimetochange #climatechange @zeroco2_;.."
3,@karoli @morgfair @osborneink @dailykos \nputin got to you too jill ! \ntrump doesn't believe in climate change at all \nthinks it's s hoax,476263,@karoli @morgfair @osborneink @dailykos \nputin got to you too jill ! \ntrump does not believe in climate change at all \nthinks it is s hoax
4,rt @fakewillmoore: 'female orgasms cause global warming!'\n-sarcastic republican,872928,@fakewillmoore: 'female orgasms cause global warming!'\n-sarcastic republican


In [None]:
def remove_punctuation(message):
    return ''.join([l for l in message if l not in string.punctuation])

In [None]:
df_train['clean_message'] = df_train['clean_message'].apply(remove_punctuation)

In [None]:
df_test['clean_message'] = df_test['clean_message'].apply(remove_punctuation)

In [None]:
df_test.head()

Unnamed: 0,message,tweetid,clean_message
0,europe will now be looking to china to make sure that it is not alone in fighting climate change…,169760,europe will now be looking to china to make sure that it is not alone in fighting climate change…
1,combine this with the polling of staffers re climate change and womens' rights and you have a fascist state.,35326,combine this with the polling of staffers re climate change and womens rights and you have a fascist state
2,"the scary, unimpeachable evidence that climate change is already here: #itstimetochange #climatechange @zeroco2_;..",224985,the scary unimpeachable evidence that climate change is already here itstimetochange climatechange zeroco2
3,@karoli @morgfair @osborneink @dailykos \nputin got to you too jill ! \ntrump doesn't believe in climate change at all \nthinks it's s hoax,476263,karoli morgfair osborneink dailykos \nputin got to you too jill \ntrump does not believe in climate change at all \nthinks it is s hoax
4,rt @fakewillmoore: 'female orgasms cause global warming!'\n-sarcastic republican,872928,fakewillmoore female orgasms cause global warming\nsarcastic republican


In [None]:
#sent_list = []

#message = []

#for count in range(0,len(df_train)):
    
#   if df_train['sentiment'][count] == 1:
        
#        sent_list.append(1)
#        message.append(df_train['clean_message'][count])
        
#    elif df_train['sentiment'].iloc[count] == -1:
                   
#        sent_list.append(0)
#        message.append(df_train['clean_message'][count])

In [None]:
#dict_ = {'sentiment': sent_list,
        'message' : message}

In [None]:
#data = pd.DataFrame(dict_)

<a id="six"></a>
##  Modelling

### Preparing for modelling

In [None]:
tweet_id = df_test.tweetid.values 

In [None]:
df = df_train.clean_message.values

y = df_train.sentiment.values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df,
                                                    y, 
                                                   random_state = 1,
                                                   test_size = 0.2,
                                                   shuffle = True)

In [None]:
test_data = df_test['clean_message']

In [None]:
vectorizer = CountVectorizer(stop_words = 'english')

vectorizer.fit(list(x_train) + list(x_test) + list(test_data))

CountVectorizer(stop_words='english')

In [None]:
X_train = vectorizer.transform(x_train)
X_test = vectorizer.transform(x_test)
X_test_test = vectorizer.transform(test_data)

In [None]:
svm = SVC(kernel = 'linear', probability = True)

prob = svm.fit(X_train,y_train).predict_proba(X_test)

y_pred = svm.predict(X_test)

In [None]:
clf = LogisticRegression()
clf.fit(X_train,y_train)

LogisticRegression()

In [None]:
rfc = RandomForestClassifier(max_depth = 2, random_state = 0)
rfc.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [None]:
rfc_f1 = f1_score(y_test, rfc.predict(X_test), average='macro')

In [None]:
rfc_f1

0.17865122892545196

In [None]:
clf_f1 = f1_score(y_test, clf.predict(X_test), average='macro')

In [None]:
clf_f1

0.6380173639457858

In [None]:
svc_f1 = f1_score(y_test, y_pred, average='macro')

In [None]:
kwanda

0.6240751446725404

In [None]:
value_lr = clf.predict(X_test_test) # 72%
value_svc = svm.predict(X_test_test) #70%
value_rfc = rfc.predict(X_test_test)

In [None]:
my_dict = {'tweetid' : tweet_id,'sentiment' : value_rfc}

In [None]:
results = pd.DataFrame(my_dict)

In [None]:
results.to_csv('Team_100.csv', index = False)