# Classification Predict - Climate Change Belief Analysis Challenge
© Explore Data Science Academy

---
### Honour Code

I {**#Team_ND2**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

<a id="cont"></a>

## Table of Contents

#### Section 1: Data Pre-processing

<a href=#one>1.1 Importing Packages</a>

<a href=#two>1.2 Loading Data</a>

<a href=#three>1.3 Exploratory Data Analysis (EDA)</a>

<a href=#four>1.4 Data Engineering</a>

#### Section 2: Model Development and Analysis

<a href=#five>2.1 Modeling</a>

<a href=#six>2.2 Model Performance</a>

#### Section 3: Model Explanation and Conclusions

<a href=#seven>3.1 Model Explanations</a>

<a href=#seven>3.2 Conclusions</a>

# Introduction
Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies. This Notebook has been so adapted and developed by **TeamND2** - a group of six students from the July 2022 cohort of the Explore Ai Academy **Data Science** course. We are:

 > David Mugambi <br>
 > Gavriel Leibovitz <br>
 > Josiah Aramide <br>
 > Aniedi Oboho-Etuk <br>
 > Joy Obukohwo <br>
 > Marvellous Eromosele <br>
 

### Problem Statement

The scenario involves

### Objectives

TeamND2 seeks to achieve the following objectives for the project brief:

- 1. analyse the supplied data;
- 2. identify xxx;
- 3. de

# Section 1: Data Pre-processing

This section describes steps for importing packages, loading the two datasets - train and test datasets, conducting the exploratory data analysis (EDA) and implementing data engineering.

 <a id="one"></a>
## 1.1 Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| Below are the libraries imported for use in this project. The libraries include  |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

# Libraries for data preparation and model building
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import mutual_info_regression #determine mutual info
from sklearn.preprocessing import StandardScaler # for standardization
from sklearn.model_selection import train_test_split
from sklearn import metrics
import math
import time
import datetime as dt
from sklearn.metrics import r2_score



Green Energy!!! Is it just a buzz? Is there a thing as Green-House Gas or Global Warming? Today we find out what Twitter users think!

<a id="two"></a>
## 1.2 Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section, we load the data from the . |

---

In [4]:
from google.colab import files
uploaded = files.upload()



Saving test_with_no_labels.csv to test_with_no_labels (1).csv
Saving train.csv to train (1).csv


In [5]:
import io
df_train = pd.read_csv(io.BytesIO(uploaded['train.csv']))
df_train = pd.read_csv(io.BytesIO(uploaded['test_with_no_labels.csv']))
# Dataset is now stored in a Pandas Dataframe

In [6]:
# View top of dataset

df_train.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [7]:
# view bottom of dataset

df_train.tail()

Unnamed: 0,message,tweetid
10541,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714
10542,2016: the year climate change came home: Durin...,875167
10543,RT @loop_vanuatu: Pacific countries positive a...,78329
10544,"RT @xanria_00018: You’re so hot, you must be t...",867455
10545,RT @chloebalaoing: climate change is a global ...,470892


In [8]:
# View rows
df_train.index # we have 15,819 rows of data

RangeIndex(start=0, stop=10546, step=1)

In [9]:
# view columns

df_train.columns

Index(['message', 'tweetid'], dtype='object')

<a id="three"></a>
## 1.3 Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, we perform an in-depth analysis of all the features |

---

In [10]:
# Dataset Matrix
df_train.shape

(10546, 2)

In [11]:
# Data Statistics
df_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tweetid,10546.0,496899.936943,288115.677148,231.0,246162.5,495923.0,742250.0,999983.0


In [12]:
# Data Types and Non-null count 
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10546 entries, 0 to 10545
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  10546 non-null  object
 1   tweetid  10546 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 164.9+ KB


### No Null Rows in Columns

In [13]:
# Check for null values 
def null_cols(df):
    features_with_nulls = []
    for col in df.columns:
        if df[col].isnull().sum() > 0:
            features_with_nulls.append((col, df[col].isnull().sum()))
            
        
    return features_with_nulls

# Call the function
null_cols(df_train)

[]

### Function method of extracting usernames using list comprehension lambda function

In [14]:
# # Extract Username. 
#Function method of extracting usernames using list comprehension lambda function
def extract_username(df):
    import re
    copy_df = df.copy()
    #copy_df['message'] = copy_df['message'].values.astype(str)
    copy_df['Username'] = list(map(lambda x: re.findall('(@[a-zA-Z]+\w+)', x), copy_df['message']))
    return copy_df

In [15]:
extract_username(df_train)

Unnamed: 0,message,tweetid,Username
0,Europe will now be looking to China to make su...,169760,[]
1,Combine this with the polling of staffers re c...,35326,[]
2,"The scary, unimpeachable evidence that climate...",224985,[@ZEROCO2_]
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,"[@Karoli, @morgfair, @OsborneInk, @dailykos]"
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,[@FakeWillMoore]
...,...,...,...
10541,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,[@BrittanyBohrer]
10542,2016: the year climate change came home: Durin...,875167,[]
10543,RT @loop_vanuatu: Pacific countries positive a...,78329,[@loop_vanuatu]
10544,"RT @xanria_00018: You’re so hot, you must be t...",867455,"[@xanria_00018, @jophie30, @asn585]"


In [16]:
# Extract Username. 
# Using regex methods
import re

#df_train['message'] = df_train['message'].values.astype(str)
df_train['Username'] = df_train['message'].str.extract('(\@[a-zA-Z]+\w+)')
df_train

Unnamed: 0,message,tweetid,Username
0,Europe will now be looking to China to make su...,169760,
1,Combine this with the polling of staffers re c...,35326,
2,"The scary, unimpeachable evidence that climate...",224985,@ZEROCO2_
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,@Karoli
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,@FakeWillMoore
...,...,...,...
10541,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,@BrittanyBohrer
10542,2016: the year climate change came home: Durin...,875167,
10543,RT @loop_vanuatu: Pacific countries positive a...,78329,@loop_vanuatu
10544,"RT @xanria_00018: You’re so hot, you must be t...",867455,@xanria_00018


In [17]:
copy_df = df_train.copy()

copy_df['Username'] = list(map(lambda x: re.findall('(@[a-zA-Z]+\w+)', x), copy_df['message'].astype(str)))

copy_df

Unnamed: 0,message,tweetid,Username
0,Europe will now be looking to China to make su...,169760,[]
1,Combine this with the polling of staffers re c...,35326,[]
2,"The scary, unimpeachable evidence that climate...",224985,[@ZEROCO2_]
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,"[@Karoli, @morgfair, @OsborneInk, @dailykos]"
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,[@FakeWillMoore]
...,...,...,...
10541,"RT @BrittanyBohrer: Brb, writing a poem about ...",895714,[@BrittanyBohrer]
10542,2016: the year climate change came home: Durin...,875167,[]
10543,RT @loop_vanuatu: Pacific countries positive a...,78329,[@loop_vanuatu]
10544,"RT @xanria_00018: You’re so hot, you must be t...",867455,"[@xanria_00018, @jophie30, @asn585]"


<a id="four"></a>
## 1.4 Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section we conduct our feature engineering to: clean the dataset, and create new features - as identified in the EDA phase. Later, we initiate some ... |

---

# Section 2: Model Development and Analysis

This section describes

# <a id="five"></a>
## 2.1 Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, the team developed some ... Our choice of models include:

- M
- L
- Support Vector Machines
- DecisionTrees
- RandomForest
---
We continue to explore some ... 

Also, in this stage...

<a id="six"></a>
## 2.2 Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section we will compare the relative performance of the various ... |

---
We will use the following

# Section 3: Model Explanations and Conclusions

This section describes

<a id="seven"></a>
## 3.1 Model Explanation
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, we discuss how the best performing model works in a simple way ...|

---

<a id="seven"></a>
## 3.2 Conclusions
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, we discuss how the best performing model works in a simple way ...|

---