# EDSA - Climate Change Belief Analysis 2021
Predict an individual’s belief in climate change based on historical tweet data

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.

## 1. Data Collection
### Importing libraries
First, we'll import the libraries we will need, followed by the data.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import urllib

### Reading in the data

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test_with_no_labels.csv')

Copying the data sets to avoid confussio on the data we are all working on the same script

In [None]:
train_df = train_df.copy()
test_df = test_df.copy()

In [3]:
train_df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [4]:
train_df.shape

(15819, 3)

The shape command shows us that our train data has has 15819 rows of data and 3 features.

In [5]:
test_df.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [6]:
test_df.shape

(10546, 2)

## 2. Data Cleaning
Before we clean our data, we need to know what type of data we're working with, does the data contain any missing values, ... etc

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15819 entries, 0 to 15818
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  15819 non-null  int64 
 1   message    15819 non-null  object
 2   tweetid    15819 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 370.9+ KB


In [8]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10546 entries, 0 to 10545
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  10546 non-null  object
 1   tweetid  10546 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 164.9+ KB


We should now check if our data contains any missing values.

In [9]:
train_df.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

In [10]:
test_df.isnull().sum()

message    0
tweetid    0
dtype: int64

### Removing Noise
Variable `message` contains contains web address, we need to remove them

In [11]:
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'

train_df['message'] = train_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
test_df['message'] = test_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

### Convert to Lowercase and Remove Punctuation
Our data needs to be consistent, `i.e. Explore and explore is one word`

In [12]:
def remove_punctuation(words):
    message = words.lower()
    return ''.join([x for x in message if x not in string.punctuation])

In [13]:
train_df['message'] = train_df['message'].apply(remove_punctuation)
test_df['message'] = test_df['message'].apply(remove_punctuation)

In [24]:
train_df['message'][122]

['rt',
 'stephenschlegel',
 'shes',
 'thinking',
 'about',
 'how',
 'shes',
 'going',
 'to',
 'die',
 'because',
 'your',
 'husband',
 'doesnt',
 'believe',
 'in',
 'climate',
 'change',
 'urlwebã¢â‚¬â¦']

### Tokenisation

In [15]:
tokeniser = TreebankWordTokenizer()

train_df['message'] = train_df['message'].apply(tokeniser.tokenize)
test_df['message'] = test_df['message'].apply(tokeniser.tokenize)

In [23]:
train_df['message'][122]

['rt',
 'stephenschlegel',
 'shes',
 'thinking',
 'about',
 'how',
 'shes',
 'going',
 'to',
 'die',
 'because',
 'your',
 'husband',
 'doesnt',
 'believe',
 'in',
 'climate',
 'change',
 'urlwebã¢â‚¬â¦']

### Lemmatization
Let's lemmatize all of the words in both train and test dataframe.

In [19]:
lemmatizer = WordNetLemmatizer()

In [20]:
def lemma_(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]

In [None]:
train_df['message'] = train_df['message'].apply(lemma_, args=(lemmatizer, ))
test_df['message'] = test_df['message'].apply(lemma_, args=(lemmatizer, ))

In [21]:
train_df['message'][122]

['rt',
 'stephenschlegel',
 'shes',
 'thinking',
 'about',
 'how',
 'shes',
 'going',
 'to',
 'die',
 'because',
 'your',
 'husband',
 'doesnt',
 'believe',
 'in',
 'climate',
 'change',
 'urlwebã¢â‚¬â¦']

### Stop Words

In [25]:
def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

In [26]:
train_df['message'] = train_df['message'].apply(remove_stop_words)
test_df['message'] = test_df['message'].apply(remove_stop_words)

In [27]:
train_df['message'][122]

['rt',
 'stephenschlegel',
 'shes',
 'thinking',
 'shes',
 'going',
 'die',
 'husband',
 'doesnt',
 'believe',
 'climate',
 'change',
 'urlwebã¢â‚¬â¦']