# Project: Disaster Tweets 

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. 

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 


# Libraries

In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import re
import string
import tensorflow 
import nltk

from nltk.corpus import stopwords
from nltk import sent_tokenize
from nltk import word_tokenize

from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Input, Dense 
from tensorflow.keras.utils import to_categorical 
from tensorflow.python.framework.random_seed import set_random_seed

from keras.utils.vis_utils import plot_model
from keras.callbacks import EarlyStopping 
from keras.callbacks import EarlyStopping 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import f1_score
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MaxAbsScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay


In [2]:
import os 
os.getcwd()

'C:\\Users\\elisa\\OneDrive\\Desktop\\MACHINE LEARNING\\PROJECT_ML'

## Loading the data and EDA

In [3]:
# datasets 

url = 'https://www.math.unipd.it/~dasan/disaster/'
train_df = pd.read_csv(url + 'train.csv', sep=",") 
test_df = pd.read_csv(url + 'test.csv', sep=",") 


In [4]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
train_df.shape # The trianing set has 7613 rows and 5 columns 

(7613, 5)

In [6]:
print("The real disaster tweets are {}".format(len(train_df[train_df['target'] == 1])))
print("The fake disaster tweets are {}".format(len(train_df[train_df['target'] == 0])))
print()
print("Hence, the two classes are balanced, although we have slightly more fake than real tweets")

The real disaster tweets are 3271
The fake disaster tweets are 4342

Hence, the two classes are balanced, although we have slightly more fake than real tweets


In [7]:
print("The missing values for the keyword column are: {}".format(train_df["keyword"].isna().sum()))
print("There are {} unique keywords in the dataframe".format(len(train_df["keyword"].unique())))
print()
print("The missing values for the location column are: {}".format(train_df["location"].isna().sum()))
print("There are {} unique locations in the dataframe".format(len(train_df["location"].unique())))
print()
print("""One can already assume that keywords are going to be more relevant than locations for classification, 
      as more than 30% of location values are missing""")

The missing values for the keyword column are: 61
There are 222 unique keywords in the dataframe

The missing values for the location column are: 2533
There are 3342 unique locations in the dataframe

One can already assume that keywords are going to be more relevant than locations for classification, 
      as more than 30% of location values are missing


In [8]:
train_df[train_df["location"] == 'USA'] ##location is 'USA' in 104 tweets 

Unnamed: 0,id,keyword,location,text,target
55,79,ablaze,USA,#Kurds trampling on Turkmen flag later set it ...,1
203,287,ambulance,USA,Twelve feared killed in Pakistani air ambulanc...,1
223,316,annihilated,USA,One thing for sure-God has promised Israel wil...,0
316,461,armageddon,USA,YOUR PHONE IS SPYING ON YOU! Hidden Back Door ...,0
382,551,arson,USA,Thousands attend a rally organized by Peace No...,1
...,...,...,...,...,...
7341,10511,wildfire,USA,The Latest: Washington #Wildfire misses town; ...,1
7356,10533,wildfire,USA,The Latest: #Wildfire destroys more homes but ...,1
7413,10606,wounded,USA,One man fatally shot another wounded on Vermon...,1
7420,10613,wounded,USA,Police Officer Wounded Suspect Dead After Exch...,1


# Natural Language Processing 

## Preprocessing phase 

First, we look at the text of the tweets

In [9]:
text_df = train_df["text"]

text_df.head()

0    Our Deeds are the Reason of this #earthquake M...
1               Forest fire near La Ronge Sask. Canada
2    All residents asked to 'shelter in place' are ...
3    13,000 people receive #wildfires evacuation or...
4    Just got sent this photo from Ruby #Alaska as ...
Name: text, dtype: object

In [10]:
len(text_df.unique())

print("There are 110 tweets which are repeated in the dataframe and 7503 unique tweets")

There are 110 tweets which are repeated in the dataframe and 7503 unique tweets


Then, we make a list of all the words which will be useless for classification, i.e. those words which give us no indication about the realness of the disaster 

In [11]:
nltk.download('stopwords')

stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\elisa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Cleaning the text 

In [12]:
# Utility functions for cleaning text of tweets 

def stopword_remover(text):
    return " ".join([word for word in str(text).split() if word not in stopwords])

def url_remover(text):
    text1 = re.sub(r'http?:\/\/.*[\r\n]*', "", text)
    text2 = re.sub(r'https:\/\/.*[\r\n]*', "", text1)
    text3 = " ".join(word for word in text2.split() if not word.startswith('@'))
    return text3.casefold().strip()

def special_chars_remover(text):
    text1 = re.sub(r"[^a-zA\s]", "", text)
    text2 = text1.replace("#", "").strip()
    return text2.strip()

In [13]:
# Applying functions to the training and test sets 

train_df["text"] = train_df.text.apply(url_remover).dropna()
test_df["text"] = test_df.text.apply(url_remover)

train_df["text"] = train_df.text.apply(stopword_remover).dropna()
test_df["text"] = test_df.text.apply(stopword_remover)

train_df["text"] = train_df.text.apply(special_chars_remover).dropna()
test_df["text"] = test_df.text.apply(special_chars_remover)

In [14]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,deeds reason earthquake may allah forgive us,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,residents asked shelter place notified officer...,1
3,6,,,people receive wildfires evacuation orders cal...,1
4,7,,,got sent photo ruby alaska smoke wildfires pou...,1


In [15]:
print(train_df['text'][1:10])

1                forest fire near la ronge sask canada
2    residents asked shelter place notified officer...
3    people receive wildfires evacuation orders cal...
4    got sent photo ruby alaska smoke wildfires pou...
5    rockyfire update  california hwy  closed direc...
6    flood disaster heavy rain causes flash floodin...
7                           im top hill see fire woods
8    theres emergency evacuation happening building...
9                        im afraid tornado coming area
Name: text, dtype: object


## Tokenization

In [16]:
# This is useless if I use CountVectorizer, does it automatically 

#from nltk.tokenize import wordpunct_tokenize

#text = train_df['text']

#text = text.apply(nltk.wordpunct_tokenize)

#text.head()

#train_df['text'] = text

In [17]:
train_df.tail()

Unnamed: 0,id,keyword,location,text,target
7608,10869,,,two giant cranes holding bridge collapse nearb...,1
7609,10870,,,control wild fires california even northern pa...,1
7610,10871,,,m utckm volcano hawaii,1
7611,10872,,,police investigating ebike collided car little...,1
7612,10873,,,latest homes razed northern california wildfir...,1


## Linear modeling 

best performance (3-fold CV and tf-idf transformation on single word counts): 0.63% 

### IDEAS IF PERFORMANCE IS BAD: 

* Try stemming 
* Try lemmatization 
* Try TF or TF-IDF representation 
* Try one-hot encoding 

In [None]:
# First we need to convert our text into a machine-readable format 
# We use BOW approach  

In [31]:
from sklearn import feature_extraction 


count_vectorizer = feature_extraction.text.CountVectorizer()
tfidf_vectorizer = feature_extraction.text.TfidfTransformer()

train_vectors = count_vectorizer.fit_transform(train_df["text"])
train_vectors = tfidf_vectorizer.fit_transform(train_vectors)

test_vectors = count_vectorizer.transform(test_df["text"])
test_vectors = tfidf_vectorizer.transform(test_vectors)

In [32]:
train_vectors[0].todense().shape

(1, 14244)

In [33]:
from sklearn import linear_model, model_selection

clf = linear_model.RidgeClassifier()

In [34]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.60056338, 0.54910243, 0.63347023])