Link : https://www.kaggle.com/competitions/nlp-getting-started/overview/evaluation

## Blue Print

1. Check dataset
2. Cleaning
3. Preprocessing

## Error Function

$F_1 = 2\frac{precision * recall}{precision + recall}$ (1 is the best, 0 is the worst) where:
 
precision = $\frac{TP}{TP+FP}$, recall = $\frac{TP}{TP+FN}$

In [2]:
# sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', 
# sample_weight=None, zero_division='warn')
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

## 1. Data Investigation

In [3]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [16]:
# Load dataset
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

print(train.shape, test.shape)
train.head()

(7613, 5) (3263, 4)


Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [26]:
print(train[train["target"] == 0]["text"].values[10])       # not a disaster tweet
print(train[train["target"] == 1]["text"].values[1])  # disaster tweet

No way...I can't eat that shit
Forest fire near La Ronge Sask. Canada


## 2. Data Cleaning

In [30]:
train.shape[0]

7613

In [38]:
# Check percentage of missing values
print("[TRAIN]keyword: ", str(round(train["keyword"].isnull().sum()/train.shape[0], 2)))
print("[TRAIN]location: ", str(round(train["location"].isnull().sum()/train.shape[0], 2)))
print("[TEST]keyword: ", str(round(test["keyword"].isnull().sum()/test.shape[0], 2)))
print("[TEST]location: ", str(round(test["location"].isnull().sum()/test.shape[0], 2)))

[TRAIN]keyword:  0.01
[TRAIN]location:  0.33
[TEST]keyword:  0.01
[TEST]location:  0.34


In [50]:
# Null values dropped training dataset
k_notnull_tr = train[train["keyword"].notnull()]
notnull_tr = k_notnull_tr[k_notnull_tr["location"].notnull()]
print(str(round(notnull_tr.shape[0]/train.shape[0], 4)*100)+"% survived")
notnull_tr.head()

66.73% survived


Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [51]:
# Null values dropped testing dataset
k_notnull_te = test[test["keyword"].notnull()]
notnull_te = k_notnull_te[k_notnull_te["location"].notnull()]
print(str(round(notnull_te.shape[0]/test.shape[0], 4)*100)+"% survived")
notnull_te.head()

66.14% survived


Unnamed: 0,id,keyword,location,text
15,46,ablaze,London,Birmingham Wholesale Market is ablaze BBC News...
16,47,ablaze,Niall's place | SAF 12 SQUAD |,@sunkxssedharry will you wear shorts for race ...
17,51,ablaze,NIGERIA,#PreviouslyOnDoyinTv: Toke MakinwaÛªs marriag...
18,58,ablaze,Live On Webcam,Check these out: http://t.co/rOI2NSmEJJ http:/...
19,60,ablaze,"Los Angeles, Califnordia",PSA: IÛªm splitting my personalities.\n\n?? t...


=> Since the test dataset also contains missing values in 'keyword' and 'location', we will drop these columns and use 'text' column only.

## 3. 

## 4. 