# Natural language processing introductory challenge
This challenge was built to introduce someone to applying machine learning to problems of natural language processing. In particular, it aims at detecting natural disasters from tweets. I will use keras' arsenal to clean the dataset from punctuation, @, and other symbols.

In [99]:
import numpy as np
import pandas as pd
import tensorflow as tf
import re
import string
from sklearn import feature_extraction, linear_model, model_selection

In [100]:
train_df = pd.read_csv("../Datasets/nlp-getting-started/train.csv")
test_df = pd.read_csv("../Datasets/nlp-getting-started/test.csv")

## Analzying the dataset
First I want to see if there are obvious words that are not connected to natural catastrophes, using the `keyword` index.

In [97]:
train_df.groupby('keyword').describe()

Unnamed: 0_level_0,id,id,id,id,id,id,id,id,target,target,target,target,target,target,target,target
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
keyword,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
ablaze,36.0,70.388889,14.035216,48.0,58.50,69.5,81.25,95.0,36.0,0.361111,0.487136,0.0,0.0,0.0,1.0,1.0
accident,35.0,121.800000,15.118746,96.0,109.50,121.0,134.50,145.0,35.0,0.685714,0.471008,0.0,0.0,1.0,1.0,1.0
aftershock,34.0,171.323529,13.975564,146.0,160.25,171.5,182.75,195.0,34.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
airplane%20accident,35.0,220.142857,15.406536,196.0,208.50,219.0,233.50,245.0,35.0,0.857143,0.355036,0.0,1.0,1.0,1.0,1.0
ambulance,38.0,269.052632,14.101845,246.0,258.50,268.5,279.75,294.0,38.0,0.526316,0.506009,0.0,0.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wounded,37.0,10609.135135,14.491688,10585.0,10598.00,10609.0,10622.00,10632.0,37.0,0.702703,0.463373,0.0,0.0,1.0,1.0,1.0
wounds,33.0,10662.393939,14.225724,10636.0,10651.00,10663.0,10675.00,10684.0,33.0,0.303030,0.466694,0.0,0.0,0.0,1.0,1.0
wreck,37.0,10708.513514,15.230856,10685.0,10695.00,10708.0,10722.00,10733.0,37.0,0.189189,0.397061,0.0,0.0,0.0,0.0,1.0
wreckage,39.0,10759.717949,14.730828,10735.0,10747.50,10760.0,10771.50,10784.0,39.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0


We can already see that there are some words, such as `aftershock`, which are never related to a natural disaster. At the same time, the word `wreckage` is always attached to it. Both have more than 30 occurrences. Is there a way to code this into my data?

# Preparing the dataset

## Standardization
Here we standardize the data to remove punctuation and other confusing elements.

In [101]:
def custom_standardization(input_data):
    lowercase = input_data.lower()
    stripped_handle = re.sub('@\w+', '', lowercase)
    remove_hashtag = re.sub('#', '', stripped_handle)
    remove_punctuation = re.sub('[%s]' % re.escape(string.punctuation), '', remove_hashtag)
    return re.sub('(https|http)\w+', '', remove_punctuation)

This function, first takes out all capitalization, then the regex snippets replace all handles (any word that starts with `@`), eliminate the `#` by keeping the hashtag handle, removes punctuation and finally removes possible web addresses. _E.g._

In [102]:
print(train_df["text"][100])

print(custom_standardization(train_df["text"][100]))

.@NorwayMFA #Bahrain police had previously died in a road accident they were not killed by explosion https://t.co/gFJfgTodad
 bahrain police had previously died in a road accident they were not killed by explosion 


Let's apply this to the whole dataset:

In [103]:
train_df["clean text"] = train_df["text"].map(custom_standardization)
test_df["clean text"] = test_df["text"].map(custom_standardization)

### Vectorization
Time to apply this to generate the dataset like in the tutorial.

In [104]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train_df["clean text"])
# to have the same vectors in the test_vectors, we use transform instead of fit_transform
test_vectors = count_vectorizer.transform(test_df["clean text"])

In [105]:
test_vectors[0].todense().shape

(1, 15588)

This reduced the number of generated tokens by about 6000, which isn't half bad (reduction of almost 30%).

## Model 1: ridge regression
We can already check whether the ridge regression works better on cleaned data.

In [108]:
clf = linear_model.RidgeClassifier()
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=5, scoring="f1")
scores

array([0.61209964, 0.49491525, 0.55414013, 0.55190311, 0.65375678])

The model does not seem particularly better! That's quite interesting. 

In [109]:
clf.fit(train_vectors, train_df["target"])

In [86]:
sample_submission = pd.read_csv("../Datasets/nlp-getting-started/sample_submission.csv")

sample_submission["target"] = clf.predict(test_vectors)
sample_submission.to_csv("ridge_submission.csv", index=False)

The result is worse than without cleaning the data! That's surprising. I wonder if that might be due to some information linked to hashtags. Maybe we should add an indicator of whether a word contained hashtags or not?