# Natural language processing introductory challenge
This challenge was built to introduce someone to applying machine learning to problems of natural language processing. In particular, it aims at detecting natural disasters from tweets. I will use keras' arsenal to clean the dataset from punctuation, @, and other symbols.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import feature_extraction, linear_model, model_selection
from sklearn.metrics import f1_score, accuracy_score
from xgboost import XGBClassifier


  from pandas import MultiIndex, Int64Index


In [2]:
train_df = pd.read_csv("../Datasets/nlp-getting-started/train.csv")
test_df = pd.read_csv("../Datasets/nlp-getting-started/test.csv")
test_id = test_df.id

tweets = train_df['text']
targets = train_df['target']

## Analzying the dataset
First I want to see if there are obvious words that are not connected to natural catastrophes, using the `keyword` index. By looking at the mean of `target` on the data aggregated by `keyword` we can observe right away words that are always attached to it, and never attached to it:

In [3]:
train_df.groupby('keyword').mean()

Unnamed: 0_level_0,id,target
keyword,Unnamed: 1_level_1,Unnamed: 2_level_1
ablaze,70.388889,0.361111
accident,121.800000,0.685714
aftershock,171.323529,0.000000
airplane%20accident,220.142857,0.857143
ambulance,269.052632,0.526316
...,...,...
wounded,10609.135135,0.702703
wounds,10662.393939,0.303030
wreck,10708.513514,0.189189
wreckage,10759.717949,1.000000


We can already see that there are some words, such as `aftershock`, which are never related to a natural disaster. At the same time, the word `wreckage` is always attached to it. Both have more than 30 occurrences. Is there a way to code this into my data?

# Preparing the dataset

### Vectorization
Time to apply this to generate the dataset like in the tutorial.

In [4]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_vectors = count_vectorizer.fit_transform(tweets)
# to have the same vectors in the test_vectors, we use transform instead of fit_transform
test_vectors = count_vectorizer.transform(test_df["text"])

In [5]:
test_vectors[0].todense().shape

(1, 21637)

## Model: XGBooster
We can already check whether the ridge regression works better on cleaned data.

In [6]:
model = XGBClassifier(eval_metric = accuracy_score, objective = 'binary:logistic',
                      learning_rate = 0.5, max_depth = 5, subsample = 1, reg_lambda = 0.8)

scores = model_selection.cross_val_score(model, train_vectors, train_df["target"], cv=5, scoring="f1")
scores



















array([0.59386973, 0.50821089, 0.57      , 0.58647937, 0.67081712])

In [7]:
model.fit(train_vectors, train_df['target'])





In [8]:
sample_submission = pd.read_csv("../Datasets/nlp-getting-started/sample_submission.csv")

sample_submission["target"] = model.predict(test_vectors)
sample_submission.to_csv("xgb_submission.csv", index=False)

XGBooster acts slightly better! It's an improvement, alright! But still far from the success that XGBooster has had in other challenges. 

## Similar to ridge regression: elastic net regularization
Because lasso was so successful from the get going, I wonder if we could be able to catch some redundancies and correlations in the data by using an elastic net regularization.

In [22]:
clf = linear_model.LogisticRegression()
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([0.6387547 , 0.61347869, 0.68350669])

This so far seems the best result. I'm going to submit this and walk away for now.

In [24]:
clf.fit(train_vectors, train_df["target"])

sample_submission = pd.read_csv("../Datasets/nlp-getting-started/sample_submission.csv")

sample_submission["target"] = clf.predict(test_vectors)
sample_submission.to_csv("logisticRegression_submission.csv", index=False)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


This was the best submission so far, gathering a 0.79987 (306/739 = 41.9%). But we are still far from a good result. I saw someone reaching good results using DNNs, which is what I think I will do next, maybe using keras.tuner again. For once that it might give the best result...