# Kaggle's tutorial to natural language processing
In this notebook I will follow the tutorial found at https://www.kaggle.com/code/philculliton/nlp-getting-started-tutorial.

In [1]:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

## Processing on data

In [3]:
train_df = pd.read_csv("../Datasets/nlp-getting-started/train.csv")
test_df = pd.read_csv("../Datasets/nlp-getting-started/test.csv")

What does out data look like?

In [37]:
print(train_df.columns)
train_df.loc[100:120, :]

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')


Unnamed: 0,id,keyword,location,text,target
100,144,accident,UK,.@NorwayMFA #Bahrain police had previously die...,1
101,145,accident,"Nairobi, Kenya",I still have not heard Church Leaders of Kenya...,0
102,146,aftershock,Instagram - @heyimginog,@afterShock_DeLo scuf ps live and the game... cya,0
103,149,aftershock,304,'The man who can drive himself further once th...,0
104,151,aftershock,Switzerland,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/yN...,0
105,153,aftershock,304,'There is no victory at bargain basement price...,0
106,156,aftershock,US,320 [IR] ICEMOON [AFTERSHOCK] | http://t.co/vA...,0
107,157,aftershock,304,'Nobody remembers who came in second.' Charles...,0
108,158,aftershock,Instagram - @heyimginog,@afterShock_DeLo im speaking from someone that...,0
109,159,aftershock,304,'The harder the conflict the more glorious the...,0


So, there are 4 explanatory variables: `id` is a simple ID that will clearly be removed; `keyword` relates to the word that was picked up by the algorithm that needs to classify tweets (or similar); `location` is sometimes a place, sometimes a number, sometimes another reference (_e.g_ Instagram); finally, `text` refers to the actual text. The response variable is `target`, which is either 1 (for a disaster tweet) or 0 (for an unrelated tweet.)

### Building vectors
We use `CountVectorizer` to count the words in each tweet and turn them into processable data (_i.e._ vectors).

In [38]:
count_vectorizer = feature_extraction.text.CountVectorizer()

example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [43]:
example_train_vectors[0].todense()

matrix([[0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0]], dtype=int64)

`example_train_vectors` contains information about how many tokens (unique words) there are in the first 5 tweets -- in this case it's the length of the vector. Then, each tweet is rewritten as a vector of 0 (when the word isn't in the tweet) and 1 (when the word is in the tweet). Let's process all data like this. 

In [44]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
# to have the same vectors in the test_vectors, we use transform instead of fit_transform
test_vectors = count_vectorizer.transform(test_df["text"])


In [45]:
test_vectors[0].todense().shape

(1, 21637)

## Model 
We assume there is a linear connection between the words in a tweet and their target. Since our vectors are quite big (21637 tokens detected), we build a ridge regression model to have our weights go to zero:

In [46]:
clf = linear_model.RidgeClassifier()

We use `cross-validation` to validate the model with the rest. This competition has F1 as metric.

### The $F_1$ metric
The $F_1$ metric is constructed from precision and recall. In pattern recognition, "precision"  is the fraction of relevant instances among the retrieved instances. In this case it's the sum of all `True` versus the total number of data. "Recall" is the fraction of relevant instances that were retrieved, compared to the actual number of relevant instances that would be available. Note that recall cannot be calculated in real life, but through cross-validation (and thus knowing the total sample) we can measure it. Then, taken these two quantities, $F_1$ is the harmonic means of these two quantities, that is

$$
F_1 = \frac{2}{\text{recall}^{-1} + \text{precision}^{-1}}
$$

Let's do cross validation for $F_1$ on our ridge classifier.

In [47]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores

array([0.59421842, 0.56498283, 0.64149093])

In the challenge I will attempt to improve from this value (which will be around 60-65%). Time to fit the model:

In [48]:
clf.fit(train_vectors, train_df["target"])

In [50]:
sample_submission = pd.read_csv("../Datasets/nlp-getting-started/sample_submission.csv")

sample_submission["target"] = clf.predict(test_vectors)
sample_submission.to_csv("tutorial_submission.csv", index=False)

This resulted in a score of about 0.78! Nice. I'm 548/730 in the range (bottom 25%). Time to try and improve this!