# Kaggle Competition: Natural Language Processing with Disaster Tweets

Here we take on the beginner challenge of using NLP and ML to predict whether a tweet is about a real disaster or not.

[Link to competition](https://www.kaggle.com/c/nlp-getting-started/overview)

## Loading the data

We get two datasets: a training data set, and the test dataset on which we will be scored in the end.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## First approximations

### Feature extraction

First, we will just tokenize the texts and use a count vectorizer to get a matrix that we can train a model on.

In [5]:
import nltk
from nltk.corpus import stopwords

stops = stopwords.words('english')

In [6]:
def tokenize(sample):
    words = ' '.join([w.lower() for w in nltk.word_tokenize(sample) if w.lower() not in stops and w.isalpha()])
    return words

In [7]:
train_df['tokens'] = train_df['text'].apply(tokenize)
test_df['tokens'] = test_df['text'].apply(tokenize)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(1, 1))

X = vect.fit_transform(train_df['tokens'])
y = train_df['target'].to_numpy()

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

In [10]:
from sklearn.model_selection import cross_val_score

In [11]:
metric_data = []

def test_models(models, run):
    for model in models:
        print(model)
        scores = cross_val_score(model, X_train, y_train, cv=3, scoring="f1")
        print(f'Mean f1-score: {np.mean(scores):.3f}')
        print("---")

        for score in scores:
            metric_data.append({'model': str(model),
                           'type': run,
                           'f1-score': score})


### Testing initial models

We will use the following models:

* Logistic Regression
* Multinomial Naive Bayes
* Suppor Vector Classifier
* Random Forest Classifier

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

models = [LogisticRegression(random_state=28),
         MultinomialNB(),
          SVC(random_state=28),
          RandomForestClassifier(random_state=28)
]

test_models(models, 'initial')

LogisticRegression(random_state=28)
Mean f1-score: 0.738
---
MultinomialNB()
Mean f1-score: 0.746
---
SVC(random_state=28)
Mean f1-score: 0.719
---
RandomForestClassifier(random_state=28)
Mean f1-score: 0.703
---


So far, the best model is a Logistic Regression model. However, scores seem to be still quite low. Let's see what we can do to improve our scores.

# Improving feature extraction

## Considering n-grams

We can consider using bigrams as part of our vectorization process to add some information on which words appear next to one another in a tweet.

In [13]:
vect = CountVectorizer(ngram_range=(1, 2))

X = vect.fit_transform(train_df['tokens'])
y = train_df['target'].to_numpy()

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

### Testing the models

In [14]:
models = [LogisticRegression(random_state=28),
         MultinomialNB(),
          SVC(random_state=28),
          RandomForestClassifier(random_state=28)
]

test_models(models, 'bigrams')

LogisticRegression(random_state=28)
Mean f1-score: 0.735
---
MultinomialNB()
Mean f1-score: 0.744
---
SVC(random_state=28)
Mean f1-score: 0.698
---
RandomForestClassifier(random_state=28)
Mean f1-score: 0.681
---


Doesn't seem to be that useful. Let's try something else.

## Considering a tf-idf vectorizer

We can also use another tf-idf vectorizer to give more weight to unique words.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(ngram_range=(1, 2))

X = vect.fit_transform(train_df['tokens'])
y = train_df['target'].to_numpy()

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

### Testing the models

In [16]:
models = [LogisticRegression(random_state=28),
         MultinomialNB(),
          SVC(random_state=28),
          RandomForestClassifier(random_state=28)
]

test_models(models, 'tfidf')

LogisticRegression(random_state=28)
Mean f1-score: 0.654
---
MultinomialNB()
Mean f1-score: 0.703
---
SVC(random_state=28)
Mean f1-score: 0.613
---
RandomForestClassifier(random_state=28)
Mean f1-score: 0.660
---


This was even worse. Let's keep on looking.

## Considering stemming

Let's consider stemming to reduce the vectors on which we are working.

In [17]:
from nltk.stem.porter import *

stemmer = PorterStemmer()

train_df['stems'] = train_df['tokens'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split(' ')]))

vect = CountVectorizer(ngram_range=(1, 1))

X = vect.fit_transform(train_df['stems'])
y = train_df['target'].to_numpy()

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

### Testing the models

In [18]:
models = [LogisticRegression(random_state=28),
         MultinomialNB(),
          SVC(random_state=28),
          RandomForestClassifier(random_state=28)
]

test_models(models, 'stems')

LogisticRegression(random_state=28)
Mean f1-score: 0.736
---
MultinomialNB()
Mean f1-score: 0.749
---
SVC(random_state=28)
Mean f1-score: 0.727
---
RandomForestClassifier(random_state=28)
Mean f1-score: 0.714
---


## Analyzing results so far

In [19]:
results = pd.DataFrame(metric_data)

In [29]:
results.groupby(['type', 'model']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,f1-score
type,model,Unnamed: 2_level_1
bigrams,LogisticRegression(random_state=28),0.735488
bigrams,MultinomialNB(),0.744229
bigrams,RandomForestClassifier(random_state=28),0.680578
bigrams,SVC(random_state=28),0.698216
initial,LogisticRegression(random_state=28),0.737598
initial,MultinomialNB(),0.746489
initial,RandomForestClassifier(random_state=28),0.703335
initial,SVC(random_state=28),0.71916
stems,LogisticRegression(random_state=28),0.736392
stems,MultinomialNB(),0.748611


In [39]:
results[results['f1-score'] == results['f1-score'].max()]

Unnamed: 0,model,type,f1-score
2,LogisticRegression(random_state=28),initial,0.761538


## Grid Search

In [41]:
from sklearn.model_selection import GridSearchCV

Since we didn't observe any boost in performance by considering bigrams or tf-idf vectorizing, let's drop back to the regular count vectorizer for the search.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(1, 2), min_df=1) # Bigrams!!

X = vect.fit_transform(train_df['stems']) # ...and stems!!
y = train_df['target'].to_numpy()

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

### Model 1:  Simple Logistic Regression

In [54]:
clf = GridSearchCV(estimator = LogisticRegression(random_state = 28), param_grid = 
                  {
                      'C': [0.001, 0.01, 0.1, 1]
                  })

clf.fit(X_test, y_test)

print(f"Best parameters: {clf.best_params_}")
print(f"Best score: {clf.best_score_:.3f}")

Best parameters: {'C': 1}
Best score: 0.772


### Model 2: Multinomial Naive Bayes

In [55]:
clf = GridSearchCV(estimator = MultinomialNB(), scoring='f1', param_grid = 
                  {
                      'alpha': [float(x)/10 for x in range(1, 10)],
                      'fit_prior': [True, False]
                  })

clf.fit(X_test, y_test)

print(f"Best parameters: {clf.best_params_}")
print(f"Best score: {clf.best_score_:.3f}")

Best parameters: {'alpha': 0.8, 'fit_prior': True}
Best score: 0.716


### Model 3: Support Vector Machine

In [56]:
clf = GridSearchCV(estimator = SVC(), param_grid = 
                  {
                      'C': [0.001, 0.01, 0.1, 1]
                  })

clf.fit(X_test, y_test)

print(f"Best parameters: {clf.best_params_}")
print(f"Best score: {clf.best_score_:.3f}")

Best parameters: {'C': 1}
Best score: 0.752


### Model 4: Random Forest

In [57]:
clf = GridSearchCV(estimator = RandomForestClassifier(), param_grid = 
                  {
                      'n_estimators': [10, 100, 500],
                      'min_samples_split': [2, 3, 5, 10],  
                  })

clf.fit(X_test, y_test)

print(f"Best parameters: {clf.best_params_}")
print(f"Best score: {clf.best_score_:.3f}")

KeyboardInterrupt: 

## Submiting predictions

In [77]:
vect = CountVectorizer(ngram_range=(1, 2), min_df=1) # Bigrams!!

X = vect.fit_transform(train_df['stems']) # ...and stems!!
y = train_df['target'].to_numpy()

X_train, X_test, y_train, y_test= train_test_split(X, y, random_state = 28)

winner = LogisticRegression(random_state = 28)

winner.fit(X_train, y_train)

test_df['stems'] = test_df['tokens'].apply(lambda x: ' '.join([stemmer.stem(w) for w in x.split(' ')]))

test_df['target'] = winner.predict(vect.transform(test_df['stems']))

In [81]:
test_df[['id', 'target']].to_csv('submission.csv', index = False)