# Data Science Fellowship program exam

**Machine Learning - Assignment 2: Natural disasters dataset**

By: Jules Kuehn

Due: 2020-12-03, 6pm Eastern

## Task 3: Pre-trained word embeddings + linear classifier model

Not implemented, but would be nice:
* Common functions moved to python module, to be imported into multiple notebooks.
* (Some code is duplicated here from the previous notebook.)

### Setup

In [21]:
%pip install -r ../requirements.txt -q

Note: you may need to restart the kernel to use updated packages.


In [None]:
import sys
!{sys.executable} -m spacy download en_core_web_sm -q

In [5]:
# Suppress warnings for cleaner output
def warn(*args, **kwargs):
    pass

import warnings
warnings.warn = warn

import re
from itertools import product

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import (ConfusionMatrixDisplay, classification_report,
                             confusion_matrix, f1_score)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC, LinearSVC
from wordcloud import WordCloud

### Import data
Use Pandas to import the CSV to a Dataframe. For a larger dataset, I would use Spark for pre-processing steps.

In [6]:
train_df = pd.read_csv('../data/raw/train.csv')
disaster_tweets = train_df[train_df['target'] == 1]['text'].tolist()
non_disaster_tweets = train_df[train_df['target'] == 0]['text'].tolist()

### Embed tweets into W-dimensional vector

This is trivially easy with spacy.

However, we may want to try a few different methods.

Since our pre-processing (lowercasing, punctuation stripping) is no longer handled by CountVectorizer, that should be added back in.

In [10]:
def preprocess_texts(
    texts,
    replace_numbers=True,
    replace_mentions=True,
    replace_hashtags=True,
):
    """
    Preprocess texts for NLP.
    Takes a list or Series of texts and returns a list of preprocessed texts.
    """
    if isinstance(texts, pd.Series):
        texts = texts.tolist()
    
    for i, text in enumerate(texts):
        if replace_numbers:
            # Replace any substring of digits with ' number ' using regex
            text = re.sub(r'\d+', ' number ', text)
        if replace_mentions:
            # For pre-trained embeddings, we want to use common words
            text = text.replace('@', ' at ')
        if replace_hashtags:
            text = text.replace('#', ' hashtag ')
        # Remove URLs
        text = re.sub(r'http\S+', '', text)
        # Remove punctuation
        text = re.sub(r'[^\w\s]', '', text)
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        # Remove leading and trailing whitespace
        text = text.strip()
        texts[i] = text
    return texts

train_df['processed_text'] = preprocess_texts(train_df['text'])

#### Spacy document embedding

The advantage of getting an embedding for a single document is ease of use, but it is not as interpretable or flexible as per-word embeddings. (This is the mean vector of the text).

In [15]:
def embed_text_spacy(texts, model):
    """
    Embed a list of text using a spacy model.
    """
    return np.array([model(text).vector for text in texts])

# Load the spacy model
# Note: better performance with larger models, but slower to load
nlp = spacy.load('en_core_web_sm')

# Embed the tweets
tweets_embedded_spacy = embed_text_spacy(train_df['text'], nlp)

tweets_embedded_spacy.shape

(7613, 96)

In [16]:
# Does the preprocessing make a difference?
clean_tweets_embedded_spacy = embed_text_spacy(train_df['processed_text'], nlp)

clean_tweets_embedded_spacy - tweets_embedded_spacy

array([[ 0.10663193, -0.06177586,  0.01677806, ...,  0.15290211,
        -0.06339556,  0.06228748],
       [ 0.35435182, -0.06206617, -0.13854524, ..., -0.01496623,
        -0.01122114, -0.4016894 ],
       [ 0.15765142,  0.03518435,  0.06658074, ...,  0.01935729,
        -0.15030086,  0.05764329],
       ...,
       [-0.09609085, -0.42296267, -0.10893895, ..., -0.37544906,
         0.06502324, -0.6771796 ],
       [ 0.09619114, -0.01231611,  0.1896926 , ...,  0.13207202,
        -0.2183739 , -0.2528774 ],
       [ 0.08932912, -0.16393521,  0.17910671, ...,  0.364201  ,
         0.06129225,  0.19030449]], dtype=float32)

#### Spacy word embeddings (concatenated)

We can also obtain a single W-dimensional vector from simply concatenating the word embeddings (and padding or trimming the vector to W). This doesn't seem like a good idea (due to loss of word boundaries, vs. a multidimensional model where each word is a vector), but we can try it.

In [29]:
def embed_text_spacy_per_word(texts, model):
    """
    Embed a list of text using a spacy model.
    """
    embedded_texts = []
    for text in texts:
        word_embeddings = np.array([token.vector for token in model(text)])
        # Concatenate the word embeddings
        concat_embedding = np.concatenate(word_embeddings, axis=0)
        # Pad or trim the embedding to a fixed length
        padded_embedding = np.pad(concat_embedding, 960)
        truncated_embedding = padded_embedding[:960]
        embedded_texts.append(truncated_embedding)
    
    return np.array(embedded_texts)

# Embed the tweets
tweets_embedded_spacy_per_word = embed_text_spacy_per_word(
    train_df['text'], nlp
)

# Embed the cleaned tweets
clean_tweets_embedded_spacy_per_word = embed_text_spacy_per_word(
    train_df['processed_text'], nlp
)

tweets_embedded_spacy_per_word.shape

(7613, 960)

### Model training and evaluation

Note that I am not using the test data at this time. I am only using the training data while testing pre-processing hyperparameters.

(We will use the test data in the last task of this exam, to compare all models.)

This code is largely the same as in notebook 1_BoW, but with the TF-IDF transformer added and manual pre-processing steps removed.

In [82]:
def evaluate_preprocessing(
    train_df,
    strip_accents='ascii',
    lowercase=True,
    initial_vocab='all',
    remove_n_common_words=5,
    min_df=5,
    max_features=None,
    verbose=False,
    return_artifacts=False,
    ngram_range=(1, 1),
    use_idf=True,
    norm='l2',
    smooth_idf=True,
    sublinear_tf=False,
    clf=LogisticRegression(),
    tfidf=True,
):

    non_disaster_tweets = train_df[train_df['target'] == 0]['text'].tolist()
    disaster_tweets = train_df[train_df['target'] == 1]['text'].tolist()

    # Create vectorizer with limited vocabulary
    vectorizer = create_vectorizer(
        non_disaster_tweets,
        disaster_tweets,
        initial_vocab=initial_vocab,
        remove_n_common_words=remove_n_common_words,
        min_df=min_df,
        max_features=max_features,
        strip_accents=strip_accents,
        lowercase=lowercase,
        ngram_range=ngram_range,
    )
    
    # Fit TF-IDF transformer
    all_tweets = train_df['text']
    tfidf_transformer = TfidfTransformer(
        use_idf=use_idf,
        norm=norm,
        smooth_idf=smooth_idf,
        sublinear_tf=sublinear_tf,
    ).fit(vectorizer.transform(all_tweets))

    # Create a pipeline to vectorize and apply TF-IDF
    # Classifier is passed as an argument for comparison
    pipeline = Pipeline([
        ('vect', vectorizer),
        ('tfidf', tfidf_transformer if tfidf else None),
        ("clf", clf),
    ])


    # Split the training data into training and validation sets
    X_train = all_tweets
    y_train = train_df['target']

    X_train, X_val, y_train, y_val = train_test_split(
        X_train,
        y_train,
        test_size=0.2,
        random_state=42,
    )

    pipeline.fit(X_train, y_train)

    # Make predictions on the validation set
    preds_val = pipeline.predict(X_val)

    if verbose:
        # Display results on validation set
        print(classification_report(y_val, preds_val))
        ConfusionMatrixDisplay.from_estimator(pipeline, X_val, y_val, cmap='Blues', normalize='true')
    
    if return_artifacts:
        return pipeline, X_train, X_val, y_train, y_val

    # Return f1 macro average score
    return f1_score(y_val, preds_val, average='macro')


### Selecting a model
TF-IDF will not necessarily improve performance. Let's evaluate its effects with a variety of SkLearn models.


In [83]:
def grid_parameters(parameters):
    for params in product(*parameters.values()):
        yield dict(zip(parameters.keys(), params))

parameters = {
    'clf': [
        LogisticRegression(max_iter=2000, random_state=42),
        RandomForestClassifier(max_depth=250, random_state=42), # Informed guess at max_depth
        SGDClassifier(max_iter=2000, random_state=42),
        SVC(max_iter=2000, random_state=42),
        LinearSVC(max_iter=2000, random_state=42),
        MultinomialNB(),
    ],
    'tfidf': [True, False],
    'ngram_range': [(1, 1), (1, 2), (1, 3)]
}

results = []

for settings in grid_parameters(parameters):
    f1_macro = evaluate_preprocessing(train_df, **settings, verbose=False)
    results.append((settings, f1_macro))

best_result = max(results, key=lambda x: x[1])
print("Best result:\n", best_result)

evaluate_preprocessing(train_df, **best_result[0], verbose=True)

In [None]:
print('Top 3 models:')
sorted(results, key=lambda x: x[1], reverse=True)[:3]

Top 3 models:


[({'clf': SVC(max_iter=2000, random_state=42),
   'tfidf': False,
   'ngram_range': (1, 1)},
  0.789233228475146),
 ({'clf': SVC(max_iter=2000, random_state=42),
   'tfidf': True,
   'ngram_range': (1, 1)},
  0.7844887644132201),
 ({'clf': MultinomialNB(), 'tfidf': True, 'ngram_range': (1, 3)},
  0.7812398369577)]

### Summary of results

We evaluated the effect of additional feature engineering:
* TF-IDF weighting
* Bigrams
* Trigrams

And several different ML models:
* Logistic Regression
* Random Forest
* SGD
* SVC
* Linear SVC
* Multinomial Naive Bayes

The best results came from the model we already obtained in the previous notebook:

**A simple bag of words model with Logistic Regression.**

### Explaining the model

*(Copied from previous notebook with modifications to use the Pipeline)*

An advantage of the bag of words + logistic regression model is its simplicity.

We can simply look up the model coefficients to determine feature importance:
* Which words contribute to a classification of "disaster" (1) or "non-disaster" (0)?

There is no need to scale the coefficients here since all the features are from the BoW.

In [63]:
pipeline, X_train, X_val, y_train, y_val = evaluate_preprocessing(
    train_df, **best_result[0], return_artifacts=True
)

model_coefficients = pd.DataFrame(
   pipeline['clf'].coef_.T,
   columns=['Coefficients'], index=pipeline['vect'].get_feature_names_out()
)

sorted_words = model_coefficients.sort_values('Coefficients', ascending=False)
print('The following words have the highest positive coefficients (disaster):')
print(sorted_words.head(10).to_string())

print('\nThe following words have the most negative coefficients (non-disaster):')
print(sorted_words.tail(10)[::-1].to_string())

The following words have the highest positive coefficients (disaster):
            Coefficients
hiroshima       2.445416
wildfire        2.359061
earthquake      2.220706
derailment      2.186881
fires           2.132240
tornado         1.992465
riots           1.907593
suicide         1.897605
massacre        1.870264
floods          1.869637

The following words have the most negative coefficients (non-disaster):
        Coefficients
full       -1.564929
better     -1.506215
blight     -1.411568
ebay       -1.404008
bags       -1.243934
cake       -1.237400
upon       -1.234887
show       -1.211611
art        -1.197938
likely     -1.184574


#### Explaining a single prediction

In [68]:
def predict_and_explain(tweet, pipeline, settings):
    """Predict the class of a tweet and explain the prediction."""
    bow = pipeline['vect'].transform([tweet])
    prediction = pipeline['clf'].predict(bow)[0]
    word_importance = []
    
    # Get words from bow
    words = pipeline['vect'].get_feature_names_out()

    # Get model coefficients
    coefficients = pipeline['clf'].coef_[0]

    # Get coefficients for words in the tweet
    for i, word in enumerate(words):
        if bow[0, i] > 0:
            word_importance.append((word, coefficients[i]))

    word_importance = sorted(word_importance, key=lambda x: x[1], reverse=True)

    print(f'Prediction: {"disaster" if prediction == 1 else "non-disaster"}\n')
    for word, coefficient in word_importance:
        print(f'{word}: {coefficient:.2f}')
    
    return prediction, word_importance


In [69]:
tweet = train_df['text'].sample(1).values[0]
print(tweet, '\n')

prediction, word_importance = predict_and_explain(tweet, pipeline, best_result[0])


So apparently there were bush fires near where I live over the weekend that I was totally oblivious to... 

Prediction: disaster

fires: 2.13
near: 0.95
were: 0.81
over: 0.71
bush: 0.49
was: 0.35
apparently: 0.28
that: 0.07
totally: -0.01
there: -0.01
live: -0.02
where: -0.37
weekend: -0.41
so: -0.49
