## Predicting whether tweets refer to a disaster or not

### Imports

In [1]:
%%capture --no-display

import os
import string

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import RFE, RFECV
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, Ridge, RidgeCV
from sklearn.metrics import make_scorer
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    ShuffleSplit,
    cross_val_score,
    cross_validate,
    train_test_split
)
from scipy.stats import loguniform, randint
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    average_precision_score
)
from sklearn.svm import SVC, SVR

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download("vader_lexicon")
nltk.download("punkt")
nltk.download('averaged_perceptron_tagger')

sid = SentimentIntensityAnalyzer()

%matplotlib inline

In [2]:
def mean_std_cross_val_scores(model, X_train, y_train, **kwargs):
    """
    Returns mean and std of cross validation

    Parameters
    ----------
    model :
        scikit-learn model
    X_train : numpy array or pandas DataFrame
        X in the training data
    y_train :
        y in the training data

    Returns
    ----------
        pandas Series with mean scores from cross_validation
    """

    scores = cross_validate(model, X_train, y_train, **kwargs)

    mean_scores = pd.DataFrame(scores).mean()
    std_scores = pd.DataFrame(scores).std()
    out_col = []

    for i in range(len(mean_scores)):
        out_col.append((f"%0.3f (+/- %0.3f)" % (mean_scores[i], std_scores[i])))

    return pd.Series(data=out_col, index=mean_scores.index)

<br>

## 1. Aim

The aim of this problem is to predict whether a tweet refers to a disaster event or not. This is a classification problem.


## 2. Data

The data used in this project can be found publicly on [Kaggle Disaster Tweets](https://www.kaggle.com/vstepanenko/disaster-tweets). The data set contains the tweet as text data, keywords found in the tweet, the location where the tweet was made, and the target describing whether the tweet refers to a disaster (`target=1`) or not (`target=0`).

In the next steps, the code assumes that the data file is stored as `tweets.csv` in the project root.

<br>

## 3. EDA

Before building any models, let us have a look at the data set more closely.

In [3]:
df = pd.read_csv(
    "tweets.csv", usecols=["keyword", "text", "target", "location"]
)
train_df, test_df = train_test_split(
    df, test_size=0.2, random_state=2
)
train_df.head()

Unnamed: 0,keyword,location,text,target
3289,debris,,"Unfortunately, both plans fail as the 3 are im...",0
2672,crash,SLC,I hope this causes Bernie to crash and bern. S...,0
2436,collide,,—pushes himself up from the chair beneath to r...,0
9622,suicide%20bomb,,Widow of CIA agent killed in 2009 Afghanistan ...,1
8999,screaming,Azania,As soon as God say yes they'll be screaming we...,0


In [4]:
X_train, y_train = train_df.drop(columns=["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns=["target"]), test_df["target"]

In [5]:
train_df["target"].value_counts(normalize=True)

0    0.812995
1    0.187005
Name: target, dtype: float64

<br>
Rudimentary EDA reveals that there is **class imbalance** since only around 19% of the examples in the training set belong to the "True" class, which is our interest. If we do not deal with class imbalance, results can be misleading, for instance, if 99% of examples have the label False, a DummyClassifier will have an accuracy of 99%. This does not mean that the dummy model is good, because it will likely have poor recall and precision.

To deal with class imbalance, metrics that are relevant to classification problems, namely **accuracy, precision, recall, f1-score, ROC AUC, and average prediction** are used. 

In this problem, a false positive corresponds to a tweet being classified as referring to a disaster, when it actually isn't. A false negative corresponds to a tweet being classified as not referring to a disaster, when it actually is. A false negative is more harmful in this case and thus, we want to minimise the number of false negatives, which is equivalent to increasing the recall. For this reason, among all the scoring metrics, the accuracy and precision will be taken less into consideration.

In [6]:
scoring_metrics = ["accuracy", "precision", "recall",
                   "f1", "roc_auc", "average_precision"]

<br>

Further EDA reveals that the `location` column contains a large proportion of **missing values** which will make it challenging for imputation. Furthermore, some of the values are **not even real locations**. Thus, in this project, the feature `location` will be dropped.

In [7]:
print(f"Proportion of NA values in location feature: "
      f"{round(X_train['location'].isna().sum() / len(X_train.index), 4)}")

Proportion of NA values in location feature: 0.2997


In [8]:
X_train[["location"]].sample(n=12, random_state=2)

Unnamed: 0,location
5998,Rwanda
11099,"Norwich, UK"
2581,"Dublin, Ireland"
9022,bi 18 she/her
11237,대한민국 서울
11053,
10483,Australia
6820,"Louisiana, USA"
10763,
6766,"Birmingham, England"


<br>

## 4. Feature transformation

In [9]:
drop_features = ["location"]
text_features = "keyword"
text_features2 = "text"

preprocessor = make_column_transformer(
    ("drop", drop_features),
    (CountVectorizer(stop_words="english"), text_features),
    (CountVectorizer(stop_words="english"), text_features2)
)

The `location` feature is dropped because there are a lot of missing values. The `text` and `keyword` features are transformed using the **bag of words representation** since the features are free text, and there are no fixed number of categories for both features.

<br>

## 5. Baseline model

In [10]:
results = {}

In [11]:
%%capture --no-display

results["dummy"] = mean_std_cross_val_scores(
    DummyClassifier(), X_train, y_train, scoring=scoring_metrics
)

pd.DataFrame(results)

Unnamed: 0,dummy
fit_time,0.001 (+/- 0.000)
score_time,0.004 (+/- 0.001)
test_accuracy,0.813 (+/- 0.000)
test_precision,0.000 (+/- 0.000)
test_recall,0.000 (+/- 0.000)
test_f1,0.000 (+/- 0.000)
test_roc_auc,0.500 (+/- 0.000)
test_average_precision,0.187 (+/- 0.000)


The accuracy of the dummy classifier is 0.813 since this corresponds to the proportion of the most common label. There are warnings when trying to compute the precision because the dummy classifier will never predict the class True (it always predicts False since that is the most common class), and thus, when computing precision, the number of false positives is zero, and there will be division by zero.

<br><br>

## 6. Logistic regression

In [12]:
lr_default_pipe = make_pipeline(
    preprocessor, LogisticRegression(random_state=123)
)

results["lr_default"] = mean_std_cross_val_scores(
    lr_default_pipe, X_train, y_train, scoring=scoring_metrics
)

pd.DataFrame(results)

Unnamed: 0,dummy,lr_default
fit_time,0.001 (+/- 0.000),0.291 (+/- 0.019)
score_time,0.004 (+/- 0.001),0.056 (+/- 0.001)
test_accuracy,0.813 (+/- 0.000),0.887 (+/- 0.005)
test_precision,0.000 (+/- 0.000),0.811 (+/- 0.012)
test_recall,0.000 (+/- 0.000),0.513 (+/- 0.036)
test_f1,0.000 (+/- 0.000),0.628 (+/- 0.026)
test_roc_auc,0.500 (+/- 0.000),0.898 (+/- 0.011)
test_average_precision,0.187 (+/- 0.000),0.747 (+/- 0.018)


The logistic regression classifier **performed better** than the dummy classifier on all scoring metrics, which is a good sign. However, the **recall seems to be a bit low**.

<br><br>

## 7. Hyperparameter optimization 

In [13]:
lr_pipe = make_pipeline(
    preprocessor,
    LogisticRegression(max_iter=2000, random_state=123)
)

param_grid = {
    "logisticregression__C": loguniform(1e-2, 1e4),
    "logisticregression__class_weight": ["balanced", None],
    "columntransformer__countvectorizer-1__max_features": randint(low=50,
                                                                  high=250),
    "columntransformer__countvectorizer-2__max_features": randint(low=5_000,
                                                                  high=24_000)
}

random_search_lr = RandomizedSearchCV(
    lr_pipe,
    param_distributions=param_grid,
    scoring=scoring_metrics,
    refit="roc_auc",
    n_jobs=-1,
    n_iter=200,
    cv=5,
    random_state=123,
    return_train_score=True
)

random_search_lr.fit(X_train, y_train)

results_rs = pd.DataFrame(random_search_lr.cv_results_)

In [14]:
columns = ["mean_test_roc_auc", "mean_test_recall",
           "mean_test_f1", "mean_test_average_precision",
           "mean_test_precision", "mean_test_accuracy",
           "param_columntransformer__countvectorizer-1__max_features",
           "param_columntransformer__countvectorizer-2__max_features",
           "param_logisticregression__C",
           "param_logisticregression__class_weight",
           "mean_fit_time", "mean_score_time"]

ranked_results = (results_rs.set_index("rank_test_roc_auc")
                  .sort_index()[columns])

results["lr_hyp_opt"] = [
    ranked_results.iloc[1]["mean_fit_time"],
    ranked_results.iloc[1]["mean_score_time"],
    ranked_results.iloc[1]["mean_test_accuracy"],
    ranked_results.iloc[1]["mean_test_precision"],
    ranked_results.iloc[1]["mean_test_recall"],
    ranked_results.iloc[1]["mean_test_f1"],
    ranked_results.iloc[1]["mean_test_roc_auc"],
    ranked_results.iloc[1]["mean_test_average_precision"]
]

ranked_results[:5]

Unnamed: 0_level_0,mean_test_roc_auc,mean_test_recall,mean_test_f1,mean_test_average_precision,mean_test_precision,mean_test_accuracy,param_columntransformer__countvectorizer-1__max_features,param_columntransformer__countvectorizer-2__max_features,param_logisticregression__C,param_logisticregression__class_weight,mean_fit_time,mean_score_time
rank_test_roc_auc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0.897887,0.676655,0.671703,0.743387,0.667566,0.876649,223,20893,0.439014,balanced,0.386503,0.093603
2,0.897748,0.665491,0.66987,0.742367,0.674929,0.877639,139,20132,0.574004,balanced,0.479325,0.104191
3,0.897748,0.48206,0.606468,0.743722,0.819892,0.883245,157,17264,0.583232,,0.450844,0.107342
4,0.897667,0.674309,0.670205,0.742849,0.666732,0.876209,198,18805,0.488096,balanced,0.45297,0.101518
5,0.897637,0.470307,0.598939,0.74544,0.826116,0.882366,246,21103,0.465738,,0.537057,0.134701


In [15]:
print(random_search_lr.best_params_)
print(random_search_lr.best_score_)

{'columntransformer__countvectorizer-1__max_features': 223, 'columntransformer__countvectorizer-2__max_features': 20893, 'logisticregression__C': 0.43901433023426095, 'logisticregression__class_weight': 'balanced'}
0.8978865079428952


<br>

The best hyperparameter values are `max_features = 223` for the keyword feature, `max_features = 20893` for the text feature, `C = 0.439` and `class_weight="balanced"` for the logistic regression. The best cross-validation ROC AUC score found with these hyperparameter values was 0.898.

<br><br>

## 8. Feature engineering

Now, let us explore whether we can engineer new features which our model find useful in classifying the tweets.

Several basic length-related and sentiment features are engineered such as:
- Relative character length.
- Number of words.
- Sentiment of the tweet.
- Number of nouns.
- Number of proper nouns.
- Whether the tweet contains a number.

The metric used to measure the sentiment is the compound score in which a score of -1 corresponds to extremely negative and a score of +1 corresponds to extremely positive. This score is extracted using [Vader lexicon](https://github.com/cjhutto/vaderSentiment).

The number of nouns and proper nouns in the text which could be useful because a tweet that refers to a disaster is likely to include many nouns describing the location or time (e.g. Canada, park, Friday, etc.), whereas a tweet that does not refer to a disaster might not have as many nouns.

On the other hand, whether the tweet contains a number could be useful because tweets referring to an actual disaster might include numbers such as the year, or number of casualties etc.

In [16]:
def get_relative_length(text, TWITTER_ALLOWED_CHARS=280.0):
    """
    Returns the relative length of text.

    Parameters:
    ------
    text: (str)
    the input text

    Keyword arguments:
    ------
    TWITTER_ALLOWED_CHARS: (float)
    the denominator for finding relative length

    Returns:
    -------
    relative length of text: (float)

    """
    return len(text) / TWITTER_ALLOWED_CHARS


def get_length_in_words(text):
    """
    Returns the length of the text in words.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    length of tokenized text: (int)

    """
    return len(nltk.word_tokenize(text))


def get_sentiment(text):
    """
    Returns the compound score representing the sentiment of the given text: -1 (most extreme negative) and +1 (most extreme positive)
    The compound score is a normalized score calculated by summing the valence scores of each word in the lexicon.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    sentiment of the text: (str)
    """
    scores = sid.polarity_scores(text)
    return scores["compound"]

def get_number_of_nouns(text):
    """
    Returns the number of nouns in the text.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    number of nouns: (int)

    """
    tags = nltk.pos_tag(nltk.word_tokenize(text))
    return sum([1 if _[1] in ["NN", "NNP", "NNS", "NNPS"]
                else 0 for _ in tags])


def get_number_of_proper_nouns(text):
    """
    Returns the number of proper nouns in the text.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    number of proper nouns: (int)

    """
    tags = nltk.pos_tag(nltk.word_tokenize(text))
    return sum([1 if _[1] in ["NNP", "NNPS"]
                else 0 for _ in tags])


def has_numbers(text):
    """
    Returns whether the text has numbers or not.

    Parameters:
    ------
    text: (str)
    the input text

    Returns:
    -------
    whether text has numbers: (bool)

    """
    return 1 if any(char.isdigit() for char in text) else 0

In [17]:
train_df = train_df.assign(
    n_words=train_df["text"].apply(get_length_in_words))
train_df = train_df.assign(
    vader_sentiment=train_df["text"].apply(get_sentiment))
train_df = train_df.assign(
    rel_char_len=train_df["text"].apply(get_relative_length))
train_df = train_df.assign(
    n_nouns=train_df["text"].apply(get_number_of_nouns))
train_df = train_df.assign(
    n_proper_nouns=train_df["text"].apply(get_number_of_proper_nouns))
train_df = train_df.assign(
    has_number=train_df["text"].apply(has_numbers))

test_df = test_df.assign(
    n_words=test_df["text"].apply(get_length_in_words))
test_df = test_df.assign(
    vader_sentiment=test_df["text"].apply(get_sentiment))
test_df = test_df.assign(
    rel_char_len=test_df["text"].apply(get_relative_length))
test_df = test_df.assign(
    n_nouns=test_df["text"].apply(get_number_of_nouns))
test_df = test_df.assign(
    n_proper_nouns=test_df["text"].apply(get_number_of_proper_nouns))
test_df = test_df.assign(
    has_number=test_df["text"].apply(has_numbers))

<br><br>

## 9. Pipeline with engineered features

In [18]:
X_train_eng, y_train_eng = train_df.drop(columns=["target"]), train_df["target"]
X_test_eng, y_test_eng = test_df.drop(columns=["target"]), test_df["target"]

In [19]:
numeric_features = ["n_words", "vader_sentiment", "rel_char_len",
                    "n_nouns", "n_proper_nouns"]

passthrough_features = ["has_number"]

text_features = "keyword"
text_features2 = "text"

drop_features = "location"

opt_max_features1 = (
    random_search_lr
    .best_params_[
        "columntransformer__countvectorizer-1__max_features"])
opt_max_features2 = (
    random_search_lr
    .best_params_[
        "columntransformer__countvectorizer-2__max_features"])
opt_C = random_search_lr.best_params_["logisticregression__C"]
opt_class_weight = (
    random_search_lr
    .best_params_["logisticregression__class_weight"])

preprocessor_feat_eng = make_column_transformer(
    ("drop", drop_features),
    (StandardScaler(), numeric_features),
    ("passthrough", passthrough_features),
    (CountVectorizer(stop_words="english",
                     max_features=opt_max_features1),
     text_features),
    (CountVectorizer(stop_words="english",
                     max_features=opt_max_features2),
     text_features2)
)

pipe_lr_feat_eng = make_pipeline(
    preprocessor_feat_eng,
    LogisticRegression(
        max_iter=2000,
        C=opt_C,
        class_weight=opt_class_weight,
        random_state=123)
)

results["lr_feat_eng"] = mean_std_cross_val_scores(
    pipe_lr_feat_eng, X_train_eng, y_train_eng,
    scoring=scoring_metrics
)

pipe_lr_feat_eng.fit(X_train_eng, y_train_eng)

pd.DataFrame(results)

Unnamed: 0,dummy,lr_default,lr_hyp_opt,lr_feat_eng
fit_time,0.001 (+/- 0.000),0.291 (+/- 0.019),0.479325,0.418 (+/- 0.057)
score_time,0.004 (+/- 0.001),0.056 (+/- 0.001),0.104191,0.062 (+/- 0.001)
test_accuracy,0.813 (+/- 0.000),0.887 (+/- 0.005),0.877639,0.876 (+/- 0.002)
test_precision,0.000 (+/- 0.000),0.811 (+/- 0.012),0.674929,0.660 (+/- 0.002)
test_recall,0.000 (+/- 0.000),0.513 (+/- 0.036),0.665491,0.700 (+/- 0.021)
test_f1,0.000 (+/- 0.000),0.628 (+/- 0.026),0.66987,0.679 (+/- 0.009)
test_roc_auc,0.500 (+/- 0.000),0.898 (+/- 0.011),0.897748,0.898 (+/- 0.011)
test_average_precision,0.187 (+/- 0.000),0.747 (+/- 0.018),0.742367,0.743 (+/- 0.017)


<br>

The cross-validation scores are similar for the accuracy, ROC AUC and average precision. However, after feature engineering, the **recall and f-1 score increased** while the precision decreased. This is a good sign because false negatives are more harmful in this problem, and a larger recall indicates a lower number of false negatives. However, no large improvement was observed after feature engineering.

<br><br>

## 10. Model interpretation

In [20]:
column_names = (numeric_features + passthrough_features +
                (pipe_lr_feat_eng.named_steps["columntransformer"]
                 .named_transformers_["countvectorizer-1"]
                 .get_feature_names_out().tolist()) +
                (pipe_lr_feat_eng.named_steps["columntransformer"]
                 .named_transformers_["countvectorizer-2"]
                 .get_feature_names_out().tolist()))

coefs = pd.DataFrame(
    np.squeeze(pipe_lr_feat_eng.named_steps["logisticregression"].coef_),
    index=column_names,
    columns=["Coefficients"]
)

coefs["abs_coef"] = np.abs(coefs["Coefficients"])
coefs = coefs.sort_values(by="abs_coef", ascending=False)

pd.DataFrame(coefs[:10]["Coefficients"])

Unnamed: 0,Coefficients
thunderstorm,1.672021
survived,1.550044
windstorm,1.540446
died,1.514138
rescued,1.484401
collision,1.388583
road,1.337435
ukrainian,1.315033
carried,1.263005
heavy,1.212298


<br>

Some of the coefficients are expected because they are closely related to disasters, for instance, the words survived, died, rescued are all very grave and are likely to be found in tweets that refer to an actual disaster. This applies to the other features with large coefficients as well. However, there are some words like ukrainian that have a large coefficient which is a bit strange, but it is possible that in the data set, tweets containing the word ukrainian are strongly associated with actual disasters.

<br><br>

## 11. Test set

In [21]:
preds = pipe_lr_feat_eng.predict(X_test_eng)
soft_preds = pipe_lr_feat_eng.predict_proba(X_test_eng)

test_results = pd.DataFrame({
    "accuracy": [pipe_lr_feat_eng.score(X_test_eng, y_test_eng)],
    "precision": [precision_score(y_test_eng, preds)],
    "recall": [recall_score(y_test_eng, preds)],
    "f1_score": [f1_score(y_test_eng, preds)],
    "roc_auc": [roc_auc_score(y_test_eng, soft_preds[:, 1])],
    "average_precision": [average_precision_score(y_test_eng, soft_preds[:, 1])]
})

test_results

Unnamed: 0,accuracy,precision,recall,f1_score,roc_auc,average_precision
0,0.890941,0.677419,0.762712,0.71754,0.924916,0.784924


<br>

The test scores are similar to the scores obtained from cross-validation which suggests that our model has **good generalisation** and it is **not overfitting** to the training data.