# Hands-On Lab 1 - Disaster Tweets

#### Welcome to the first hands-on lab of this NLP Workshop. In this task, you will have the opportunity to apply some of the concepts you have learned during Class 1 regarding Text Processing and Feature Extraction from Text. For this, we have prepared a text classification task where you will try to distinguish between tweets that talk about real disasters, and those that do not.

Import Packages

In [1]:
import pandas as pd

Load Data

In [3]:
train_set = pd.read_csv("../data/lab1_train.csv")
test_set = pd.read_csv("../data/lab1_test.csv")

What is a disaster tweet? Here are a few examples:

In [None]:
print(f"Disaster Tweet #1: {train_set[train_set['target'] == 1]['text'].values[1]}")
print(f"Disaster Tweet #2: {train_set[train_set['target'] == 1]['text'].values[3]}")
print(f"Disaster Tweet #3: {train_set[train_set['target'] == 1]['text'].values[9]}")

And here are a few examples of what is NOT a disaster tweet

In [None]:
print(f"Non-Disaster Tweet #1: {train_set[train_set['target'] == 0]['text'].values[1]}")
print(f"Non-Disaster Tweet #2: {train_set[train_set['target'] == 0]['text'].values[3]}")
print(f"Non-Disaster Tweet #3: {train_set[train_set['target'] == 0]['text'].values[9]}")

So how can we distinguish them without knowing their label? Using NLP, of course! Let's start by applying some of the processing techniques you have larned.

## Text Preprocessing

Complement the function "preprocess_text", to return a list of clean tokens, when receiving a chunk of text (sentences, tweets). Some specific preprocessing steps needed when working with tweets are already implemented.

###### Note: As you can see, the regular expressions package is very usefull to implement customized preprocessing

In [299]:
import re

def preprocess_text(text):
    """Takes a text and returns the processed version of it.

    Args:
        text (str): raw text

    Returns:
        list: set of clean tokens containing the content of text
    """
    # remove tweet username
    text = re.sub(r'\@[^\s\n\r]+', '', text)
    # remove stock market tickers like $GE
    #text = re.sub(r'\$\w*', '', text)
    # remove retweet text "RT"
    text = re.sub(r'^RT[\s]+', '', text)
    # remove hyperlinks    
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    # remove hashtags (only the hash # sign)
    processed_text = re.sub(r'#', '', text)

    # Add more preprocessing steps below:
    
    # Remember the preprocessing techniques you learned:
    # - Tokenization
    # - Stopwords Removal
    # - Stemming/Lemmatization

    # You can also implement other preprocessing you may find useful

    return processed_text

Test the function on a randomly sampled tweet from the train dataset (Give it a couple of tries to really see impact).

In [None]:
from random import randint

random_tweet = train_set['text'].values[randint(0, train_set.shape[0])]
print(f"Original Tweet: {random_tweet}")
print(f"Processed Tweet: {preprocess_text(random_tweet)}")

Now that you have cleaned every entry in our dataset, you can proceed to extract features. Which method, from those we discussed in the class, do you think is best fit to distinguish disaster tweets from regular tweets? Can you think of any custom feature that might help in this specific context? In the following cells, you'll be able to try several ways of vectorizing the data.

## Feature Extraction

### Bag-of-Words

To implement a bag-of-words vectorization, we will use the CountVectorizer function from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

###### Note: Notice that this function expects the output from the preprocessing function to be a tokenized tweet. If you did not implement a tokenizer yet, you must re-think your preprocessing methodology.

In [199]:
from sklearn import feature_extraction

def get_bow_representations(train_samples, test_samples, tokenizer):
    """Returns a bag-of-words based representation of both the train and test samples.

    Args:
        train_samples (list): List of training samples.
        test_samples (list): List of test samples.
        tokenizer (object): A preprocessing function that outputs a list of tokens.

    Returns:
        train_vectors, test_vectors: vectorized representations of the train and test sets, according to the BOW method.
    """

    count_vectorizer = feature_extraction.text.CountVectorizer(tokenizer=tokenizer)

    train_vectors = count_vectorizer.fit_transform(train_samples)

    test_vectors = count_vectorizer.transform(test_samples)

    return train_vectors, test_vectors

Test this function and check the dimension of the resultant vectors:

In [None]:
train_vectors, test_vectors = get_bow_representations(train_set['text'], test_set['text'], preprocess_text)
print(train_vectors[0].shape)

#### Considering the result of the previous cell, what is the number of unique words in the entire preprocessed train dataset?

Answer here: 

### TF-IDF

TF-IDF implementation is similar to the Bag-of-Words one. But instead, we use the TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [242]:
def get_tfidf_representations(train_samples, test_samples, tokenizer):
    """Returns a tf-idf based representation of both the train and test samples.

    Args:
        train_samples (list): List of training samples.
        test_samples (list): List of test samples.
        tokenizer (object): A preprocessing function that outputs a list of tokens.

    Returns:
        train_vectors, test_vectors: vectorized representations of the train and test sets, according to the BOW method.
    """

    tfidf_vectorizer = feature_extraction.text.TfidfVectorizer(tokenizer=tokenizer)

    train_vectors = tfidf_vectorizer.fit_transform(train_samples)

    test_vectors = tfidf_vectorizer.transform(test_samples)

    return train_vectors, test_vectors


Test this function and check the dimension of the resultant vectors:

#### Considering the result of the previous cell, what is the number of unique words in the entire preprocessed train dataset?

Answer here: 

## Models

This class was focused on preprocessing and feature extraction. However, you still need a model to perform the main task: Text Classification of Disaster Tweets. As so, we give you predictive functions for two baseline models: A Naive Bayes and a Logistic Regression.

#### Naive Bayes Implementation

In [271]:
from sklearn.naive_bayes import MultinomialNB

def get_nb_predictions(train_samples, train_labels, test_samples):
    """Simple implementation of a Naive Bayes classifier.

    Args:
        train_samples (_type_): List of vectorized trained tweets.
        train_labels (_type_): List of train labels.
        test_samples (_type_): List of vectorized test tweets.

    Returns:
        preds: Predictions against the test set.
    """

    nb_model = MultinomialNB()

    nb_model.fit(train_samples, train_labels)

    return nb_model.predict(test_samples)

#### Logistic Regression Implementation

In [289]:
from sklearn.linear_model import LogisticRegression

def get_lr_predictions(train_samples, train_labels, test_samples):
    """Simple implementation of a Logistic Regression classifier.

    Args:
        train_samples (_type_): List of vectorized trained tweets.
        train_labels (_type_): List of train labels.
        test_samples (_type_): List of vectorized test tweets.

    Returns:
        preds: Predictions against the test set.
    """

    lr_model = LogisticRegression()

    lr_model.fit(train_samples, train_labels)

    return lr_model.predict(test_samples)

## Build your Pipeline!

Time to bring it all together! You got your preprocessing function, your feature extractors, and your models. Now, you can combine them according to the structure of the traditional NLP pipeline you learned about in this first class.

1. Start by getting your data in the correct format, i.e. a set of training tweets, a set of training labels, a set of test tweets, and a set of test labels.

In [285]:
# Get your train and test tweets and correspondent labels
x_train, y_train, x_test, y_test = # COMPLETE THIS

2. Preprocess and get a numerical representation of the tweets. You should perform these two steps at once, since the vectorizers accept a preprocessing function as input.

In [None]:
# Vectorize your text
# Notice that you don't have to apply pre-processing first, because the vetorizers apply it themselves.
# You must pass our preprocessing function to the feature extraction function, though.
x_train_vectorized, x_test_vectorized = # COMPLETE THIS

3. Get predictions against the test set, using one of the pre-implemented models.

In [None]:
# Train your model and get predictions against the test samples
preds = # COMPLETE THIS

4. Check the performance you achieve through the selected pipeline.

In [None]:
# Test your model in terms of accuracy, f1-score
from sklearn.metrics import balanced_accuracy_score, f1_score
print(f"Accuracy: {round(balanced_accuracy_score(y_test, preds), 5)}")
print(f"F1-score: {round(f1_score(y_test, preds), 5)}")

If you managed to reach this cell without any issues, you probably saw that you can reach a quite reasonable f1-score with this simple pipeline. Congratulations, you just built a decent text classifier!

Challenge: Would you dare to improve your results? Try to add some steps to the preprocessing function, and maybe add the extraction of new, customized features to this pipeline, and see what values of performance you can reach.