[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W4E_NonTextual_Information.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy transformers sentencepiece datasets scikit-learn pandas
!python -m spacy download en_core_web_sm

# Combining Textual and Non-textual Features in NLP Models

*Based on the Text Classification tutorial by [Debora Nozza](https://dnozza.github.io/)*

In many real-world applications, text is just one of the multiple sources of information that can be used to predict desired quantities. In this exercise, you will reproduce a standard machine learning pipeline integrating text and non-textual information. Importantly, while in the previous tutorials and exercises we focused on the usage of advanced tooling such as the Transformers library, here we will start from the basics to establish some baseline results using the popular [Scikit-learn](https://scikit-learn.org/stable/index.html) library. 

**Exercise 1**, which is mandatory and will be part of your graded midterm portfolio, will include the following steps:

1. Preprocess the text to extract lemmatized content words.

2. Converting the text to a vector representation using simple count-based approaches.

3. Convert categorical features into one-hot vectors.

4. Fit a simple model to predict the desired target.

5. Establish a simple baseline performance for the prediction task.

6. Evaluate the model performance on a held-out set.

7. Perform a feature selection and re-evaluate the model

8. Obtain insights about salient words and features for the prediction task.

Every operation to be completed is marked with a `TODO` comment in the code section. While these represent a small but nonetheless comprehensive set of steps to train models on textual and non-textual information, nowadays NLP practitioners operate mainly with pre-trained word embeddings for representing textual information. 

In **Exercise 2** (optional), you will be asked to replace our text representation with a modern transformer-based model and evaluate the difference with respect to previous results. As always, we recommend you to complete this optional exercise, especially if you plan to deal with non-textual data (e.g. quality estimation scores, translator's behavioral data) during your final project.

# Exercise 1: A Simple Wine Scoring Pipeline

In this exercise, we will use a filtered version of the [Winemag dataset](https://www.kaggle.com/zynicide/wine-reviews) containing a collection of wine reviews (`description`), accompanied by some metadata: `country` and `province` of provenance, `variety` of wine, `price` per bottle and the WineEnthusiast rating (`points`) describing the wine quality.

*Your final goal is to build and evaluate a simple linear regression model that predicts the `points` assigned to a wine given its `description` and its other features.* This is commonly known as a **regression problem**, since you are trying to predict a continuous quantity, as opposed to a discrete one (e.g. a class label).

Most importantly, the general procedure and methods you will use can be applied to any kind of data with the adequate preprocessing, and can be extended to other tasks such as binary and multiclass classification.

You can have a look at the data, which has been conveniently packed into a Huggingface Dataset object:

In [1]:
from datasets import load_dataset

data = load_dataset("GroNLP/ik-nlp-22_winemag")
print(data)
data["train"].to_pandas().head()

Using custom data configuration GroNLP--ik-nlp-22_winemag-0f995f6990ce8262
Reusing dataset csv (/home/gsarti/.cache/huggingface/datasets/csv/GroNLP--ik-nlp-22_winemag-0f995f6990ce8262/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)
100%|██████████| 3/3 [00:00<00:00, 513.29it/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'country', 'description', 'points', 'price', 'province', 'variety'],
        num_rows: 70458
    })
    test: Dataset({
        features: ['index', 'country', 'description', 'points', 'price', 'province', 'variety'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['index', 'country', 'description', 'points', 'price', 'province', 'variety'],
        num_rows: 5000
    })
})





Unnamed: 0,index,country,description,points,price,province,variety
0,129857,US,Dusty tannins make for a soft texture in this ...,90,44.0,California,Merlot
1,112217,US,Sweet-tart Maraschino cherry and bitter brambl...,85,14.0,New York,Pinot Noir
2,114216,France,A lightly orange-colored rosé that is made by ...,92,90.0,Champagne,Champagne Blend
3,37808,France,"A ripe wine that is almost off dry, this has a...",85,17.0,Bordeaux,Bordeaux-style Red Blend
4,31157,US,"Crisp and very floral, this is a beautiful sho...",92,20.0,California,Pinot Gris


### Text Preprocessing

Text is messy. The goal of preprocessing is to reduce the amount of noise (= unnecessary variation), while maintaining the signal. There is no one-size-fits-all solution, but a good approximation can be, for example, to preserve only content words and reduce the size of the vocabulary by means of a lemmatizer.

You learned how to extract lemmas and POS tags using spaCy, and how to use `.map` to apply a function to a `Dataset`, so you should have all the tools to succeed in this. Fill in the missing code:

In [None]:
import spacy

# Disable unused components
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

def clean_text(text):
    '''Reduce text to lower-case lemmatized content words.'''
    content_op = ['NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN']
    # TODO: Extract the lemmas for content words only
    lemmas = None
    return ' '.join(lemmas)

clean_text('This is a test sentence. And here comes another one... Go me!')

Let's now apply this cleaning function to the `description` column of the dataset:

In [None]:
for split in data.keys():
    # TODO: Use .map to apply the clean_text function to the
    # description column, mapping the output to a new clean_text column
    # and removing the original description column.
    data[split] = None

### Representing Text

Now that you have a more compact representation of the text, the next step is converting it into a vector representation. For this purpose, you will use the [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class provided by Scikit-learn. This class converts a collection of text documents to a matrix of [TF-IDF scores](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) reflecting the importance of a word in a document, and in relation to the full corpus. We are going to set some parameters to ensure a limited size of the vocabulary, but you can experiment with other values.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    ngram_range=(1,2), # Use 1-grams and 2-grams.
    min_df=0.001,      # Ignore terms that appear in less than 0.1% of the documents.
    max_df=0.75,       # Ignore terms that appear in more than 75% of documents.
    max_features=1000,  # Use only the top 1000 most frequent words.
    stop_words='english'
)

# TODO: Apply the vectorizer to the clean_text column of each data split 
# by converting the Dataset object to pandas and using .fit_transform
# Remember a Dataset object can be converted to Pandas at any
# time by calling .to_pandas(). The output of the vectorizer is a sparse matrix
# in Compressed Sparse Row format (CSR), so you will need to apply the .toarray()
# method to convert it to a regular NumPy array before building the DataFrame.
text_vectors = None

# Converting the text vectors to a pandas dataframe
# Every column is a word, e.g. w_wine, w_glass, etc.
text_vectors = pd.DataFrame(
    text_vectors,
    columns=["w_" + w for w in vectorizer.get_feature_names_out()]
)

### Categorical -> One-Hot Conversion

Many of the available features in the datasets are categorical, and you will need to convert them to [one-hot vectors](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) in order to use them in a regression model. Luckily, the [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) class is readily available in Scikit-learn for this purpose. An even more convenient approach is to use the [`pandas.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function, which returns a pandas Dataframe with labeled one-hot encoded columns,

In [None]:
import pandas as pd

# TODO: Encode the fields `country`, `province` and `variety`
# of each data split separately into one-hot vectors 
# using pd.get_dummies().
country_vectors = None
province_vectors = None
variety_vectors = None

### Putting it all Together and Fitting a Model

Now that all the data is ready to be processed, create two Pandas dataframes to train the model: `features` should be the concatenation of the `price` field plus all the vectorized features (`text_vectors`, `country_vectors`, `province_vectors`, `variety_vectors`), while `target` should be a single column of `points` field.

Finally, you will train a simple linear model using the `fit` method of an instance of the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model, which fits a regular least-squares linear regression to the features. A regression model is simply a function that takes a set of numeric values, called **features**, as input, and returns an output score. Fitting a model is the process of finding the right parameters, called **weights**, to map the input features to the output targets.

In [None]:
# TODO: Create the features and target dataframes for the train split
features = None
target = None

from sklearn.linear_model import LinearRegression

regressor = LinearRegression(n_jobs=-1)
regressor.fit(features, target)
print(regressor)

### A Simple Baseline

Before evaluating the performance of your fitted model, you might want to establish a reasonable **baseline**, representing a null-hypothesis choice. In the case of regression, usually a simple statistical baseline is the mean of the targets, minimizing the prediction error in absence of any information but the distribution of target values. The Scikit-learn library implements a [`DummyRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html) that can be used to fit various regression baselines, including the mean.

In [None]:
from sklearn.dummy import DummyRegressor

baseline = DummyRegressor(strategy="mean")

# TODO: Fit the baseline to the same data used with the regressor

### Evaluation

Having a model is great, but how well does it do? Can it predict what it has seen? We need a way to estimate how well the model will work on new data. We will use two metrics: the [**mean absolute error**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) (MAE), representing the mean positive prediction error across all tested instances, and the [**mean squared error**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) (MSE), where the prediction error is made positive by squaring its value rather than applying the $abs$ operator. The second gives a more intuitive sense of the model's performance, while the first is more robust to outliers, which are upweighted by the squaring operation.

Classifying new (held-out) data is called prediction. We reuse the weights we have learned before on a new data matrix to predict the new outcomes. 

**Important**: the new data needs to have the same number of features! This means using the same vectorizers you fitted on the training split, using only the `.transform` method on the test split.

If you didn't apply the vectorization procedure described above to all the splits in `data`, do it now so as to obtain a `features` and `target` dataframe for each split. In the following, we will use `test_features` and `test_target` to evaluate the model using the `.predict` method.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# TODO: Repeat the same steps used above for the test split
# ...
test_features = None
test_target = None

# TODO: Use the regressor and the baseline to predict the test target
# using the predict method.
regressor_predictions = None
baseline_predictions = None

# Print the scores
for metric in [mean_absolute_error, mean_squared_error]:
    print("Linear regressor", metric.__name__, metric(test_target, regressor_predictions))
    print("Mean baseline", metric.__name__, metric(test_target, baseline_predictions))

### Better features = Better model

We now have a lot of features! Some are simply tf-idf scores for words that will be totally unrelated to predicting the wine quality, so we might want to discard most of them. Let's select the top 500 based on how well they predict the outcome of the training data.

For this purpose, you will use two classes from sklearn, [`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) (the selection algorithm) and [`chi2`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) (the selection criterion). Using them in combination will allow you to remove features that are most likely to be independent of the target, and thus not useful for the model.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

selector = SelectKBest(chi2, k=500).fit(features, target)
filtered_features = selector.transform(features)
print(filtered_features.shape)

# TODO: Fit another linear regression model to
# the filtered features, and compare the performance with the
# previous system using the full set of features.

### Obtaining Insights about Salient Features

In this exercise we used a simple linear regression model to predict the `points` of a wine given its `description` and its other features. A strength of linear models is that they're highly interpretable: the coefficient assigned to each feature expresses the importance of the said feature in determining the predicted target. For example, a vectorized word having a large positive coefficient given by the trained regression model entails a large predicted quality score for the wine.

In [None]:
# Get the indices of the top 500 features
top_scores = selector.scores_.argsort()[-500:]
labels = [features.columns[i] for i in sorted(top_scores)]

# TODO: Build and print a dataframe containing the top 500 features
# and their respective coefficients, sorted from highest to lowest
# Coefficients can be accessed as regressor.coef_[0]

To conclude the exercise, comment on the results from the previous operation and the usefulness of textual features in predicting the `points` of a wine.

# (Optional) Exercise 2: Better Text Features for Wine Scoring

In this exercise, you will repeat the same procedure of the previous exercise, but you will use a pre-trained transformer model via the Huggingface Transformers library instead of the TfidfVectorizer.

Remember that extracting embeddings from pretrained models is easily done with the `pipeline("feature-extraction")` class, but this can be a very time consuming process even using GPU accelerators. For this reason, consider using small models (e.g. Distilbert) and possibly select a subset of the training instances to make the process faster.

**Important**: Since you are now using a model trained on naturally-occurring text, any transformation applied to the text should be ignored and the original text should be used.

In [None]:
# TODO: Reproduce the same pipeline presented above using
# features extracted from a transformer model of your choice.
# You will still use a Linear Regressor, only the feature from
# step 1 will change.

# TODO: Compare the performance of the new model with the original
# models and baselines. Comment whether it is still possible to
# understand the importance of different terms in this new setting.