<a href="https://colab.research.google.com/github/kalue23/Exercises-Uni/blob/main/06_natural_language_processing_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

## Learning Objectives

* Being able to transform text so that we can apply machine learning methods to it.
* Being able to classify emotions in text.
* Being able to evaluate the performance of a classifier that classifies more than two classes.

## Text as Data

We are using a dataset with English sentences, each assigned one of six possible emotions (sadness, joy, love, anger, fear, and surprise). The dataset is a subset of the [Emotions](https://www.kaggle.com/datasets/bhavikjikadara/emotions-dataset?resource=download) dataset, in which each emotion is equally represented (with 1,000 examples). We want to use this dataset to train a classifier that can recognize emotions in text.

*The dataset may be used, modified, and published under a [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).*

In [None]:
import pandas as pd
url = "https://drive.google.com/uc?id=1vfgHvGBMOAyozxlbwTGElPcyofeqjP9w"
emotions = pd.read_csv(url)
emotions.head(10)

In [None]:
# 0: sadness
# 1: joy
# 2: love
# 3: anger
# 4: fear
# 5: surprise

In [None]:
emotions["text_label"].value_counts()

In [None]:
# we assign our observations (sentences) and target values (emotions)
# to variables:
x = emotions["text"]
y = emotions["label"]

As discussed in the [lecture](https://janalasser.at/lectures/MD_KI/VO4_1_text_as_data/), we first need to transform the texts into numbers — that is, create an [embedding](https://janalasser.at/lectures/MD_KI/VO4_1_text_as_data/#/1/0/0) of the texts. One way to do this is by applying [one-hot encoding](https://janalasser.at/lectures/MD_KI/VO4_1_text_as_data/#/1) and treating each word as its own "category."

<div>
<img src="https://drive.google.com/uc?id=1Bi6EPtynMNCraEnN780taAJkCQE8bjAF" width="600"/>
</div>

In [None]:
# we can use scikit-learn's CountVectorizer to achieve this
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love apples",
    "I received a cat",
    "tomorrow is Tuesday"]

count_vect = CountVectorizer()
count_vect.fit(sentences)

In [None]:
# for the example with only three sentences, the number of words (features)
# is still manageable. In addition, the CountVectorizer ignores words
# that have only one letter.
count_vect.get_feature_names_out()

How many dimensions (features) would the DataFrame have that we generate from the `emotions` dataset in this way? To find out, we can count the number of different words contained in the sentences of the dataset that have a length greater than 1:

In [None]:
# list for saving the words
all_words = []

# iterate over all sentences in the data
for sentence in emotions["text"]:
  # split the sentence into individual words based on spaces
  words_in_sentence = sentence.split(" ")
  # iterate over all words of the current sentence
  for word in words_in_sentence:
    # if the word is not yet in the list: add
    if word not in all_words and len(word) > 1:
      all_words.append(word)

# print the length of the resulting list
print(f"The dataset contains {len(all_words)} different words with a length greater than 1.")

In [None]:
# cross-check: the CountVectorizer also returns 8951 features
count_vect = CountVectorizer()
count_vect.fit(emotions['text'])
print(f"Number of features in CountVectorizer: {len(count_vect.get_feature_names_out())}")

In [None]:
# we can refine the approach by ignoring words that occur rarely.
# Here, we ignore all words that occur less than 3 times:
count_vect = CountVectorizer(min_df=3)
count_vect.fit(emotions['text'])
print(f"Number of features in CountVectorizer: {len(count_vect.get_feature_names_out())}")

In [None]:
# That looks much more manageable! So we transform our dataset
# with the CountVectorizer trained in this way:
x_counts = count_vect.transform(x)

In [None]:
# this dataset has 6,000 rows (observations) and 2,753 columns
# (features, dimensions)
x_counts.shape

In [None]:
# what does a sentence transformed in this way look like?
x_counts[0]

In [None]:
# such large matrices are automatically stored as "sparse matrices"
# to save memory. To display their content, we first need to
# transform them into a regular matrix
x_counts[0].todense()

In [None]:
# How many non-zero entries does the transformed first sentence have?
x_counts[0].todense().sum()

In [None]:
# what does the original sentence look like?
# excluding words with length <2, there are 11 words in the sentence.
# that means one word was dropped because it occurs less than 3 times
# in the entire dataset.
emotions["text"].iloc[0]

## A Classifier for Emotion Recognition
We now want to train a classifier with the dataset that can recognize emotions in an English sentence. For this, we follow the well-known train-test approach from [Session 4](https://colab.research.google.com/drive/1tkYspN1O9ehvcjKTyerLAISbpUG3XJB2?usp=sharing).

In [None]:
# we split the dataset into a training set and a test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# we transform the observations in the training and test sets
# with the previously trained CountVectorizer. Attention: it is important
# that we trained the CountVectorizer on the entire dataset beforehand,
# because there may be words that appear only in the training or
# test set after splitting, which would otherwise not be transformed correctly
x_train_counts = count_vect.transform(x_train)
x_test_counts = count_vect.transform(x_test)

<font color='blue'>**Exercise 1**</font>  
<ul class="outside">
<li><font color='blue'>Train a <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html">random forest classifier</a> using the training data. Follow the procedure from <a href="https://colab.research.google.com/drive/1tkYspN1O9ehvcjKTyerLAISbpUG3XJB2?usp=sharing">Session 4</a> or the <a href="https://colab.research.google.com/drive/1hEHtlMQh794Lv8Th6J0VFksVBDFbbwm8?usp=sharing">Session 4 homework</a>, Exercise 1 (6).</font></li>
<li><font color='blue'>Predict the emotions for the test data.</font></li>
<li><font color='blue'>Measure the accuracy of the predictions.</font></li>
<li><font color='blue'>Predict the emotions for the four new sentences given in the code cell below. <b>Note</b>: don’t forget to transform the sentences first with the trained <tt>CountVectorizer</tt>!</font></li>
<li><font color='blue'>Come up with four more sentences and predict their emotions. Do the predictions match your expectations?</font></li>
</ul>

In [None]:
# Your code here

## Multiclass Classification

So far, in supervised learning for classification, we have only dealt with binary classification (survive vs. not survive, breast cancer vs. no breast cancer). In the case of emotions, however, we have multiple classes (emotions). This is referred to as "multiclass classification."

It can also happen that we want to assign more than one class to an observation (not covered in this course). This is then referred to as "multilabel classification."

<div>
<img src="https://drive.google.com/uc?id=1Sdftq2ZeBmJjzyHjrfJIMj4AnrDKIsrH" width="800"/>
</div>

Measuring the accuracy of a multiclass classifier is straightforward. We can simply count the cases where the classifier is incorrect.

For other [performance metrics](https://janalasser.at/lectures/MD_KI/VO2_2_performance/#/2/0/2) like precision and recall, the procedure is not quite as simple — each class has its own precision and recall.

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(verbose=True, random_state=0)
classifier.fit(x_train_counts, y_train)
y_pred = classifier.predict(x_test_counts)

In [None]:
# For which of the six emotions does the prediction work best?
# 0: sadness
# 1: joy
# 2: love
# 3: anger
# 4: fear
# 5: surprise

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

<font color='blue'>**Exercise 2**</font>  
<ul class="outside">
<li><font color='blue'>Each person chooses an emotion and writes 10 sentences that are meant to express that emotion in this <a href="https://docs.google.com/document/d/1-U_YxRfU0q10_rLh_gFrwUqaRhBqv9znYUwQfqGdoo4/edit?usp=sharing">Google Doc</a>.</font></li>
<li><font color='blue'>Copy all sentences into the list <tt>x_test_new</tt> in the code cell below. Pay attention to the order: first all sentences for "sadness", then all for "joy", etc.</font></li>
<li><font color='blue'>Predict the emotion for all the new sentences. <b>Note</b>: don’t forget to transform the sentences first with the trained <tt>CountVectorizer</tt>!</font></li>
<li><font color='blue'>Measure the classifier's performance on the new sentences by creating a <tt>classification_report</tt>. How well does the classifier perform on this completely new dataset, i.e., "out of sample"?</font></li>
</ul>

In [None]:
x_test_neu = [
    # Copy the sentences here
]

y_test_neu = [
    # Copy the targets here
]

In [None]:
# Your code here

## Additional Materials
* **Embeddings with Pretrained Models**: [NLTK tutorial](https://www.nltk.org/howto/gensim.html)
* **Micro- and Macro F1 Score Explained**: [Blog post](https://iamirmasoud.com/2022/06/19/understanding-micro-macro-and-weighted-averages-for-scikit-learn-metrics-in-multi-class-classification-with-example/)
* **How LLMs Work**: [Video](https://www.youtube.com/watch?v=wjZofJX0v4M) by 3Blue1Brown (also includes a brief section on embeddings)
* **Emotion Recognition and the AI Act**: [Comment](https://www.technologyslegaledge.com/2025/04/eu-ai-act-spotlight-on-emotional-recognition-systems-in-the-workplace/) on emotion recognition systems in the AI Act

## Source and License

This notebook was created by Jana Lasser for Course "B1 - Technical Aspects" of the Microcredential "AI and Society" at the University of Graz.

The notebook may be used, modified, and redistributed under the terms of the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0) license.

This notebook was translated from German using GPT-5 and cross-checked by Alina Herderich.