# NLP applied to classify Spanish jokes

In this notebook we will be applying different Natural Language Processing techniques to a corpus of jokes in Spanish.
The **objective** is to **train a Machine Learning model to classify jokes in categories**.

In order to execute smoothly the code, you should've installed the requirements using `pipenv` or `pip` (refer to the README.md for details).

Actually we will be using **pandas** to do a first exploration of the dataset, **spacy (with Spanish package installed)** to extract Natural Language information from the jokes, and **sklearn** to vectorize the jokes and train a Machine Learning model to classify jokes.

## Importing basic libraries

In [None]:
# jupyter config
import warnings
warnings.filterwarnings('ignore')

# data science stack
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import matplotlib.pyplot as plt
%matplotlib inline

## Reading the dataset

In [None]:
df = pd.read_csv('data/chistes.csv', index_col='id', dtype={'text':str})
df.head()

We will be focusing on joke's text and category. Let's see how many jokes do we have and the different categories.

In [None]:
df.shape

So we have ~2400 jokes with 4 columns defining jokes' details. Let's explore how many categories are in this dataset.

In [None]:
df['category'].value_counts()

In [None]:
df['category'].value_counts()/df.shape[0]

So we have 7 categories, being the most common "otros" with a 31.8%.

Let's explore the size of the jokes too.

In [None]:
df['len'] = df['text'].apply(lambda t: len(t))

df['len'].describe()

In [None]:
df['len'].hist(bins=200, figsize=(16, 4))

plt.xticks(range(0,2000,50))
plt.xlim((0,1000))
plt.axvline(df['len'].median(), color='r')  # Median in red
plt.axvline(df['len'].mean(), color='g')  # Mean in green


Half of the jokes have a size below 160 characters, which looks like really short documents to try NLP.


## Using spacy info

We will load Spanish module in spacy and try to get Part Of Speech (POS) of each word and other information.

In [None]:
import spacy
nlp = spacy.load("es")
nlp

Let's try to get a random joke and process it with spacy.

In [None]:
random_joke = df.iloc[10]['text']

print(random_joke)

In [None]:
processed_joke = nlp(random_joke)

for token in processed_joke:
    print(token.text,'\t lemma:', token.lemma_, ', pos:', token.pos_, ', tag:', token.tag_, ', stopword:', token.is_stop)

Spacy provides us with some useful information. In this case:
* lemma: the *dictionary form* of the word
* pos: part of speech, for example: noun, verb, adjective...
* tag: part of speech with extended info, like gender, number, etc
* stopword: if the word is considered meaningless (for NLP tasks) or not

As you see, spacy sometimes gives us wrong info:
* the lemma of "pelo" should be "pelo"(noun) and not "pelar"(verb)
* "durmiente" should've been tagged as noun, but it's tagged as adverb

Let's stress spacy with a *classic* complex sentence:

In [None]:
complex_sentence = 'Bajo con un tipo bajo a tocar el bajo bajo la escalera.'
processed = nlp(complex_sentence)
for token in processed:
    print(token.text, token.tag_)

The first "bajo" should be a verb, but it was tagged as preposition. However the other 3 "bajo" are correctly labeled.

## Using scikit-learn for vectorization

An easy way to convert a document into numbers (so algorithms can be easily applied to) is to count the words it contains. Usually it's better to consider only words that have strong meaning, and in our case (try to classify documents) it's important to find words that are common enough, to convert them into "features", but not too common, so they can help us classifying the documents.

In sklearn there are several methods to count words from documents. In the next example, we will be looking for the 20 most common words, but that appear in less than 200 jokes.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

max_jokes_with_that_word = 200

vectorizer20 = CountVectorizer(max_df=max_jokes_with_that_word, max_features=20)
bag_of_words = vectorizer20.fit_transform(df.text)

vectorizer20.get_feature_names()

The most relevant words are a mix of good ones (like "doctor") and not so good (like "soy"). The result is a *bag of words*, that is, a matrix with the counts of each relevant words in each joke.

The bag of words here is a sparse matrix, but we can convert it to a pandas dataframe.

In [None]:
bag_of_words

In [None]:
counted = pd.DataFrame(data=bag_of_words.toarray(), index=df.index, columns=vectorizer20.get_feature_names())
counted.head(20)

So, for example, the word "mamá" appears 2 times in the joke with id=7.

## Filtering words

Let's try to keep only words with real meaning. A classic way to do so is removing all the stopwords, but here we can leverage spacy extra information and keep only nouns, verbs, adjectives and adverbs.

In the old times they used to process the words with a stemmer (that removes the ending "s" and some other tricks), but we are going to normalize the words using their lemma.

In [None]:
def keep_only_content_words(s):
    processed = nlp(s)
    result = [token.lemma_ for token in processed if token.pos_ in ('NOUN', 'VERB', 'ADJ', 'ADV')]
    return ' '.join(result)

print(random_joke)
print()
print(keep_only_content_words(random_joke))

In [None]:
df['filtered_text'] = df['text'].apply(keep_only_content_words)
df.head()

Now let's try again to find the most common 20 words, only using the filtered words.

In [None]:
vectorizer20 = CountVectorizer(max_df=max_jokes_with_that_word, max_features=20)
bag_of_words = vectorizer20.fit_transform(df.filtered_text)

vectorizer20.get_feature_names()

In [None]:
# Let's see it in a table
counted = pd.DataFrame(data=bag_of_words.toarray(), index=df.index, columns=vectorizer20.get_feature_names())
counted.head(20)

### tf-idf

A better way to make numerical values of the words is using tf-idf (term frequency, inverse document frequency). Let's apply it to our bag_of_words.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_trans = TfidfTransformer()

normalized_bag = tfidf_trans.fit_transform(bag_of_words)

In [None]:
# Display word importance
pd.DataFrame(data=normalized_bag.toarray(), index=df.index, columns=vectorizer20.get_feature_names()).head(25)

## Real vectorizer with tf-idf

Finally we are going to find no just 20 but the most relevant 500 words, and use these later for ML training.

In [None]:
# TfidfVectorizer = CountVectorizer + TFidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=max_jokes_with_that_word, max_features=500)
bag_of_words = tfidf_vectorizer.fit_transform(df.filtered_text)

important_words = tfidf_vectorizer.get_feature_names()

print(', '.join(important_words))

## Train a ML algorithm

Now we have 500 features per document (joke). We are going to train a ML algorithm to learn the 8 categories provided.

Usually the collection of samples (documents, jokes) with their features is called "X", and the target is called "y" (in our case, the categories).

First we will convert the categories to numbers.

In [None]:
# y : let's make category a number
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(df.category)


In [None]:
df['y'] = y

df.head()

Now we will split X into a train set and a test set. So we will be training the ML algorithm ONLY with the train set, and later see how well it preforms with the test set.

In [None]:
# Split products in train (75%) and test (25%)
from sklearn.model_selection import train_test_split

X = bag_of_words

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

So we have a training set with 1814 jokes and 500 features per joke, and a test set of 605 jokes.

### Train a Random Forest

We are going to train a RandomForestClassifier with 200 trees, and see if we can beat the base score (that is, supose all jokes are in "otros" category).

In [None]:
from sklearn.ensemble import RandomForestClassifier

number_of_trees = 200
clf = RandomForestClassifier(n_estimators=number_of_trees, random_state=1)
clf.fit(X_train, y_train)

base_score = 0.318
train_score = clf.score(X_train, y_train)
test_score = clf.score(X_test, y_test)

print(f'Base score (all to "otros"): {base_score}')
print(f'Train set score: {train_score}')
print(f'Test set score: {test_score}' )

Not bad! We have improved 16 points the score. However the result is far from perfect, probably due to the short size of the jokes.

Let's check which features(words) were the most relevant for deciding the category of a joke.

In [None]:
# Feature importance
importances = pd.DataFrame(data=clf.feature_importances_, index=tfidf_vectorizer.get_feature_names(), columns=['importance'])
importances.sort_values(['importance'], ascending=False).head(10)

In [None]:
# Let's remember categories
df['category'].value_counts()/df.shape[0]

Some of the most important words make sense, like perro->animales. Other like "maridar"?! or "llamar" are not that clear.




**EXERCISE** : Try to improve the result of the ML algorithm.

You can check 3 alternatives:
- In the section "Filtering words", choose using the lemma or not, and different POS.
- In the section "Real vectorizer with tf-idf", consider more than 500 words.
- In the section "Train a Random Forest", explore RandomForestClassifier options (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)).

Notice: you can NOT change the train/test splitting code.

# Topic Modeling

Having a second look at the categories provided, it looks like they are not really good (specially if we look at "otros").

There are several unsupervised techniques to, given a collection of documents, find groups of topics. 

We will try here LaternDirichletAllocation, or LDA, which is a classic tech but usually hard to work with, as it needs a lot of fine tunning. For instance, choosing the number of topics.

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

number_of_topics = 10
lda = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)

lda.fit(X)
topics = lda.transform(X)

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
            

print("\nTopics in LDA model:")
print_top_words(lda, tfidf_vectorizer.get_feature_names(), 8)


Some of the topics make sense, like topic #5 (mother and school) or topic #8 (father), but others show no clear understandable topic.

Let's try to visualize the weights for each topic.

In [None]:
jokes_with_topics_weights = pd.concat([df, pd.DataFrame(topics, columns=[f'topic_{x}' for x in range(0,10)])], axis=1)
jokes_with_topics_weights.head()

Let's find some strong examples of topic #1.

In [None]:
topic1 = jokes_with_topics_weights[jokes_with_topics_weights['topic_1']>0.80]
topic1['text'].apply(lambda s: print(s+'\n-----\n'))

Apparently there is not a clear subject in common.

**EXERCISE** : Try to explore other topics, change the number of topics, or alter [LDA hyperparams](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

# Word embeddings

Spacy includes (in each language pack) a vectors for words from word2vec. We can access easily using .vector property in a token, but also in a complete text, as it returns the average of each word in this case. You can also use .similarity(other_words) to check cosine similarity between two words.

Let's try some examples.

In [None]:
velocidad = nlp('velocidad')
velocidad.vector

In [None]:
aceleracion = nlp('aceleración')
tocino = nlp('tocino')

print(velocidad.similarity(aceleracion))
print(velocidad.similarity(tocino))

## Process all jokes to get their vectors

Let's calculate the vectors for all jokes.

In [None]:
jokes = df['text']
vectors = []

for index,joke in jokes.iteritems():
    vectors.append(nlp(joke).vector)

## Use the vectors to estimate categories using ML

We are going to split again train and test set, and use RandomForestClassifier.

In [None]:
Xv = vectors

Xv_train, Xv_test, y_train, y_test = train_test_split(Xv, y, random_state=1)    

In [None]:
number_of_trees = 200
clf = RandomForestClassifier(n_estimators=number_of_trees, random_state=1)
clf.fit(Xv_train, y_train)

base_score = 0.318
train_score = clf.score(Xv_train, y_train)
test_score = clf.score(Xv_test, y_test)

print(f'Base score (all to "otros"): {base_score}')
print(f'Train set score: {train_score}')
print(f'Test set score: {test_score}')

The result is not really surprising. Perhaps because the original categories were not good enough.

**EXERCISE** : Try to, for each joke, get only the average vector of the filtered words.

# Visualization

Finally, just to see how were the jokes classified, we can reduce the dimensionality and plot it in 2D. Knowing the categories, we can use Linear Discriminant Analysis (if not know, PCA is more common).

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)

X_reduced = lda.fit_transform(X.toarray(), y)

X_reduced = pd.DataFrame(X_reduced, columns={'x', 'y'})
X_reduced['color'] = df['y'].apply(lambda cat: 'rgbcmyk'[cat])
X_reduced.head()

In [None]:
plt.figure(figsize=(16,8))
plt.scatter(X_reduced['x'], X_reduced['y'], color=X_reduced['color'])

We can observe that only red is out of the mass, which is category 0: animales.
Green (category 1: familia) and yellow (cat 5: sexo) are similar.
And the rest is a mess.

Thank you for joining this workshop!