<a href="https://colab.research.google.com/github/michalis0/DataScience_and_MachineLearning/blob/master/06-text-analytics/Week_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install wikipedia
!pip install -q transformers
%pip install ipywidgets

In [3]:
# Import standard libraries
import pandas as pd
import numpy as np
import math
import bs4 as bs
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
import ipywidgets as widgets
from ipywidgets import interact, interact_manual

# Import for text analytics
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string
import wikipedia
import gensim
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.utils import simple_preprocess
from gensim import corpora
import multiprocessing

# Import libraries for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

# Import libraries for hugginface
from transformers import pipeline
import gensim.downloader

# Text Analytics

<img src='https://images.unsplash.com/photo-1605429201125-37e867327609?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1176&q=80' width="450">

Credit: [Piotr Łaskawski](https://unsplash.com/@tot87)

## Content

The goal of this walkthrough is to provide you with insights on text analytics. [Text Analytics](https://en.wikipedia.org/wiki/Text_mining) (or text mining) is "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." ([Marti Hearst](https://people.ischool.berkeley.edu/~hearst/text-mining.html)). Written resources may include websites, books, emails, reviews, and articles.

There are many applications of text analytics, for example:
- Search for relevant websites or articles using a search engine;
- Sentiment Analysis (e.g., classify tweets or film reviews as positive, neutral or negative);
- Summarize, anonymize, or translate documents;
- Chatbots (e.g., ChatGPT, Siri, Alexa);
- etc.

In this notebook, we will see how to prepare and represent texts and explore various text-analytics techniques, before doing an application on text similarity:
- [Text Preparation](#Text-Preparation)
    - [Tokenization](#Tokenization)
    - [Remove Stopwords](#Remove-Stopwords)
    - [Lemmatization](#Lemmatization)
    - [Your turn!](#Your-turn-Preparation)
- [Text Representation](#Text-Representation)
    - [Bag of Words (BOW)](#Bag-of-Words-(BOW))
    - [TF-IDF Representation](#TF-IDF-Representation)
    - [Your turn!](#Your-turn-Representation)
- [Introduction to Gensim and Word Embedding](#Introduction-to-Gensim-and-Word-Embedding)
    - [Background](#Background)
    - [Implementing Word2vec with Gensim](#Implementing-Word2vec-with-Gensim)
    - [Using pretrained models](#Using-pretrained-models)
    - [Your turn](#Your-turn)
- [Application: Text Classification with TF-IDF](#Application:-Text-Classification-with-TF-IDF)
    - [Load and clean data](#Load-and-clean-data)
    - [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    - [Classification using TF-IDF and Logistic Regression](#Classification-using-TF-IDF-and-Logistic-Regression)
- [Introduction to Hugginface and sentiment analysis](#Introduction-to-Hugginface-and-sentiment-analysis)
    - [Implementation of Hugginface](#Implementation-of-Hugginface)
    - [Your turn !](#Your-turn-!)

## Text Preparation

In this section, we explain how to prepare a text for analysis. This includes tokenizing the text, removing stopwords, etc.

We will use the [spaCy](https://spacy.io/) library, an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently.

You can directly [install the library](https://spacy.io/usage) in your Anaconda environment, or, if you opened this notebook in Colab, with the following line of code:
```python
!pip install -U spacy
```

We also install the English-language model: in you Anaconda environment install "spacy-model-en_core_web_sm"; in Colab, run the following line of code:
```python
!python -m spacy download en_core_web_sm
```

Note: If you obtain the error `Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed"`, try the following:
```python
!pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz
```

In [None]:
#!pip install -U spacy
#!python -m spacy download en_core_web_sm
!pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0.tar.gz

Once everything is installed, and imported (at the beginning of this notebook), we can load our language dictionary, namely the English language model, using `spacy.load('en_core_web_sm')`:

In [5]:

# Load English language model
sp = spacy.load('en_core_web_sm')

### Tokenization

**Tokenization** is the process of breaking a text into pieces called tokens. A **token** simply refers to an individual part of a sentence having some semantic value. In other words, tokens are the elementary building blocks (words, numbers, characters) in a document.

SpaCy's tokenizer takes input in form of unicode text and outputs a sequence of token objects. In addition, SpaCy automatically breaks your document into tokens when a document is created using the language model.

There are a couple of different ways we can approach this. The first is called **word tokenization**, which means breaking up the text into individual words. This is a critical step for many language processing applications, as they often require inputs in the form of individual words rather than longer strings of text.

Let’s take a look at a simple example. Imagine we have the following text, and we would like to tokenize it:

> When learning data science, you shouldn't get discouraged!

> Challenges and setbacks aren't failures, they're just part of the journey. You've got this!

We create a spaCy object, which contains linguistic annotations and various language properties:

In [None]:
# Declare text
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""

# spaCy object is used to create a document
my_doc = sp(text)

my_doc

In [None]:
# This is a spaCy document
type(my_doc)

Let's now create a list of tokens:

In [None]:
# Create list of tokens
token_list = [token.text for token in my_doc]
token_list

As we can see, spaCy produces a list that contains each token as a separate item. Notice that it has recognized that contractions such as _shouldn’t_ actually represent two distinct words, and has thus broken them down into two distinct tokens.

We can also see the parts-of-speech (POS) of each of these tokens using the `.pos_` attribute, as shown below.

In [None]:
# POS
for word in my_doc:
    print(word.text, '->', word.pos_)

POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word "fish" can be used as both a noun and verb, depending upon the context:

In [None]:
# Another example
doc1 = sp("I like to fish") # verb
doc2 = sp("I eat a fish") # noun

for word in doc1:
    print(word.text, '->', word.pos_)

print("-----------------")

for word in doc2:
    print(word.text, '->', word.pos_)

If we want, we can also break the text into sentences rather than words. This is called **sentence tokenization**. When performing sentence tokenization, the tokenizer looks for specific characters that normally fall between sentences, like periods, exclamation points, and newline characters.

In [None]:
# create list of sentence tokens
sents_list = [sent.text for sent in my_doc.sents]

sents_list

### Remove Stopwords

Most text data that we work with is going to contain a lot of words that are not actually useful to our analysis (e.g., "is", "and", "you", etc.). These words, called **stopwords**, are useful in human speech, but they do not have much to contribute to the meaning of a sentence. Removing stopwords helps us eliminate noise and distraction from our text data, and also speeds up the time of the analysis (since there are fewer words to process). This makes text analysis more efficient.


Let’s take a look at the stopwords spaCy includes by default.

In [None]:
# Import stopwords from English language
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Print total number of stopwords
print('Number of stopwords: %d' % len(spacy_stopwords))

# Print 20 stopwords
print('20 stopwords: %s' % list(spacy_stopwords)[:20])

Now that we’ve got our list of stopwords, let’s use it to remove the stopwords from the text string we were working on in the previous section.

In [None]:
# Which words will be removed?
my_doc

In [None]:
# Filter stopwords
filtered_sent = [word.text for word in my_doc if word.is_stop == False]

print('The filtered sentence contains the words:', filtered_sent)

We can also remove the punctuation:

In [None]:
# Filter stopwords, punctuation and spaces
filtered_sent2 = []
removed_tokens = []

for word in my_doc:
    if (word.is_stop == True) or (word.is_punct == True) or (word.is_space == True):
        removed_tokens.append(word.text)
    else:
        filtered_sent2.append(word.text)

print('We remove the following tokens:', removed_tokens)
print('The filtered sentence contains the words:', filtered_sent2)

### Lemmatization

**Lemmatization** is a way of dealing with the fact that while words like connect, connection, connecting, connected, etc. are not exactly the same, they all have the same essential meaning: connect. The differences in spelling have grammatical functions in spoken language, but for machine processing, those differences can be confusing, so we need a way to change all the words that are forms of the word connect into the word connect itself.

One method for doing this is called **stemming**. Stemming involves simply lopping off easily-identified prefixes and suffixes to produce what is often the simplest version of a word, the root. Connection, for example, would have the -ion suffix removed and be reduced to connect. This kind of simple stemming is often all that is needed, but lemmatization — which actually looks at words and their roots (called lemma) as described in the dictionary — is more precise (e.g feet -> foot).

Let's look at this simple example.

In [None]:
# Lemmatization
lem = sp("run runs ran running runner runners")

# Find lemma for each word
for word in lem:
    print(word.text, '->', word.lemma_)

### Your turn! <a id = "Your-turn-Preparation"></a>

The text below is taken from the [the presentation](https://www.unil.ch/formations/en/home/menuinst/masters/systemes-dinformation.html) of the Master of Science (MSc) in Information Systems and Digital Innovation.

In [17]:
text = """The Master of Science in Information Systems and Digital Innovation allows you to acquire advanced skills in New Information and Communications Technologies (NICT) for use within organisations.
Subjects are studied with a balanced multidisciplinary approach and cover both information technology and management techniques.
The Master’s degree thus trains high-level specialists with the skills needed to design, manage, evaluate and implement IT services and applications.
This course also allows to undertake doctoral studies.
"""

- Create two lists:
    - the first one containing the punctuation and the stopwords,
    - the second one containing the words (tokens).

In [None]:
# YOUR CODE HERE


- For each token, print its lemma

*Note:* You can convert a list of strings into a string using for instance the `join()` method

In [None]:
# YOUR CODE HERE


## Text Representation

The goal is to transform text into numerical features such that it can be used by ML algorithms. There are different techniques:
- **Bag of Words (BOW)** simply treat every document as an unordered set of words. It works in many case but order is not preserved. As a solution, we can use **n-grams**, i.e., we count token pairs, triplets, etc.
- **TF-IDF**: emphasizes important words, i.e., words that appear frequently in a document, (informing about the topic of the document), and words that are rare in a corpus of documents (setting one document apart from other similar ones).

To transform our text, we are going to use our old friend, the scikit learn library. As input it will require strings (and not a spaCy object). Here, our corpus of documents will consist of four sentences on [symbiosis](https://en.wikipedia.org/wiki/Symbiosis).

In [None]:
# Sentences (as strings, not spaCy objects)
s1 = "Symbiosis is any type of a close and long-term biological interaction between two biological organisms of different species."
s2 = "Mutualism describes the ecological interaction between two or more species where each species has a net benefit."
s3 = "Commensalism is a long-term biological interaction (symbiosis) in which members of one species gain benefits while those of the other species neither benefit nor are harmed."
s4 = "Parasitism is a close relationship between species, where one organism, the parasite, lives on or inside another organism, the host, causing it some harm, and is adapted structurally to this way of life."

# List of sentences
texts = [s1, s2, s3, s4]
texts

### Bag of Words (BOW)

We use the `CountVectorizer` class of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)), using as parameters:
- `ngram_range=(1,2)`, i.e., we consider tokens (1-grams) and pair of tokens (2-grams);
- `stop_words="english"`, a built-in stop word list for English.

In [None]:
# Using default tokenizer
count = CountVectorizer(ngram_range=(1,2), stop_words="english")

# Learn the vocabulary dictionary and return document-term matrix
bow = count.fit_transform(texts)

# Show feature matrix
print(bow.toarray())

Let's check the n-grams (tokens and pair of tokens) created:

In [None]:
# Get feature names
feature_names = count.get_feature_names_out()

# View feature names
print('Our n-grams are:', ', '.join(feature_names))

We can better visualize the result in a dataframe:

In [None]:
# Show as a dataframe
pd.set_option("display.max_columns", None)
pd.DataFrame(
    bow.todense(),              # Feature matrix
    columns=feature_names,      # n-grams
    index= ['s1', 's2', 's3', 's4']
    )

### TF-IDF Representation

**TF-IDF** emphasizes important words. It is the product of term frequency (TF) and inverse document frequency (IDF):
- **Term Frequency** identifies tokens that appear frequently in a document: TF(token, document) = number of times token appears in document / total number of tokens in document
- **Inverse Document Frequency** identifies words that appear rarely in the corpus: IDF(token, corpus) = log( total number of documents in corpus / number of documents containing token )

Note that the IDF value for a token remains the same throughout all the documents as it depends upon the total number of documents. On the other hand, TF values of a token differ from document to document.

The goal of using TF-IDF instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus since those are less informative than tokens that occur in a small fraction of the corpus.

Ok, let's implement TD-IDF. We are using the `TfidfVectorizer` class of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)):

In [None]:
# Sentences (as strings, not spaCy objects)
s1 = "Symbiosis is any type of a close and long-term biological interaction between two biological organisms of different species."
s2 = "Mutualism describes the ecological interaction between two or more species where each species has a net benefit."
s3 = "Commensalism is a long-term biological interaction (symbiosis) in which members of one species gain benefits while those of the other species neither benefit nor are harmed."
s4 = "Parasitism is a close relationship between species, where one organism, the parasite, lives on or inside another organism, the host, causing it some harm, and is adapted structurally to this way of life."

texts = [s1, s2, s3, s4]

# Using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words="english")

# Learn the vocabulary dictionary and return document-term matrix
features = tfidf.fit_transform(texts)

# Visualize result in dataframe
tfidf_df = pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out(),
    index = ['s1', 's2', 's3', 's4']
)

display(tfidf_df)   

It's now possible to calculate the summed TF-IDF value for each term. This gives us an indication of the importance or relevance of each word across the different sentences.

 The higher the summed TF-IDF value, the more frequently and uniquely the term appears within the documents. This helps identify key terms that stand out and may hold significant meaning in the sentences.

In [None]:
tfidf_score_per_word = tfidf_df.sum(axis=0).sort_values(ascending=False)
print(tfidf_score_per_word)

Another interesting metric to calculate is the sum of all the token TF-IDF values across a single sentence.

This tells us the overall importance or uniqueness of the sentence within the context of the entire dataset. A higher summed TF-IDF score for a sentence indicates that the sentence contains more unique or significant terms, making it stand out compared to other sentences.



In [None]:
tfidf_score_per_sentence = tfidf_df.sum(axis=1).sort_values(ascending=False)
print(tfidf_score_per_sentence)

### Your turn! <a id = "Your-turn-Representation"></a>

- Create a TF-IDF Representation of the three sentences below using bigrams

In [25]:
# Sentences
s1 = "Information systems are structured arrangements of people, data, processes, and technology that work together to collect, process, store, and disseminate information within an organization."
s2 = "Information systems encompass various components, such as hardware, software, databases, networks, and human resources, all working in synergy to manage and distribute information effectively."
s3 = "Information systems come in different types, including transaction processing systems, management information systems, decision support systems, and executive information systems, tailored to specific organizational needs."

In [None]:
# YOUR CODE HERE


## Introduction to Gensim and Word Embedding

With BOW and TF-IDF, similar sentences/words have a completely different representation. Thus, sentences with different words but same meaning/semantics will be very distant.

In the following, we illustrate how we can find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

We will use the [Gensim](https://pypi.org/project/gensim/) library. Gensim stands for "Generate Similar". It is a popular open-source natural language processing (NLP) library used for unsupervised topic modeling. A complete tutorial can be found [here](https://www.tutorialspoint.com/gensim/gensim_introduction.htm).

### Background

Word embedding approaches use deep learning and neural network-based techniques to convert words into corresponding vectors so that semantically similar vectors are close to each other in an N-dimensional space, where N refers to the dimensions of the vectors. The underlying assumption is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.

Two word embedding methods:
- [Word2vec](https://en.wikipedia.org/wiki/Word2vec), by Google
- [GloVe](https://en.wikipedia.org/wiki/GloVe) (Global vectors for Word Representation), by Stanford

Word2vec gives astonishing results. Its ability to maintain a semantic relationship is reflected in a classic example where if you have a vector for the word "King" and you remove the vector represented by the word "Man" from the "King" and add "Woman", you get a vector that is close to the vector "Queen":
- King - Man + Woman = Queen

Second example: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2vec they will therefore share a similar vector representation.

In real applications, Word2vec models are created from billions of documents. For example, [Google's Word2Vec model](https://code.google.com/archive/p/word2vec/) is formed from 3 million words and phrases.

GloVe is an extension of Word2vec. More information [here](https://nlp.stanford.edu/projects/glove/).

Recently, more advanced models have been developed, such as [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) - Bidirectional Encoder Representations from Transformers-  and [GPT-3](https://en.wikipedia.org/wiki/GPT-3) - Generative Pre-trained Transformer 3. While Word2vec models represent tokens (word) with a single vector, BERT generates different output vectors for a same word when used in different context. You can find further readings on the topic at the end of this notebook.

### Implementing Word2vec with Gensim

We will implement Word2vec using the Gensim library. We are going to use a corpus of text extracted from Wikipedia by web scrapping. We first define a function to retrieve texts from a Wikipedia url:

In [27]:
# Get texts from Wikipedia
def get_text(url):
    # Retrieve data
    scrapped_data = urllib.request.urlopen(url)
    article = scrapped_data.read()
    # Parse data: # The text is contained in the HTML tag 'p'
    parsed_article = bs.BeautifulSoup(article,'lxml')
    paragraphs = parsed_article.find_all('p')
    # Create a string with all the paragraphs
    article_text = ""
    for p in paragraphs:
        article_text += p.text
    return article_text

Let's get the Wikipedia articles on [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) and on [Artificial Intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence). This will be our corpus of documents.

In [None]:
# Get articles
machine_learning = get_text("https://en.wikipedia.org/wiki/Machine_learning")
ai = get_text("https://en.wikipedia.org/wiki/Artificial_intelligence")

print(machine_learning[:705])
print(ai[:741])

# Group texts in list
texts = [machine_learning, ai]

Next, we preprocess out texts. We create a tokenizer function to lemmatize each token and remove stopwords.

In [29]:
# Create tokenizer function for preprocessing
def spacy_tokenizer(text):

    # Define stopwords, punctuation, and numbers
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    punctuations = string.punctuation +'–' + '—'
    numbers = "0123456789"

    # Create spacy object
    mytokens = sp(text)

    # Lemmatize each token and convert each token into lowercase
    mytokens = ([ word.lemma_.lower().strip() for word in mytokens ])

    # Remove stop words and punctuation
    mytokens = ([ word for word in mytokens
                 if word not in stop_words and word not in punctuations ])

    # Remove sufix like ".[1" in "experience.[1"
    mytokens_2 = []
    for word in mytokens:
        for char in word:
            if (char in punctuations) or (char in numbers):
                word = word.replace(char, "")
        if word != "":
            mytokens_2.append(word)

    # Return preprocessed list of tokens
    return mytokens_2

Let's apply our function to tokenize our corpus of documents:

In [None]:
# Tokenize texts
processed_texts = [spacy_tokenizer(text) for text in texts]

for processed_text in processed_texts:
    print(processed_text[:20])

Now that our text is preprocessed, we can train a Word2vec model. We use the `Word2Vec` module of Gensim ([Documentation](https://radimrehurek.com/gensim/models/word2vec.html)). As input, we provide the processed texts, i.e., a list of lists of tokens. In addition, we use as parameters:
- `min_count`: minimum number of occurence of single word in corpus to be taken into account
- `vector_size`: dimension of the vectors representing the tokens

Once the model is trained, we can access to the mapping between words and embeddings with the method `.wv`

In [None]:
# Word embedding
word2vec = Word2Vec(processed_texts, min_count=2, vector_size=100)

# Vocabulary
vocab = word2vec.wv.key_to_index
print(vocab)

Each token (word) is represented by a vector (array) of size 100:

In [None]:
# Vector
v1 = word2vec.wv['intelligence']
v1

In this space, we can explore the similarities between tokens. For instance, let's find the most similar words to "intelligence":

In [None]:
# Similar vectors/words
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

Or the similarity between two words:

In [None]:
# Similarity between two words
print('The similarity between "computer" and "argiculture" is: ', word2vec.wv.similarity('computer', 'agriculture'))
print('The similarity between "computer" and "machine" is: ', word2vec.wv.similarity('computer', 'machine'))

Remarks:
- There are other models than Word2Vec in Gensim. For instance, `Doc2Vec` is used to create a vectorised representation of a group of words (i.e., a document) taken collectively as a single unit (illustrated in the next section).
- Gensim has many applications besides word embedding, see e.g., [topic modelling](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/). Feel free to explore the library!

### Using pretrained models
Gensim comes with pretrained models. This means that you don't necessarily have to create your model from scartch in some cases. You can see how to use these pretrained models [here](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models).

In our example, we want to see the 10 most similar words to 'twitter' :

In [None]:
# Download the "glove-twitter-25" embeddings
glove_vectors = gensim.downloader.load('glove-twitter-25')
# Check the most similar terms to the word 'twitter'
glove_vectors.most_similar('twitter')

### Your turn

- Using the functions defined above, create a corpus of documents with the following Wikipedia articles: [Photovoltaics](https://en.wikipedia.org/wiki/Photovoltaics), [Wind turbine](https://en.wikipedia.org/wiki/Wind_turbine), [Hydropower](https://en.wikipedia.org/wiki/Hydropower), and [Nuclear power plant](https://en.wikipedia.org/wiki/Nuclear_power_plant). Do you know the share of each technology in the Swiss electricity mix? Check the [Electricity sector in Switzerland](https://en.wikipedia.org/wiki/Electricity_sector_in_Switzerland) for the answer...

In [36]:
# YOUR CODE HERE


- Preprocessing: Tokenize your corpus of documents

In [37]:
# YOUR CODE HERE


- What is the number of occurrences of the word "energy"?

In [None]:
# YOUR CODE HERE


- Create a Word2Vec representation of the article with a min_count of 1 and a vector size of 50

In [39]:
# YOUR CODE HERE


- What are the 10 most similar words to "electricity"?

In [None]:
# YOUR CODE HERE


## Application: Text Classification with TF-IDF

In this section, we do an application on text classification to illustrate how the embedding can influence the accuracy of a classifier.

Our goal is to classify consumer finance complaints into 12 pre-defined categories using:
- TF-IDF and logistic regression

We use the same tokenizer function, train-test split, classification algorithm, etc. The only difference is the mathematical representation (i.e., the vectorization from the tokens) of the complaints.

This application was inspired by the articles published by Susan Li on Towards Data Science:
- [Multi-Class Text Classification with Scikit-Learn](https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f)

### Load and clean data

We work with a sample of a large dataset from Data.gov that can be found [here](https://catalog.data.gov/dataset/consumer-complaint-database).

In [None]:
# Load data from GitHub
path = "https://raw.githubusercontent.com/michalis0/DataScience_and_MachineLearning/refs/heads/master/Labs/05-text-analytics/data/complaints_sample.csv"
df = pd.read_csv(path, index_col=0)
df.head()

In [None]:
df.info()

The data set includes 18 columns and 9101 rows describing consumer complaints about financial products. In this case, we want to predict the `Product` category based on the text of the complaint (i.e., `Consumer complaint narrative`).

In [43]:
# Select columns of interest
data = df[["Product", "Consumer complaint narrative"]]

Around 2/3 of the complaints are null values. They are not useful for the prediction so we drop them.

In [None]:
# Drop NaN
print(data.isnull().sum())
data = data.dropna().reset_index(drop=True)
data.head()

In [None]:
data.info()

We end up with 3137 complaints for which we would like to predict the product concerned.

### Exploratory Data Analysis

As always, we start by an EDA to better understand our data and inform our analysis. First note that we are dealing with a dataset containing a large number of words:

In [None]:
# Total number of words - over 600,000
words_number = data['Consumer complaint narrative'].apply(lambda x: len(x.split(' '))).sum()
print(f'The complaints contain {words_number} words.')

Let's extract a sample to see how the complaints look like:

In [None]:
# Sample
data['Consumer complaint narrative'].sample().values[0]

The data has been anonymized (i.e., names, dates, IDs, etc. have been replaced by XXXX).

Next, note that the classes (products) are imbalanced:

In [None]:
# Imbalanced dataset
data.Product.value_counts()

There are 17 categories. We group some of them together (e.g. "Credit card", "Prepaid card", and "Credit or prepaid card") because they are sub-categories of each other. We end up with 12 categories.

In [None]:
# Merge categories
dic_replace = {'Credit reporting':'Credit reporting, credit repair services, or other personal consumer reports',
               'Credit card':'Credit card or prepaid card',
               'Payday loan':'Payday loan, title loan, or personal loan',
               'Money transfers':'Money transfer, virtual currency, or money service',
               'Prepaid card':'Credit card or prepaid card',
               'Virtual currency':'Money transfer, virtual currency, or money service'}
data.replace(dic_replace, inplace=True)
data.Product.value_counts()

Let's visualize the number of observation per product using a bar plot.

In [None]:
# Plot number of complaints per category
cnt_pro = data['Product'].value_counts()
plt.figure(figsize=(12,4))
sns.countplot(x=data['Product'], order = cnt_pro.index)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.xticks(rotation=90)
plt.show()

Finally, let's compute the base rate, i.e., the accuracy obtained using a naive classifier that predicts that all observations are from the largest class ("Credit reporting, credit repair services, or other personal consumer reports").

In [None]:
# Base rate
base_rate = round(len(data[data.Product == "Credit reporting, credit repair services, or other personal consumer reports"]) / len (data), 4)
print(f'The base rate is: {base_rate*100:0.2f}%')

### Classification using TF-IDF and Logistic Regression

We first define our training and test set, using the `train_test_split` module of sklearn.

In [52]:
# Select features
X = data['Consumer complaint narrative'] # Features we want to analyze
ylabels = data['Product']                # Labels we test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=1234)

Next, we use the `TfidfVectorizer` class of sklearn for the word embedding. Since we are dealing with very specific data (e.g., the anonymization process generated non-standard sequence of characters), we are defining our own tokenizer function, which we can use as parameter of `TfidfVectorizer` instead of the default one.

In [53]:
# Define tokenizer function
def spacy_tokenizer(sentence):

    punctuations = string.punctuation
    stop_words = spacy.lang.en.stop_words.STOP_WORDS

    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Remove anonymous dates and people
    mytokens = [ word.replace('xx/', '').replace('xxxx/', '').replace('xx', '') for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in ["xxxx", "xx", ""] ]

    # Return preprocessed list of tokens
    return mytokens

As other parameters of `TfidfVectorizer`, we are using token and pair of tokens (`ngram_range = (1,2)`) and we ignore terms that have a document frequency strictly lower than 5 (`min_df = 5`).

Note that we also rely on the `Pipeline` module of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)) to sequentially apply models, first the vectorizer, then the classifier. We also time our training (it might take a few minutes).

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)

# Define classifier
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

Finally, we predict the test set values and evalute the performance of our model:

In [55]:
# Predictions
y_pred = pipe.predict(X_test)

In [None]:
# Evaluate model

## Accuracy
accuracy_tfidf = round(accuracy_score(y_test, y_pred), 4)
print(f'The accuracy using TF-IDF is: {accuracy_tfidf*100:0.2f}%')

## Confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,7))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

## Introduction to Hugginface and sentiment analysis
Another powerful tool is the Hugginface library. You can acces the documentation [here](https://huggingface.co/docs)

In our example, we will try to implement a sentiment analysis on a text. A sentiment analysis tries to classify a text into three categories: positive, neutral and negative.

### Implementation of Hugginface
First, we have to load the model that we want to use. There are many differents models that you can find [here](https://huggingface.co/docs/transformers/main_classes/model).

In [None]:
%pip install --upgrade transformers huggingface_hub tensorflow
%pip install --upgrade accelerate
%pip install --upgrade tf-keras

In [None]:
from transformers import pipeline

# Test with a simpler model
sentiment_pipeline = pipeline("sentiment-analysis")


Then, we will try to classify these two sentences:

In [None]:
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

This is it! We are already done and we see that the first sentence is obviously classified as *positive* and that the second one is *negative*.

### Your turn !
Try to use this [pipeline](https://huggingface.co/SamLowe/roberta-base-go_emotions) to classify the emotions of this sentence : "I am not having a great day":
- task : 'text-classification'
- model : 'SamLowe/roberta-base-go_emotions'
- top_k : None

In [None]:
# YOUR CODE HERE


In [None]:
# print the outputs



What are the top three emotions related to this sentence : "I don't like clowns, I am afraid of them"

In [None]:
# YOUR CODE HERE
