In [None]:
# Import standard libraries
import pandas as pd
import numpy as np
import math
import bs4 as bs
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns

# Import for text analytics
import spacy
from spacy import displacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import string
import gensim
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from gensim.utils import simple_preprocess
from gensim import corpora
import multiprocessing

# Import libraries for logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score

# Load English language model of spacy
sp = spacy.load('en_core_web_sm')

# Text Analytics 2: Word Embedding

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/0*QXJDYJTGexmpeQ43.jpg' width="450">

## Content

In this walkthrough, we pursue our exploration of [Text Analytics](https://en.wikipedia.org/wiki/Text_mining), diving into [Word Embedding](https://en.wikipedia.org/wiki/Word_embedding) and doing an application on sentiment classification.

- [Recap on text representation](#Recap-on-text-representation)
    - [Some definitions](#Some-definitions)
    - [Bag of Words (BOW)](#Bag-of-Words-(BOW))
    - [TF-IDF](#TF-IDF)
- [Introduction to Gensim and Word Embedding](#Introduction-to-Gensim-and-Word-Embedding)
    - [Background](#Background)
    - [Implementing Word2vec with Gensim](#Implementing-Word2vec-with-Gensim)
    - [Your turn](#Your-turn)
- [Application: Text Classification with TF-IDF vs Doc2Vec](#Application:-Text-Classification-with-TF-IDF-vs-Doc2Vec)
    - [Load and clean data](#Load-and-clean-data)
    - [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    - [Classification using TF-IDF and Logistic Regression](#Classification-using-TF-IDF-and-Logistic-Regression)
    - [Classification using Doc2Vec and Logistic Regression](#Classification-using-Doc2Vec-and-Logistic-Regression)
    - [How to improve the accuracy of a text classifier?](#How-to-improve-the-accuracy-of-a-text-classifier?)
- [Further reading](#Further-reading)

## Recap on text representation

In order to be able to use texts as inputs for classification, we have to transform them into numbers (i.e., vectors). There are several ways of doing this. We recap in this section the concepts and techniques seen last week, namely Bag of Words and TF-IDF.

### Some definitions

- Document = some text, i.e., a string (e.g., a sentence, a tweet, paragraph of text, book, news article, etc.).
- Corpus = collection of documents.
- Dictionary = list of unique tokens in (preprocessed) corpus.
- Vector = mathematical representation of a document (e.g., Bag of Words).
- Model = algorithm used for transforming vectors from one representation to another (e.g., TF-IDF).

Let's illustrate these concepts in Python, using text from the article *[The decarbonisation of Europe powered by lifestyle changes](https://iopscience.iop.org/article/10.1088/1748-9326/abe890/meta)*.

Reference: Costa, L., Moreau, V., et al. (2021). The decarbonisation of Europe powered by lifestyle changes. *Environmental Research Letters*, 16(4), 044057.

Recall that you can define a string using single quotes, double quotes, or triple quotes:

In [None]:
# A document
doc = 'Changes in behaviour may contribute more than 20% of the GHG emission reductions required for net-zero.' 
doc = "Changes in behaviour may contribute more than 20% of the GHG emission reductions required for net-zero." 
doc = """Changes in behaviour may contribute more than 20% of the GHG emission reductions required for net-zero.""" 

Here is a corpus, containing a collection of sentences:

In [None]:
# A corpus
d1 = "The impacts of behavioural change vary across sectors."
d2 = "Changes in travel behaviour limit the rising demand for electricity."
d3 = "Adopting a healthy diet reduces emissions substantially."
d4 = "Without behavioural change, the dependency of Europe on carbon removal technologies increases."
d5 = "Changes in lifestyles are crucial, contributing to achieving climate targets sooner."
corpus = [d1, d2, d3, d4, d5]
corpus

Let's create a dictionary for our corpus. First, we apply a simple preprocessing technique to convert each each sentence into a list of tokens (words). We are using the [Gensim](https://pypi.org/project/gensim/) library.

In [None]:
# Preprocessing
processed_corpus = [simple_preprocess(d) for d in corpus]
print(processed_corpus)

We now create our dictionary, obtaining a mapping between each word in our corpus and an associated integer identification:

In [None]:
# A dictionary
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

In [None]:
print(dictionary.token2id)

### Bag of Words (BOW)

**Bag of Words (BOW)** is the simplest approach to achieve the transformation of documents into vectors. It is divided into two basic steps:
- Create a dictionary of unique words from the corpus.
- Analyse the documents, i.e. for each word in the dictionary and each document, add 1 if the word is in the document, otherwise 0.

Let's implement BOW from scratch using the [spaCy](https://spacy.io/) library. We first define two functions: the first one to get the words of a document and the second to get the unique words of a corpus of documents.

In [None]:
# Tokens in document
def get_tokens(document):
    doc_tokens = ([token.lower_ for token in sp(document) 
                   if (token.is_punct == False) and (token.is_space == False)])
    return doc_tokens

In [None]:
get_tokens(d1)

In [None]:
# List of unique words in corpus (dictionary)
def vocabulary(corpus):
    # Delare output
    word_list = []
    # Loop documents - lower each word and add it to the output
    for document in corpus:
        spacy_doc = sp(document)
        for token in spacy_doc:
            if token.lower_ not in word_list and (token.is_punct == False) and (token.is_space == False):
                word_list.append(token.lower_)
    # Return output
    return word_list

In [None]:
print(vocabulary(corpus))

We use our two functions to create the Bag of Words. 

In [None]:
# Bag of Words
def bow(document, corpus):
    # Get tokens
    doc_tokens = get_tokens(document)
    corpus_tokens = vocabulary(corpus)
    # Initialization
    bag = {}
    for token in corpus_tokens:
        bag[token] = 0
    # Add 1 if token is in document
    for token in doc_tokens:
        bag[token] += 1
    # Return
    return bag

Here is what we obtain for the first sentence of our corpus:

In [None]:
print(bow(d1, corpus))

Let's do the same for all our documents and visualize the result in a dataframe:

In [None]:
# BOW for all documents in corpus
bag_of_words = [bow(d, corpus) for d in corpus]

# Visualize in dataframe
pd.set_option("display.max_columns", None)
pd.DataFrame(bag_of_words,
    index= ['d1', 'd2', 'd3', 'd4', 'd5']
    )

Note that this is not perfect. We could (should) remove stopwords, use lemmatization, and potentially consider n-grams.

Instead of implementing the technique from scratch, we can rely on the `CountVectorizer` class of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)):

In [None]:
# Initialize model
vectorizer = CountVectorizer()

# Learn the vocabulary dictionary and return document-term matrix
bag_of_words = vectorizer.fit_transform(corpus).todense()

# DataFrame
bag_of_words = pd.DataFrame(bag_of_words, 
                            columns=vectorizer.get_feature_names_out(),
                            index = ['d1', 'd2', 'd3', 'd4', 'd5'])
bag_of_words

Advantages of BOW:
- No need of huge corpus of words to get good results in practice.
- Easy to understand (i.e., not mathematically complex).

Disadvantages of BOW:
- A lot of zeros (imagine a corpus of 1000 articles) --> consume memory and space.
- Does not maintain any context information ("I eat a fish" vs. "A fish eats me").
- Half solutions: n-grams, specifiying min_df and max_df (see [documentation](https://https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)).

### TF-IDF

**TF-IDF** is a type of bag of words approach where instead of adding zeros and ones in the embedding vector, we add floating numbers that contain more useful information. The idea is to emphasize words that appear in few documents in the corpus. A word that appear many times but only in one document will have a high value (close to one) compared to words that appear many times in many documents. This word is then very useful to identify the document.

TF-IDF is the product of term frequency (TF) and inverse document frequency (IDF):
- **Term Frequency** identifies tokens that appear frequently in a document: 
    - TF(token, document) = number of times token appears in document / total number of tokens in document
    - greater if word appears many times in document
- **Inverse Document Frequency** identifies words that appear rarely in the corpus: 
    - IDF(token, corpus) = log( total number of documents in corpus / number of documents containing token )
    - greater if word appears in fewer doucuments
    
Ok, let's try implement from scratch TF-IDF! First, we define a function to compute the term frequency:

In [None]:
# Term frequency (TF)
def tf(document):
    # Get tokens
    tokens = get_tokens(document)
    # Initialization 
    term_freq = {token: 0 for token in tokens}  # Notice the use of comprehension!
    # Increment
    for token in tokens:
        term_freq[token] += 1/len(tokens)
    # Return
    return term_freq

Let's check our function:

In [None]:
tf(d1)

Each words appear once in our sentence, and the sentence contains 8 words. Hence, the frequency of each token is 1/8=0.125.

Let's proceed, defining a function to compute the inverse document frequency:

In [None]:
# Inverse document frequency
def idf(corpus):
    # Get list of unique words in corpus
    voc = vocabulary(corpus)
    # Initialization
    inv_doc_freq = {word: 0 for word in voc}
    # Number of apparition of word
    for word in voc:
        for document in corpus:
            doc_tokens = get_tokens(document)
            if word in doc_tokens:
                inv_doc_freq[word] += 1
    # IDF
    inv_doc_freq = {k: math.log(len(corpus) / inv_doc_freq[k]) for k in inv_doc_freq.keys()}
    # Return
    return inv_doc_freq

Let's test our function:

In [None]:
idf(corpus)

Ok, finally we compute TF-IDF!

In [None]:
# TF-IDF
def tfidf(document, corpus):
    # TF
    tf_bag = tf(document)
    # IDF
    idf_bag = idf(corpus)
    # TF*IDF
    tfidf_bag = {k: tf_bag[k]*idf_bag[k] for k in tf_bag.keys()}
    return tfidf_bag

Let's see the result for the first sentence of our corpus:

In [None]:
tfidf(d1, corpus)

Compared to what we obtained with TF, the tokens are now scaled by the IDF. For instance, since "the" is a common word, its TF-IDF is lower than words like "sectors" that appears only once in our corpus.

Let's compute TF-IDF for all tokens and documents and visualize the result in a dataframe.

In [None]:
# TF-IDF for all documents
bag_of_words_tfidf = [tfidf(doc, corpus) for doc in corpus]

# Visualize in Dataframe
pd.DataFrame(bag_of_words_tfidf,
    index= ['d1', 'd2', 'd3', 'd4', 'd5']
    ).fillna(0)

As before with BOW, the result is not perfect since we could remove stopwords, and use n-gramms and lemmas.

Instead of implementing the technique from scracth, we can use the `TfidfVectorizer` class of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)). Note that the results differ from ours because `TfidfVectorizer` is using a slightly different formula to compute IDF.

In [None]:
# Initialize model
vectorizer = TfidfVectorizer()

# Learn the vocabulary dictionary and return document-term matrix
bag_of_words_tfidf = vectorizer.fit_transform(corpus).todense()

# DataFrame
bag_of_words_tfidf = pd.DataFrame(bag_of_words_tfidf, 
                                  columns=vectorizer.get_feature_names_out(),
                                  index = ['d1', 'd2', 'd3', 'd4', 'd5'])
bag_of_words_tfidf

Advantage of TF-IDF:
- Smart way of representing documents in corpus. More information is provided.

Disadvantages of TF-IDF (same as for BOW):
- A lot of zeros (imagine a corpus of 1000 articles) --> consume memory and space
- Does not maintain any context information ("I eat a fish" vs. "A fish eats me")
- Half solutions: n-grams, specifiying min_df and max_df (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)).

## Introduction to Gensim and Word Embedding

With BOW and TF-IDF, similar sentences/words have a completely different representation. Thus, sentences with different words but same meaning/semantics will be very distant.

In the following, we illustrate how we can find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

We will use the [Gensim](https://pypi.org/project/gensim/) library. Gensim stands for "Generate Similar". It is a popular open-source natural language processing (NLP) library used for unsupervised topic modeling. A complete tutorial can be found [here](https://www.tutorialspoint.com/gensim/gensim_introduction.htm). 

### Background

Word embedding approaches use deep learning and neural network-based techniques to convert words into corresponding vectors so that semantically similar vectors are close to each other in an N-dimensional space, where N refers to the dimensions of the vectors. The underlying assumption is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model.

Two word embedding methods:
- [Word2vec](https://en.wikipedia.org/wiki/Word2vec), by Google
- [GloVe](https://en.wikipedia.org/wiki/GloVe) (Global vectors for Word Representation), by Stanford

Word2vec gives astonishing results. Its ability to maintain a semantic relationship is reflected in a classic example where if you have a vector for the word "King" and you remove the vector represented by the word "Man" from the "King" and add "Woman", you get a vector that is close to the vector "Queen": 
- King - Man + Woman = Queen

Second example: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2vec they will therefore share a similar vector representation.

In real applications, Word2vec models are created from billions of documents. For example, [Google's Word2Vec model](https://code.google.com/archive/p/word2vec/) is formed from 3 million words and phrases.

GloVe is an extension of Word2vec. More information [here](https://nlp.stanford.edu/projects/glove/).

Recently, more advanced models have been developed, such as [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) - Bidirectional Encoder Representations from Transformers-  and [GPT-3](https://en.wikipedia.org/wiki/GPT-3) - Generative Pre-trained Transformer 3. While Word2vec models represent tokens (word) with a single vector, BERT generates different output vectors for a same word when used in different context. You can find further readings on the topic at the end of this notebook.

### Implementing Word2vec with Gensim

We will implement Word2vec using the Gensim library. We are going to use a corpus of text extracted from Wikipedia by web scrapping. We first define a function to retrieve texts from a Wikipedia url:

In [None]:
# Get texts from Wikipedia
def get_text(url):
    # Retrieve data
    scrapped_data = urllib.request.urlopen(url)
    article = scrapped_data.read()
    # Parse data: # The text is contained in the HTML tag 'p'
    parsed_article = bs.BeautifulSoup(article,'lxml')
    paragraphs = parsed_article.find_all('p')  
    # Create a string with all the paragraphs
    article_text = ""
    for p in paragraphs:
        article_text += p.text
    return article_text

Let's get the Wikipedia articles on [Machine Learning](https://en.wikipedia.org/wiki/Machine_learning) and on [Artificial Intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence). This will be our corpus of documents.

In [None]:
# Get articles
machine_learning = get_text("https://en.wikipedia.org/wiki/Machine_learning")
ai = get_text("https://en.wikipedia.org/wiki/Artificial_intelligence")

print(machine_learning[:705])
print(ai[:741])

# Group texts in list
texts = [machine_learning, ai]

Next, we preprocess out texts. We create a tokenizer function to lemmatize each token and remove stopwords.

In [None]:
# Create tokenizer function for preprocessing
def spacy_tokenizer(text):

    # Define stopwords, punctuation, and numbers
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    punctuations = string.punctuation +'–' + '—'
    numbers = "0123456789"

    # Create spacy object
    mytokens = sp(text)

    # Lemmatize each token and convert each token into lowercase
    mytokens = ([ word.lemma_.lower().strip() for word in mytokens ])

    # Remove stop words and punctuation
    mytokens = ([ word for word in mytokens 
                 if word not in stop_words and word not in punctuations ])

    # Remove sufix like ".[1" in "experience.[1"
    mytokens_2 = []
    for word in mytokens:
        for char in word:
            if (char in punctuations) or (char in numbers):
                word = word.replace(char, "")
        if word != "":
            mytokens_2.append(word)

    # Return preprocessed list of tokens
    return mytokens_2

Let's apply our function to tokenize our corpus of documents:

In [None]:
# Tokenize texts
processed_texts = [spacy_tokenizer(text) for text in texts]

for processed_text in processed_texts:
    print(processed_text[:20])

Now that our text is preprocessed, we can train a Word2vec model. We use the `Word2Vec` module of Gensim ([Documentation](https://radimrehurek.com/gensim/models/word2vec.html)). As input, we provide the processed texts, i.e., a list of lists of tokens. In addition, we use as parameters:
- `min_count`: minimum number of occurence of single word in corpus to be taken into account
- `vector_size`: dimension of the vectors representing the tokens

Once the model is trained, we can access to the mapping between words and embeddings with the method `.wv`

In [None]:
# Word embedding 
word2vec = Word2Vec(processed_texts, min_count=2, vector_size=100)

# Vocabulary
vocab = word2vec.wv.key_to_index
print(vocab)

Each token (word) is represented by a vector (array) of size 100:

In [None]:
# Vector
v1 = word2vec.wv['intelligence'] 
v1

In this space, we can explore the similarities between tokens. For instance, let's find the most similar words to "intelligence":

In [None]:
# Similar vectors/words
sim_words = word2vec.wv.most_similar('intelligence')
sim_words

Or the similarity between two words:

In [None]:
# Similarity between two words
print('The similarity between "computer" and "animal" is: ', word2vec.wv.similarity('computer', 'animal'))
print('The similarity between "computer" and "machine" is: ', word2vec.wv.similarity('computer', 'machine'))

Remarks:
- There are other models than Word2Vec in Gensim. For instance, `Doc2Vec` is used to create a vectorised representation of a group of words (i.e., a document) taken collectively as a single unit (illustrated in the next section).
- Gensim has many applications besides word embedding, see e.g., [topic modelling](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/). Feel free to explore the library!

### Your turn

- Using the functions defined above, create a corpus of documents with the following Wikipedia articles: [Photovoltaics](https://en.wikipedia.org/wiki/Photovoltaics), [Wind turbine](https://en.wikipedia.org/wiki/Wind_turbine), [Hydropower](https://en.wikipedia.org/wiki/Hydropower), and [Nuclear power plant](https://en.wikipedia.org/wiki/Nuclear_power_plant). Do you know the share of each technology in the Swiss electricity mix? Check the [Electricity sector in Switzerland](https://en.wikipedia.org/wiki/Electricity_sector_in_Switzerland) for the answer...

In [None]:
# YOUR CODE HERE


- Preprocessing: Tokenize your corpus of documents

In [None]:
# YOUR CODE HERE


- What is the number of occurrences of the word "energy"?

In [None]:
# YOUR CODE HERE


- Create a Word2Vec representation of the article with a min_count of 1 and a vector size of 50

In [None]:
# YOUR CODE HERE


- What are the 10 most similar words to "electricity"?

In [None]:
# YOUR CODE HERE


## Application: Text Classification with TF-IDF vs Doc2Vec

In this section, we do an application on text classification to illustrate how the embedding can influence the accuracy of a classifier. 

Our goal is to classify consumer finance complaints into 12 pre-defined categories using:
- TF-IDF and logistic regression
- Doc2Vec and logistic regression
We use the same tokenizer function, train-test split, classification algorithm, etc. The only difference is the mathematical representation (i.e., the vectorization from the tokens) of the complaints.

This application was inspired by the articles published by Susan Li on Towards Data Science:
- [Multi-Class Text Classification with Scikit-Learn](https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f)
- [Multi-Class Text Classification with Doc2Vec & Logistic Regression](https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4)

### Load and clean data

We work with a sample of a large dataset from Data.gov that can be found [here](https://catalog.data.gov/dataset/consumer-complaint-database).

In [None]:
# Load data from GitHub
path = "https://raw.githubusercontent.com/michalis0/MGT-502-Data-Science-and-Machine-Learning/main/data/complaints_sample.csv"
df = pd.read_csv(path, index_col=0)
df.head()

In [None]:
df.info()

The data set includes 18 columns and 9101 rows describing consumer complaints about financial products. In this case, we want to predict the `Product` category based on the text of the complaint (i.e., `Consumer complaint narrative`).

In [None]:
# Select columns of interest
data = df[["Product", "Consumer complaint narrative"]]

Around 2/3 of the complaints are null values. They are not useful for the prediction so we drop them.

In [None]:
# Drop NaN
print(data.isnull().sum())
data = data.dropna().reset_index(drop=True)
data.head()

In [None]:
data.info()

We end up with 3137 complaints for which we would like to predict the product concerned.

### Exploratory Data Analysis

As always, we start by an EDA to better understand our data and inform our analysis. First note that we are dealing with a dataset containing a large number of words: 

In [None]:
# Total number of words - over 600,000
words_number = data['Consumer complaint narrative'].apply(lambda x: len(x.split(' '))).sum()
print(f'The complaints contain {words_number} words.')

Let's extract a sample to see how the complaints look like:

In [None]:
# Sample
data['Consumer complaint narrative'].sample().values[0]

The data has been anonymized (i.e., names, dates, IDs, etc. have been replaced by XXXX).

Next, note that the classes (products) are imbalanced:

In [None]:
# Imbalanced dataset
data.Product.value_counts()

There are 17 categories. We group some of them together (e.g. "Credit card", "Prepaid card", and "Credit or prepaid card") because they are sub-categories of each other. We end up with 12 categories.

In [None]:
# Merge categories
dic_replace = {'Credit reporting':'Credit reporting, credit repair services, or other personal consumer reports', 
               'Credit card':'Credit card or prepaid card', 
               'Payday loan':'Payday loan, title loan, or personal loan', 
               'Money transfers':'Money transfer, virtual currency, or money service',
               'Prepaid card':'Credit card or prepaid card',
               'Virtual currency':'Money transfer, virtual currency, or money service'}
data.replace(dic_replace, inplace=True)
data.Product.value_counts()

Let's visualize the number of observation per product using a bar plot.

In [None]:
# Plot number of complaints per category
cnt_pro = data['Product'].value_counts()
plt.figure(figsize=(12,4))
sns.countplot(x=data['Product'], order = cnt_pro.index)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.xticks(rotation=90)
plt.show()

Finally, let's compute the base rate, i.e., the accuracy obtained using a naive classifier that predicts that all observations are from the largest class ("Credit reporting, credit repair services, or other personal consumer reports").

In [None]:
# Base rate
base_rate = round(len(data[data.Product == "Credit reporting, credit repair services, or other personal consumer reports"]) / len (data), 4)
print(f'The base rate is: {base_rate*100:0.2f}%')

### Classification using TF-IDF and Logistic Regression

We first define our training and test set, using the `train_test_split` module of sklearn.

In [None]:
# Select features
X = data['Consumer complaint narrative'] # Features we want to analyze
ylabels = data['Product']                # Labels we test against

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.2, random_state=1234)

Next, we use the `TfidfVectorizer` class of sklearn for the word embedding. Since we are dealing with very specific data (e.g., the anonymization process generated non-standard sequence of characters), we are defining our own tokenizer function, which we can use as parameter of `TfidfVectorizer` instead of the default one.

In [None]:
# Define tokenizer function
def spacy_tokenizer(sentence):

    punctuations = string.punctuation
    stop_words = spacy.lang.en.stop_words.STOP_WORDS

    # Create token object, which is used to create documents with linguistic annotations.
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # Remove anonymous dates and people
    mytokens = [ word.replace('xx/', '').replace('xxxx/', '').replace('xx', '') for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in ["xxxx", "xx", ""] ]

    # Return preprocessed list of tokens
    return mytokens

As other parameters of `TfidfVectorizer`, we are using token and pair of tokens (`ngram_range = (1,2)`) and we ignore terms that have a document frequency strictly lower than 5 (`min_df = 5`). 

Note that we also rely on the `Pipeline` module of sklearn ([Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)) to sequentially apply models, first the vectorizer, then the classifier. We also time our training (it might take a few minutes).

In [None]:
%%time
# Define vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), tokenizer=spacy_tokenizer)

# Define classifier
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)

# Create pipeline
pipe = Pipeline([('vectorizer', tfidf),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

Finally, we predict the test set values and evalute the performance of our model:

In [None]:
# Predictions
y_pred = pipe.predict(X_test)

In [None]:
# Evaluate model

## Accuracy
accuracy_tfidf = round(accuracy_score(y_test, y_pred), 4)
print(f'The accuracy using TF-IDF is: {accuracy_tfidf*100:0.2f}%')

## Confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,7))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

### Classification using Doc2Vec and Logistic Regression

We now try to do the same exercise, but using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html). While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document (i.e., sentence) in the corpus.

We first tokenize our data, using the same tokenizer as before. We use `TaggedDocument` from Gensim to obtain the appropriate input document format for `Doc2Vec`. `TaggedDocument` returns, for each observation, a document containing *words* (i.e., a list of tokens) and the associated *tags* (our label).

In [None]:
%%time

# Tokenize data - same tokenizer function as before
sample_tagged = data.apply(lambda r: TaggedDocument(words=spacy_tokenizer(r['Consumer complaint narrative']), tags=[r.Product]), axis=1)
print(sample_tagged.head(20))

In [None]:
sample_tagged.values[10]

Next, we split our dataset into training and test, using the same split as before.

In [None]:
# Train test split - same split as before
train_tagged, test_tagged = train_test_split(sample_tagged, test_size=0.2, random_state=1234)

To speed up the training process, we use the `multiprocessing` library ([Documentation](https://docs.python.org/3/library/multiprocessing.html)). It will allow to train the model using several worker threads via the parameter `workers` of `Doc2Vec`.

In [None]:
# Number of CPUs in the system
cores = multiprocessing.cpu_count()

Next we define the `Doc2Vec` model, using as parameters:
- `dm`: training algorithm, if 1 distributed memory is used (PV-DM), else distributed bag of words (PV-DBOW)
- `vector_size`: dimension of feature vector
- `hs`: if 1, hierarchical softmax will be used for model training; if set to 0, and negative is non-zero, negative sampling will be used
- `negative`: specifies how many "noise words" should be drawn for negative sampling
- `min_count`: ignore words with frequency lower than this
- `epochs`: number of iteration (epoch) over the corpus

In addition, we build our vocabulary.

In [None]:
# Define Doc2Vec and build vocabulary
model_dbow = Doc2Vec(dm=0, vector_size=30, negative=6, hs=0, min_count=1, sample=0, workers=cores, epochs=300)
model_dbow.build_vocab([x for x in train_tagged.values])

We now train the distributed bag of words model. In short, it trains a neural network and the optimal weights are the coefficients of the vectors of the documents. Therefore, similar documents will be close to each other in the N-dimentional space (N being the size of the vectors). More information on this [here](https://thinkinfi.com/simple-doc2vec-explained/).

In [None]:
# Train distributed Bag of Word model
model_dbow.train(train_tagged, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs)

Next, we convert our tagged documents into vectors using the model we trained. We define a function that use as input our model and tagged documents and returns the targets (i.e., labels) and regressors, i.e., the vector representation of the complaints. We then apply this function to prepare the training and test set for the classification.

In [None]:
# Embedding
def vec_for_learning(model, tagged_docs):
    sents = tagged_docs.values
    targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, epochs=100)) for doc in sents])
    return targets, regressors

y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)

Note that each document (i.e., complaint) is now a vector in the space of 30 dimentions. Similar complaints should have similar vector representation.

In [None]:
X_train[:3]

Ok, we can finally implement our logistic regression. We proceed as before, training the model, predicting on the test set, and evaluating the performance of our classifier.

In [None]:
# Fit model on training set - same algorithm as before
logreg = LogisticRegression(max_iter=1000, solver='lbfgs')
logreg.fit(X_train, y_train)

# Predictions
y_pred = logreg.predict(X_test)

In [None]:
# Evaluate model

## Accuracy
accuracy_doc2vec = round(accuracy_score(y_test, y_pred), 4)
print(f'The accuracy using TF-IDF is: {accuracy_doc2vec*100:0.2f}%')

## Confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8,7))
sns.heatmap(conf_mat, annot=True, fmt='d')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

What do you observe?

### How to improve the accuracy of a text classifier?

In order to improve the prediction, we can try to:
- Resample our data (i.e., create balanced dataset)
- Tune the hyperparameters of the model
- Improve text preparation
- Use another classifier, e.g., k-NN, decision trees, random forests, etc.
- All of the above!

## Further reading

Text Analytics is a rich field. We have seen above one application of text classification. A common application of text classification is [Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis), which is the process of tagging data according to their sentiment, such as positive, negative and neutral. You can find a guide to sentiment analysis [here](https://huggingface.co/blog/sentiment-analysis-python). You can also find many already trained models on [Huggingface](https://huggingface.co/models), including ready-made classifier for sentiment analysis.

Finally, here are some resources on Word2Vec, GloVe, BERT to deepen your understanding of the topic:
- Rong, X. (2014). word2vec parameter learning explained. *arXiv preprint arXiv:*[1411.2738](https://arxiv.org/abs/1411.2738)
- [Introduction to Word Embedding and Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
- [Word Embeddings for NLP](https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4)
- [NLP — Word Embedding & GloVe](https://jonathan-hui.medium.com/nlp-word-embedding-glove-5e7f523999f6)
- [Intuitive Guide to Understanding GloVe Embeddings](https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010)
- [Word2vec vs BERT](https://medium.com/@ankiit/word2vec-vs-bert-d04ab3ade4c9)
- [Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT](https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794)