# Text Data: an introduction
## What is text data?
- Text data is a data structure that represents information in the form of text. It is a collection of words, sentences, and paragraphs that is readable and includes alphabets and numbers. Text data is everywhere, being one of the most common forms of data that is generated by humans in the form of blogs, tweets, comments, and so on.

- But, text data is **unstructured data**, and it is not easy to extract information from it.  Text data is **also known as natural language data**.

- NLP (Natural Language Processing) is a field of computer science that deals with the interaction between computers and humans using the natural language.

## Applications of text data
Text data is used in many applications, such as:
- **SENTIMENT ANALYSIS**: Sentiment analysis is the process of analyzing the sentiment of a piece of text. It is used to determine whether the sentiment of a piece of text is positive, negative, or neutral. It is used in many applications, such as social media monitoring, brand monitoring, and customer service.
  
- **TEXT CLASSIFICATION**: Text classification is the process of classifying text into different categories. It is used in many applications, such as spam detection, sentiment analysis, and topic classification.
  
- **TEXT SUMMARIZATION**: Text summarization is the process of summarizing text into a shorter version. It is used in many applications, such as news summarization, document summarization, and email summarization.
  
- **MACHINE TRANSLATION**: Machine translation is the process of translating text from one language to another. It is used in many applications, such as language translation, document translation, and website translation.
  
- **QUESTION ANSWERING**: Question answering is the process of answering questions. It is used in many applications, such as question answering systems, question answering systems, and question answering systems.
  
- **INFORMATION RETRIEVAL**: Information retrieval is the process of retrieving information from a collection of documents.
  
- etc.

### Basic definitions

So, a text document is a collection of words, sentences, and paragraphs. All in all it not more than a string. Let us define some basic concepts:

#### Corpus
 The **corpus** is (a large and) structured collection of texts or written materials that are used for linguistic analysis, research, or language modeling purposes. 
 
E.g., the collection of all the documents in a library, the collection of all the documents in a database, the collection of all the documents in a website, etc.

#### Lexicon
The **lexicon** of a language is the set of all words in that language. It is also called the **vocabulary** of the language.

**words** are also called **terms** and, in some contexts, **tokens**.

E.g., the lexicon of the English language is the set of all English words.


#### $n$-grams

$n$-grams are contiguous sequences of $n$ items from a given sample of text or speech. In the context of natural language processing, an $n$-gram typically refers to a sequence of $n$ words or characters.

For example, let's consider the sentence: "I love to code." Here are some examples of $n$-grams with different values of $n$:

- Unigrams ($n = 1$): ["I", "love", "to", "code"]
- Bigrams ($n = 2$): ["I love", "love to", "to code"]
- Trigrams ($n = 3$): ["I love to", "love to code"]
- 4-grams ($n = 4$): ["I love to code"]

The longer the $n$-gram (the higher the value of $n$), the more context you have to work with. In general, a larger $n$-gram generally means more context, which means a better understanding of the structure and sentiment of a text. The optimal value of $n$ depends on the application and the dataset. For example, in spam detection, unigrams perform better than bigrams and trigrams, while for authorship attribution, character 4-grams work better than word 4-grams.

#### Summary example
Example, given the list of documents:
- D1: "I like to play football"
- D2: "I hate football"
- D3: "I like to play tennis"

As a result

- The lexicon is: {I, like, to, play, football, hate, tennis}.
- The corpus is: {D1, D2, D3}.
- The 2-grams / bigrams are: {I like, like to, to play, play football, I hate, hate football, like to, to play, play tennis}.



## Basic feature extraction techniques
### Dataset

In this notebook we'll use the **Large Movie Review Dataset v1.0**, retrieved from http://ai.stanford.edu/~amaas/data/sentiment/.

This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification.


The core dataset contains 50,000 reviews split evenly into 25K train and 25K test sets. The overall distribution of labels is balanced (25K are positive and 25K are negative). It also includes an additional 50,000 unlabeled documents for unsupervised learning, which we will not use in this notebook.

In the entire collection, 
- no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. 

- the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.  

In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

Please read the README (in the aclImdb.zip) for more details. We will use the file imdb_data.csv.zip which is the compiled version of the dataset.

In [None]:
from IPython.display import HTML
import pandas as pd

df = pd.read_csv('./data/IMDB/imdb_data_train.zip')

print('shape = ', df.shape)
df.head()

For example, a review is something like:

In [None]:
print('--> review:', df.loc[42, 'review'])
print('--> classification:', df.loc[42, 'classification'])
print('--> sentiment:', df.loc[42, 'sentiment'])

So, this set of reviews builds out the corpus of text data. A number of operations can now be performed on this data. Let's start with some basic feature extraction techniques.

### Number of words

Counting of words in a document is a basic feature extraction technique.

We can use the `split()` function to count the number of words in a document. The split function splits a string into a list separated by a delimiter. The default delimiter is a space.

In [None]:
df['#words'] = df['review'].apply(lambda x: len(str(x).split(" ")))
df.head()

In [None]:
def classification_vs_feature__analysis(df, col):
    """ function to analyze correlation between a column and classification """
    print(df[['classification', col]].corr())
    df[['classification', col]].groupby(['classification']).mean().sort_values(by='classification', ascending=True).plot.bar()

We can observe that there isn't a direct correlation between the number of words in a document and the sentiment of the document. 

Nevertheless, we can see that extreme classification tend to have less words than the others...

In [None]:
classification_vs_feature__analysis(df, '#words')

### Number of characters
Counting the number of characters in a document is also a basic feature extraction technique.

In [None]:
df['#chars'] = df['review'].str.len()
df.head()

Take your conclusion from the following...

In [None]:
classification_vs_feature__analysis(df, '#chars')

### Average word length
Average word length is a feature extraction technique that is used to find the average length of all the words in a document.

In [None]:
def avg_word(sentence):
  words_lens = [len(word) for word in sentence.split()]
  return sum(words_lens)/len(words_lens)

df['avg_word_len'] = df['review'].apply(lambda x: avg_word(x))

df[['review', 'avg_word_len']].head()

In [None]:
classification_vs_feature__analysis(df, 'avg_word_len')

### Number of Stopwords
Stopwords are the words that are most commonly used in a language, such as "the", "a", "an", "in", and "on". **These words do not add any meaning to a sentence**. 

Stopwords are **removed to reduce the dimensionality of the data** and to **remove noise from the data** (e.g., see https://en.wikipedia.org/wiki/Stop_word).

Obviously, stop words are language dependent.
- In English, stopwords are, for example: "the", "a", "an", "in", "on", etc.
- In Portuguese, stopwords are, for example: "o", "a", "os", "as", "em", "sobre", etc.
- In Spanish, stopwords are, for example: "el", "la", "los", "las", "en", "sobre", etc.
- In japanese, stopwords are, for example: "の", "に", "は", "を", "た", "が", etc. (!)

We can use the `nltk` library to count the number of stopwords in a document. The `nltk` library is a collection of natural language processing libraries. It is used to perform various natural language processing tasks, such as tokenization, stemming, lemmatization, and so on. We'll some of these functional later.

In [None]:
import nltk
from nltk.corpus import stopwords
# nltk.download('stopwords')
stop_words = stopwords.words('english')

print(stop_words)

In [None]:
df['#stopwords'] = df['review'].apply(lambda x: len([x for x in x.split() if x in stop_words]))
df.head()

In [None]:
classification_vs_feature__analysis(df, '#stopwords')

#### Personalized stop words list

**Using a preexisting collection of stop words may seem convenient, but it often proves inadequate for specific applications**. Take clinical texts, for instance, where words like "mcg," "dr.," and "patient" appear frequently in almost every document. In the context of clinical text mining and retrieval, these terms can be considered as potential stop words. Likewise, when dealing with tweets, terms like "#," "RT," and "@username" may qualify as potential stop words. Unfortunately, the standard list of language-specific stop words fails to encompass these domain-specific terms.

To **set our own stop words**, as a thumb rule, we can use the following strategies:
1. set the $n$-most frequent terms in the corpus as stop words
2. set the $n$-least frequent terms in the corpus as stop words
3. set the $n$-least IDF score terms as stop words (see below)
4. ...

### Number of special characters
Special characters include `!`, `@`, `#`, `$`, `%`, etc. We can use the `count()` function to count the number of special characters in a document. 

For instance, the number of exclamations can be used the level of excitement, surprise, anger etc. in a document. 

In [None]:
df['#exclamations'] = df['review'].str.count('!')

df[['review', '#exclamations']].sort_values(by='#exclamations', ascending=False).head()

In [None]:
df.loc[15598, 'review']

Here, we can the the number of exclamations points is higher in extreme reviews!!!

In [None]:
classification_vs_feature__analysis(df, '#exclamations')

Another example, in twitter special characters are used to tag topics, so counting special characters can be useful in topic detection.

In [None]:
df['#topics'] = df['review'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

df[['review', '#topics']].sort_values(by='#topics', ascending=False).head()

### Number of numerics
We can use the `isdigit()` function to count the number of numerics in a document.
This can be useful in detecting spam, which often contains a lot of numerics.


In [None]:
df['#numerics'] = df['review'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
df[['review', '#numerics']].head()

### Number of uppercase words
Uppercase words can be used to express anger or excitement. We can use the `isupper()` function to count the number of uppercase words in a document.
In our case, maybe is not so useful, as we can see in uppercase words list below.

In [None]:
df['upper'] = df['review'].apply(lambda x: [x for x in x.split() if x.isupper()])
df['#upper'] = df['upper'].apply(lambda x: len(x))

df[['review', 'upper', '#upper']].sort_values(by='#upper', ascending=False).head(10)

In [None]:
HTML(df.loc[9542, 'review'])

## Basic Text Pre-processing of text data
So far, we have seem how to extract basic features from text data. Now, we will see how to pre-process text data before extracting features from it. **Pre-processing** refers to the transformations applied to our data before feeding it to some algorithm. In the context of text data, it is also known as **text cleaning and pre-processing**.

So, text normalization, also known as text standardization, is a process that transforms text into a consistent or canonical form. Its purpose is to ensure uniformity and facilitate text processing and analysis. The normalization process is not a one-size-fits-all approach and can involve various techniques.

For example, one common step in normalization is converting all text to lowercase. This straightforward and widely applicable method is effective for text pre-processing. Additionally, dealing with misspelled words, acronyms, short forms, and out-of-vocabulary terms is another approach. For instance, terms like "super," "superb," and "superrrr" can be normalized to "super." By applying text normalization techniques, the noise and disruptions in the text data are handled, resulting in cleaner, more reliable data.

Stemming and lemmatization are also employed as part of text normalization. Stemming reduces words to their base or root form, while lemmatization aims to bring words to their canonical or dictionary form. These techniques further contribute to word normalization in text processing.

All in all, the texts are transformed from a sequencial list of words to a multidimensional vector of numbers, as we will see below.

### $n$-grams
Remember, an $n$-gram is a contiguous sequence of $n$ items from a given sample of text or speech.

the nltk library provides a function to split text into $n$-grams. We can use the `ngrams()` function from the `nltk.util` module to generate $n$-grams from a sequence of tokens. The `ngrams()` function takes in two arguments: the sequence of tokens and the value of $n$.

In [None]:
# build the n-grams for the reviews
from nltk.util import ngrams

# build the n-grams for the reviews
df['review_ngrams'] = df['review'].apply(lambda x: list(ngrams(x.split(), 2)))

df[['review', 'review_ngrams']]

### Bag-of-Words (BoW)
A **bag-of-words (BoW)** is a representation of text that describes the occurrence of words within a document. It keeps track of word counts and disregards the grammatical details and the word order.

A possible assumption is that the higher the count of a word in a document, the more important it is and vice versa.

We will use the `CountVectorizer()` function from the `sklearn.feature_extraction.text` module to perform CountVectorization (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). The `CountVectorizer()` function takes in a list of strings and converts it to a matrix of integers. Each row in the matrix represents a document and each column represents a word and the corresponding cell represents the count of that word in that document.

Further, for this part, we'll restrict the corpus to the first 10 documents in the IMDB dataset.

In [None]:
# the corpus is a list of strings (documents) to analyze
corpus = df['review'].head(10)
corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# initialize the CountVectorizer
cv = CountVectorizer()

# fit_transform() creates the vocabulary and returns a term-document matrix
X = cv.fit_transform(corpus)

# build a dataframe with the term-document matrix
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

Some observation can be made at once:
- the matrix is sparse, i.e., most of the cells are zero
- the matrix is not normalized, i.e., the number of words in each document is not taken into account
- the matrix contains a lot of words that might not be useful for some analysis, e.g., "a", "about", "above", "after", etc. These words are called **stop words** and they need to be removed from the corpus before performing CountVectorization.
- Depending on the context, some words have the same "meaning" but are written differently, e.g., "actor" and "actress" (should these words be merged?)

#### CountVectorizer without stop words removal
We can remove stopwords before performing CountVectorization or ask it to do it, by passing the list of stopwords to the `stop_words` parameter of the `CountVectorizer()` function or a string with the language.

In [None]:
cv = CountVectorizer(stop_words='english')

# fit_transform() creates the vocabulary and returns a term-document matrix
X = cv.fit_transform(corpus)

# build a dataframe with the term-document matrix. Note the toarray() function is used to convert the sparse matrix to a dense matrix
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

#### CountVectorizer with stop words removal and $n$-grams
We can extract $n$-grams by passing the value of $n$ to the `ngram_range` parameter of the `CountVectorizer()` function.

In [None]:
cv = CountVectorizer(stop_words='english', 
                     ngram_range=(1, 2))

# fit_transform() creates the vocabulary and returns a term-document matrix
X = cv.fit_transform(corpus)

# build a dataframe with the term-document matrix
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

#### Meaningfulness of the results
Bag-of-words does not bring in any information on the meaning of the text. For example, if we consider these two sentences
- “Text processing is easy but tedious.” and
- “Text processing is tedious but easy.”

a bag-of-words model would create the same vectors for both of them, even though they have different meanings.

In [None]:
small_corpus = [
    'Text processing is easy but tedious.',
    'Text processing is tedious but easy.'
]

cv = CountVectorizer(stop_words='english')

# fit_transform() creates the vocabulary and returns a term-document matrix
X = cv.fit_transform(small_corpus)

# build a dataframe with the term-document matrix
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

### HashingVectorizer
The `HashingVectorizer()` function from the `sklearn.feature_extraction.text` module is another technique that is used to extract features from text (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). It is also known as `Hashing Trick`.

In this technique, we simply **apply a hash function to the terms to convert them into numeric values**. The assumption here is that the hash function will assign unique indexes to the terms and hence we will not need to store the vocabulary explicitly. This will help us save memory.

Further, it turns a collection of text documents into a `scipy.sparse matrix` holding token occurrence counts (or binary occurrence information), possibly normalized. Used norms are 'l1' and 'l2', being computed as:
- 'l1': $\frac{x_{ij}}{\sum x_{ij}}$
- 'l2': $\frac{x_{ij}}{\sqrt{\sum x_{ij}^2}}$ 

This strategy has several advantages:
- it is very **low memory** scalable to large datasets as there is no need to store a vocabulary dictionary in memory.
- it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.
- it can be used in a streaming (partial fit) or parallel pipeline as there is **no state computed during fit**.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
- there is **no way to compute the inverse transform** (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
- there can be **collisions**: distinct tokens can be mapped to the same feature index. However, in practice this is **rarely** an issue if `n_features` is large enough (e.g. $2^{18}$ for text classification problems).

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

hv = HashingVectorizer(stop_words='english',
                       n_features=500) # note: this is a very small vocabulary... in a real case, we would use a much larger vocabulary

# fit_transform() creates the vocabulary and returns a term-document matrix
X = hv.fit_transform(corpus)

print(X.shape)
# build a dataframe with the term-document matrix
hv_df = pd.DataFrame(X.toarray())
hv_df

For example, 'Shakespeare' is hashed to 401,...

In [None]:
print(hv.transform(['Shakespeare']))

So,containing words hashed to 401 can be found in the following...

In [None]:
hv_df.loc[:, 401]

In row 0, we can find the "Shakespeare" word, as we can see below.

In [None]:
HTML(corpus.loc[0])

Howeever, for line 4, we can't find the "Shakespeare" word, as we can see below. Some other words are hashed to 401.

In [None]:
HTML(corpus.loc[4])

### Lower casing
Another preprocessing technique is to lowercase the term-document matrix. This avoids having multiple copies of the same word just because it was capitalized differently.

For instance, converting to lower case, is the default behavior of the `CountVectorizer()` function from the `sklearn.feature_extraction.text` module (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

But it can also be done manually, using the `lower()` function from the `str` module (https://docs.python.org/3/library/stdtypes.html#str.lower).

In [None]:
df['review_lower'] = df['review'].apply(lambda x: x.lower())

df[['review', 'review_lower']]

### Punctuation removal
Ponctuation might not be / is useful for (basic) text analysis. We can remove it using the `replace()` function from the `re` module (https://docs.python.org/3/library/re.html).

The regular expression "[^\w\s]" matches everything that is not (^) a word character (alphanumeric character) or whitespace. then the `re.sub()` function replaces all the matches with empty string. (see https://regexr.com/ for more information about regular expressions and to test them).

In [None]:
import re

df['review_no_punctuation'] = df['review_lower'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

df[['review', 'review_no_punctuation']]

### Stopwords removal
The remove of stopwords is a common preprocessing step in text analysis. In this example, the stopwords are removed using the `stopwords` corpus from the `nltk` module (https://www.nltk.org/).

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords_list = stopwords.words('english')

print(stopwords_list)

In [None]:
df['review_no_stopwords'] = df['review_no_punctuation']\
    .apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords_list)]))

df[['review', 'review_no_stopwords']]

### Frequent words removal
**Sometimes**, we can also remove the most frequent words from the text data. These words **might** not be useful for text analysis as they are very common and do not provide any information about the text.

In [None]:
# create a list of all the words in the text
words_frequence = df['review_no_stopwords']\
    .str\
    .split(expand=True)\
    .unstack()\
    .value_counts()\
    .sort_values(ascending=False)

words_frequence

In [None]:
# remove the 10 most frequent words
words_to_remove = words_frequence[:10].index.tolist()

print('words to remove:', words_to_remove)

df['review_no_frequent_words'] = df['review_no_stopwords'].apply(lambda x: ' '.join([word for word in x.split() if word not in words_to_remove]))

df[['review_no_stopwords', 'review_no_frequent_words']]

### Rare words removal
Similar, we can remove the rare words from the text data. These words might not be useful for text analysis as they are very rare and do not provide any information about the text. I.e., because they’re so rare, the association between them and other words is dominated by noise.

In [None]:
# remove the 10 rarest words
words_to_remove = words_frequence[-10:].index.tolist()
words_to_remove

In [None]:
df['review_no_rare_words'] = df['review_no_stopwords'].apply(lambda x: ' '.join([word for word in x.split() if word not in (words_to_remove)]))

df[['review_no_stopwords', 'review_no_rare_words']]

### Spelling correction

Text reviews (posts in generl) often contain spelling mistakes. We can use the `TextBlob()` function from the `textblob` module (https://textblob.readthedocs.io/en/dev/) to correct the spelling of the words (Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate...!

In [None]:
# !pip install textblob
from textblob import TextBlob

# this is a time consuming process so we'll just do it for first 10 reviews
df.loc[:10, 'review_corrected'] = df.loc[:10, 'review'].apply(lambda x: str(TextBlob(x).correct()))

df[['review', 'review_corrected']]

### Tokenization
Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. For instance, the sentence "The cat is brown" can be tokenized into the list of tokens ['The', 'cat', 'is', 'brown'].

In [None]:
from nltk.tokenize import word_tokenize

df['review_tokenized'] = df['review_corrected'].apply(lambda x: word_tokenize(x))
df[['review', 'review_tokenized']]


### Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (from https://en.wikipedia.org/wiki/Stemming).  It chops off the prefixes and suffixes. 

For instance, the words 'fishing', 'fished', 'fishes' all stem from the word 'fish'. Stemming is useful in text analysis as it reduces the number of words to analyze.

In this example, stemming can be done using the `SnowballStemmer()` class from the `nltk` module (https://www.nltk.org/). We should notice this is stemmer dependent!

In [None]:
from nltk.stem import SnowballStemmer

# create an instance of the SnowballStemmer class
stemmer = SnowballStemmer('english')

# define a function that applies the stemming to a list of words
def stem_doc(doc):
    return ' '.join([stemmer.stem(word) for word in doc.split()])

sentence = 'actor actors actress actresses fish fishing fishes fished fisher am are is was were best well better good'

for w1, w2 in zip(sentence.split(), stem_doc(sentence).split()):
    print(f'{w1:10} -> {w2}')

Apply stemming to the reviews in the dataset.

In [None]:
df['review_stemmed'] = df['review'].apply(lambda x: stem_doc(x))

df[['review', 'review_stemmed']]

### Lemmatization
Lemmatization is the process of grouping together the inflected forms of a word, so they can be analysed as a single item, identified by the word's lemma, or dictionary form (from https://en.wikipedia.org/wiki/Lemmatisation).

It is closely related to stemming. The main difference is that lemmatization **considers the context of the word while normalization** is performed.

Lemmatization is useful in text analysis as it reduces the number of words to analyze.

**Lemmatization vs. Stemming:** 
- Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling. 
- Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. 

e.g., see (https://www.turing.com/kb/stemming-vs-lemmatization-in-python)[https://www.turing.com/kb/stemming-vs-lemmatization-in-python]

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# nltk.download('wordnet') # download the WordNet corpus
# nltk.download('omw-1.4') # download the Open Multilingual WordNet
# nltk.download('treebank') # download the Treebank corpus

lemmatizer = WordNetLemmatizer()

# define a function that applies the lemmatization to a list of words
def lemmatize_doc(doc):
    #PoS stands for "part of speech", the syntactic type of words, such as nouns, pronouns, adjectives, verbs, adverbs, and prepositions.
    for pos in ['a', 's', 'r', 'n', 'v']: # a: adjective, s: adjective satellite, r: adverb, n: noun, v: verb
        doc = ' '.join([lemmatizer.lemmatize(word, pos) for word in nltk.word_tokenize(doc)])
    return doc


for w1, w2 in zip(sentence.split(), lemmatize_doc(sentence).split()):
    print(f'{w1:10} -> {w2}')

In [None]:
df['review_lemmatized'] = df['review'].apply(lambda x: lemmatize_doc(x))

df[['review', 'review_lemmatized']]

Alternatives can be found to the `WordNetLemmatizer()` class from the `nltk` module (https://www.nltk.org/). For instance,
- the `spacy` module (https://spacy.io/)
- the `textblob` module (https://textblob.readthedocs.io/en/dev/)
- the `stanza` module (https://stanfordnlp.github.io/stanza/)

## Advance Text Processing

More advanced text processing includes term frequency, inverse document frequency etc. These are covered in the following sections.

So, first lets us re-read the data (considerer only the 100 first reviews) and do some basic text processing.

In [None]:
import pandas as pd

def load_and_preprocess_IMDB(dataset, nrows=None):
    """ load the IMDB data and preprocess it:
            - remove html tags
            - remove ponctuation
            - convert to lower case
            - remove stop words
            - remove numbers
            - remove extra spaces
            - replave words with their root form (stem)
            - replace words with their lemma
        :param dataset: 'train' or 'test'
        :param nrows: number of rows to read
        :return: df
    """

    # read the data
    df = pd.read_csv(f'./data/IMDB/imdb_data_{dataset}.zip', nrows=nrows)

    # keep a copy of the original review
    df['original_review'] = df['review']

    # remove the html tags
    df['review'] = df['review'].str.replace('<br />', ' ')

    # remove the punctuation and '_' characters
    df['review'] = df['review'].str.replace('[^\w\s]', ' ', regex=True)
    df['review'] = df['review'].str.replace('_', ' ', regex=False)

    # convert to lower case
    df['review'] = df['review'].str.lower()

    # remove the stop words
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    df['review'] = df['review'].apply(lambda doc: ' '.join([word for word in nltk.word_tokenize(doc) if word not in (stop_words)]))

    # remove the numbers
    df['review'] = df['review'].str.replace('\d+', '', regex=True)

    # remove the extra spaces
    df['review'] = df['review'].str.replace(' +', ' ', regex=True)

    # replace the words with their root form
    from nltk.stem import SnowballStemmer
    stemmer = SnowballStemmer('english')
    df['review'] = df['review'].apply(lambda doc: ' '.join([stemmer.stem(word) for word in nltk.word_tokenize(doc)]))

    # replace the words with their lemma
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    df['review'] = df['review'].apply(lambda doc: ' '.join([lemmatizer.lemmatize(word) for word in nltk.word_tokenize(doc)]))

    return df


df = load_and_preprocess_IMDB('train', nrows=100)
df.info()

### Term Frequency
The term frequency (TF) of a word is the frequency of the word (i.e. the number of times it appears) in a document. The term frequency is often divided by the sentence/document length (i.e. the total number of words in the sentence/document) as a way of normalization.

So the term $t$ frequency in document $d$ is given by:
$$
\mbox{term frequency}_{t,d}
    = \frac{\mbox{number of times the t-word  appears in the d-document}}{\mbox{total number of words in the d-document }}
    = \frac{n_{t,d}}{\sum_w n_{w,d}}
$$
where $n_{t,d}$ is the number of times the $t$-word appears in the $d$-document.

For instance, the term frequency of the word 'fish' in the sentence "The fish is brown" is 1/4 = 0.25.

The term frequency ranges from 0 to 1. The higher the term frequency, the more "important" the word is to that document.

Alternatives and tf-based solution include::
- binary: 0, 1 (exists or not in the document)
- raw count (term absolute frequency): $n_{i,d}$
- log normalization (to reduce the impact of very frequent words): $\log(1+n_{i,d})$ 
- double normalization(to avoid bias towards longer documents): $0.5 + 0.5\frac{n_{i,d}}{\max_{k \in d} n_{k,d}}$ 

In [None]:
def compute_term_frequency(doc):
    """ Compute the term frequency of a word in a document
    :param doc: the document
    :return: the term frequency as a pandas series
    """

    # compute the term absolute frequency by doc (i.e. the number of times the word appears in the document)
    count_vect = CountVectorizer()
    X = count_vect.fit_transform([doc])

    # convert the term absolute frequency to a pandas dataframe
    bow = pd.DataFrame(X.toarray(), columns=count_vect.get_feature_names_out())

    # compute the term frequency by doc (i.e. the number of times the word appears in the document)
    tf = bow.div(bow.sum(axis=1), axis=0)

    return tf


# compute the term frequency for each review
df['review_tf'] = df['review'].apply(lambda doc: compute_term_frequency(doc).to_dict())

df[['review', 'review_tf']]

### Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) is a weight indicating **how commonly a word is used**.

The IDF of a word is the measure of how significant that term is in the whole corpus (i.e. the list of all reviews).

The inverse document frequency is computed by dividing the total number of documents in the corpus by the number of documents containing the word, and then taking the logarithm of that quotient, i.e.,
$$ IDF_t
 = \log\left(\frac{\mbox{total number of documents in the corpus}}{\mbox{number of documents containing the word}}\right)
 = \log\left(\frac{N}{|\{d\in D: t\in d\}|}\right)
 $$
 where $N$ is the total number of documents in the corpus $D$,  and $|\{d\in D: t\in d\}|$ is the number of documents containing the $t$-word.

The IDF value ranges from 0 to $\infty$. **The closer the value is to 0, the more common the word is.**

In [None]:
import numpy as np
def compute_idf(corpus):
    """
    Compute the inverse document frequency for each word
    :param corpus:
    :return: ifd as a pandas series
    """
    # fit and transform the vectorizer to the corpus
    vectorizer = CountVectorizer(binary=True) # use binary=True to indicate the presence or absence of a word
    X = vectorizer.fit_transform(corpus)

    # convert the sparse matrix to a pandas dataframe
    bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

    # compute the inverse document frequency for each word
    return np.log(len(corpus) / bow.sum(axis=0))

compute_idf(df['review']).sort_values()

### Term Frequency-Inverse Document Frequency (TF-IDF)
The term frequency-inverse document frequency (TF-IDF) is the product of the term frequency and the inverse document frequency. The TF-IDF is used to measure how important a word is to a document in a collection of documents (i.e. a corpus). **The higher the TF-IDF, the more important the word is to that specific document**, i.e., relevant words in the document are expected to:
- have a high term frequency (i.e. the word appears many times in the document)
- have a high inverse-document frequency (i.e. the word appears in a small number of documents in the corpus)


The **TF-IDF of a word in a document** is computed as follows:
$$
\mbox{TF-IDF}_{t,d}
    = \mbox{term frequency}_{t,d} \times \mbox{inverse document frequency}_{t}
    = \frac{n_{t,d}}{\sum_w n_{w,d}} \times \log\left(\frac{N}{|\{d\in D: t\in d\}|}\right).
$$

For example, if the corpus has two sentence:
- "The fish is brown"
- "The fish is green"

then the TF-IDF of the word 'fish' in the first sentence is: 0.25 * log(2/2) = 0.25 * 0 = 0 since
- the term frequency of the word 'fish' in the first sentence is 0.25
- the inverse document frequency of the word 'fish' is 0 since the word 'fish' appears in all the documents in the corpus. Since it appears in all the documents, we can't use it to differentiate between the documents...

On the other hand, the TF-IDF of the word 'brown' in the first sentence is: 0.25 * log(2/1) = 0.25 * 0.693147 = 0.1733 since
- the term frequency of the word 'brown' in the first sentence is 0.25
- the inverse document frequency of the word 'brown' is 0.69 since the word 'brown' appears in only one document in the corpus.

The TF-IDF ranges from 0 to $\infty$. The higher the TF-IDF, the more important the word is to that document.


In [None]:
# compute the TF-IDF for some review
def compute_tfidf(review, idf):
    """ Compute the TF-IDF for a review
    :param review: the review
    :param idf: the inverse document frequency
    :return: the TF-IDF as a pandas series
    """
    # compute the term frequency
    tf = compute_term_frequency(review)

    # compute the TF-IDF
    return tf.mul(idf, axis=1)

In [None]:
corpus = [
    'The fish is brown',
    'The fish was green',
]

# compute the TF-IDF for the first sentence
compute_tfidf(corpus[0], compute_idf(corpus))

In [None]:
compute_tfidf(corpus[1], compute_idf(corpus))

Remember

    $$ \mbox{TF-IDF}_{t,d}
    = \mbox{term frequency}_{t,d} \times \mbox{inverse document frequency}_{t}
    = \frac{n_{t,d}}{\sum_w n_{w,d}} \times \log\left(\frac{N}{|\{d\in D: t\in d\}|}\right).
    $$

So, a **high weight in TF-IDF** is reached by:
- a high term frequency (in the given document) and
- a low document frequency of the term in the whole collection of documents;

The **weights hence tend to filter out common terms**.

Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.

(From Wikipedia:)
TF_IDF: 
- "measures" how important a word is to a document in a collection or corpus.
- It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
- The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
- tf–idf has been one of the most popular term-weighting schemes. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries use tf–idf.

### Recognizing entities
Recognizing entities is the process of identifying and classifying named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Let us use spaCy to recognize entities.

In [None]:
#!pip install spacy
#!python -m spacy download en_core_web_lg

In [None]:
import spacy

# load the model
nlp = spacy.load("en_core_web_lg")

# create a doc object
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# print the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
for doc in nlp.pipe(pd.read_csv('./data/IMDB/imdb_data_train.zip')['review'].head(10)):
    print()
    print(doc)
    print([(ent.text, ent.label_) for ent in doc.ents])

### Word Embeddings
Word embeddings are a type of word representation that allows **words with similar meaning to have a similar representation**. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space. Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning.

For example, consider the following two sentences:
- The cat sat on the mat.
- The dog sat on the mat.

In this example we have a vocabulary of 5 words. The sentences are 5 words long. We can represent each word using a one-hot encoding, i.e.:

    cat = [1, 0, 0, 0, 0]
    dog = [0, 1, 0, 0, 0]
    mat = [0, 0, 1, 0, 0]
    on = [0, 0, 0, 1, 0]
    the = [0, 0, 0, 0, 1]

We can then represent each sentence as a collection of vectors:

        The cat sat on the mat = [[0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]
        The dog sat on the mat = [[0, 0, 0, 0, 1], [0, 1, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]

The folling code shows how to use the `Word2Vec` class from the `gensim` module (https://radimrehurek.com/gensim/models/word2vec.html) to create word embeddings. Uncomment the code to run it... it's quite computationally intensive.

In [None]:
# !pip install gensim

# from gensim.models import KeyedVectors

# Load the Word2Vec model pre-trained on the Google News dataset
# Note: This model is over 1.5GB, and loading it requires significant memory

# model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

# Get the word vector for a specific word
# word_vector = model['computer']
# print(word_vector)  # This prints the dense vector associated with 'computer'

# Find the top 10 most similar words to 'computer'
# similar_words = model.most_similar('computer', topn=10)
# for word, similarity in similar_words:
#    print(f"{word}: {similarity}")

# Solve the analogy: man:king :: woman:?
# result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
# print(result)



### Word cloud
A word cloud is a visualization technique for text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually **single words, and the importance of each tag is shown with font size or color**. 

This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence. When used as website navigation aids, the terms are hyperlinked to items associated with the tag.

Let us use the wordcloud library to create a word cloud.

In [None]:
# !pip install wordcloud
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

# create a word cloud
wordcloud = WordCloud(background_color="white",
                      #stopwords=STOPWORDS,
                      max_words=100,
                      random_state=42
                      ).generate(' '.join(df['review'].to_list()))

# display the word cloud
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

## Sentiment Analysis
https://monkeylearn.com/sentiment-analysis/

### What is it?
 - **Sentiment analysis** is a technique that uses natural language processing to **analyze the emotions in a piece of text**. It is also known as **opinion mining**, deriving the opinion or attitude of a speaker.

- It can be used to analyze social media comments, product reviews, survey responses, and so much more!


### Types of sentiment analysis
There are two main types of sentiment analysis:
- **Polarity detection**: Polarity detection is the most common type of sentiment analysis. It involves classifying a statement as either **positive, negative, or neutral**.
- **Emotion detection**: Emotion detection is a more advanced type of sentiment analysis that detects emotions in a text. It involves detecting a whole range of emotions, such as, **joy, anger, disgust, sadness, fear, surprise, or anticipation**.

### Sentiment analysis using text classification
Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in natural language processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

The sentiment analysis process using text classification consists of the following steps:
- **Data collection**: The first step is to collect the data. This data can be in the form of text, audio, or video. For example, if you want to analyze the sentiment of tweets, you’ll need to collect tweets that you want to analyze.
- **Data labeling**: The next step is to label the data. This means that you need to manually assign a sentiment label to each piece of text. For example, if you want to analyze the sentiment of tweets, you’ll need to label each tweet as positive, negative, or neutral. Or with an emotion.
- **Preprocessing the data**: The next step is to preprocess the data. This means that you need to clean the data and transform it into a format that can be used by a machine learning algorithm. For example, you can remove punctuation and convert all letters to lowercase.
- **Training a text classification model**: The next step is to train a text classification model. This means that you need to feed the labeled data into a machine learning algorithm so that it can learn how to classify text. For example, you can train a text classification model to classify tweets as positive, negative, or neutral.
- **Evaluating the model**: The final step is to evaluate the model. This means that you need to test the model on a set of data that it hasn’t seen before. For example, you can test the model on a set of tweets that it hasn’t seen before to see how well it can classify them.
- **Deploying the model**: The final step is to deploy the model. This means that you need to make the model available for use. For example, you can deploy the model as a web service so that it can be used to analyze the sentiment of tweets.

![text classification](./images/text_processing_model.png)

### Sentiment analysis using BoW

Ley us start by remembering our dataset. We have the IMDB dataset of review dataset

In [None]:
df = load_and_preprocess_IMDB('train')
df.head()

For which the distribution of sentiment is the following

In [None]:
df.groupby(by='sentiment').count()

Let us generate the BoW matrix

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer(lowercase=True, # it should already be in lower case...
                                   stop_words='english', # stop words should already have been removed but ...
                                   ngram_range = (1, 1))

count_vectors_train = count_vectorizer.fit_transform(df['review'])
count_vectors_train

Build a dataframe with BoW and add the sentiment column (for an easier visualization)

In [None]:
bow_train = pd.DataFrame(count_vectors_train.toarray(), columns=count_vectorizer.get_feature_names_out())
bow_train

Now, we don't need to split the data into train and test, because that is already provided by the dataset.
Let us just try is using a random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# train the model (send bow_train or the count_vectors_train)
rf.fit(bow_train, df['sentiment'])

Now, load the test data and pass it through the BoW

In [None]:
df_test = load_and_preprocess_IMDB('test')
count_vectors_test = count_vectorizer.transform(df_test['review'])

Build the BoW matrix

In [None]:
bow_test = pd.DataFrame(count_vectors_test.toarray(), columns=count_vectorizer.get_feature_names_out())
bow_test

And predict the sentiment and coresponding score

In [None]:
rf.score(bow_test, df_test['sentiment'])

Let us see some examples

In [None]:
def info(idx):
    review = df_test.loc[idx, 'original_review']
    target = df_test.loc[idx, 'sentiment']

    classification = df_test.loc[idx, 'classification']
    pred = rf.predict([bow_test.loc[idx]])
    pred_proba = rf.predict_proba([bow_test.loc[idx]])

    print(f'{"OK" if target == pred else "!OK"} / real sentiment: {target} (classfication: {classification}) / precicted sentiment: {pred} / pred_proba: {pred_proba}')
    print(f'[{idx}]', review)

import random

for idx in random.sample(range(len(df_test)), 5):
    info(idx)
    print('-------------------')

Save the model

In [None]:
import pickle

# save the model to disk
filename = 'rf_model.sav'
pickle.dump(rf, open(filename, 'wb'))

# save the vectorizer to disk
filename = 'count_vectorizer.sav'
pickle.dump(count_vectorizer, open(filename, 'wb'))

## References
1. Aggarwal, C. (2015). Data Mining: The Textbook. Springer.
2. Navlani, A., Fandango, A., & Idris, I. (2021). Python Data Analysis: Perform Data Collection. In Data Processing, Wrangling, Visualization, and Model Building Using Python. Packt Publishing Ltd..
2. Zong, C., Xia, R., & Zhang, J. (2021). Text data mining (Vol. 711, p. 712). Singapore: Springer.