# EM Lyon - Python text mining - Session 1

1. **Text mining with Python**
2. Dedicated libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# display options
pd.set_option("display.max_rows", 20)
pd.set_option("display.max_columns", 30)

#### Dataset

100,000 IMDB movie reviews for Sentiment Analysis:
- positive opinion, label `pos`
- negative opinion, label `neg`
- untagged opinion, label `unsup`

Source: kaggle.com

# 1. Text mining with Python

In [None]:
# load dataset
df = pd.read_csv('imdb_master.csv',
                 encoding='latin-1',
                 usecols=['review', 'label'])
df.shape

## 1.1 First insights on the dataset

In [None]:
# sample
np.random.seed(1)
df.sample(10)

In [None]:
# review sample
print(df.sample(1).iloc[0, 0])

<div class="alert alert-success">
<b>Exercise 1</b>
<ul>
    <li>Get the value counts of labels</li>
    <li>Apply the `describe()` method to the length of reviews</li>
    <li>Perform a seaborn `distplot()` with length of reviews</li>
    <li>Get the shortest and the largest review</li>
</ul>
</div>

In [None]:
# %load session8/ex1.py

## 1.2 Breaking the reviews into words with regular expressions

The `re` module from the Python Standard Library provides regular expression matching operations.

Regular expressions (called REs, or regex, or regex patterns) are essentially a tiny (but very powerful), highly specialized programming language embedded inside Python and made available through the `re` module.

Using this little language, one can specify the rules for the set of possible strings that one wants to match.

- `.`: match any character
- `*`: match 0 or many repetitions of the preceding pattern
- `+`: match 1 or many repetitions of the preceding pattern
- `?`: match 0 or 1 repetition of the preceding pattern
- `^`: match the start of the string
- `$`: match the end of the string
- `\`: in order to use the special characters above as standard ones, one should prefix them with a `\`

It is also possible to list within square brackets the characters that are to match:
- `[aeiouy]`: match any vowel
- `[^aeiouy]`: do not match any vowel (here the `^` is interpreted as a NOT)
- `[A-Z]`: match all characters between `A` and `Z` (several ordered pairs can be used: e.g., for uppercase, lowercase, accentuated letters and digits: `[A-Za-zÀ-ÿ0-9]`)

See the ISO 8859-1 (or latin-1) table to get characters intervals: https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout

The `findall()` function enables to extract all strings which match a given pattern.

The language contains many more powerful functionalities:
- shortcuts (e.g., `\w`, `\W`)
- repetition qualifiers: `*`, `+`, `?`, `{m,n}`
- capturing patterns: `(...)`
- non capturing patterns: `(?:...)`
- backreference to a named group: `(P=name...)`
- lookahead assertions: `(?=...)`
- negative lookahead assertions: `(?!...)`
- ...

<div class="alert alert-warning">
<b>Further reading</b>
<ul>
    <li>https://docs.python.org/3/library/re.html</li>
</ul>
</div>

In [None]:
# import re
import re
big_review = df.loc[df['review'].str.len().idxmax(), 'review']
big_review

In [None]:
# example 1: pattern = 1 uppercase letter
pattern = '[A-Z]'
re.findall(pattern, big_review)

In [None]:
# example 2: several uppercase letters
pattern = '[A-Z]+'
re.findall(pattern, big_review)

In [None]:
# example 1: several numbers
pattern = '[0-9]+'
re.findall(pattern, big_review)

<div class="alert alert-success">
<b>Exercise 2</b>
<ul>
    <li>Build a pattern to find words starting with an uppercase letter and other letters in lowercase.</li>
</ul>
</div>

In [None]:
# %load session8/ex2.py

Here we find all words compound of uppercase, lowercase, accentuated letters, numbers and possibly an apostrophe followed by `s` or `t`.

In [None]:
# get all words with an apostrophe followed by `s` or `t`.
pattern = "[A-Za-zÀ-ÿ0-9]+'[st]"
re.findall(pattern, big_review)

In [None]:
# get all words
pattern = "[A-Za-zÀ-ÿ0-9]+(?:'[st])?"
re.findall(pattern, big_review)

## 1.3 Bag-of-words model

Computing word frequency in a document could be a hassle:
- create a dictionary
- split a review into words
- for each word:
    - if it is not in the dictionary, set the value to 1
    - if it is in the dictionary, increase the value with 1
    
The `Counter` class from the Python standard library `collections` performs the job.

The method `update()` takes a list of keys or words and automatically calculates the number of occurences.

<div class="alert alert-warning">
<b>Further reading</b>
<ul>
    <li>https://docs.python.org/2/library/collections.html</li>
</ul>
</div>

In [None]:
# import
from collections import Counter

### 1.3.1 Bag of Words model of a single document

In [None]:
# 1 document
c = Counter()
c.update(re.findall(pattern, big_review))
c

We can put the result in a Series in order to obtain a vector-like representation.

In [None]:
# putting the result in a Series object
bag_of_words = pd.Series(list(c.values()), index=c.keys())
bag_of_words = bag_of_words.sort_values(ascending=False)
bag_of_words

### 1.3.2 Words frequency in all documents

Now, we want to compute the words frequency in all documents: i.e., for each word, the number of documents in which it appears.

We can apply the `update()` method of the instanciated counter to the full review column.

In this schema, we are not interested in the result of the apply method, but in its **side effect** on the counter.

In [None]:
%%time
c = Counter()
df['review'].apply(lambda x: c.update(re.findall(pattern, x)))

In [None]:
# view the dictionary
c

In [None]:
# length of the vocabulary
len(c)

### 1.3.3 Putting the results in a Series object

In [None]:
# putting the result in a new DataFrame
vocab = pd.Series(list(c.values()), index=c.keys())
vocab = vocab.sort_values(ascending=False)
vocab

In fact, we did not compute the word frequency but the word count. The word `the` appears more than 100K times.

<div class="alert alert-success">
<b>Exercise 3</b>
<ul>
    <li>Modify the code above to get the word frequency: i.e., for each word, the number of documents in which it appears.</li>
</ul>
</div>

In [None]:
# %load session8/ex3.py

## 1.4 Finding the context of any word

We can notice a strange word: `br`. Let us fond out what is this word.

<div class="alert alert-success">
<b>Exercise 4</b>
<ul>
    <li>Select the reviews which contain "br".</li>
    <li>Print out the reviews with 25 characters before the "br" and 25 characters after</li>
    <li>Process the reviews to switch the "br" to spaces.
    <li>Generalize and create a function which finds out all reviews containing a word and print out the results 25 characters before and after.</li>
    <li>Then perform few requests:</li>
        <ul>
            <li>Find a word, e.g. ghost</li>
            <li>Find a proper name, e.g. Hitchcock</li>
            <li>Find good/bad movie, e.g. </li>
            <li>Find good/bad movie, , e.g. good movie, bad movie, not a good movie, not a bad movie</li>
            <li>Find Ã</li>
        </ul>
</ul>
</div>

In [None]:
# %load session8/ex4.py

### 1.4.1 Cleaning the dataset

We can replace all mistaken chars by their appropriate value.

In [None]:
# replace character encoding mistakes
df['review'] = df['review'].apply(lambda x: x.replace('Ã¡', 'á'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã ', 'à'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã ', 'à'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¢', 'â'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã\xa0', 'à'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¥', 'å'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã£', 'ã'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã»', 'â'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã§', 'ç'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã©', 'é'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¨', 'è'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã«', 'ë'))
df['review'] = df['review'].apply(lambda x: x.replace('Ãª', 'ê'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¯', 'ï'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã®', 'î'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¬', 'ì'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã\xad', 'í'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã±', 'ñ'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã³', 'ó'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã²', 'ò'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¶', 'ö'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã´', 'ô'))
df['review'] = df['review'].apply(lambda x: x.replace('Ãµ', 'õ'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã°', 'ð'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¸', 'ø'))
df['review'] = df['review'].apply(lambda x: x.replace('Ãº', 'ú'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¹', 'ù'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¼', 'ü'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã½', 'ý'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¿', 'ÿ'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã?', 'Æ'))
df['review'] = df['review'].apply(lambda x: x.replace('Ã¦', 'æ'))
df['review'] = df['review'].apply(lambda x: x.replace('Â', ''))
# br
df['review'] = df['review'].apply(lambda x: x.replace('<br />', ' '))

### 1.4.2 Find the top vocabulary in a given context

We can use the `find()` function that has been defined above to collect automatically the vocabulary around a word or an expression and then find the top vocabulary which is used.

We use an enhanced version of the find function which adds spaces before and after the reviews to deal with words that are at the begining or at the end of a review.

In [None]:
# find word
def find(word):
    selection = df.loc[df['review'].str.contains(word, regex=False)]
    result = selection['review'].apply(lambda x: (' ' * 25 + x + ' ' * 25)[x.find(word):x.find(word) + 50 + len(word)])
    return result

pattern = '[A-Za-zÀ-ÿ0-9]+(?:\'[st])?'

# top vocabulary
def top_voc(s, n=20):
    c = Counter()
    s.apply(lambda x: c.update(re.findall(pattern, x)))
    voc = pd.DataFrame(list(c.items()), columns=['word', 'count'])
    voc = voc.nlargest(n, 'count')
    return voc['word'].unique()

In [None]:
# example
s = find('Costner')
top_voc(s)

In [None]:
# find Duck
s = find('Duck')
top_voc(s)

We can notice that some small words are printed out: e.g., a, and, in, is, of, the.

### 1.4.3 Collect the stop words

In text mining, stop words are words which are filtered out before or after processing of natural language texts.

The NLTK module provides a list of such stop words.

Of course, stop words depend on the language in which the texts are written.

In [None]:
# stop words from nltk
if False:
    from nltk.corpus import stopwords
    stopwords_en = set(stopwords.words("english"))
else:
    import json
    with open('stopwords_en.json') as f:
        stopwords_en = set(json.load(f))
        
stopwords_en

In [None]:
len(stopwords_en)

<div class="alert alert-success">
<b>Exercise 5</b>
<ul>
    <li>Modify the `top_voc()` function such as:</li>
        <ul>
            <li>Words with less or equal than 2 characters are discarded</li>
            <li>Stop words are discarded</li>
        </ul>
</ul>
</div>

In [None]:
# %load session8/ex5.py

We change also the pattern, so that it includes `'d`, `'ll`, `'re` and `'ve`.

In [None]:
# extended pattern
pattern = '[A-Za-zÀ-ÿ0-9]+(?:\'(?:d|ll|re|s|t|ve))?'

In [None]:
def top_voc(s, n=20):
    c = Counter()
    s.apply(lambda x: c.update(re.findall(pattern, x.lower())))
    voc = pd.DataFrame(list(c.items()), columns=['word', 'count'])
    voc = voc.loc[(voc['word'].str.len() > 2) & ~voc['word'].isin(stopwords_en)]
    voc = voc.nlargest(n, 'count')
    return voc['word'].unique()

In [None]:
# example
s = find('Costner')
top_voc(s)

In [None]:
# find Duck
s = find('Duck')
top_voc(s)

In [None]:
# find Daffy Duck
s = find('Daffy Duck')
top_voc(s)

In [None]:
# find Donald Duck
s = find('Donald Duck')
top_voc(s)

In [None]:
# find 007
s = find(' 007 ')
top_voc(s)

In [None]:
# find Pierce Brosnan
s = find('Pierce Brosnan')
top_voc(s)

In [None]:
# find Sean Connery
s = find('Sean Connery')
top_voc(s)

In [None]:
# find Simpson
s = find('Simpson')
top_voc(s)

## 1.5 Naive Bayes Classification

The dataset countains 50K labelled documents, 25K positive and 25K negative.

We are going to implement a Naive Bayes Classification based on the Bag of Words model.

Taking a document `d` we want to compute the probability that the document is in the class `C` (here `pos` or `neg`), i.e., probability of `C` class given a document `d`.

Without entering into details, this calculation requires some maths and assumptions:

- **Bayes Theorem** application: which enables to pass from the probability of `C` class given a document `d` to the probability of `d` document given a class `C`

- **Bag of Words** assumption: a document is represented by its words in any order, which enables to pass from the probability of `d` document given a class `C` to the probability of all words included in the document given a class `C`

- **Conditional independance** assumption: probability of words are independent given a class, which enables to pass from the probability of all words included in the document given a class `C` to the product of the probabilities of each word included in the document given a class `C`

- To estimate the probability of a word given a class, we can use the **document frequency** (count of documents in the class containing the word divided by the number of documents in the class), or the **maximum likehood** (count of the word in the class divided by the total number of words in the class), or the **Laplace smoothing** by adding 1 in order to avoid factors with $0$ (count of the word in the class plus one divided by the total number of words in the class plus the number of different words in the class)

- Then we use the **log** function to transform products into sums, therefore probability into score

- In our case, for a review, we will compute the ratio between the score being positive and the one of being negative

Further readings:
- Naive Bayes classifier: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- Text Classification in NLP -  Naive Bayes: https://medium.com/@theflyingmantis/text-classification-in-nlp-naive-bayes-a606bf419f8c

#### Step 1: separate the data into a train and a test set

In [None]:
# value counts of "label"
df['label'].value_counts()

We use the scikit-learn `train_test_split()` function to split the dataset into a train set and a test set.

In [None]:
# import and apply the train_test_split function
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df.loc[df['label']!='unsup'])

The train dataset

In [None]:
# train dataset
df_train.shape

In [None]:
# train value counts
df_train['label'].value_counts()

The test dataset

In [None]:
# test dataset
df_test.shape

In [None]:
# test value counts
df_test['label'].value_counts()

#### Step 2: collect the vocabulary from positive and negative labels

In [None]:
# compute document frequency
pattern = '[A-Za-zÀ-ÿ0-9]+(?:\'(?:d|ll|re|s|t|ve))?'

def get_document_frequency(df):
    c = Counter()
    df['review'].apply(lambda x: c.update(set(re.findall(pattern, x.lower()))))
    vocab = pd.Series(list(c.values()), index=c.keys())
    vocab = vocab.sort_values(ascending=False)
    return vocab

Positive labels

In [None]:
# word frequency for documents labelled "pos"
vocab_pos = get_document_frequency(df_train.loc[df['label']=='pos'])
vocab_pos

Negative labels

In [None]:
# word frequency for documents labelled "neg"
vocab_neg = get_document_frequency(df_train.loc[df['label']=='neg'])
vocab_neg

#### Step 3: compute a frequency ratio between positive and negative labels

In [None]:
# frequency ratio pos / neg transformed by log
var = vocab_pos.div(vocab_neg, fill_value=1)
var = np.log(var)
var = var.to_dict()
var

#### Step 4: compute the score of any text by adding log of ratios for all words

In [None]:
# score = sum of log of ratio (log has been already computed)
def compute_score(review):
    words = re.findall(pattern, review.lower())
    score = 0
    for word in words:
        score += var.get(word, 0)
    return score

In [None]:
# score of a single word
compute_score('amazing')

In [None]:
# score of a single word
compute_score('horrible')

In [None]:
# score of a review
compute_score(big_review)

#### Step 5: score on the train dataset
- compute the score of all reviews from train dataset
- plot them for positive and negative labels
- compute the contingency table in absolute value and in %
- compute the accuracy

In [None]:
%%time
df_train['score'] = df_train['review'].apply(compute_score)

In [None]:
sns.distplot(df_train.loc[df_train['label']=='pos', 'score'])
sns.distplot(df_train.loc[df_train['label']=='neg', 'score']);

In [None]:
# contingency table in absolute value
tab = pd.crosstab(df_train.loc[df['label']!='unsup', 'score'] > 0, df_train.loc[df['label']!='unsup', 'label'])
tab

In [None]:
# contingency table in %
pd.crosstab(df_train.loc[df['label']!='unsup', 'score'] > 0, df_train.loc[df['label']!='unsup', 'label'], normalize='index')

In [None]:
# accuracy
acc = (tab.iloc[0,0] + tab.iloc[1,1])/tab.sum().sum()
print('accuracy: {:.3f}'.format(acc))

#### Step 6: score on the test dataset

<div class="alert alert-success">
<b>Exercise 6</b>
<ul>
    <li>compute the score of all reviews from train dataset</li>
    <li>plot them for positive and negative labels</li>
    <li>compute the contingency table in absolute value and in %</li>
    <li>compute the accuracy</li>
</ul>
</div>

In [None]:
# %load session8/ex6.py

<div class="alert alert-success">
<b>Exercise 7</b>
<ul>
    <li>Modify the `get_document_frequency()` function such as stop words are discarded</li>
    <li>Modify the `compute_score()` function such as the computation is perform only once per word</li>
    <li>Re-run the whole process and compare scores</li>
</ul>
</div>

In [None]:
# %load session8/ex7.py