<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@acca-logo.jpg" alt="ACCA logo" style="width: 400px;"/>

# Machine learning with Python
## Part 2 - Natural language processing

* **Course:** __Machine learning with Python for finance professionals__ by ACCA
* **Instructor:** [Coefficient](https://coefficient.ai) / [@CoefficientData](https://twitter.com/CoefficientData)

---

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Text processing in scikit-learn
</h2><br>
</div>

### Goal: Predict Category from item description

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [None]:
orders = pd.read_excel("Grocery Database.xlsx", sheet_name="Grosto DB")
orders.shape

In [None]:
orders.head()

Let's _just_ look at Category and Item.

In [None]:
df = orders[['Items', 'Category']].copy()
print(df.shape)
df.head()

The original dataset was purchase data, let's reduce from 50k purchases to 603 unique items only.

In [None]:
df = df.drop_duplicates().reset_index()
df.shape

### Any missing values?

In [None]:
df.isnull().sum()

### What are the available categories?

In [None]:
df.Category.value_counts().sort_values(ascending=True).plot(kind='pie', figsize=(10,8));

In [None]:
# Examples from each category
df.groupby('Category').tail(1)

In [None]:
category_map = {
    # Bakery & Breakfast has 174 records, leave this alone
    'Bakery & Breakfast': 'Bakery & Breakfast',
    
    # Fresh Food
    'Fruit & Vegetable': 'Fresh Food',
    'Dairy, Chilled & Eggs': 'Fresh Food',
    'Meat & Seafood': 'Fresh Food',
    
    # Drinks
    'Wines, Beers & Spirits': 'Drinks',
    'Beverages': 'Drinks',
    
    # Cupboard
    'Rice & Cooking Essentials': 'Cupboard',
    'Choco, Snacks, Sweets': 'Cupboard',
    'Health': 'Cupboard',
    'Frozen': 'Cupboard',
    
    # Other
    'Household': 'Other',
    'Mother & Baby': 'Other',
    'Beauty': 'Other',
    'Pet Care': 'Other',
    'Party Supplies': 'Other',
    'Kitchen & Dining': 'Other',
}

In [None]:
df['Target'] = df.Category.map(category_map)

In [None]:
df.head()

In [None]:
# How many of each of the new target categories?
df.Target.value_counts().sort_values(ascending=True).plot(kind='pie', figsize=(5,5));

### Text vectorization

It's ready for vectorization! Let's apply scikit-learn's CountVectorizer to the `Items` column.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
CountVectorizer?

We need to create a new vectorizer and specify how we want it to work. Think of this like constructing a machine to your specification, ready to feed all your text data into. 🤖

In [None]:
vectorizer = CountVectorizer(max_features=1000,     # max number of words to consider (uses first N most frequent)
                             ngram_range=(1, 2),    # e.g. (1,1) for single words, (1,2) for bigrams, etc
                             stop_words='english',  # remove English language stop words, e.g. 'to', 'the', 'it'
                             binary=True)           # use 1/0 instead of word count

The `vectorizer` "machine" hasn't yet seen any of our data. Let's change that by feeding our item descriptions into the `.fit()` method. When this runs, it will "learn a vocabulary", i.e. what words appear in this data, and how often?

In [None]:
# Use `fit` to learn the vocabulary of the titles
vectorizer.fit(df.Items)

Learning some vocabulary is only half the job. It's time for our `vectorizer` to apply what it learned and construct a "document-term matrix" containing one row for each sample and one column for each term (remember, a "term" may be 1 or even 2 consecutive words, as we specified a couple cells above).

In [None]:
# Use `transform` to generate the X "word matrix" - one column per feature (word or n-grams)
vectorizer.transform(df.Items)
# Sparse matrix! Only the non-zero entries are recorded...

This sparse matrix has the same number of rows as our original data (603) and 500 columns (because we specified `max_features=500` earlier when creating our `vectorizer`). However, it's stored in a compressed format that might be machine-friendly but isn't human-friendly. Let's fix that, first by converting it to a NumPy matrix.

In [None]:
# Call .toarray() to transform this into a full matrix (less space optimised)
vectorizer.transform(df.Items).toarray()

What's going on here? Most entries are actually zero (this is what "sparse matrix" means, the matrix is mostly empty). This is because most rows don't contain that many words, so for the top 1k words most of them won't be in a short description like "Oreo mini oreo sharepack".

The data is all here, but it's still not friendly as we're missing our column names (i.e. the word terms themselves). These have been saved for us into `vectorizer.get_feature_names()`:

In [None]:
vectorizer.get_feature_names()[:5]

Let's transform our vectorizer ➡ turn it into a NumPy matrix ➡ add in the feature names ➡ store all this in a pandas dataframe.

In [None]:
# Call .toarray() to transform this into a full matrix (less space optimised)
X = pd.DataFrame(vectorizer.transform(df.Items).toarray(),
                 columns=vectorizer.get_feature_names())
y = df.Target

Time to take a look.

In [None]:
# Not all columns are shown - you can disable this by calling pd.set_option('display.max_columns', None)
X.head()

In [None]:
# Let's take a look at the non-zero entries in the first row, and compare to the original text.

# Original text
print(df.Items[1])

# Add some blank lines
print('\n')

# Non-zero entries in the first row
first_row = X.loc[1]
print(first_row[first_row > 0])

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Build a random-forest text classifier
</h2><br>
</div>

This code should be very familiar from the previous notebook.

In [None]:
from sklearn import ensemble, model_selection

In [None]:
model = ensemble.RandomForestClassifier(n_estimators=20)
model_selection.cross_val_score(model, X, y, scoring='accuracy', cv=5).mean()

In [None]:
# What features are most important?
model.fit(X, y)
feature_importances = pd.DataFrame({'Features' : X.columns, 'Importance Score': model.feature_importances_})
feature_importances = feature_importances.sort_values('Importance Score', ascending=False)
feature_importances.head(10)

In [None]:
# Let's also calculate which way these keywords influence the categorisation decision

def is_word_in_text(x, word):
    return word in x

def percentify(x):
    """Turns 0.1234 into 12%"""
    return f"{100*x:.0f}%"

In [None]:
for word in feature_importances.Features.head(10):
    print('\n\n-------------------\n\n')
    df[word] = df.Items.apply(is_word_in_text, word=word)
    print("Word:", word)

    # We want to calculate which categories contain this term,
    # and also what % of the items in the category contains the term
    percent_that_contains_word = df.groupby('Target')[word].mean()
    percent_that_contains_word

    # Most words are only in 1-2 categories, so let's ignore
    # categories which don't contain the word
    percent_that_contains_word = (
        percent_that_contains_word
        .sort_values(ascending=False)
        .reset_index()
        .query(f"{word} > 0")
    )

    percent_that_contains_word

    # Convert 0.212644 into 21.3%
    percent_that_contains_word[word] = percent_that_contains_word[word].apply(percentify)

    print(percent_that_contains_word)

# How to read this?

#     Word: bread
#                  Category     bread
#     0  Bakery & Breakfast     21%

# "bread" is the first item in this list, because "bread" is the feature
# with the highest importance score according to the RandomForestClassifier.

# The above suggests that this is true because "bread" appears in 21% of Bakery & Breakfast rows,
# and in no other categories. This is a fairly strong signal that rows containing "bread" belong
# to the "Bakery & Breakfast" category!

<div class="alert alert-block alert-warning">
<b><i class="fa fa-check-square" aria-hidden="true"></i>&nbsp; Check</b><br>

There's a lot going on in the last few cells. Run through them line by line, and ensure you follow along. If there are any pandas functions that you're unclear on it's **strongly** encouraged you take a moment to double check the [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) or the pandas documentation or otherwise review the previous course materials.
    
You should also check if you fully understand what's happening in this last cell? We explained the bread example. Here's another example for "oats" (this may be the final example above, although random forests _are_ random so it may not be!):

```
Word: oats
               Target oats
0  Bakery & Breakfast   7%
1              Drinks   1%
```
    
The above suggests that "oats" is an important feature because:
    - it appears in 7% of `Bakery & Breakfast`
    - it appears in 1% of `Drinks`
</div>

In [None]:
# Examples containing "oats"...could be breakfast oats or oat milk
df[X['oats'] == 1].head(10)

---

> ### 🚩 Exercise
> Copy your hyperparameter tuning code from the previous notebook, and identify the best `max_depth` and `n_estimators` for a `RandomForestClassifier` for this problem.
> 
> **Tips:**
> - You don't need to change much at all! This is why we use generic variables like `df` and `X` and `y` and `model`...it means code you write to solve one problem is abstract enough that it can be copied verbatim to solve another problem. This is a **huge** productivity win, as you'll find out now.
> - You are welcome to try adjusting the other hyperparameters for a **[RandomForestClassifier()](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**, it's good practice! We encourage you to read scikit-learn's **[advice on tuning random forest parameters](https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters)** as it's a great example of _why_ this is such a high quality library. This advice are the result of the library's authors distilling countless research papers into best practices!
> - _However_, if you do try experimenting with other hyperparameters, don't spend too long. `max_depth` and `n_estimators` are the big ones, and you'll get a lot more "bang for your buck" by focusing on adding some more features first and coming back to model tuning at a later stage.

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Adding more features
</h2><br>
</div>

---

> ### 🚩 Exercise
> Take your best settings for `max_depth` and `n_estimators` and enter them into the cell below.

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE

best_max_depth = 1
best_n_estimators = 1

In [None]:
model = ensemble.RandomForestClassifier(n_estimators=best_n_estimators, max_depth=best_max_depth)
model_selection.cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()

---

We can add more features from our original dataset `orders` to our input feature matrix `X` (which contains only vectorizer-generated words/terms at the moment) as follows. Let's review the three dataframes in this notebook.

In [None]:
# orders - this is our original dataframe with 50k rows and 32 cols
print(orders.shape)
orders.head()

In [None]:
# df - contains one row for each product + item descriptions + categories + columns added in
#      our "what % of category X contains word Y" feature interpretation step earlier
print(df.shape)
df.head()

In [None]:
# X - this is our feature matrix that we input into the ML model
X.head()

Orders has some useful info that we can add to X. Let's add Price.

In [None]:
# This creates a lookup from Items to Price
items_to_price = (
    orders[['Items', 'Price']]
    .drop_duplicates(subset='Items')
    .set_index('Items')['Price']
)
items_to_price

In [None]:
# Use this lookup to add it into df
df['Price'] = df.Items.map(items_to_price)

In [None]:
# df maps 1:1 to X (they have the same number of rows) so we can just copy it across
X['Price'] = df['Price']

In [None]:
# Intuitively, should adding Price help?
plt.figure(figsize=(25, 6))
sns.boxplot(x='Target', y='Price', data=df);

In [None]:
# Does including Price help?
model = ensemble.RandomForestClassifier(n_estimators=best_n_estimators, max_depth=best_max_depth)

print(
    'Without Price:',
    model_selection.cross_val_score(model, X.drop(columns='Price'), y, cv=5, scoring='accuracy').mean()
)

print(
    'With Price:',
    model_selection.cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
)

<div class="alert alert-block alert-warning">
<b><i class="fa fa-check-square" aria-hidden="true"></i>&nbsp; Check</b><br>

Did adding Price improve this model noticeably? Try re-running the cell above a few times to get a sense of how much is "random variation" (due to the random shuffling in k-fold cross-validation or the random forest itself) and how much is a real difference, if any.
    
Were you expecting this to be a useful feature?
</div>

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Model architecture selection
</h2><br>
</div>

In the previous notebook we included this graphic, taken from [this page on the scikit-learn documentation](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

If we follow the prompts, it suggests we try a Naive Bayes approach for this type of problem. Naive Bayes methods are a great technique to consider when building text classifiers and their history dates back to the very first email spam detection algorithms. We won't go into detail here on what Naive Bayes methods are or how they work, but do feel free to read scikit-learn's [excellent user guide on Naive Bayes techniques](https://scikit-learn.org/stable/modules/naive_bayes.html) or listen to this friendly [short Data Skeptic podcast episode on Naive Bayes classifiers for spam detection](https://dataskeptic.com/blog/episodes/2018/spam-filtering-with-naive-bayes).

<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@ml_map.png" alt="ML map" style="width: 1000px;"/>

In [None]:
from sklearn import naive_bayes

---

> ### 🚩 Exercise
> We've given you a Naive Bayes model in the cell below. Find & copy in your line of code from earlier that calculates the five-fold cross-validated accuracy, given `X`, `y` and a model. It _should_ work directly with the `BernoulliNB` below without any issues. It should also provide an accuracy improvement _far_ better than anything else we've done so far!
> 
> This consistency is one of the best features of scikit-learn's design: whether you're working with linear methods, decision trees, random forests, support vector machines, neural networks...it doesn't matter, everything is just "plug-and-play". Want to try out an entirely different model architecture? Just swap out the one you have for a new one, easy!

In [None]:
X = X.drop(columns='Price')

In [None]:
# How did we know to set alpha=0.1? We may have done some hyperparameter tuning in advance,
# feel free to replicate and confirm our findings.
model = naive_bayes.BernoulliNB(alpha=0.1)

In [None]:
# ✏️ ENTER YOUR SOLUTION HERE




---

In [None]:
model = naive_bayes.BernoulliNB(alpha=0.1).fit(X, y)
df['Predicted'] = model.predict(X)
df['Correct'] = df.Predicted == y
df.query("Correct == False")

In [None]:
from sklearn.metrics import plot_confusion_matrix

In [None]:
plot_confusion_matrix(model, X, y);

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Predicting for new text input
</h2><br>
</div>

We're going to return to the simple text-only classifier and train a BernoulliNB model to learn to predict the target categories. Everything we've done so far is simply model evaluation to find the best "recipe".

Now we know the best recipe (`BernoulliNB(alpha=0.1)`) it's time to use that recipe to prepare our competition-winning cake. In other words, let's train the model on the full dataset and take it for a spin.

In [None]:
# Re-create X to work with just text data, and then train a model on the full dataset
X = pd.DataFrame(vectorizer.transform(df.Items).toarray(),
                 columns=vectorizer.get_feature_names())
y = df.Target

model = naive_bayes.BernoulliNB(alpha=0.1).fit(X, y)

In [None]:
# Let's construct some new data, I've made these up!
new_text = pd.Series([
    # Bakery & Breakfast
    'honey & maple syrup porridge',
    'bran flakes',
    
    # Cupboard
    'arborio risotto rice',
    'baking powder',

    # Drinks
    'cabernet sauvignon red wine',
    'sparkling water',
    
    # Fresh Food
    'wheel of cheese',
    'mixed grapes',
    
    # Other
    'cat food',
    'washing up liquid',
    
    # Trickier edge cases
    'banana bread',  # is it bananas or bread?
    'grape juice',
    'chocolate orange',  # one of your 5-a-day?
    
    # ADD YOUR OWN EXAMPLES UNDER HERE
    
])

In [None]:
# We need to vectorize the data first, but using the EXACT SAME vectorizer
# (remember the vectorizer was fitted to the training data)
X_predict = pd.DataFrame(vectorizer.transform(new_text).toarray(),
                         columns=vectorizer.get_feature_names())

In [None]:
X_predict

In [None]:
# This what the classifier model actually sees
pd.set_option('display.max_columns', None)

In [None]:
# Filter to only words in the feature matrix that are in our new_text list
non_zero_cols = (X_predict.sum() > 0)

In [None]:
non_zero_cols

In [None]:
# Pick out the columns that match this filter
non_zero_col_names = X_predict.columns[non_zero_cols]
non_zero_col_names

In [None]:
# Display the filtered dataframe and add in new_text as the index so it's easy to review
new_text_feature_matrix = X_predict[non_zero_col_names].set_index(new_text)
new_text_feature_matrix

In [None]:
# This is a great way to visualise this matrix
sns.heatmap(new_text_feature_matrix)

In [None]:
# Generate the model's predictions
model.predict(X_predict)

In [None]:
# Construct a dataframe showing the predicted class + probabilities for other classes
# This gives us insight into the model's confidence for each prediction

predicted_classes = pd.DataFrame(model.predict(X_predict), columns=['Prediction'])

predicted_probabilities = pd.DataFrame(model.predict_proba(X_predict), columns=model.classes_)
predicted_probabilities = (predicted_probabilities * 100).round()  # formatted as %

# Horizontally concatenate & add in new_text
predictions = pd.concat(
    [predicted_classes,
     predicted_probabilities], axis=1).set_index(new_text)
predictions

In [None]:
# Display stacked bar chart showing the predicted class probabilities for each item
predictions.iloc[::-1].plot.barh(stacked=True, figsize=(10,10));

---