# Wine Ratings Explanations w/ Supervised ML

This Notebook is a follow-up of https://www.kaggle.com/olivierg13/wine-ratings-analysis-w-supervised-ml
We are trying to explain our RFC model, classifying wines ratings based on their descriptions, and improve it along the way.
For this purpose, we are going to use Lime (https://github.com/marcotcr/lime)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Plot / Graph stuffs
import seaborn as sns
import matplotlib.pyplot as plt

# SK learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Explanations libraries
from lime import lime_text
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Preparing the data

In [None]:
init_data = pd.read_csv("../input/winemag-data_first150k.csv")
print("Length of dataframe before duplicates are removed:", len(init_data))
init_data.head()

parsed_data = init_data[init_data.duplicated('description', keep=False)]
print("Length of dataframe after duplicates are removed:", len(parsed_data))

parsed_data.dropna(subset=['description', 'points'])
print("Length of dataframe after NaNs are removed:", len(parsed_data))

parsed_data.head()

dp = parsed_data[['description','points']]
dp.info()
dp.head()

#Transform method taking points as param
def transform_points_simplified(points):
    if points < 84:
        return 1
    elif points >= 84 and points < 88:
        return 2 
    elif points >= 88 and points < 92:
        return 3 
    elif points >= 92 and points < 96:
        return 4 
    else:
        return 5

#Applying transform method and assigning result to new column "points_simplified"
dp = dp.assign(points_simplified = dp['points'].apply(transform_points_simplified))
dp.head()

# Description Vectorization and Model Training

In [None]:
X = dp['description']
y = dp['points_simplified']

# Vectorizing model
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

# Training model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Testing model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))

# Explanation of Random Forest Classifier

Using Lime (https://github.com/marcotcr/lime), we are going to explain this Model and find what words represent the most each group.
This notebook will follow Lime's tutorial for text explanation: https://marcotcr.github.io/lime/tutorials/Lime%20-%20basic%20usage%2C%20two%20class%20case.html

In [None]:
# Pipeline from text to prediction
pipeline = make_pipeline(vectorizer, rfc)

# LimeTextExplainer
class_names = ['Not Super Good', 'Average', 'Good', 'Great', 'Exceptional']
explainer = LimeTextExplainer(class_names = class_names)

exp = explainer.explain_instance(dp['description'].iloc[0], pipeline.predict_proba, num_features=20, top_labels=1)
print('Probability =', pipeline.predict_proba([dp['description'].iloc[0]]))
exp.show_in_notebook(text=True)

# Text cleanup

Thanks to the explanation, we can see that TfidfVectorizer is actually not lowering some common words' weights enough: instances of unrelevant words like "at", "is", "this", "the" and "of" (they change depending on new trainings) are seen as explainations of a "Great" wine.
Let's clean those words from our data using the stopwords collection.

In [None]:
# Stop words import for English
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))

def removeStopwords(x):
    # Removing all the stopwords
    filtered_words = [word for word in x.split() if word not in stops]
    return " ".join(filtered_words)

# Apply to the column
print(dp['description'].iloc[0])
dp['description'] = dp['description'].map(removeStopwords)
print(dp['description'].iloc[0])

Cool,  no more common words. Let's try to re-train our RFC.

In [None]:
X = dp['description']
y = dp['points_simplified']

# Vectorizing model
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

# Training model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Testing model
predictions = rfc.predict(X_test)
print(classification_report(y_test, predictions))

Let's run the explaination again:

In [None]:
# Pipeline from text to prediction
pipeline = make_pipeline(vectorizer, rfc)

# LimeTextExplainer
class_names = ['Not Super Good', 'Average', 'Good', 'Great', 'Exceptional']
explainer = LimeTextExplainer(class_names = class_names)

exp = explainer.explain_instance(dp['description'].iloc[0], pipeline.predict_proba, num_features=20, top_labels=1)
print('Probability =', pipeline.predict_proba([dp['description'].iloc[0]]))
exp.show_in_notebook(text=True)

# Conclusion

Explaining Machine Learning Models is really useful!

In our case, we found out that TfidfVectorizer was not doing that great of a job at lowering common words weights.
It did not really improve our Model overall precision, but we can see that the model is more precise on an individual class now.