# Utilizing Machine Learning Classification Algorithms to Rank Cheese Quality 

#### *An Exercise in Exploratory Data Analaysis by Karnjiv Gill*

***

## Introduction

Cheesemaking is a practice that predates recorded history and has been a central part of many cultures throughout the world. There are a large variety of cheeses available to us in the world, everything from cheap and highly processed grocery store cheese, to rich, and meticulously refined artisinal cheese. I will attempt to create a predictive model that scores well and can forecast the fat level of various cheeses. 


## Dataset Description 

For this analysis I will be utilizing the `cheese.csv` dataset that has been retrieved through Kaggle and provided using an Open License by the Government of Canada. More specifically I will be assigning the `MoisturePercent` column as the target and the `ManufacturerTypeEn`,`MilkTypeEn` and `FatLevel` columns as features.

## Exploratory Data Analysis

Let us begin by loading in the necessary libraries and the following table: `cheese.csv` 


In [2]:
import altair as alt
import graphviz
import numpy as np
import pandas as pd
import string
from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.svm import SVC, SVR

from scipy.stats import lognorm, loguniform, randint

#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Below I will display the `cheese.csv` dataframe so that we can take a closer look at the necessary columns for this analysis

In [3]:
#Loading in the dataframe
cheese_df = pd.read_csv('data/cheese.csv')
cheese_df.head()

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
0,228,NB,Farmstead,47.0,"Sharp, lactic",Uncooked,0,Firm Cheese,Ewe,Raw Milk,Washed Rind,Sieur de Duplessis (Le),lower fat
1,242,NB,Farmstead,47.9,"Sharp, lactic, lightly caramelized",Uncooked,0,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Tomme Le Champ Doré,lower fat
2,301,ON,Industrial,54.0,"Mild, tangy, and fruity","Pressed and cooked cheese, pasta filata, inter...",0,Firm Cheese,Cow,Pasteurized,,Provolone Sette Fette (Tre-Stelle),lower fat
3,303,NB,Farmstead,47.0,Sharp with fruity notes and a hint of wild honey,,0,Veined Cheeses,Cow,Raw Milk,,Geai Bleu (Le),lower fat
4,319,NB,Farmstead,49.4,Softer taste,,1,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Gamin (Le),lower fat


Next, I will perform some data cleaning by removing the columns that I will not be using in my analysis. 

In [4]:
cheese_clean = cheese_df.drop(columns=['CheeseId','CharacteristicsEn','RindTypeEn','FlavourEn','CheeseName'])
cheese_clean

Unnamed: 0,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,FatLevel
0,NB,Farmstead,47.0,0,Firm Cheese,Ewe,Raw Milk,lower fat
1,NB,Farmstead,47.9,0,Semi-soft Cheese,Cow,Raw Milk,lower fat
2,ON,Industrial,54.0,0,Firm Cheese,Cow,Pasteurized,lower fat
3,NB,Farmstead,47.0,0,Veined Cheeses,Cow,Raw Milk,lower fat
4,NB,Farmstead,49.4,1,Semi-soft Cheese,Cow,Raw Milk,lower fat
...,...,...,...,...,...,...,...,...
1037,NS,Farmstead,37.0,1,Hard Cheese,Cow,Pasteurized,higher fat
1038,AB,Industrial,46.0,0,Fresh Cheese,Cow,Pasteurized,lower fat
1039,NS,Artisan,40.0,0,Veined Cheeses,Ewe,Thermised,higher fat
1040,NS,Artisan,34.0,0,Semi-soft Cheese,Ewe,Thermised,higher fat


Below I have visualized the animals that are producing high fat and low fat cheese. It is interesting to note that a majority of the cheese in this dataset is made using low and high fat Cow's milk, followed by Goat and Ewe milk.

In [10]:
fat_count = cheese_clean.groupby(by=['MilkTypeEn','FatLevel']).size()
fat_count = pd.DataFrame(fat_count).reset_index()
fat_count = fat_count.sort_values(by=0,ascending=False)
fat_count = fat_count.rename(columns={0:'Count'})

alt.Chart(fat_count, width=500, height=300).mark_bar().encode(  
    x=alt.X('MilkTypeEn:O'),
    y=alt.Y('sum(Count):Q'),
    color='MilkTypeEn:N',     
    row='FatLevel:N' )

Before we can begin any sort of data manipulation it is important to perform a split to avoid breaking the golden rule. I will be using an 80/20 split for my train and test data where the target column is `FatLevel` and the features will include `ManufacturerProvCode`,`ManufacturingTypeEn`,`MoisturePercent`,`Organic`,`CategoryTypeEn`,`MilkTypeEn`,
`MilkTreatmentTypeEn`

The features can be subset into the following types: 


The `MilktypeEn`, `ManufacturerProvCode`, `CategoryTypeEn`, `MilkTreatmentTypeEn` and `ManufacturingTypeEn` columns are categorical features

The `Organic` column is a binary feature

The `MoisturePercent` column is a numeric feature

Now I will create my X and y variables by seperating the features and the target

In [11]:
X = cheese_clean.drop(columns='FatLevel')
y = cheese_clean['FatLevel']

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=123)

Because there are some null values present, I will utilize `SimpleImputer` with `strategy='most_frequent'` to deal with these

In [12]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(X_train)
X_train_imp = imputer.transform(X_train)
X_train_imp

X_train_new = pd.DataFrame(X_train_imp)
X_train_new.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64

Now that the data has been split and null values have been dealt with through imputation, I will create a `DummyClassifier` to create a baseline score for which we can compare other models with. 

In [13]:
model = DummyClassifier()
model = model.fit(X,y)
model_scores = cross_validate(model, X_train,y_train, cv=5, return_train_score = True)
model_scores = pd.DataFrame(model_scores)
model_scores

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.001478,0.001137,0.658683,0.657658
1,0.001432,0.000617,0.658683,0.657658
2,0.00127,0.000643,0.658683,0.657658
3,0.001266,0.000612,0.656627,0.658171
4,0.001266,0.000583,0.656627,0.658171


As we can see the train and test scores are fairly low, so I will use another model to see if the results are any better. First I need to define the categorical, binary and numeric features. 

In [14]:
categorical_features = ['MilkTypeEn','ManufacturerProvCode','CategoryTypeEn','MilkTreatmentTypeEn','ManufacturingTypeEn']
binary_features = ['Organic']
numeric_features = ['MoisturePercent']

Now to build the transformers to make the necessary adjustments to the features.

In [15]:
categorical_transformer = make_pipeline(
                    SimpleImputer(strategy='most_frequent'),
                    OneHotEncoder(handle_unknown='ignore'))

binary_transformer =  make_pipeline(
                    SimpleImputer(strategy='most_frequent'),
                    OneHotEncoder(handle_unknown='ignore'))
            
numeric_transformer = make_pipeline(
                    SimpleImputer(strategy='median'),
                    StandardScaler())

Next, I will build the preprocessor to input into the pipeline. 

In [16]:
preprocessor = make_column_transformer((categorical_transformer,categorical_features),
                                      (binary_transformer,binary_features),
                                      (numeric_transformer, numeric_features))

Building the first pipeline with DecisionTreeClassifier

In [17]:
main_pipe = make_pipeline(preprocessor, DecisionTreeClassifier())

In [18]:
score = cross_validate(main_pipe, X_train, y_train, cv=5, return_train_score=True)
score = pd.DataFrame(score)
score

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.025761,0.010648,0.814371,0.951952
1,0.024024,0.010454,0.772455,0.951952
2,0.023726,0.010515,0.808383,0.947447
3,0.02384,0.010507,0.831325,0.943028
4,0.023709,0.010352,0.783133,0.946027


It appears that I am not over or underfitting my data judging by the training and test scores, however I feel that I may be able to achieve a better test score so I will try another model

In [19]:
forest_pipe = make_pipeline(preprocessor, RandomForestClassifier())
forest_pipe.fit(X_train,y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['MilkTypeEn',
                                                   'ManufacturerProvCode',
                                                   'CategoryTypeEn',
                                                   'MilkTreatmentTypeEn',
                                                   'ManufacturingTypeEn']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
    

In [20]:
score = cross_validate(forest_pipe, X_train, y_train, cv=5, return_train_score=True)
score = pd.DataFrame(score)
score

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.26804,0.024795,0.832335,0.951952
1,0.259005,0.024719,0.820359,0.951952
2,0.257563,0.024711,0.820359,0.947447
3,0.261819,0.024418,0.855422,0.943028
4,0.260899,0.024581,0.837349,0.946027


Here we can see that `RandomForestClassifier` has achieved a better score than DummyTreeClassifier

In [24]:
pipe = make_pipeline(
    preprocessor, RandomForestClassifier(class_weight='balanced', random_state=123)
)

param_grid = {
    "randomforestclassifier": range(1,15),
    "randomforestclassifier": range(1,15)
}

randomizer = RandomizedSearchCV(
    pipe,
    param_grid,
    n_iter=50,
    cv=3,
    verbose=1,
    n_jobs=-1,
    scoring="f1",
    random_state=123,
)
randomizer

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('pipeline-1',
                                                                               Pipeline(steps=[('simpleimputer',
                                                                                                SimpleImputer(strategy='most_frequent')),
                                                                                               ('onehotencoder',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['MilkTypeEn',
                                                                                'ManufacturerProvCode',
                                                                                'CategoryTypeEn',
                

In [22]:
scoring = RandomizedSearchCV(forest_pipe, param_grid, cv=5, n_iter=5, return_train_score=True, random_state=123, verbose=2 )


## Discussion

The RandomForestClassifier outperformed the DecisionTreeClassifier. When compared to the DummyClassifier that I used as a baseline test, the RandomForestClassifier performed much better. Another interesting question I had that I was unable to answer with my current skillset was whether it was possible to predict a cheeses manufacturer by looking at the fat level, moisture percentage and animal. This would've lead to some interesting analysis about whether a certain type of manufacturer is utilizing a certain type of animal to produce their cheese, or if a certain type of manufacturer prefers a cheese with a higher or lower fat content. 

## References

All the work contained in this notebook is original and my own, the provided data folder was used to load in the datasets.