# Classification of marbles - Part II
Let's go on with our marble examples. So far we performed the first three general steps of ML.
* Data Import and Preparation
* Data Exploration
* Feature Selection and Engineering
* [Model Definition](#Model-Definition-and-Training)
* [Training](#Model-Definition-and-Training)
* [Basics: Validation and Performance](#Basics:-Validation-and-Performance) 

In [None]:
import os
import numpy as np
import pandas as pd
import zipfile
import matplotlib.pyplot as plt

from IPython.display import display, clear_output, Markdown

## Data import and preparation

In [None]:
def parse_lines(lines):
    """ Parse strings of marble data"""
    lines = lines[2:-2]
    rows = [d.split(', ') for d in lines.split('), (')]
    data = [[int(v.replace(')][(', '')) for v in r] for r in rows]
    return pd.DataFrame(data)[[0, 1, 2]]

files = [
    'blue-white-glass.data',
    'cyan-glass.data',
    'glass-blue.data',
    'glass-green.data',
    'glass-red.data',
    'glass-yellow.data',
    'planet-black-blue.data',
    'planet-green.data',
    'planet-ocean.data',
]

dfs = []
for i, fname in enumerate(files):
    print(f'Load data {i}: {fname}')

    with zipfile.ZipFile(f'../.assets/data/marbles/{fname}.zip', 'r') as zipf:
        with zipf.open(f'{fname}', 'r') as infile:
            content = infile.readlines()[0].decode()
            dfs.append(parse_lines(content).assign(color=f'{fname}'.replace('.data', '')))

df = pd.concat(dfs)
df.columns=['R', 'G', 'B', 'color']

## Feature Engineering

In [None]:
def generate_xy_values(df):
    df['X'] = 0.5 * np.sqrt(3) * df['G'] - 0.5 * np.sqrt(3) * df['B']
    df['Y'] = df['R'] - (1 / 3 * df['G']) - (1 / 3 * df['B'])
    
def generate_intensity_values(df):
    df['I'] = np.square(df['X']) + np.square(df['Y'])

def generate_angles(df):
    df['Phi'] = np.arctan2(df['Y'], df['X'])

In [None]:
generate_xy_values(df)
generate_intensity_values(df)
generate_angles(df)

In [None]:
# Quick check
df.head()

In [None]:
# Quick check
df['color'].value_counts()

In [None]:
# Add target ID
ids = {'blue-white-glass': 0,
       'cyan-glass': 1,
       'glass-blue': 2,
       'glass-green': 3,
       'glass-red': 4,
       'glass-yellow': 5,
       'planet-black-blue': 6,
       'planet-green': 7,
       'planet-ocean': 8,}

df['cat'] = df['color'].map(ids)

In [None]:
df.sample(5)

## Feature Selection

If we stick to some conventions it is quite easy to use the same workflow to switch between models and compare the results.

We use as features our R/G/B values or/and our engineered features. As our data set is prepared, we only have to set our `training_features`.

In [None]:
# Mix data set
df = df.sample(frac=1)

In [None]:
df.head()

In [None]:
# Define features for training
training_features = ['R', 'G', 'B']

# Set target
target = ['cat']

# Define input and target & Use only part of the data set to save computing time
X = df[training_features + target].dropna().head(300000)
y = X[target] # 
X.drop(target, axis=1, inplace=True)

To have an easy start in this example we are going to use the whole data set for training. In a later step we will go through the proper workflow and split the data set into parts for validation and performance checks.

In [None]:
X_train = X
y_train = y

X_test = X
y_test = y

## Model Definition and Training

Let's start with the actual ML, that is, the fitting of a model to the training data using a machine learning algorithm. In `sklearn` we can import  different algorithms for ML, from simple [decision trees](http://scikit-learn.org/stable/modules/tree.html) to more advanced models like [random forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or [multi layer perceptron (MLP)](http://scikit-learn.org/stable/modules/neural_networks_supervised.html). `sklearn` provides these algorithms via a consistent interface, so it is easy to switch between them.

In [None]:
# Import of ML models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

In this step an instance of the algorithm is created with certain settings that influence the learning process (and thereby the model performance). These are called **hyper-parameters**. Hyper-parameters are parameters of the learning algorithm, rather than parameters of the trained model.

In [None]:
# Defining the model with key parameters
model = RandomForestClassifier(n_estimators=10, max_depth=10)
#model = MLPClassifier(hidden_layer_sizes=(20,), activation='relu')

Now the actual training step:

In [None]:
# Fit the model
model.fit(X_train, y_train.values[:,0])

#### Tuning the Model

To get the best performing model, we would need to tune the algorithms' **hyper-parameters**, that is, experimentally trying out different combinations while measuring model performance. For now, this is an advanced concept that we will revisit later.

## Basics: Validation and Performance

The actual machine learning training is done. Let's have a look at our results and measure how well our model performs.

### Hypothesis Test
There is an easy way to check the results by visualization. Each chart gives the probability of all samples to belong to one marble type. In addition, each color gives the true membership. A good classifier will show a good splitting.

In [None]:
# Getting the probability for a given data set
y_proba_test = model.predict_proba(X_test)

In [None]:
y_proba_test.shape

In [None]:
#cat = [0,1,2,3,4,5,6,7,8]
cat = [0, 4, 8]

for i in cat:
    y_proba_test_i = y_proba_test[:,i]
    plt.figure(figsize=(8, 4))
    
    for j in range(9):
        plt.hist(y_proba_test_i[y_test['cat'] == j], 
                 bins=np.linspace(0,1,100), 
                 alpha=0.5, 
                 density=False, 
                 label=f'Type {j}')        
    
    plt.title(f'Hypothesis: Marble belongs to type {i}')
    plt.xlabel('Probability')   
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    plt.tight_layout()
    plt.yscale('log', nonposy='clip')
    plt.show()

### ROC Curves

The [**Receiver Operating Characteristics (ROC**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)) are a slightly more condensed way to validate a model. A ROC curve shows the **true positive rate** (TPR, $\frac{TP}{P} = \frac{TP}{TP+FN}$) as a function of the **false positive rate** (FPR, $\frac{FP}{N} = \frac{FP}{FP+TN}$) for each class. For each sample the class with the highest probability is chosen for the curve. When given a certain hypothesis and an acceptable false-positive rate, we see how many samples that truly fit the hypothesis we can select. Typically the ROC curve raises quickly and flattens to (1,1). The diagonal would reflect a *random guess*. Keep in mind that both axes show rates and the overall absolute sample size do (can) differ significantly. In addition, the ROC curve can be used to compare within one condensed plot 
* the performance of different data sets (e.g. training and test data set),
* different sets of hyper-parameter of one model 
* different models.

Here, we show the results for the train and test data set in comparison, to detect deviations. Are there significant deviations this could be an indice for overfitting to the train data!

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
#cat = [0,1,2,3,4,5,6,7,8]
cat = [0, 4, 8]

y_proba_test = model.predict_proba(X_test)
   
for i in cat:
    y_proba_test_i = y_proba_test[:,i]
    
    plt.figure(figsize=(5, 5))
    plt.plot(*roc_curve(y_test == i, y_proba_test_i)[:2], label='test data')
    plt.plot([0, 1],[0, 1], color='black', linestyle=':')
    plt.title(f'ROC curve type {i}')
    plt.xlabel('false positive rate')
    plt.ylabel('true positive rate') 
    plt.legend(loc='best')
    plt.show();   

### Confusion Matrix

The [**Confusion Matrix**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) (Table of Confusion) gives for each class how many samples are classified correctly (principal diagonal) and how many classifications are false. In addition, it shows to which wrong class the samples were assigned. In our case we get a 9x9 matrix. The sum of a row are all members of each marble type and the sum of a column returns the predicted members of a class. A _perfect_ classificator would have only entries on the pricipal diagonal. Keep in mind that each sample will be assigned to the class with the highest probability regardless how high it is (worst case: 100%/9 = ~11%).

In [None]:
from sklearn.metrics import confusion_matrix

y_pred_test = model.predict(X_test)
truth = y_test 
cm = confusion_matrix(truth,y_pred_test)

pd.DataFrame(data=cm)

The code below outputs a more readable visualization of the confusion matrix:

In [None]:
import itertools

plt.figure(figsize=(8, 8))
plt.imshow(cm, interpolation='nearest', cmap='viridis',vmin=0, vmax=cm[4, 4])
plt.colorbar()
for i, j in itertools.product(range(9), range(9)):
        plt.text(j, i, f'{cm[i, j]:.0f}', horizontalalignment="center",color="white" if not i==j else "black")
plt.title('Confusion Matrix')
plt.ylabel('True class')
plt.xlabel('Predicted class');        

### Deduced performance indicators

There are several performance indicators which only reflect single rates. For example the **True Positive Rate** (TPR, Sensitivity, Hit Rate, Recall) is the rate between True Positives and Positives. It's counterpart is the **True Negative Rate** (TNR, Specificity).

* True Positive Rate (TPR, Sensitivity) : $\frac{TP}{P}$


* True Negative Rate (TNR, Specificity) : $\frac{TN}{N}$ 

Thereby, we should always take both rates into account to get something like an average. In addition, the **Accuracy** (ACC) can give a hint for that purpose as it covers Positives and Negatives

* Accuracy (ACC): $\frac{TP + TN}{P + N}$

The area under the ROC Curve **AUC** (**A**rea **U**nder **C**urve) can be used as well.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

In [None]:
auc=[]
tpr=[]

for i in range(9):
    y_proba_test_i = y_proba_test[:,i]
    auc.append(roc_auc_score(y_test.values == i, y_proba_test_i))
    tpr.append(model.score(X_test[y_test.values == i], y_test[y_test.values == i]))
# Displaying
pd.DataFrame({'AUX': np.array(auc), 'TPR': np.array(tpr)})

In [None]:
print(f'Mean Accuracy: {model.score(X_test, y_test):.3f}')

With this simple model we can obtain a quite reasonable result. However, training and validation has to be an iterative process in machine learning. Getting the last 10-20% percent can be the challenge in the end. One other interesting fact is that all types show different performances. 

For example a RandomForstClassifier (`RandomForestClassifier: n_estimators=10, max_depth=10`) gives us a accuracy of ~85%. Just by using RGB values without any feature engineering and training the model. Some of the marbles types perform even better and we already get something like around 99% (type 4).

## Feature Importance

Several machine learning models return a score for the feature importance within the classificator. This can be used to perform more training steps to improve the model, improve computing time or feedback this to the initial data acquisition. If we detect that one feature is very important for the classificator it maybe a good idea to improve the quality of this feature or engineer equivalent features. In addition, this step can highlight features which were not be be expected to be important and can lead to a rethinking of strategies.

In [None]:
if (type(model) is not MLPClassifier):
    plt.figure(figsize=(5, 5))
    plt.barh(range(len(X.columns)), model.feature_importances_)
    plt.yticks(range(len(X.columns)), X.columns)
    plt.show()
else:
    print("MLP does not support feature importances in this example.")

### More functionality
There are other tools in [**sklearn.metrics**](http://scikit-learn.org/stable/modules/model_evaluation.html) to perform general performance and validation analyses. With some models there comes a [**classification_report**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report) which can be take into account. Another often applied strategy is a [cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) in which the train-test-split is performed several times on a data set. Thereby averaged performance indicators can be estimated and we get some hints how stable the system is. 



### Attention
All performance indicators are rates and can be much more interesting when checking absolut values. Especially when the different classes have not the same amount of class members.




## Import & Export trained model
It is possible to save a trained model or open it. Thereby the user can distribute or compare models from different states or types in another instance. More details are [availible online](http://scikit-learn.org/stable/modules/model_persistence.html).

In [None]:
import pickle
pickle.dump(model, open('model.pkl', 'wb'))
load_model = pickle.load(open('model.pkl', 'rb'))
load_model

In [None]:
import joblib
joblib.dump(model, 'model.pkl') 
load_model = joblib.load('model.pkl')
load_model

## Task:
Let's try out different models and tune some of the standard hyperparameters.

In [None]:
### It's your turn!








---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_