### Glass type classification  
This is a simple classification model from a Kaggle dataset: https://www.kaggle.com/uciml/glass  
The example uses a support vector machine to classify 6 glass types based upon the mineral content.


In [1]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib

In [2]:
# set seed for reproducability
np.random.seed(1234)

In [3]:
# read in original dataset
glassv1 = pd.read_csv('H:\\My Documents\\Work\\datasets\\glass.csv')

In [4]:
# display the distribution of glass types
glassv1['Type']. value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: Type, dtype: int64

#### Note

Based upon the distribution of the glass types from the counts above,  
the original dataset is very unbalanced and doesn't provide enough  
examples to train a good model.

In an attempt to correct this problem, I used the mean and standard deviaton  
from the original to generate additional examples to add to the current data.  
The new dataset had the original data and the appended additional samples so  
that all types had 70 examples with the exception of __Type 2__ with 76

As shown below, the new dataset now has 426 examples versus the original 214.

In [5]:
# read the new dataset
glass = pd.read_csv('H:\\My Documents\\Work\\datasets\\newglass.csv')

In [6]:
# display the shape of the dataset along with first 5 rows
print(glass.shape)
glass.head()

(426, 10)


Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [7]:
# display the distribution of glass types from the new dataset
glass['Type'].value_counts()

2    76
7    70
6    70
5    70
3    70
1    70
Name: Type, dtype: int64

In [8]:
# split dataset into training and test data
train_data, test_data = train_test_split(glass, test_size=0.2)

In [9]:
# create numpy array of data
train_datawk = train_data.iloc[:, 0:8].as_matrix()
test_datawk = test_data.iloc[:, 0:8].as_matrix()

In [10]:
# training and test datasets scaled
train_scaled = preprocessing.scale(train_datawk)
test_scaled = preprocessing.scale(test_datawk)

In [11]:
# create numpy array of labels
train_labelwk = train_data.loc[:, "Type"].as_matrix()
test_labelwk = test_data.loc[:, "Type"].as_matrix()

#### Note

The dataset was divided into training and test data at an 80/20 split and matricies  
were created for the model input as required. Both the training and test data were  
also scaled to 0 mean and 1 standard deviation.

In [12]:
# parameters are defined to use in grid search
param_tune = [{'kernel':['rbf'], 'gamma': [0.1, 0.25, 0.50], 'C': [0.5, 1, 1.5, 2.0, 2.5]},
             {'kernel': ['poly'], 'degree': [2, 3], 'C': [0.5, 1, 1.5, 2.0, 2.5]}]

In [13]:
# train and fit the model by a grid search on the parameter values
clf_model = GridSearchCV(svm.SVC(C=0.5), param_tune, cv=10, scoring='accuracy')
clf_model.fit(train_scaled, train_labelwk)
print(clf_model.best_params_)

{'C': 2.0, 'kernel': 'rbf', 'gamma': 0.25}


In [14]:
# take the model and score the test data
print(round(clf_model.score(test_scaled, test_labelwk), 3))

0.802


#### Note
For the SVM, I decided to try a RBF and Ploynomial kernel along with variations in  
the gamma, degree and C parameters. The GridSearchCV function was used to test the  
kernels and their parameters to find the best solution based on the accuracy metric.

An RBF kernel was selected with gamma=0.25 and C=2.0. The combination should provide  
support vectors that provide good coverage and granularity over the data. Results against  
the test data showed 80% accuracy. This model is saved for future use to predict the glass  
type from a recreated dataset.

In [15]:
# Save the model to a file for future use
job_model = ('H:\\My Documents\\python_files\\models\\glass_jobmodelSVMscaled.sav')
joblib.dump(clf_model, job_model)

['H:\\My Documents\\python_files\\models\\glass_jobmodelSVMscaled.sav']

In [16]:
# read in dataset to predict --- example dataset created from mean/stdev of original data
glass_pred = pd.read_csv('H:\\My Documents\\Work\\datasets\\glass\\fakeglass.csv')

In [17]:
# display the shape of the prediction dataset along with first 5 rows
print(glass_pred.shape)
glass_pred.head()

(600, 10)


Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.521671,13.431947,3.622851,0.85476,73.713339,0.502031,9.103466,-0.132064,0.047244,1
1,1.522557,13.239578,3.348771,1.3241,72.54275,0.269667,8.92793,-0.006342,0.017555,1
2,1.518733,13.499,3.873861,1.142749,72.78919,0.125805,8.31239,-0.072759,0.044885,1
3,1.518428,13.857746,3.652998,0.822807,72.644126,0.748717,8.577476,0.058371,0.02982,1
4,1.516934,12.494064,3.480528,1.065279,72.613119,0.240472,8.820267,-0.047982,0.11687,1


#### Note  
The next step is to create some fake data in order to use the model to predict glass type.  
In order to create the dataset, I again used the mean and standard deviaton for each of the  
minerals and glass types to generate 100 examples of each glass type.

I hadn't spend any time in making corrections to this prediction dataset and some problems  
are evident. In the column Ba in the table above, one can see negative values which wouldn't  
exist in real glass samples.

In the steps below, the prediction data was prepared similar to the training/test data  
with scaling. Predictions were made and I added evaluation metrics: confusion matrix,  
precision, recall and f1-score.

In [18]:
# create numpy array of data
glass_pred_new = glass_pred.iloc[:, 0:8].as_matrix()
glass_pred_label = glass_pred.loc[:, "Type"].as_matrix()

In [19]:
# scale data from prediction dataset
glass_pred_newsc = preprocessing.scale(glass_pred_new)

In [20]:
# load the saved model
scores_job = joblib.load('H:\\My Documents\\python_files\\models\\glass_jobmodelSVMscaled.sav')

In [21]:
# generate glass type predicitons from the prediction dataset
pred_results = scores_job.predict(glass_pred_newsc)

In [22]:
# generate confusion matrix from prediction results
# left side is actual --- top is predicted
print(confusion_matrix(glass_pred_label, pred_results))

[[17 12 70  0  1  0]
 [10 21 20 28 20  1]
 [ 6  6 87  1  0  0]
 [ 0  1  0 91  4  4]
 [ 0  1  1  9 89  0]
 [ 0  0  0  8 16 76]]


In [23]:
# generate standard metrics from prediction results
print(classification_report(glass_pred_label, pred_results))

             precision    recall  f1-score   support

          1       0.52      0.17      0.26       100
          2       0.51      0.21      0.30       100
          3       0.49      0.87      0.63       100
          5       0.66      0.91      0.77       100
          6       0.68      0.89      0.77       100
          7       0.94      0.76      0.84       100

avg / total       0.63      0.64      0.59       600



In [24]:
# generate accuracy measure from prediction results
print(round(accuracy_score(glass_pred_label, pred_results), 3))

0.635


#### Summary
The overall accuracy of the model on the prediction dataset was 64% which is  
not great but there are some obvious changes that can be made to improve the model.

The model performed well with glass types 3 - 7 but struggled with types 1 & 2.  
There were a good deal of __false negatives__ in these glass types thus lowering  
the recall scores. Type 3 glass also had a high number of __false positives__ which  
lower the precision value.

The approach of creating extra training data was good and necessary but corrections  
to that extra data would be needed. Specifically correcting any negative values and  
adjusting any extreme values. The mineral values within each glass type also needs  
to be considered and adjusted as needed.

The same can be said for prediction dataset. This dataset was totally made up so creating  
real world data from the mean/standard deviation combination is a challange. Even with  
those issues, the model worked well for 4 of the 6 glass groups.