# Model Performance Measures, ML Pipeline and Hyperparameter Tuning

## Can you correctly identify glass type?

## Context:
    
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 
values)

# Content

Attribute Information:

Id number: 1 to 214

RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute) 
    -- 1 building_windows_float_processed 
    -- 2 building_windows_non_float_processed 
    -- 3 vehicle_windows_float_processed 
    -- 4 vehicle_windows_non_float_processed (none in this database) 
    -- 5 containers 
    -- 6 tableware 
    -- 7 headlamps

## Source:
https://archive.ics.uci.edu/ml/datasets/Glass+Identification

# 1.  Import necessary libraries and load the data

In [1]:
import warnings
import pandas as pd 
df = pd.read_csv('glass.csv')
df.head()

FileNotFoundError: [Errno 2] File glass.csv does not exist: 'glass.csv'

# 2. Split the data into dependent and independent variables. Also see how the looks like

Hint: you can make use of nay method(iloc or drop method)

In [None]:
X = df.iloc[:, 1:10].values 
y = df.iloc[:, 10].values 
print(X.shape)
print(y.shape)
print(df.info())

# 3. Convert Target variable into numerical

In [None]:
from sklearn.preprocessing import LabelEncoder 

le = LabelEncoder() 
y = le.fit_transform(y)

le.transform(['building_windows_float_processed','building_windows_non_float_processed','containers','headlamps',
              'tableware','vehicle_windows_float_processed'])

# 4. Split the dataset into train set test set also the validation 
Always a good practice to split the dataset into 3 sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

# 5. Build the pipeline
Steps:

Instantiate the pipeline, as first defining standard scaler and on the scaled data run the PCA and then feed it to the logistic regression(or any other algo)

Hint:

Import standard scaler to standardize the data

You can take an algorithm of choice and build a pipeline

In [None]:
#PCA - to reduce dimensions
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA 
from sklearn.linear_model import LogisticRegression 
from sklearn.pipeline import Pipeline 

pipe_lr = Pipeline([('scl', StandardScaler()), ('pca', PCA(n_components=3)), ('clf', LogisticRegression(random_state=1))]) 
pipe_lr.fit(X_train, y_train) 
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))

# 6.Follow the above steps and check if you can tweak the logistic regression parameters above and make use of Grid search(can use any algorithm)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC 

pipe_svc = Pipeline([('scl', StandardScaler()), ('pca', PCA()), ('svc', SVC())]) 


param_grid = {'pca__n_components':[4,5],'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10], 'svc__kernel':['rbf','poly']} 

grid = GridSearchCV( pipe_svc , param_grid = param_grid, cv = 5) 

grid.fit( X_train, y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( grid.best_score_)) 
print(" Best parameters: ", grid.best_params_) 
print(" Test set accuracy: {:.2f}". format( grid.score( X_test, y_test)))

In [None]:
grid.predict(X_test)

# 7. Optimize the model parameters(can make use of any algorithm)
Make use of Grid search for hyper parameter

Steps:
Split the dataset into train and test set

Make use of any algorithm , from the list of hyper parameters you get apply param grid 

Once hyper parameter grid is defined, import grid search CV and fit x_train and y_train

Find the best params and mean test score


In [None]:
#split the dataset into train and test set
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y,random_state = 7)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
### Number of nearest neighbors
knn_clf = KNeighborsClassifier()

In [None]:
knn_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
param_grid = {'n_neighbors': list(range(1,9)),
             'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }

In [None]:
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(knn_clf,param_grid,cv=10)

In [None]:
gs.fit(X_train, y_train)

In [None]:
gs.best_params_

In [None]:
gs.cv_results_['mean_test_score']