- Fazer exploração, preparação, e treinamento de um modelo para classificar tipos de vidro,
- usando o dataset do link acima.
- Explore os dados, faça as limpezas que julgar pertinentes
- Depois Escolha um estimador adequado, faça busca de hiperparametros e crossvalidation
- Não se esqueça de utilizar tratamentos que julgue importantes

In [55]:
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, make_scorer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_validate



In [3]:
vidros = pd.read_csv('glass.csv')
vidros.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


## Dicionário das colunas:
RI: refractive index

Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)

Mg: Magnesium

Al: Aluminum

Si: Silicon

K: Potassium

Ca: Calcium

Ba: Barium

Fe: Iron

Type of glass: (class attribute)

-- 1 building_windows_float_processed

-- 2 building_windows_non_float_processed

-- 3 vehicle_windows_float_processed

-- 4 vehicle_windows_non_float_processed (none in this database)

-- 5 containers

-- 6 tableware

-- 7 headlamps


In [4]:
vidros.isna().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

In [5]:
vidros.describe()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
count,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0,214.0
mean,1.518365,13.40785,2.684533,1.444907,72.650935,0.497056,8.956963,0.175047,0.057009,2.780374
std,0.003037,0.816604,1.442408,0.49927,0.774546,0.652192,1.423153,0.497219,0.097439,2.103739
min,1.51115,10.73,0.0,0.29,69.81,0.0,5.43,0.0,0.0,1.0
25%,1.516522,12.9075,2.115,1.19,72.28,0.1225,8.24,0.0,0.0,1.0
50%,1.51768,13.3,3.48,1.36,72.79,0.555,8.6,0.0,0.0,2.0
75%,1.519157,13.825,3.6,1.63,73.0875,0.61,9.1725,0.0,0.1,3.0
max,1.53393,17.38,4.49,3.5,75.41,6.21,16.19,3.15,0.51,7.0


In [43]:
#Vejamos as correlações mais significativas
vidros.corr()['Type'].abs().sort_values(ascending=False)[1:]

Mg    0.744993
Al    0.598829
Ba    0.575161
Na    0.502898
Fe    0.188278
RI    0.164237
Si    0.151565
K     0.010054
Ca    0.000952
Name: Type, dtype: float64

In [45]:
vidros['Type'].value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: Type, dtype: int64

Notamos que as colunas com maior correlação com o tipo de vidro são: Mg, Al, Ba e Na.

Notamos também que não existem vidros do tipo 4 (janelas de veículo) na base, mas isso já havia sido dito no dicionário do dataset.


In [47]:
vidros.groupby('Type').mean()

Unnamed: 0_level_0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.518718,13.242286,3.552429,1.163857,72.619143,0.447429,8.797286,0.012714,0.057
2,1.518619,13.111711,3.002105,1.408158,72.598026,0.521053,9.073684,0.050263,0.079737
3,1.517964,13.437059,3.543529,1.201176,72.404706,0.406471,8.782941,0.008824,0.057059
5,1.518928,12.827692,0.773846,2.033846,72.366154,1.47,10.123846,0.187692,0.060769
6,1.517456,14.646667,1.305556,1.366667,73.206667,0.0,9.356667,0.0,0.0
7,1.517116,14.442069,0.538276,2.122759,72.965862,0.325172,8.491379,1.04,0.013448


In [51]:
X=vidros[['Mg', 'Al', 'Ba' ,'Na']]
y= vidros['Type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [59]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [60]:
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)

In [62]:
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance']
}

grid_search = GridSearchCV(knn, param_grid, cv=5)
grid_search.fit(X_train_scaled, y_train)

best_params = grid_search.best_params_
best_params

{'n_neighbors': 5, 'weights': 'uniform'}

In [63]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)

pd.DataFrame(cv_scores)

Unnamed: 0,0
0,0.685714
1,0.558824
2,0.617647
3,0.705882
4,0.588235


In [64]:
from sklearn.metrics import classification_report, confusion_matrix

# Predict on training set
y_train_pred = knn.predict(X_train_scaled)
train_confusion_matrix = confusion_matrix(y_train, y_train_pred)

# Predict on test set
y_test_pred = knn.predict(X_test_scaled)
test_confusion_matrix = confusion_matrix(y_test, y_test_pred)

print("Training Set Metrics:")
print(classification_report(y_train, y_train_pred))
print("Training Set Confusion Matrix:")
print(train_confusion_matrix)

print("Test Set Metrics:")
print(classification_report(y_test, y_test_pred))
print("Test Set Confusion Matrix:")
print(test_confusion_matrix)


Training Set Metrics:
              precision    recall  f1-score   support

           1       0.70      0.89      0.79        56
           2       0.77      0.82      0.79        61
           3       0.00      0.00      0.00        14
           5       0.86      0.60      0.71        10
           6       0.83      0.71      0.77         7
           7       0.86      0.83      0.84        23

    accuracy                           0.76       171
   macro avg       0.67      0.64      0.65       171
weighted avg       0.71      0.76      0.73       171

Training Set Confusion Matrix:
[[50  6  0  0  0  0]
 [ 9 50  0  1  0  1]
 [10  4  0  0  0  0]
 [ 0  2  0  6  0  2]
 [ 1  1  0  0  5  0]
 [ 1  2  0  0  1 19]]
Test Set Metrics:
              precision    recall  f1-score   support

           1       0.80      0.86      0.83        14
           2       0.79      0.73      0.76        15
           3       1.00      0.33      0.50         3
           5       0.67      0.67      0.6

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
