## Entendendo o dataset
Temos 11 colunas sendo a primeira o ID da linha e a última a variável objetivo.

O problema é para tentar descobrir qual o tipo de vidro que está sendo passado, analisando as variáveis de quantidade de cada elemento químico que foi usado na construção dele.

1. **ID** number: 1 to 214
2. **RI**: refractive index
3. **NA2O**: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4. **MGO**: Magnesium
5. **AL2O3**: Aluminum
6. **SIO2**: Silicon
7. **K2O**: Potassium
8. **CAO**: Calcium
9. **BAO**: Barium
10. **FE2O3**: Iron
11. **TYPE** of glass: (class attribute)
-- 1 building_windows_float_processed
-- 2 building_windows_non_float_processed
-- 3 vehicle_windows_float_processed
-- 4 vehicle_windows_non_float_processed (none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps

## Importação e manipulação dos dados

In [1]:
import pandas as pd
colunas = ['ID', 'RI', 'NA2O', 'MGO', 'AL2O3', 'SIO2', 'K2O', 'CAO', 'BAO', 'FE2O3', 'TYPE']
df = pd.read_csv('glass.data.csv', header=None, names=colunas, index_col=0)

In [2]:
df.head()

Unnamed: 0_level_0,RI,NA2O,MGO,AL2O3,SIO2,K2O,CAO,BAO,FE2O3,TYPE
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [3]:
df.isna().sum()

RI       0
NA2O     0
MGO      0
AL2O3    0
SIO2     0
K2O      0
CAO      0
BAO      0
FE2O3    0
TYPE     0
dtype: int64

In [4]:
df['TYPE'].value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: TYPE, dtype: int64

In [5]:
df.dtypes

RI       float64
NA2O     float64
MGO      float64
AL2O3    float64
SIO2     float64
K2O      float64
CAO      float64
BAO      float64
FE2O3    float64
TYPE       int64
dtype: object

## Separação e normalização dos dados

In [6]:
x = df[colunas[1:-1]]
x.head()

Unnamed: 0_level_0,RI,NA2O,MGO,AL2O3,SIO2,K2O,CAO,BAO,FE2O3
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0
2,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0
3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0
4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0
5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0


In [7]:
y = df[colunas[-1:]]
y.head()

Unnamed: 0_level_0,TYPE
ID,Unnamed: 1_level_1
1,1
2,1
3,1
4,1
5,1


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

x_norm = StandardScaler().fit(x).transform(x.astype(float))
x_treino, x_teste, y_treino, y_teste = train_test_split( x_norm, y, test_size=0.3, random_state=2 )

In [9]:
import warnings
warnings.filterwarnings("ignore")

## Modelos

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

### Modelo de Árvore de Decisão

In [11]:
modelo = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
modelo.fit(x_treino, y_treino)
y_predict = modelo.predict(x_teste)
print('Acurácia -> {0:0.2f}%'.format(accuracy_score(y_teste, y_predict)*100))

Acurácia -> 75.38%


### Modelo de K-Nearest Neighbor

In [12]:
modelo = KNeighborsClassifier(n_neighbors = 4)
modelo.fit(x_treino, y_treino)
y_predict = modelo.predict(x_teste)

In [13]:
from sklearn.metrics import accuracy_score
print('acurácia -> {0:0.2f}%'.format(accuracy_score(y_teste, y_predict)*100))

acurácia -> 64.62%


In [14]:
for n in range(1,10):
    modelo = KNeighborsClassifier(n_neighbors = n)
    modelo.fit(x_treino, y_treino)
    y_predict = modelo.predict(x_teste)
    acuracia = accuracy_score(y_teste, y_predict)
    print('Com {0} K(s) teve uma acurácia de {1:0.2f}%'.format(n, acuracia * 100))

Com 1 K(s) teve uma acurácia de 69.23%
Com 2 K(s) teve uma acurácia de 64.62%
Com 3 K(s) teve uma acurácia de 72.31%
Com 4 K(s) teve uma acurácia de 64.62%
Com 5 K(s) teve uma acurácia de 66.15%
Com 6 K(s) teve uma acurácia de 66.15%
Com 7 K(s) teve uma acurácia de 66.15%
Com 8 K(s) teve uma acurácia de 66.15%
Com 9 K(s) teve uma acurácia de 70.77%


## Conclusão
Depois de testado, vimos que o KNN com seu melhor resultado teve um acerto de 72,31%, enquanto a Árvore de Decisão teve um acerto de 75,38%, concluindo que o melhor modelo é o da Árvore de Decisão para esse caso.