# Classificação: aprendizado semi-supervisionado

Uma indústria de laticínios precisa assegurar a qualidade do leite para produzir seus produtos.

Neste projeto, iremos analisar características de amostras de leite para identificar a qualidade e verificar se podem ser utilizados na produção dos produtos alimentícios.

O machine learning será utilizado para fazer a **classificação** do leite entre as qualidades **baixa, média e alta**.

## Carregando os dados

In [2]:
import pandas as pd

path = "dados/qualidade_leite.csv"
df_dados = pd.read_csv(path)
df_dados

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor,Qualidade
0,6.6,35,1,0,1,0,254,alta
1,6.6,36,0,1,0,1,253,alta
2,8.5,70,1,1,1,1,246,
3,9.5,34,1,1,0,1,255,
4,6.6,37,0,0,0,0,255,média
...,...,...,...,...,...,...,...,...
1054,6.7,45,1,1,0,0,247,
1055,6.7,38,1,0,1,0,255,
1056,3.0,40,1,1,1,1,255,
1057,6.8,43,1,0,1,0,250,


| Coluna     | Tipo de Dado | Descrição                                                                                                                                                                |
| :--------- | :----------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| pH         | float64      | Nível de acidez ou alcalinidade do leite.                                                                                                                                 |
| Temperatura | int64        | Temperatura do leite em graus Celsius.                                                                                                                                     |
| Sabor      | int64        | Indicador binário de qualidade do sabor (1 para alta, 0 para baixa).                                                                                                       |
| Odor       | int64        | Indicador binário de qualidade do odor (1 para alta, 0 para baixa).                                                                                                        |
| Gordura    | int64        | Indicador binário de qualidade do gordura (1 para alta, 0 para baixa).                                                                                                    |
| Turbidez   | int64        | Indicador binário de qualidade do turbidez (1 para alta, 0 para baixa).                                                                                                 |
| Cor        | int64        | Valor numérico representando a tonalidade da cor do leite.                                                                                                               |
| Qualidade  | object       | Classificação da qualidade do leite ("alta", "média", "baixa") com valores nulos (NaN) indicando amostras não classificadas. Esta será a variável alvo do nosso modelo. |

## Explorando os dados

In [3]:
df_dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pH           1059 non-null   float64
 1   Temperatura  1059 non-null   int64  
 2   Sabor        1059 non-null   int64  
 3   Odor         1059 non-null   int64  
 4   Gordura      1059 non-null   int64  
 5   Turbidez     1059 non-null   int64  
 6   Cor          1059 non-null   int64  
 7   Qualidade    424 non-null    object 
dtypes: float64(1), int64(6), object(1)
memory usage: 66.3+ KB


In [4]:
df_dados["Qualidade"].value_counts(dropna=False)

Qualidade
NaN      635
baixa    184
média    149
alta      91
Name: count, dtype: int64

In [5]:
df_dados.describe()

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
count,1059.0,1059.0,1059.0,1059.0,1059.0,1059.0,1059.0
mean,6.630123,44.226629,0.546742,0.432483,0.671388,0.491029,251.840415
std,1.399679,10.098364,0.498046,0.495655,0.46993,0.500156,4.307424
min,3.0,34.0,0.0,0.0,0.0,0.0,240.0
25%,6.5,38.0,0.0,0.0,0.0,0.0,250.0
50%,6.7,41.0,1.0,0.0,1.0,0.0,255.0
75%,6.8,45.0,1.0,1.0,1.0,1.0,255.0
max,9.5,90.0,1.0,1.0,1.0,1.0,255.0


## Abordagem Supervisionada

In [6]:
df_dados_rotulados = df_dados.dropna()
df_dados_rotulados

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor,Qualidade
0,6.6,35,1,0,1,0,254,alta
1,6.6,36,0,1,0,1,253,alta
4,6.6,37,0,0,0,0,255,média
6,5.5,45,1,0,1,1,250,baixa
7,4.5,60,0,1,1,1,250,baixa
...,...,...,...,...,...,...,...,...
1047,6.8,45,1,1,1,0,245,alta
1048,9.5,34,1,1,0,1,255,baixa
1049,6.5,37,0,0,0,0,255,média
1052,6.5,40,1,0,0,0,250,média


### Separando em variáveis explicativas (x) e variável alvo (y)

In [7]:
x = df_dados_rotulados.drop(columns=["Qualidade"])
y = df_dados_rotulados["Qualidade"]

### Preparando os dados

In [10]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)
y

array([0, 0, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 0, 1, 0, 0, 2, 2, 1, 1, 0,
       2, 2, 0, 1, 1, 2, 2, 1, 2, 0, 1, 1, 2, 1, 2, 1, 2, 2, 0, 2, 1, 1,
       1, 0, 2, 2, 1, 2, 2, 0, 1, 0, 0, 1, 0, 1, 2, 2, 2, 2, 0, 1, 0, 1,
       2, 1, 1, 2, 1, 1, 2, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1,
       1, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2,
       1, 2, 0, 2, 1, 2, 2, 0, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1, 2, 0, 1, 1,
       0, 2, 1, 1, 2, 0, 0, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 1, 2, 1, 0, 2,
       0, 2, 0, 0, 2, 1, 1, 2, 1, 1, 2, 0, 1, 2, 2, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 0, 1, 0, 2, 2, 0, 1, 2, 1, 1, 2, 0, 2, 1, 2, 0, 1,
       1, 1, 1, 2, 1, 2, 0, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 2, 1, 2, 1, 2,
       0, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 1, 2, 0, 0, 1, 2, 0,
       1, 2, 1, 1, 0, 0, 1, 0, 1, 2, 1, 2, 0, 1, 0, 2, 1, 2, 0, 2, 1, 2,
       0, 1, 0, 2, 0, 2, 0, 1, 2, 0, 1, 2, 1, 0, 1, 1, 2, 0, 1, 1, 1, 2,
       0, 2, 1, 0, 1, 2, 0, 2, 0, 2, 1, 0, 2, 2, 0,

In [11]:
encoder.inverse_transform([0, 1, 2])

array(['alta', 'baixa', 'média'], dtype=object)

In [12]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

x_normalizado = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
x_normalizado

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
0,0.553846,0.017857,1.0,0.0,1.0,0.0,0.933333
1,0.553846,0.035714,0.0,1.0,0.0,1.0,0.866667
2,0.553846,0.053571,0.0,0.0,0.0,0.0,1.000000
3,0.384615,0.196429,1.0,0.0,1.0,1.0,0.666667
4,0.230769,0.464286,0.0,1.0,1.0,1.0,0.666667
...,...,...,...,...,...,...,...
419,0.584615,0.196429,1.0,1.0,1.0,0.0,0.333333
420,1.000000,0.000000,1.0,1.0,0.0,1.0,1.000000
421,0.538462,0.053571,0.0,0.0,0.0,0.0,1.000000
422,0.538462,0.107143,1.0,0.0,0.0,0.0,0.666667
