# Classificação: aprendizado semi-supervisionado

Uma indústria de laticínios precisa assegurar a qualidade do leite para produzir seus produtos.

Neste projeto, iremos analisar características de amostras de leite para identificar a qualidade e verificar se podem ser utilizados na produção dos produtos alimentícios.

O machine learning será utilizado para fazer a **classificação** do leite entre as qualidades **baixa, média e alta**.

## Carregando os dados

In [20]:
import pandas as pd

path = "dados/qualidade_leite.csv"
df_dados = pd.read_csv(path)
df_dados

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor,Qualidade
0,6.6,35,1,0,1,0,254,alta
1,6.6,36,0,1,0,1,253,alta
2,8.5,70,1,1,1,1,246,
3,9.5,34,1,1,0,1,255,
4,6.6,37,0,0,0,0,255,média
...,...,...,...,...,...,...,...,...
1054,6.7,45,1,1,0,0,247,
1055,6.7,38,1,0,1,0,255,
1056,3.0,40,1,1,1,1,255,
1057,6.8,43,1,0,1,0,250,


| Coluna     | Tipo de Dado | Descrição                                                                                                                                                                |
| :--------- | :----------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| pH         | float64      | Nível de acidez ou alcalinidade do leite.                                                                                                                                 |
| Temperatura | int64        | Temperatura do leite em graus Celsius.                                                                                                                                     |
| Sabor      | int64        | Indicador binário de qualidade do sabor (1 para alta, 0 para baixa).                                                                                                       |
| Odor       | int64        | Indicador binário de qualidade do odor (1 para alta, 0 para baixa).                                                                                                        |
| Gordura    | int64        | Indicador binário de qualidade do gordura (1 para alta, 0 para baixa).                                                                                                    |
| Turbidez   | int64        | Indicador binário de qualidade do turbidez (1 para alta, 0 para baixa).                                                                                                 |
| Cor        | int64        | Valor numérico representando a tonalidade da cor do leite.                                                                                                               |
| Qualidade  | object       | Classificação da qualidade do leite ("alta", "média", "baixa") com valores nulos (NaN) indicando amostras não classificadas. Esta será a variável alvo do nosso modelo. |

## Explorando os dados

In [21]:
df_dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1059 entries, 0 to 1058
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pH           1059 non-null   float64
 1   Temperatura  1059 non-null   int64  
 2   Sabor        1059 non-null   int64  
 3   Odor         1059 non-null   int64  
 4   Gordura      1059 non-null   int64  
 5   Turbidez     1059 non-null   int64  
 6   Cor          1059 non-null   int64  
 7   Qualidade    424 non-null    object 
dtypes: float64(1), int64(6), object(1)
memory usage: 66.3+ KB


In [22]:
df_dados["Qualidade"].value_counts(dropna=False)

Qualidade
NaN      635
baixa    184
média    149
alta      91
Name: count, dtype: int64

In [23]:
df_dados.describe()

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
count,1059.0,1059.0,1059.0,1059.0,1059.0,1059.0,1059.0
mean,6.630123,44.226629,0.546742,0.432483,0.671388,0.491029,251.840415
std,1.399679,10.098364,0.498046,0.495655,0.46993,0.500156,4.307424
min,3.0,34.0,0.0,0.0,0.0,0.0,240.0
25%,6.5,38.0,0.0,0.0,0.0,0.0,250.0
50%,6.7,41.0,1.0,0.0,1.0,0.0,255.0
75%,6.8,45.0,1.0,1.0,1.0,1.0,255.0
max,9.5,90.0,1.0,1.0,1.0,1.0,255.0


## Abordagem Supervisionada

In [24]:
df_dados_rotulados = df_dados.dropna()
df_dados_rotulados

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor,Qualidade
0,6.6,35,1,0,1,0,254,alta
1,6.6,36,0,1,0,1,253,alta
4,6.6,37,0,0,0,0,255,média
6,5.5,45,1,0,1,1,250,baixa
7,4.5,60,0,1,1,1,250,baixa
...,...,...,...,...,...,...,...,...
1047,6.8,45,1,1,1,0,245,alta
1048,9.5,34,1,1,0,1,255,baixa
1049,6.5,37,0,0,0,0,255,média
1052,6.5,40,1,0,0,0,250,média


### Separando em variáveis explicativas (x) e variável alvo (y)

In [25]:
x = df_dados_rotulados.drop(columns=["Qualidade"])
y = df_dados_rotulados["Qualidade"]

### Preparando os dados

In [26]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)
y

array([0, 0, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 0, 1, 0, 0, 2, 2, 1, 1, 0,
       2, 2, 0, 1, 1, 2, 2, 1, 2, 0, 1, 1, 2, 1, 2, 1, 2, 2, 0, 2, 1, 1,
       1, 0, 2, 2, 1, 2, 2, 0, 1, 0, 0, 1, 0, 1, 2, 2, 2, 2, 0, 1, 0, 1,
       2, 1, 1, 2, 1, 1, 2, 0, 0, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1,
       1, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2,
       1, 2, 0, 2, 1, 2, 2, 0, 1, 1, 1, 2, 1, 0, 2, 1, 1, 1, 2, 0, 1, 1,
       0, 2, 1, 1, 2, 0, 0, 2, 2, 2, 0, 1, 2, 1, 2, 1, 1, 1, 2, 1, 0, 2,
       0, 2, 0, 0, 2, 1, 1, 2, 1, 1, 2, 0, 1, 2, 2, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 0, 1, 0, 2, 2, 0, 1, 2, 1, 1, 2, 0, 2, 1, 2, 0, 1,
       1, 1, 1, 2, 1, 2, 0, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 2, 1, 2, 1, 2,
       0, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 0, 1, 2, 0, 0, 1, 2, 0,
       1, 2, 1, 1, 0, 0, 1, 0, 1, 2, 1, 2, 0, 1, 0, 2, 1, 2, 0, 2, 1, 2,
       0, 1, 0, 2, 0, 2, 0, 1, 2, 0, 1, 2, 1, 0, 1, 1, 2, 0, 1, 1, 1, 2,
       0, 2, 1, 0, 1, 2, 0, 2, 0, 2, 1, 0, 2, 2, 0,

In [27]:
encoder.inverse_transform([0, 1, 2])

array(['alta', 'baixa', 'média'], dtype=object)

In [28]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

x_normalizado = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
x_normalizado

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
0,0.553846,0.017857,1.0,0.0,1.0,0.0,0.933333
1,0.553846,0.035714,0.0,1.0,0.0,1.0,0.866667
2,0.553846,0.053571,0.0,0.0,0.0,0.0,1.000000
3,0.384615,0.196429,1.0,0.0,1.0,1.0,0.666667
4,0.230769,0.464286,0.0,1.0,1.0,1.0,0.666667
...,...,...,...,...,...,...,...
419,0.584615,0.196429,1.0,1.0,1.0,0.0,0.333333
420,1.000000,0.000000,1.0,1.0,0.0,1.0,1.000000
421,0.538462,0.053571,0.0,0.0,0.0,0.0,1.000000
422,0.538462,0.107143,1.0,0.0,0.0,0.0,0.666667


### Criando um modelo supervisionado

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report

SEED = 10

x_treino, x_teste, y_treino, y_teste = train_test_split(x_normalizado, y, stratify=y, random_state=SEED)

svm = SVC(kernel="linear", random_state=SEED)
svm.fit(x_treino, y_treino)

y_pred = svm.predict(x_teste)

resultados_svm = classification_report(y_teste, y_pred)
print(resultados_svm)

              precision    recall  f1-score   support

           0       0.75      0.39      0.51        23
           1       0.67      0.91      0.77        46
           2       0.90      0.76      0.82        37

    accuracy                           0.75       106
   macro avg       0.77      0.69      0.70       106
weighted avg       0.77      0.75      0.73       106



## Abordagem Semi-Supervisionada

### Obtendo os dados não rotulados

In [30]:
dados_nao_rotulados = df_dados[df_dados["Qualidade"].isna()].drop(columns=["Qualidade"])
dados_nao_rotulados

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
2,8.5,70,1,1,1,1,246
3,9.5,34,1,1,0,1,255
5,6.6,37,1,1,1,1,255
8,8.1,66,1,0,1,1,255
10,6.7,45,1,1,1,0,245
...,...,...,...,...,...,...,...
1053,8.1,66,1,0,1,1,255
1054,6.7,45,1,1,0,0,247
1055,6.7,38,1,0,1,0,255
1056,3.0,40,1,1,1,1,255


### Preparando os dados não rotulados

In [31]:
dados_nao_rotulados_normalizado = pd.DataFrame(scaler.fit_transform(dados_nao_rotulados), columns=x.columns)
dados_nao_rotulados_normalizado

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
0,0.846154,0.642857,1.0,1.0,1.0,1.0,0.400000
1,1.000000,0.000000,1.0,1.0,0.0,1.0,1.000000
2,0.553846,0.053571,1.0,1.0,1.0,1.0,1.000000
3,0.784615,0.571429,1.0,0.0,1.0,1.0,1.000000
4,0.569231,0.196429,1.0,1.0,1.0,0.0,0.333333
...,...,...,...,...,...,...,...
630,0.784615,0.571429,1.0,0.0,1.0,1.0,1.000000
631,0.569231,0.196429,1.0,1.0,0.0,0.0,0.466667
632,0.569231,0.071429,1.0,0.0,1.0,0.0,1.000000
633,0.000000,0.107143,1.0,1.0,1.0,1.0,1.000000


In [32]:
y_previsto = svm.predict(dados_nao_rotulados_normalizado)

### Pseudo Labeling

In [33]:
novo_x_treino = pd.concat([x_treino, dados_nao_rotulados_normalizado])
novo_x_treino

Unnamed: 0,pH,Temperatura,Sabor,Odor,Gordura,Turbidez,Cor
9,0.584615,0.017857,1.0,0.0,1.0,0.0,0.400000
191,0.538462,0.035714,0.0,0.0,0.0,0.0,0.466667
103,0.538462,0.017857,1.0,0.0,1.0,0.0,0.400000
347,0.569231,0.071429,1.0,0.0,1.0,0.0,1.000000
137,0.553846,0.196429,0.0,1.0,1.0,1.0,0.666667
...,...,...,...,...,...,...,...
630,0.784615,0.571429,1.0,0.0,1.0,1.0,1.000000
631,0.569231,0.196429,1.0,1.0,0.0,0.0,0.466667
632,0.569231,0.071429,1.0,0.0,1.0,0.0,1.000000
633,0.000000,0.107143,1.0,1.0,1.0,1.0,1.000000


In [34]:
novo_y_treino = pd.concat([pd.Series(y_treino), pd.Series(y_previsto)])
novo_y_treino

0      2
1      2
2      2
3      0
4      0
      ..
630    1
631    1
632    0
633    1
634    2
Length: 953, dtype: int64

In [35]:
svm_pseudo_labeling = SVC(kernel="linear", random_state=SEED)
svm_pseudo_labeling.fit(novo_x_treino, novo_y_treino)

novo_y_previsto = svm_pseudo_labeling.predict(x_teste)

resultados_pseudo_labeling =classification_report(y_teste, novo_y_previsto)

In [36]:
print("SVM Supervisionado")
print(resultados_svm)
print("-"*50)
print("SVM Pseudo Labelling")
print(resultados_pseudo_labeling)

SVM Supervisionado
              precision    recall  f1-score   support

           0       0.75      0.39      0.51        23
           1       0.67      0.91      0.77        46
           2       0.90      0.76      0.82        37

    accuracy                           0.75       106
   macro avg       0.77      0.69      0.70       106
weighted avg       0.77      0.75      0.73       106

--------------------------------------------------
SVM Pseudo Labelling
              precision    recall  f1-score   support

           0       0.69      0.39      0.50        23
           1       0.66      0.87      0.75        46
           2       0.88      0.76      0.81        37

    accuracy                           0.73       106
   macro avg       0.74      0.67      0.69       106
weighted avg       0.74      0.73      0.72       106

