# Qualidade de vinhos portugueses
## Sobre o projeto
Constuir uma aplicação em nuvem usando o Amazon sagemaker para o pipeline de aprendizagem de máquina. 
## A ideia
Definir se um vinho é um bom vinho ou não,usando um algoritmo de classificação e o conjunto de dados "Qualidade de vinhos" disponibilizado em UCI Machine Learning Repository, e produzido por Comissão de Viticultura da Região dos Vinhos Verdes (CVRVV) , Porto, Portugal.
## O conjunto de dados
Os dois conjuntos de dados estão relacionados com as variantes tinto e branco do vinho "Vinho Verde" português. Devido a questões de privacidade e logística, apenas variáveis ​​físico-químicas (entradas) e sensoriais (saída) estão disponíveis. 
Esses conjuntos de dados podem ser vistos como tarefas de classificação ou regressão. As classes são ordenadas e não equilibradas. 
## Features
Variável de entrada:
- acidez fixa
- acidez volátil
- ácido cítrico
- açúcar residual
- cloretos
- dióxido de enxofre livre
- dióxido de enxofre total
- densidade
- pH
- sulfatos
- álcool
Variável de saída (com base em dados sensoriais):
- qualidade (pontuação entre 0 e 10)
## Citação do dataset
P. Cortez, A. Cerdeira, F. Almeida, T. Matos e J. Reis.
Modelagem de preferências de vinho por mineração de dados de propriedades físico-químicas. In Decision Support Systems, Elsevier, 47 (4): 547-553, 2009.


# Configuração do laboratório
Apartir da qualidade que é um número na base 10(0 à 10) transforma num número na base 2(0 ou 1) para ter uma clasiificação. 
1-Considerado um bom vinho, 
2-considerado um vinho ruim. 
Notas de qualidade igual ou acima de 6 são considerados bons vinhos.

## Pipeline no aws sagemaker
Após definido e formulado um problema, e coletado os dados, o próximo passo são:
- analisar os dados
- definir se as características funcionam para a regra de negócio
- implantar o modelo
- selecionar o modelo
- treinar o modelo
- avaliar e fazer os testes

## Importando os dados e analisando

In [2]:
# antes de tudo é preciso importar os dados a serem trabalhados
import warnings, requests, zipfile, io
warnings.simplefilter('ignore')
import pandas as pd
from scipy.io import arff
import boto3

In [3]:
f_zip = 'http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip'
r = requests.get(f_zip, stream=True)
wines_zip = zipfile.ZipFile(io.BytesIO(r.content))
wines_zip.extractall()

In [4]:
data = pd.read_csv('winequality/winequality-white.csv',sep = ';', header = None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,6
2,6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
3,8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
4,7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6


In [5]:
# transformar num dataset para classificação binária
mapper = {'0':0,'1':0,'2':0,'3':0,'4':0,'5':0,'6':1,'7':1,'8':1,'9':1}
data[11]=data[11].replace(mapper)
data.shape

(4899, 12)

In [6]:
data.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [7]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1,7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,1
2,6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,1
3,8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,1
4,7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,1


In [8]:
# Após analisar os dados, ainda precisamos ajustar a regra de negócio
# o dataset precisa ser executado no algoritmo de classificação XGboost
# a coluna de classificação deve ser a primeira no dataset
cols = data.columns.tolist()
cols = cols[-1:] + cols[:-1]
data = data[cols]
data.columns

Int64Index([11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

In [9]:
data.head()

Unnamed: 0,11,0,1,2,3,4,5,6,7,8,9,10
0,quality,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
1,1,7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8
2,1,6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5
3,1,8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1
4,1,7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9


## Dividindo os dados - 80%-10%-10%

In [10]:
from sklearn.model_selection import train_test_split
train, test_and_validate = train_test_split(data, test_size=0.2, random_state=42)
test, validate = train_test_split(test_and_validate, test_size=0.5, random_state=42)

In [11]:
print(train.shape)
print(test.shape)
print(validate.shape)

(3919, 12)
(490, 12)
(490, 12)


In [12]:
print(train[11].value_counts())
print(test[11].value_counts())
print(validate[11].value_counts())

1          2583
0          1335
quality       1
Name: 11, dtype: int64
1    349
0    141
Name: 11, dtype: int64
1    326
0    164
Name: 11, dtype: int64


## Upload dos dados na aws S3

In [13]:
bucket='paularaujoufrpepisi4'

prefix='wines'

train_file='wines_train.csv'
test_file='wines_test.csv'
validate_file='wines_validate.csv'

import os

s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

In [14]:
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

## Treinando o modelo

In [15]:
import boto3
from sagemaker.image_uris import retrieve
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

hyperparams={"num_round":"50",
             "eval_metric": "auc",
             "objective": "binary:logistic"}

In [16]:
import sagemaker
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
xgb_model=sagemaker.estimator.Estimator(container,
                                       sagemaker.get_execution_role(),
                                       instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

In [17]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

In [18]:
xgb_model.fit(inputs=data_channels, logs=False)
print('ready for hosting!')


2021-12-02 02:21:04 Starting - Starting the training job.
2021-12-02 02:21:10 Starting - Launching requested ML instances............
2021-12-02 02:22:19 Starting - Preparing the instances for training..........................
2021-12-02 02:24:32 Downloading - Downloading input data..
2021-12-02 02:24:47 Training - Downloading the training image........
2021-12-02 02:25:33 Training - Training image download completed. Training in progress.
2021-12-02 02:25:36 Uploading - Uploading generated training model
2021-12-02 02:25:45 Completed - Training job completed
ready for hosting!


In [19]:
## Hospedando o modelo
xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                serializer = sagemaker.serializers.CSVSerializer(),
                instance_type='ml.m4.xlarge')

---------!

In [20]:
# realizando previsões
test.shape
test.head(5)

Unnamed: 0,11,0,1,2,3,4,5,6,7,8,9,10
3783,1,7.6,0.27,0.3,9.2,0.018,23.0,96,0.9938,3.08,0.29,11.0
4352,1,6.4,0.31,0.28,2.5,0.039,34.0,137,0.98946,3.22,0.38,12.7
4113,1,6.0,0.2,0.25,2.0,0.041,30.0,95,0.99078,3.27,0.56,11.1
4567,0,8.6,0.36,0.26,11.1,0.03,43.5,171,0.9948,3.03,0.49,12.0
1912,1,7.1,0.18,0.26,1.3,0.041,20.0,71,0.9926,3.04,0.74,9.9


In [21]:
row = test.iloc[0:1,1:] 
row.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
3783,7.6,0.27,0.3,9.2,0.018,23,96,0.9938,3.08,0.29,11


In [22]:
batch_X_csv_buffer = io.StringIO()
row.to_csv(batch_X_csv_buffer, header=False, index=False)
test_row = batch_X_csv_buffer.getvalue()
print(test_row)

7.6,0.27,0.3,9.2,0.018,23,96,0.9938,3.08,0.29,11



In [23]:
# conferir a porcentagem com a classificação
xgb_predictor.predict(test_row)

b'0.7266927361488342'

In [24]:
test.head(5)

Unnamed: 0,11,0,1,2,3,4,5,6,7,8,9,10
3783,1,7.6,0.27,0.3,9.2,0.018,23.0,96,0.9938,3.08,0.29,11.0
4352,1,6.4,0.31,0.28,2.5,0.039,34.0,137,0.98946,3.22,0.38,12.7
4113,1,6.0,0.2,0.25,2.0,0.041,30.0,95,0.99078,3.27,0.56,11.1
4567,0,8.6,0.36,0.26,11.1,0.03,43.5,171,0.9948,3.03,0.49,12.0
1912,1,7.1,0.18,0.26,1.3,0.041,20.0,71,0.9926,3.04,0.74,9.9


In [25]:
# delete endpoint !!
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

## Executando transformação em lote

In [26]:
batch_X = test.iloc[:,1:];
batch_X.head()

batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [27]:
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

..................................[34m[2021-12-02:02:35:51:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-12-02:02:35:51:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2021-12-02:02:35:51:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    loc

## Baixar resultados, comparar e testar

In [28]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),',',names=['class'])
target_predicted.head(5)

Unnamed: 0,class
0,0.726693
1,0.996102
2,0.990326
3,0.939032
4,0.571022


In [29]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['class'].apply(binary_convert)

print(target_predicted.head(10))
test.head(10)

      class  binary
0  0.726693       1
1  0.996102       1
2  0.990326       1
3  0.939032       1
4  0.571022       0
5  0.360810       0
6  0.847414       1
7  0.048097       0
8  0.691831       1
9  0.919990       1


Unnamed: 0,11,0,1,2,3,4,5,6,7,8,9,10
3783,1,7.6,0.27,0.3,9.2,0.018,23.0,96,0.9938,3.08,0.29,11.0
4352,1,6.4,0.31,0.28,2.5,0.039,34.0,137,0.98946,3.22,0.38,12.7
4113,1,6.0,0.2,0.25,2.0,0.041,30.0,95,0.99078,3.27,0.56,11.1
4567,0,8.6,0.36,0.26,11.1,0.03,43.5,171,0.9948,3.03,0.49,12.0
1912,1,7.1,0.18,0.26,1.3,0.041,20.0,71,0.9926,3.04,0.74,9.9
1261,1,6.4,0.23,0.3,7.1,0.037,63.0,236,0.9952,3.06,0.34,9.2
4596,1,7.1,0.39,0.3,9.9,0.037,29.0,124,0.99414,3.07,0.42,10.9
4681,0,6.8,0.63,0.04,1.3,0.058,25.0,133,0.99271,3.17,0.39,10.2
1090,1,7.0,0.17,0.33,4.0,0.034,17.0,127,0.9934,3.19,0.39,10.6
1507,1,8.1,0.2,0.49,8.1,0.051,51.0,205,0.9954,3.1,0.52,11.0


In [30]:
from sklearn.metrics import accuracy_score
y_pred = [round(value) for value in target_predicted['binary']]
y_true = test['11']
accuracy_score(y_true, y_pred)

KeyError: '11'