# Piedmont wine Dataset

Data refer to chemical properties of 178 specimens of three types of wine produced in the Piedmont region of Italy.

<https://search.r-project.org/CRAN/refmans/sn/html/wines.html>

The data represent 27 chemical measurements on each of 178 wine specimens belonging to three types of wine produced in the Piedmont region of Italy. The data have been presented and examined by Forina et al. (1986) and were freely accessible from the PARVUS web-site until it was active. These data or, more often, a subset of them are now available from various places, including some R packages. The present dataset includes all variables available on the PARVUS repository, which are the variables listed by Forina et al. (1986) with the exception of ‘Sulphate’. Moreover, it reveals the undocumented fact that the original dataset appears to include also the vintage year; see the final portion of the ‘Examples’ below.

wine name (categorical, levels: Barbera, Barolo, Grignolino)

In [20]:
import pandas as pd

# Load the dataset
wine_ds = pd.read_csv("wine.csv", header=None)

samples = wine_ds.iloc[:,1:]
varieties = wine_ds.iloc[:,0]
varieties = varieties.replace({1: "Barolo", 2: "Grignolino", 3: "Barbera"})
print(wine_ds.describe())

# There are 178 samples from 3 distinct varieties of red wine: Barolo, Grignolino and Barbera

               0           1           2           3           4           5   \
count  178.000000  178.000000  178.000000  178.000000  178.000000  178.000000   
mean     1.938202   13.000618    2.336348    2.366517   19.494944   99.741573   
std      0.775035    0.811827    1.117146    0.274344    3.339564   14.282484   
min      1.000000   11.030000    0.740000    1.360000   10.600000   70.000000   
25%      1.000000   12.362500    1.602500    2.210000   17.200000   88.000000   
50%      2.000000   13.050000    1.865000    2.360000   19.500000   98.000000   
75%      3.000000   13.677500    3.082500    2.557500   21.500000  107.000000   
max      3.000000   14.830000    5.800000    3.230000   30.000000  162.000000   

               6           7           8           9           10          11  \
count  178.000000  178.000000  178.000000  178.000000  178.000000  178.000000   
mean     2.295112    2.029270    0.361854    1.590899    5.058090    0.957449   
std      0.625851    0.9988

In [18]:
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
labels = model.fit_predict(samples)



In [19]:
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

varieties  Barbera  Barolo  Grignolino
labels                                
0               19       0          50
1                0      46           1
2               29      13          20


We can note that this time the KMenas with 3 clusters as our data set did not correspond well with the wine varieties. Thats because the features of the wine dataset have very different variances.

In [21]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)

In [22]:
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(scaler,model)
pipeline.fit(samples_scaled)
labels = pipeline.predict(samples_scaled)



In [23]:
df = pd.DataFrame({'labels': labels, 'varieties': varieties})
ct = pd.crosstab(df['labels'], df['varieties'])
print(ct)

varieties  Barbera  Barolo  Grignolino
labels                                
0                0      59           4
1               48       0           3
2                0       0          64


So is revealed that incorporating standardization is fantastic.