# Wine Analysis

## Introduction

We have in our hands a dataset which covers the chemical compositon of wines from three different cultivars from the same region in Italy. 

These chemicals are:

- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline

We will see if we can find any interest insights from it by applying exploratory analysis and clustering techniques.

## Pre-Analysis

Let's say that we are nor wine or chemical experts. Then, it would be useful to have some insights about the attributes that we are working with to have an idea of what we could expect to find out.

### Alcohol
- **Description:** Represents the percentage of alcohol in the wine.
- **Typical Range:** Varies from around 8% to 15%.


### Malic Acid
- **Description:** Organic acid present in grapes. Affects acidity and can influence flavor.
- **Typical Range:** In the range of 0.1 to 5 g/L.



### Ash
- **Description:** Describes the total amount of minerals present in the wine after burning. Can be indicative of wine quality.
- **Typical Range:** Can vary, but typically found in the range of 1.5 to 3 g/L.



### Alcalinity of Ash
- **Description:** Measures the amount of alkali in terms of carbonate equivalent. Related to wine acidity.
- **Typical Range:** Common values are in the range of 10 to 30 mEq/L.



### Magnesium
- **Description:** Concentration of magnesium in the wine.
- **Typical Range:** Can vary, but typical values range between 70 and 162 mg/L.


### Total Phenols
- **Description:** Represents the total concentration of phenolic compounds in the wine, including antioxidants.
- **Typical Range:** Concentrations can vary, but red wines, in particular, may have values in the range of 100 to 300 mg/L.



### Flavanoids
- **Description:** Antioxidant compounds contributing to wine structure, flavor, and color.
- **Typical Range:** Concentrations can vary, but typical values are between 0.5 and 5 mg/L.



### Nonflavanoid Phenols
- **Description:** Another group of phenolic compounds excluding flavonoids.
- **Typical Range:** Concentrations can vary, but typical values are between 0.1 and 1.5 mg/L.



### Proanthocyanins
- **Description:** Antioxidant compounds contributing to astringency and flavor.
- **Typical Range:** Concentrations can vary, but typical values are between 0.5 and 3 mg/L.



### Color Intensity
- **Description:** Measures the intensity of the wine color.
- **Typical Range:** Red wines often have higher values, in the range of 1 to 15.



### Hue
- **Description:** Refers to the color tone of the wine.
- **Typical Range:** Typical values can be in the range of 0.5 to 1.5.



### OD280/OD315 of Diluted Wines
- **Description:** The ratio of optical density at 280 nm to 315 nm. Provides information about wine color concentration and clarity.
- **Typical Range:** Concentrations can vary, but typical values may be in the range of 1 to 4.



### Proline
- **Description:** A measure of proline concentration, an amino acid, in the wine.
- **Typical Range:** Concentrations can vary, but typical values are between 300 and 1680 mg/L.


#### General Observations:

Based on this information, we could expect some attributes to the be nearly correlated:

- The ones that explicity refer to or affect the color: Hue, OD280/OD315, Color Intensity 
- The one that refer to the acid level of each wine: Ash, Alcalinity of Ash, Malic Acid
- And the ones which cuold tell us something about the antioxidantes in the wine: Total Phenols, Flavanoids, Non Flavanoids Phenols

Now, let's get start.

## Analysis

### Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

### Loading the data

In [2]:
path = "data/"
dataset = "wine-clustering.csv"
df = pd.read_csv(path+dataset, sep=",")

In [4]:
df.head() #check that it was correctly loaded

Unnamed: 0,Alcohol,Malic_Acid,Ash,Ash_Alcanity,Magnesium,Total_Phenols,Flavanoids,Nonflavanoid_Phenols,Proanthocyanins,Color_Intensity,Hue,OD280,Proline
0,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### Data Processing

In [5]:
df.shape #(178, 13)

(178, 13)

In [6]:
df.columns

Index(['Alcohol', 'Malic_Acid', 'Ash', 'Ash_Alcanity', 'Magnesium',
       'Total_Phenols', 'Flavanoids', 'Nonflavanoid_Phenols',
       'Proanthocyanins', 'Color_Intensity', 'Hue', 'OD280', 'Proline'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Alcohol               178 non-null    float64
 1   Malic_Acid            178 non-null    float64
 2   Ash                   178 non-null    float64
 3   Ash_Alcanity          178 non-null    float64
 4   Magnesium             178 non-null    int64  
 5   Total_Phenols         178 non-null    float64
 6   Flavanoids            178 non-null    float64
 7   Nonflavanoid_Phenols  178 non-null    float64
 8   Proanthocyanins       178 non-null    float64
 9   Color_Intensity       178 non-null    float64
 10  Hue                   178 non-null    float64
 11  OD280                 178 non-null    float64
 12  Proline               178 non-null    int64  
dtypes: float64(11), int64(2)
memory usage: 18.2 KB


It seems that the data set has no null values and all the attributes have numeric data types.

As we have no categorical variables or strings, there is no necessity to check for misspelling words or encoding problemas. 


The columns names are representative of what they have. The values of their rows, at least based on the previos seaction `.head()` seems to be between the expected ranges (besides `Total_Phenols`, but we can consider that it has a different measurment unit).

However, it would be convenient to normalize the values for when we reach the clustering section.

### Exploratory Analysis

### Clustering

## Conclusions