# **Exploratory data analysis**

This project entails a thorough analysis of the selected data to identify relevant patterns, trends, and insights.

**Proposal:** Choose a dataset of interest and explore its characteristics. Clean and prepare the data, then utilize techniques such as visualization, descriptive statistics, and interactive exploration to identify valuable insights. Highlight notable patterns, trends, or anomalies present in the data.

# **About the dataset**

The data in the chosen dataset represents the results of a chemical analysis of wines grown in the same region in Italy but from three different varieties. The analysis determined the quantities of 13 constituents found in each of the three types of wine.

**Link to the dataset:** https://archive.ics.uci.edu/dataset/109/wine

# **Data cleaning and preparation**

I will begin the analysis by performing data cleaning and preparation. This way, I will be working with a cleaner and more organized dataset.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('wine.csv')

df.head(len(df))

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid,phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


A specific column caught my attention after a brief analysis of the dataset.

The column 'Color intensity' provides an index of the color intensity of the wines.

My initial objective will be to determine if a specific element of the wine composition is related to stronger color intensities.

In [3]:
df_asc = df.sort_values(by='Color intensity', ascending=True)

df_asc.head(len(df_asc))

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid,phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
119,2,12.00,3.43,2.00,19.0,87,2.00,1.64,0.37,1.87,1.28,0.93,3.05,564
89,2,12.08,1.33,2.30,23.6,70,2.20,1.59,0.42,1.38,1.74,1.07,3.21,625
115,2,11.03,1.51,2.20,21.5,85,2.46,2.17,0.52,2.01,1.90,1.71,2.87,407
116,2,11.82,1.47,1.99,20.8,86,1.98,1.60,0.30,1.53,1.95,0.95,3.33,495
59,2,12.37,0.94,1.36,10.6,88,1.98,0.57,0.28,0.42,1.95,1.05,1.82,520
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
153,3,13.23,3.30,2.28,18.5,98,1.80,0.83,0.61,1.87,10.52,0.56,1.51,675
166,3,13.45,3.70,2.60,23.0,111,1.70,0.92,0.43,1.46,10.68,0.85,1.56,695
151,3,12.79,2.67,2.48,22.0,112,1.48,1.36,0.24,1.26,10.80,0.48,1.47,480
159,3,13.48,1.67,2.64,22.5,89,2.60,1.10,0.52,2.29,11.75,0.57,1.78,620


After sorting the data in ascending order, I conducted a brief visual analysis to search for any correlation between the quantities of the wine's components and the different color intensities.

I noticed that wines with higher numerical values in the 'od280/od315 of diluted wines' column are associated with less intense coloration.

I conducted a brief search and found a document titled **Identification of red wine categories based on physicochemical properties**, which provides additional information about wine compounds as well a more comprehensive description of the meaning of 'od280/od315 of diluted wines'.

This document refers to the three types of Italian wines in the dataset.

Link to the document: https://webofproceedings.org/proceedings_series/ESSP/ETMHS%202019/ETMHS19309.pdf

During my reading of the document, I came across the following sentence: **The higher absorbance ratio of OD280/OD315 indicates high protein purity**.

The sentence states that a higher absorbance ratio of OD280/OD315 indicates a higher purity of protein.

After reading the sentence mentioned above, I can assume that the lower values within the 'OD280/OD315 of diluted wines' column are related to wines with a lower protein purity rate.

In [52]:
columns = df[['Color intensity','OD280/OD315 of diluted wines']]

filtered = columns.sort_values(by='Color intensity', ascending=True)

filtered.head(len(filtered))

Unnamed: 0,Color intensity,OD280/OD315 of diluted wines
119,1.28,3.05
89,1.74,3.21
115,1.90,2.87
116,1.95,3.33
59,1.95,1.82
...,...,...
153,10.52,1.51
166,10.68,1.56
151,10.80,1.47
159,11.75,1.78
