# Exploratory Data Analysis of Red Wine

We are analyzing the **QualityReds** data set, developed by Paulo Cortez. This dataset analyzes the components and quality of red wine. I chose this dataset because I am interested in food science, specifically the chemical composition of food and drink and what makes these products "good." The variables were developed by objective chemical tests as well as more subjective tests (quality). These subjective tests were based on sensory data and are the median of three experts' opinions. 

The original data set is located at https://www.kaggle.com/piyushgoyal443/red-wine-dataset where you can also find a further description.

citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
                [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
                [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib


## Data Loading and Initial Processing

First we will load the needed packages and data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (6,6)

In [3]:
wine = pd.read_csv("data/wineQualityReds.csv")

In [4]:
wine.shape

(1599, 13)

In [5]:
wine.head()

Unnamed: 0.1,Unnamed: 0,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
0,1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [6]:
wine.dtypes

Unnamed: 0                int64
fixed.acidity           float64
volatile.acidity        float64
citric.acid             float64
residual.sugar          float64
chlorides               float64
free.sulfur.dioxide     float64
total.sulfur.dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

Here I will rename some of the columns to simplify their titles and remove spaces and "." to avoid confusion later in the analysis.

In [7]:
wine = wine.rename(columns={
    "Unnamed: 0" : "id",
    "fixed.acidity" : "fixed_acidity",
    "volatile.acidity" : "volatile_acidity",
    "citric.acid" : "citric_acid",
    "residual.sugar" : "residual_sugar",
    "free.sulfur.dioxide" : "free_sulfur_dioxide",
    "total.sulfur.dioxide" : "total_sulfur_dioxide",
    "sulphates" : "sulfates",
})

In [8]:
wine.head()

Unnamed: 0,id,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulfates,alcohol,quality
0,1,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,2,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,3,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,4,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,5,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


### What is the goal of our analysis?

* To understand the relationship of the levels of various chemicals on the alcohol content and pH of wine
* To understand how alcohol content and pH influence the quality of wine

### Data Dictionary

It is important to write down the description and datatypes of the variables.

* fixed_acidity (float64): most acids involved with wine are fixed or nonvolatile (do not evaporate readily)
* volatile_acidity (float64): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
* citric_acid (float64): found in small quantities, citric acid can add 'freshness' and flavor to wines
* residual_sugar (float64): the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
* chlorides (float64): the amount of salt in the wine
* free_sulfur_dioxide (float64): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
* total_sulfur_dioxide (float64): amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
* density (float64): the density of water is close to that of water depending on the percent alcohol and sugar content
* pH (float64): describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
* sulfates (float64): a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
* alcohol (float64): the percent alcohol content of the wine
* quality (int64): score between 0 and 10

### Entity Description

Here we describe the possible entities that we can break our dataset into, this will help us think of different ways to slice and group the dataset in further steps.

- id

There is only one entity listed because most of our data is numerical. We will group our data categorically later in the analysis.

In [9]:
wine.to_csv("data/wine.1.initial_process.csv", index=False)