- Going through the entire data analysis process to analyze a dataset on the chemical properties of wine and their associated quality readings
- Learning how to work with data using Pandas, Numpy and Matplotlib
- Data always requires a significant amount of work, to make it suitable for analysis like cleaning, feature engineering and visualizing. These packages make work faster and more efficient

- We will be analysing two datasets, one on red wine samples and the other on white wine samples from the North of Portugal
- Each wine sample comes with quality rating from one to ten and results from several physicl chemical tests

In [36]:
import pandas as pd
red_df = pd.read_csv('dataset/winequality-red.csv')
white_df = pd.read_csv('dataset/winequality-white.csv',)

In [37]:
red_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free sulfur dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [38]:
white_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


1. If you are the owner of a vineyard or a wine store and you are given this data, how might you use it
- Relevant Questions;
    * What chemical properties are most important on predicting the quality of wine?
    * Is a certain type of wine (red or white) associated with higher quality?
    * Do wines with higher alcoholic content receive better ratings?
    * Do sweeter wines (more residual settings) receive better ratings?
    * What level of acidity is associated with the highest quality?
2. How would you go about answering this questions? Which part of the dataset might you use?
- __Things to think about__
- Are there ways you could modify or represent the data differently to help you answer this question

### Assessing Data

In [32]:
print(red_df.shape)
red_df.info()

(1599, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [24]:
white_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [34]:
# checking for null values
red_df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [29]:
# number of duplicate rows in the white dataset
# sum(white_df.duplicated())
white_df.duplicated().sum()

240

In [30]:
# unique values in the red wine dataset
red_df.nunique()

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64

In [31]:
white_df.nunique()

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64

In [35]:
# mean density in the red wine dataset
red_df.density.mean()

0.9967466791744831

#### Appending and Markdown
- To analyse our data more efficiently lets combine the red and white datasets into one data frame
- Add a feature to each data frame indicating whether the wine is red or white to preserve each characteristic for each sample when the data frames are combined
- One way we can do this is by creating an array using numpy and add that as a column into each data frame

__Numpy__
- Short for Nnumerical python and is designed for efficient scientific computation. It's built on top of C

__Pandas__
- Built on top of numpy

In [40]:
import numpy as np
import pandas as pd

red_df = pd.read_csv('dataset/winequality-red.csv')
white_df = pd.read_csv('dataset/winequality-white.csv')
print(red_df.shape[0])


1599


In [None]:
# renaming a column
red_df.rename(columns={'free sulphur dioxide': 'free_sulphur_dioxide'}, inplace=True)

__create color columns__
- create two arrays as long as the number of rows in the red and white dataframes that repeat the value red or white

In [42]:
# create colour array for red data frame
color_red = np.repeat("red", red_df.shape[0])

# colour array for white dataframe
color_white = np.repeat("white", white_df.shape[0])

In [43]:
# Adding the array to the red data frame by creating a column called color
red_df['color'] = color_red
red_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free sulfur dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [44]:
# adding color array to the white data frame
white_df['color'] = color_white
white_df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,white
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,white
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,white
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,white
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,white


In [49]:
# combining the two data frames
# wine_df = pd.concat([red_df, white_df])
wine_df = red_df.append(white_df)
wine_df.head()

  wine_df = red_df.append(white_df)


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free sulfur dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color,free_sulfur_dioxide
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,


In [47]:
# saving the combined dataset
wine_df.to_csv('dataset/wine_data.csv', index=False)