# Investigating factors related to quality in Portuguese Wines

This dataset contains the results of 11 common physicochemical tests on a selection of 6,497 wines from Northwest Portugal. It was assembled with the goal of helping understand what factors impact quality in wine, something I'll also be looking into in this notebook.

URL: https://archive.ics.uci.edu/dataset/186/wine+quality

In [5]:
import pandas as pd

from ucimlrepo import fetch_ucirepo

Firstly let's retreive the data and take a look at it.

In [6]:
wine_quality_dataset = fetch_ucirepo(id=186) # This line may hang for a bit, it's a ~80mb data set

wine_quality_dataset.data.keys()


dict_keys(['ids', 'features', 'targets', 'original', 'headers'])

So we see that there are 5 parts to this data set.
- `ids` is a `None` value in this instance
- `features` is the physiochemical details minus the ratings
- `targets` is just the ratings on their own
- `original` is all of the data together
- `headers` is all of the header values for the `originals` data set

Now if we were training a model to predict quality then keeping out data seperated into `features` and `targets` would be useful. But since we're just exploring we're going to focus on the whole dataset, as stored in `original`.

In [7]:
wine_quality = wine_quality_dataset.data.original

wine_quality.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


As we see it contains a row id, 11 physiochemical test results, the rating, and the colour. For 14 columns total.

Let's take a look at some of those numbers now:

In [8]:
wine_quality.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


The first thing I notice is the counts, they all have the same count and it's the number we expect from the description of the dataset. This means no missing values. Secondly, I notice that the `color` header is gone-- as a catagorical variable it's not included in `describe`.

Let's check how many of each color we have.

In [None]:
colors = wine_quality.color

colors.value_counts()

color
white    4898
red      1599
Name: count, dtype: int64

So we see that there are roughly three time as many whites as reds in this dataset, and that the sum is what we expect.

There are some other things that jump out from that description, but it's hard to really understand just from summary statistics. So let's try some visualisations. 