# Selected Topics in Statistics ICA 2 - Patrick Leask

## Task A1

*1) Can we use the sommelier/ wine data to create an AI with super-human performance in wine tasting?*


*2) Which components of wine make a wine a good wine?*
- There may be interactions between components of wine that make it impossible to establish how variations in a single component affect the score with knowledge of the dependencies between the components.

*3) Can the AI use the data to create the perfect wine, i.e. wine whose quality exceeds all that we have seen?*
- As in the second question, I expect there to be complex interactions between the components of wine that do not allow extrapolation to regions that the AI does not have data for.
- It is unlikely that the only factors in determining the quality of the wine are those in this data set. If we take water and chemicals to it until we matched the levels found in Chateau Lafite Rotschild, we will still not have created a wine. Even when starting with wine, rebalancing the qualities measured in the data will not necessarily create a better wine.
- The question asks whether, given the data, the AI can create the perfect wine. This is a poorly worded question, as an entirely random wine generating process *can* create the perfect wine. A more precise question is whether the AI would know the ranges of values that would result in a rating of 10. We cannot answer this question with the data provided, and even if we had infinite data we must still consider the rating that is given to be a random variable and as such cannot say with certainty that the wine would receive a higher rating (see the next question).

*4) Is human perception of wine entirely subjective? If so, what would it be that AIs could learn from humans?*
- Human tastes are highly subjective.

In [87]:
%matplotlib inline
import pandas as pd
import numpy as np

pd.set_option('precision', 3)

wine_types = ['red', 'white']
all_data = pd.concat([pd.read_csv("./winequality/winequality-{0}.csv".format(wine_type), sep=';').assign(colour=wine_type) for wine_type in wine_types])
all_data.head()

numeric_titles = list(all_data)
numeric_titles.remove('colour')
numeric_titles.remove('quality')

## Task A2

In [88]:
data_description = all_data.describe()

display(data_description)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215,0.34,0.319,5.443,0.056,30.525,115.745,0.995,3.219,0.531,10.492,5.818
std,1.296,0.165,0.145,4.758,0.035,17.749,56.522,0.003,0.161,0.149,1.193,0.873
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.987,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.992,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.995,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.997,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.039,4.01,2.0,14.9,9.0


In [94]:
import plotly
from plotly import tools
import plotly.graph_objs as go

plotly.offline.init_notebook_mode(connected=True)

def col_hist(col_name):
    """
    Plots a histogram for the column.
    """
    col_max = all_data[col_name].max()
    col_min = all_data[col_name].min()

    step = (col_max - col_min) / 15

    trace1 = go.Histogram(
        x = all_data[all_data['colour']=='red'][col_name], 
        name = 'red'.title(),
        opacity = 0.75,
        xbins={
            'start': col_min,
            'end': col_max,
            'size': step
        },
        histnorm='probability', 
        marker={
            'color':'#900020'
        }
    )

    trace2 = go.Histogram(
        x = all_data[all_data['colour']=='white'][col_name],
        name = 'white'.title(),
        opacity = 0.75,
        xbins={
            'start': col_min,
            'end': col_max,
            'size': step
        },
        histnorm='probability', 
        marker={
            'color':'#D1B78F'
        }
    )

    histogram_data = [trace1, trace2]
    layout = go.Layout(
        xaxis={
            'title': col_name.title()
        },
        yaxis={
            'title':'Proportion'
        },
        bargap=0.2,
        bargroupgap=0.1
    )
    this_fig = go.Figure(data=histogram_data, layout=layout)
    plotly.offline.iplot(this_fig)

#hist_plots = [col_hist(col_name) for col_name in list(data_description)]

Some of the histograms indicate it may be useful to perform transforms on the data. From the law of mass action, we should transform all chemical balance ratios with the logarithm. This should improve performance for additive models, where we consider absolute, not proportional, change. The exceptions are listed below, as they already appear to be normally distributed, or at least not exponentially skewed.

In [90]:
log_exceptions = [
    'volatile acidity',
    'total sulfur dioxide',
    'density', 
    'ph',
    'alcohol', 
    'citric acid'
]

remaining = [title for title in numeric_titles if title not in log_exceptions]

all_data[remaining] = all_data[remaining].apply(np.sqrt)

#hist_plots = [col_hist(col_name) for col_name in list(data_description)]

In [92]:
def col_scatter(col_name):
    """
    Plots a histogram for the column.
    """
    
    trace1 = go.Scattergl(
        x = all_data[all_data['colour']=='red'][col_name],
        y = all_data[all_data['colour']=='red']['quality'],
        name = 'red'.title(),
        mode = 'markers',
        marker={
            'size': 10,
            'color':'#900020',
            'opacity': 0.1
        }
    )

    trace2 = go.Scattergl(
        x = all_data[all_data['colour']=='white'][col_name],
        y = all_data[all_data['colour']=='white']['quality'],
        name = 'white'.title(),
        mode = 'markers',
        marker={
            'size': 10,
            'color':'#D1B78F', 
            'opacity': 0.1
        }
    )

    histogram_data = [trace1, trace2]
    layout = go.Layout(
        xaxis={
            'title': col_name.title()
        },
        yaxis={
            'title':'Quality'
        },
        bargap=0.2,
        bargroupgap=0.1
    )
    this_fig = go.Figure(data=histogram_data, layout=layout)
    plotly.offline.iplot(this_fig)
    
#scatter_plots = [col_scatter(col_name) for col_name in list(data_description)]

In [None]:
import matplotlib.pyplot as plt

#axes = pd.plotting.scatter_matrix(all_data, alpha=0.2)
#plt.tight_layout()
#plt.savefig('scatter_matrix.png')

In [27]:
data

NameError: name 'data' is not defined