<h1> Red wine data analysis </h1>

<h2>Business Understanding</h2>

Through the business, we have been asked to examine the following questions:

<ol>
<li>In hoeverre is de score van een Portugese Red te voorspellen op basis van de chemische kenmerken?</li>
<li>Zelf te bepalen: denk aan andere type wijnen of andere landen.</li>
<li>Zelf te bepalen: denk bijvoorbeeld aan het clusteren op basis van de chemische kenmerken waarmee het type druif of de regio bepaald kan worden. Of kun je logische clusters vinden van topwijnen, doordrinkwijnen en bocht?</li>
</ol>

<i>TODO(m-jeu): Either translate these questions to english, or change the language of the rest of the document to dutch. This also goes for the dutch list in 'Data Understanding'.</i>

Currently, the exact nature of 'the business' and their desires datascience-wise is unknown to us. This requires further investigation.

<h2> Data Understanding </h2>

The business tells us the most important variables in the dataset are:

<ol>
<li>Herkomst van de wijn en type druif.</li>
<li>Review van de wijn, inclusief naam van de sommelier en de score op een schaal van 1 tot 100.</li>
<li>De uitkomsten van chemische tests op 11 waarde (waaronder suikergehalte, pH, alcoholgehalte, et cetera).</li>



We import some libraries and the dataset to examine the data through code.

In [None]:
import numpy as np
import pandas as pd

In [None]:
dataset = pd.read_csv("redwine.csv", sep=";")

A first look at the dataset:

In [None]:
dataset.head(5)

Rows seems to correspond with individual wines on first glance, though this does need to be examined more thoroughly. Columns are different attributes for those individual wines.

<h2>Data Preparation</h2>

The data needs some cleaning up. An overview of datatypes:

In [None]:
dataset.dtypes

Jupyter doesn't recognize some of the Python str objects for what they are, and simply calls them the 'object' type. Let's convert them to the right type to allow for more method flexibility:

In [None]:
dataset = dataset.convert_dtypes()
dataset.dtypes

There are some datatypes in the DataFrame that don't quite correspond to what you'd expect them to be, considering what they represent. Citric acid, Density and Alcohol are string objects, even though you'd expect them to be some kind of number-datatype. Let's take care of each one individually:

<h4>Alcohol</h4>

Most alcohol percentages are formatted like proper floats, and can be converted automatically. This allows us to analyse the actual numbers as well:

In [None]:
dataset["alcohol"] = pd.to_numeric(dataset["alcohol"], errors='coerce')  # No OOP interface.
print(f"""The alcohol-percentage column in the dataset has a minimum of {np.min(dataset['alcohol'])}%.
A maximum of {np.max(dataset['alcohol'])}%.
And {dataset['alcohol'].isnull().sum()} nan entries.""")

The alcohol-percentages we have been able to extract from the String objects seem to fall within the boundaries of what could reasonably be considered 'wine'. We are still left with 10 not-a-numbers entries where pandas wasn't able to convert the strings to numbers. Let's have a look at them.

In [None]:
dataset[dataset["alcohol"].isnull()][['id', 'title', 'designation', 'alcohol']]  # TODO(m-jeu): This can probably be made prettier

Considering only 10 wines are missing data, we can probably look these up by hand. The percentages are based on these sources:

<a href="https://www.internationalwinechallenge.com/canopy/beverage_details?wid=44074">1344. </a><a href="https://www.winemag.com/buying-guide/companhia-das-quintas-2009-quinta-da-fronteira-seleccao-do-enologo-red-douro/">1345. </a><a href="https://www.winemag.com/buying-guide/quinta-das-bandeiras-2010-passagem-reserva-red-douro/">1620. </a><a href="https://www.nataliemaclean.com/wine-reviews/quinta-da-rede-reserva-2015/317025">1670. </a><a href="https://www.winemag.com/buying-guide/terra-silvestre-2014-grande-reserva-red-tejo-portuguese-red/">1764. </a><a href="https://www.vivino.com/NL/en/adega-cooperativa-de-borba-alentejo-reserva-tinto/w/1235048?year=2013">1765. </a><a href="https://www.winemag.com/buying-guide/adega-cooperativa-cartaxo-2012-bridao-classico-red-tejo/">1766. </a><a href="https://www.winemag.com/buying-guide/quinta-da-lagoalva-de-cima-2015-lagoalva-barrel-selection-red-tejo/">1794. </a><a href="https://www.winemag.com/buying-guide/quinta-do-casal-branco-2014-lobo-e-falcao-reserva-red-tejo/">1795. </a><a href="https://www.wine-searcher.com/find/ramos+pinto+duas+quinta+rsrv+douro+portugal/2006/netherlands#t2">2028. </a>


In [None]:
# Assignments use numeric row index, and not 'id' column. Might break when working on the id column.
dataset.loc[1344, "alcohol"] = 14.0
dataset.loc[1345, "alcohol"] = 14.5
dataset.loc[1620, "alcohol"] = 14.0
dataset.loc[1670, "alcohol"] = 13.5
dataset.loc[1764, "alcohol"] = 13.5
dataset.loc[1765, "alcohol"] = 14.0
dataset.loc[1766, "alcohol"] = 14.0
dataset.loc[1794, "alcohol"] = 14.0
dataset.loc[1795, "alcohol"] = 14.0
dataset.loc[2028, "alcohol"] = 14.0

Lets' have a look how many nan entries are left in the alcohol column:

In [None]:
dataset["alcohol"].isnull().sum()

(Some other things to look at for later:)

In [None]:
dataset["variety"].value_counts()

In [None]:
dataset["country"].value_counts()