<h1>Research validity</h1>

In our main Jupyter Notebook, we have found some evidence to suggest that the provided dataset contains some catastrophic errors, and is not reliable to draw conclusions from. This combined with the fact that we haven't been able to get a clear source for the dataset, has caused us to create this notebook to elaborate on and investigate our concerns.

Let's import the necessary Python libraries and modules. Then load in the dataset and have a preliminary look at it:


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
dataset = pd.read_csv("datasets/redwine.csv", sep=";")
dataset.head(10)

In [None]:
dataset.info()

The dataset consists of the following 22 columns, sorted into categories and not in their original order:

id: A column that conflicts with the automatically generated pandas index which should probably be fixed by anyone planning to seriously use the dataset.

Columns that describe a wine's origin:
-Country (notably always Portugal)
-Province
-Winery

Columns that describe a wine name-wise:
-Designation
-Title

Columns that describe what type of wine a wine is:
-Variety (notable always Portugese red)

Columns related to a review of a wine:
-Taster name
-Points
-Price (could perhaps be listed under another category, but I suspect the listed price is the price that the reviewer paid).

A wine's chemical properties:
-Fixed acidity
-Volatile acidity
-Citric acid
-Residual sugar
-Chlorides
-Free sulfur dioxide
-Total sulfur dioxide
-Density
-pH
-Sulphates
-Alcohol

We were proposed several questions to look at, including the (possible) relationship between the chemical properties of a wine, and it's review score.

Our suspicions began when we were tasked with adding a second dataset to our existing one in some meaningful way that provides more insight. Research on the internet pointed us towards many datasets that had similarities with ours, but none were quite the same. Some had everything but the chemical property columns, some had only that in an anonymized fashion (no wine names) with some attributes that our dataset doesn't contain.

After searching for the original source for the dataset we came to the conclusion that the dataset is constructed from the following two sources: <a href="https://www.kaggle.com/zynicide/wine-reviews">Dataset wine reviews</a>. and <a href="https://archive.ics.uci.edu/ml/datasets/wine+quality">Dataset chemical values</a>.

The Sjoerd part


Correlation between numerical attributes:
![](images/correlation_heatmap.png)
In the correlation matrix above we can see a couple noteworthy correlations.
But when looking deeper into the significant correlations it can be said that they are correlations between attributes that are either only in the wine review data set or in the chemical values dataset.
This raises some questions considering we expected to find a correlation between chemical properties of a wine and their reviewed score.

Merge conflicts are currently a big problem, so here's a temporaru line to seperate what parts we can work on.

<h2>------------------------------------------The line-------------------------------------</h2>

The Maarten part

To be absolutely sure that these datasets were merged with no real world overlap, we decided to do some manual checking. Alcohol percentage is allegedly from the 'chemical' dataset, though we could look these up for names in the 'left' dataset, to see wether they correspond to eachother. Let's have a look at the dataset:

In [None]:
dataset[["title", "alcohol"]].head(10)

And let's look up some of the alcohol percentages:

Quinta dos Avidagos 2011 Avidagos Red (Douro)
According to the dataset: 9.7%
According to <a href="https://www.wine-searcher.com/find/quinta+dos+avidagos+tinto+douro+portugal/2011/netherlands#t2">the internet</a>: Between 13 and 14 %

Casa Santa Vit�ria 2013 Grande Reserva Tinto Red (Alentejano)
According to the dataset: 9.6%.
According to <a href="https://www.wine-searcher.com/find/casa+de+santa+vittoria+grand+rsrv+regional+alentejano+alentejo+portugal/2013#t2">the internet.</a>: 14%.

Luis Duarte 2013 Monte de Carrapatelo Colheita Seleccionada Tinto Red (Alentejano) (index 5):
According to the dataset: 12.3%
According to <a href="https://www.specialtycellars.com/wp-content/uploads/2020/08/Monte-De-Carrapatelo-Tinto-2013.pdf">the internet:</a> 14.5%