# The wine dataset

[The wine dataset](https://archive.ics.uci.edu/dataset/109/wine) is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. It can be downloaded online, or using the following code:

```Python
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)
```

But you don't need to as the CSV is already stored in the files-folder.

## Data loading

Load the CSV into a numpy array. Show the first 5 rows of the array and print the shape. 
Values are shown in scientific notation.

In [None]:
#Up to you



The shape is 178 rows by 14 columns. This means we have 178 different types of wine in our database. 


Explore the data by showing the average value of every column.

If you see this, the means are shown in scientific notation:

![](files/2025-09-17-20-01-43.png)

It's technically correct but rather difficult to interprete. To show these values as numbers, you can use the np-method set_printoptions().



In [None]:
#Up to you



## Normalization

You will notice that some columns have a different value range. The second-to-last column has an average of 746, while other columns have an average of less than 1.  In later courses, you will learn that some Machine Learning algorithms (k-Means clustering or logistic regression) really struggle with this and force you to normalise the data. 

Solution is to rescale numerical data to a standard range to ensure all features in a dataset have the same scale. So, one feature with a large range (e.g., salary) doesn't unfairly dominate another with a smaller range (e.g., years of experience). This process of rescaling is called normalization.

When normalizing we recalculate all these values to be between for example 0 and 1. In machine learning you'd normally use a scaler, like sklearn.preprocessing.StandardScaler, but all this scaler does is apply some NumPy-magic to the columns.   
Let's start with an example:

In [None]:
X = np.array([
    [ 1, 200,  3],
    [ 2, 180,  6],
    [ 3, 160,  9],
    [ 4, 220, 12],
    [ 5, 240, 15]
], dtype=float)

**Try Min–Max Scaling (to [0,1])**

Rescale each column so its minimum = 0 and maximum = 1.

Tasks:
* Compute the column-wise minimum and maximum with np.min / np.max.
* Apply the formula:

![](files/2025-09-17-20-22-45.png)

* Verify that every column is now between 0 and 1.

In [None]:
#Up to you



**Try Z-Score Standardization (mean = 0, std = 1)**

In stead of min-max scaling you could apply a Z-score standarization. Make each column have mean 0 and standard deviation 1. This means the numbers won't be between 0 and 1 but between -1 and +1.

Tasks:
* Compute the column-wise mean and standard deviation (np.mean, np.std).
* Apply the formula:

![](files/2025-09-17-20-27-40.png)

* Check results:
	* Column means should be very close to 0.
	* Column standard deviation should be 1.

In [None]:
#Up to you



Which of the two methods above is most sensitive to outliers? (This is a theoretical question, so in the next code-block you should only enter text.)

In [None]:
#Up to you
#Answer to theoretical question below:


**Apply Min–Max Scaling (to [0,1]) to the wine dataset**   

Normalize the wine dataset as well!

In [None]:
#Up to you



You may feel a sting now, a nagging thought in the back of your mind: We've drastically changed the numbers. How can we do predictions on this. Suppose we have another wine with the following values:

```
13.71,1.86,2.36,16.6,101.0,2.61,2.88,0.27,1.69,3.8,1.11,4.0,1035.0,0
```

Could we transform it into a normalized wine?

(Remember that we still have the datarow with minimum values and with maximum values.)

In [None]:
#Up to you
#New wine row
new_row = np.array([13.71, 1.86, 2.36, 16.6, 101.0, 2.61, 2.88, 0.27, 1.69, 3.8, 1.11, 4.0, 1035.0, 0])



Going back is also important. Can you recalculate the scaled row into it's original values?

In [None]:
#Up to you



## Covariance and correlation

Covariance and correlation are measures on how the data in distinct columns moves in a similar or opposite way. 

Covariance is a statistical which measures the relationship between a pair of random variables where a change in one variable causes a change in another variable. It assesses how much two variables change together from their mean values. 

Using Numpy, you can display a covariance matrix to show the covariance between the columns.

In [None]:
# Calculate the covariance matrix
cov_matrix = np.cov(wine_data, rowvar=False)
print(cov_matrix)

What you get is a 14 by 14 matrix of values. The first row looks like this:

![](files/2025-09-17-21-16-33.png)

These are the covariances between the first column and each of the 14  columns. You notice two main things:
- Some numbers are positive and some numbers are negative
- Some numbers are big, other really small

If a number is negative it means that if one goes up the other goes down. If it is positive both tend to move in the same direction. That is usefull information.

The size of the number means nothing at all. We are comparing data on different chemicals to each other, which is as good as comparing apples and oranges. Another analogy would be saying somebody got a "7" on a test. Is that good or bad? No way of knowing if you don't know the maximum score for that test. That is why covariance is not used much. Covariance helps us understand the direction of the relationship but not how strong it is because the calculated number depends on the units used. The covariance matrix just helps you to see how two columns are connected.

If you need a measure of the strength and direction of the linear relationship between two columns, you calculate the correlation matrix. It is derived from covariance, but standardized and ranges between -1 and 1. Unlike covariance, which only indicates the direction of the relationship, correlation provides a standardized measure that you can use to detect the strength of the relationship between 2 variables.

In [None]:
# Calculate the correlation matrix
corr_matrix = np.corrcoef(wine_data, rowvar=False)
print(corr_matrix)

The same row now looks like this:

![](files/2025-09-17-21-20-21.png)

Again, we see different values:
- The first value is 1. This is when comparing the column to itself, which is a perfect correlation.
- Negative values: one goes up, the other goes down
- Positive values: the values move in the same direction

But mainly interesting is the size of the numbers. Explain what a correlation value of -1, +1, 0, and all between means. (Code block text.)

In [None]:
#Up to you
#Answer to theoretical question below:

