# In-Class Basics


In [1]:
import numpy as np
import matplotlib.pyplot as plt

**Linear Regression**

The goal of this week's exercise is to explore a simple linear regression problem based on Portugese white wine.

The dataset is based on 
Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. **Modeling wine preferences by data mining from physicochemical properties**. Published in Decision Support Systems, Elsevier, 47(4):547-553, 2009. 



**Before we start**

Download the [file] and save it as `winequality-white.csv` in the same directory as the Jupyter notebooks.

The downloaded file contains data on 4989 wines. For each wine 11 features are recorded (column 0 to 10). The final columns contains the quality of the wine. This is what we want to predict.

List of columns/features: 
0. fixed acidity
1. volatile acidity
2. citric acid
3. residual sugar
4. chlorides
5. free sulfur dioxide
6. total sulfur dioxide
7. density
8. pH
9. sulphates
10. alcohol
11. quality



[file]: https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

In [2]:
# load all examples from the file
data = np.genfromtxt('winequality-white.csv',delimiter=";",skip_header=1)

print("data:", data.shape)

# Prepare for proper training
np.random.shuffle(data) # randomly sort examples

# take the first 3000 examples for training
X_train = data[:3000,:11] # all features except last column
y_train = data[:3000,11]  # quality column

# and the remaining examples for testing
X_test = data[3000:,:11] # all features except last column
y_test = data[3000:,11] # quality column

print("First example:")
print("Features:", X_train[0])
print("Quality:", y_train[0])


data: (4898, 12)
First example:
Features: [7.700e+00 2.800e-01 6.300e-01 1.110e+01 3.900e-02 5.800e+01 1.790e+02
 9.979e-01 3.080e+00 4.400e-01 8.800e+00]
Quality: 4.0


For more information on how the `data[:3000,:11]` commands work, you can read up on [slicing].

[slicing]: https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.indexing.html

# Homework

1. First we want to better understand the dataset. Plot (`plt.hist`) the distribution of each of the features for the training data as well as the 2D distribution (either `plt.scatter` or `plt.hist2d`) of each feature versus quality. Also calculate the correlation coefficient (`np.corrcoef`) for each feature with quality. Which feature is most predictive for the quality?

2. Calculate the linear regression weights as derived in the lecture. Numpy provides functions for matrix multiplication (`np.matmul`), matrix transposition (`.T`) and inverse (`np.linalg.inv`).

3. Use the weights to predict the quality for the test dataset. How does your predicted quality compare with the true quality of the test data. Calculate the correlation coefficient between predicted and true quality and draw the scatter plot. 