# Vinho Verde - Exercise

We just performed linear regression involving two variables. Almost all the real-world problems that you are going to encounter will have more than two variables (multiple linear regression).

In this exercise we will use a dataset with variants of the Portuguese *Vinho Verde* wine. We will take into account various input features like fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. Based on these features we will try to predict the quality of the wine.

Saturday I am receiving friends for dinner. I would like to offer my guests a glass of wine. I am currently shopping and I've got my eye on the next bottle of wine from Portugal. *(It is unbelievable what information can be found on a wine label these days)*

<table>
    <tr>
        <td>
            volatile acidity: 0.650<br />
            citric acid: 0.00<br />
            residual sugar: 1.2<br />
            chlorides: 0.089<br />
            free sulfur dioxide: 21.0<br />
            density: 1.3946<br />
            pH: 3.39<br />
            sulphates: 0.53<br />
            alcohol: 9.6<br />
        </td>
        <td>
            <img src="./resources/calamares.jpg"  style="height: 250px"/>
        </td>
    </tr>
</table>

Would this wine be a good choice? Can you help me?

## 1. Import and read the data

Import all the required libraries :

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

Import the file *winequality.csv* using Pandas.

In [2]:
wines = pd.read_csv('./files/winequality.csv')

NameError: name 'pd' is not defined

## 2. Explore the data

Can you check the number of rows and columns in our dataset?

In [None]:
print(wines.shape)

What features of the wines are included in the data?

In [None]:
# statistical
# print(wines.describe())

# example values
# print(wines.head())

# only column names
print(wines.columns)

Can you print the quality of the first 25 wines? What values are being used?

In [1]:
print(*wines.quality[:25], sep=", ")


NameError: name 'wines' is not defined

Can you print all different values and count them?

In [None]:
print(wines.quality.value_counts().sort_index())

Can you check on a 2-D graph if there's any relationship between the fixed acidity and the pH of the wine?

In [None]:
wines.plot(x='fixed acidity', y='pH', style='o')

import matplotlib.pyplot as plt

plt.title('Fixed acidity vs pH')
plt.xlabel('Fixed acidity')
plt.ylabel('pH')
plt.show()

What can you see in the graph? The higher the fixed acidity, the **lower** the pH. You shoudn't be surprised, since pH is a scale used to specify how acidic a fluid is.

## 3. Histograms

Let’s check the quality of the wines. Create a histogram. What qualities occur the most?

In [None]:
# wines.quality.plot(kind='hist', title='Quality Histogram', bins=int(8/2), figsize=(10, 7));


wines.quality.value_counts().sort_index().plot(kind='bar', title='Quality Histogram');

## 4. Data splicing

Our next step is to divide the data into independent variables and dependent variables, whose values are to be predicted. To make the predictions we are only using the following independent variables

- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- density
- pH
- sulphates
- alcohol

to predict the quality. Create the two datasets and next, split 80% of the data to the training set and 20% to the test set.

In [None]:
X = wines[[
    'volatile acidity',
    'citric acid',
    'residual sugar',
    'chlorides',
    'free sulfur dioxide',
    'density',
    'pH',
    'sulphates',
    'alcohol'] ]
y = wines['quality'].values

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Print the independent and dependent training set.

In [None]:
print(X_train.head())
print(y_train[:5])


## 5. Train the model

Now train the model.

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In the case of multivariable linear regression, the regression model has to find the most optimal coefficients for all the attributes. To see what coefficients our regression model has chosen, execute the following script:

In [None]:
coefficients = regressor.coef_.reshape(9,-1)
variables = np.array([['volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']]).reshape(9,-1)

coeff = pd.DataFrame(coefficients, variables)

print(coeff)

This means that for a unit increase in *density*, there is an increase of 5.17 units in the quality of the wine. Similarly, a unit decrease in *chlorides* results in an increase of 1.85 units in the quality of the wine. 

## 6. Predictions

Now that we have trained our model, it’s time to make some predictions. Do the prediction on test data.

In [None]:
y_pred = regressor.predict(X_test)

Print the actual and predicted values for the first 25 wines from the test set.

In [None]:
compare_df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
print(compare_df.head(25))

Visualize the comparison result as a bar graph. Take only the first 10 results.

In [None]:
compare25_df = compare_df.head(25)
compare25_df.plot(kind='bar', figsize=(16,10))
plt.grid(linestyle=':', linewidth='0.25', color='black')
plt.show()

The final step is to evaluate the performance of the algorithm. Since R² = 1 corresponds to the perfect fit, what can you conclude?

In [None]:
from sklearn import metrics

# compute performance metrics
print("Mean absolute error =", round(metrics.mean_absolute_error(y_test, y_pred), 2))
print("Mean squared error =", round(metrics.mean_squared_error(y_test, y_pred), 2))
print("Root Mean Squared Error =", round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 2))
print("R2 score =", round(metrics.r2_score(y_test, y_pred), 2))

In [None]:
# answer: value or R²?
# ok?

Finaly can you predict the quality of my wine (details above)? Is it a good wine according to our model?

In [None]:
my_wine = {
            'volatile acidity': [0.650],
            'citric acid': [0.00],
            'residual sugar': [1.2],
            'chlorides': [0.089],
            'free sulfur dioxide': [21.0],
            'density': [1.3946],
            'pH': [3.39],
            'sulphates': [0.53],
            'alcohol': [9.6]
}

my_wine_X = pd.DataFrame.from_dict(my_wine)

print(my_wine_X)

y_pred = regressor.predict(my_wine_X)


In [None]:
print(y_pred)