## Midterm Assignment

**1) Understanding and explaining the data set.**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
wine_quality = pd.read_csv('winequality-red.csv')

In [3]:
wine_quality.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1,7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
2,7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...
3,11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...
4,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5


In [4]:
wine_quality.shape

(1599, 1)

The data set includes aspects of wine quality, including acidity and chloride concentrations, and an overall quality score. The data set needs to be cleaned to form a proper dataframe. It includes 1599 instances. Our model should predict quality based on the 11 attributes that are given.

**2) Processing data, cleaning up.**

In [5]:
# the csv is being read as one column, need to convert to 12
# use ';' as delimiter when loading

wine_quality = pd.read_csv('winequality-red.csv', sep=';')

In [6]:
# now shape should be 1599x12

wine_quality.shape

(1599, 12)

In [7]:
# check for missing values

wine_quality.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [8]:
# check data types

wine_quality.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

In [9]:
wine_quality.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


There is no missing data to deal with in this data set. All attributes represent numerical data. The dataframe is now all 12 columns as stated in the "read me" for the data set.

**3) Dividing your data into a training and test set.**

In [10]:
# define a function to split the data set into train/test sets
# assumes target variable is the last column in dataframe

def train_test_split(df, split, shuffle=True):
    # make list of length of dataframe
    indexes = np.arange(len(df))
    
    # shuffle numbers equal to range of dataframe
    if shuffle == True:
        np.random.shuffle(indexes)
    
    # use split as decimal to determine length of train/test set
        # based on dataframe length
    n = round(len(df) * split)
    
    # use 'n' to get train set and '1-n' for test set
    train = df.iloc[indexes[:n], :]
    test = df.iloc[indexes[n:], :]
    
    # then divide the x and y portions of train and test sets
    x_train = train.drop(train.columns[-1], axis=1)
    y_train = train[train.columns[-1]]
    x_test = test.drop(test.columns[-1], axis=1)
    y_test = test[test.columns[-1]]
    
    return x_train, y_train, x_test, y_test

In [11]:
# now set split to 75% train, 25% test and divide data

x_train, y_train, x_test, y_test = train_test_split(wine_quality, 0.75)

In [12]:
# double check train_test_split divided correctly

x_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
72,7.7,0.69,0.22,1.9,0.084,18.0,94.0,0.9961,3.31,0.48,9.5
133,6.6,0.5,0.01,1.5,0.06,17.0,26.0,0.9952,3.4,0.58,9.8
1243,8.3,0.56,0.22,2.4,0.082,10.0,86.0,0.9983,3.37,0.62,9.5
1430,7.4,0.41,0.24,1.8,0.066,18.0,47.0,0.9956,3.37,0.62,10.4
1391,8.0,0.64,0.22,2.4,0.094,5.0,33.0,0.99612,3.37,0.58,11.0


In [13]:
x_train.shape

(1199, 11)

In [14]:
y_train.head()

72      5
133     6
1243    5
1430    5
1391    5
Name: quality, dtype: int64

In [15]:
y_train.shape

(1199,)

In [16]:
x_test.shape

(400, 11)

In [17]:
y_test.shape

(400,)

The training and test data were split at 75% train, 25% test and the "quality" variable that is our dependent variable is also separated to be used as the target. 

**4) Choosing the relevant algorithm.**

Because our target variable is a numerical value that does not fall into a binary or multi-class classification, this task requires a regression algorithm. Since the data set does not have a high number of attributes, there is no huge worry about computational complexity. I will be using a linear regression to perform learning here.

**5) Writing a python code to perform learning.**

In [18]:
# make a "fit" function to return the weights for regression line

def linreg_fit(x_train, y_train, x_copy=True):
    
    # make a copy of the x_train dataframe
    if x_copy == True:
        x = x_train.copy()
    
    # convert x to numpy array
    x.to_numpy()
    
    # calculate length of x for dummy variable creation
    length = len(x)
    
    # add dummy variable x0=1 to all instances
    x_b = np.c_[np.ones((length, 1)), x]
    
    # calculate weights for x_train
    weights = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y_train)
    
    return weights
    
    

In [19]:
ω = linreg_fit(x_train, y_train)
print(ω)

[ 5.47297914e+00 -5.60198525e-03 -1.12184086e+00 -1.48092810e-01
  8.25041806e-03 -2.45051697e+00  6.25420092e-03 -4.48348967e-03
 -3.11402050e-01 -5.94309412e-01  9.12853335e-01  2.71828320e-01]


In [20]:
# now make a "predict" function to return the predicted y values

def linreg_predict(x_test, weights, x_copy=True):
    
    # make copy of x_test
    if x_copy == True:
        x = x_test.copy()
    
    # add x0=1 to all instances
    length = len(x)
    x_b = np.c_[np.ones((length, 1)), x]
    
    # predict y for each instance
    y_pred = x_b.dot(weights)
    
    return y_pred

In [21]:
y_pred = linreg_predict(x_test, ω)

**6) Evaluating your learning performance. \
7) Making sure your results does not depend on your choosing parameters.**

For linear regression tasks, R-squared ($R^2$) and Mean-Square-Error (MSE) are suitable metrics for model evaluation. Both result as follows:

In [22]:
# create a function to calculate R2 and MSE in one go

def model_eval(y_test, y_pred):
    
    # calculate R2
    u = ((y_test - y_pred) ** 2).sum()
    v = ((y_test - y_test.mean()) ** 2).sum()
    r_squared = 1 - u / v
    
    # calculate MSE
    z = len(y_test)
    mse = u / z
    
    return r_squared, mse

In [23]:
r2, mse = model_eval(y_test, y_pred)