<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

## Lab 4.2.1: Feature Selection

In this lab, we delve into the fundamental concept of feature selection. We start by conducting correlation analysis to identify relevant features for our regression model. By examining the relationship between each feature and the target variable, we aim to pick the most influential features. Additionally, we explore the significance of cross validation in model evaluation and how it relates to feature selection. Through cross validation, we ensure that our model generalizes well to unseen data by assessing its performance across multiple validation sets.

### 1. Load & Explore Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#### 1.1 Load Data

In [2]:
# Read CSV
wine_csv = '../DATA/winequality_merged.csv'

wine = pd.read_csv(wine_csv)

#### 1.2 Explore Data (Exploratory Data Analysis)

In [3]:
# ANSWER
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


### 2. Set Target Variable

Create a target variable for wine quality.

In [5]:
# Target Variable
y = wine.quality

wine.quality.unique()

array([5, 6, 7, 4, 8, 3, 9], dtype=int64)

### 3. Set Predictor Variables

Create a predictor matrix with variables of your choice. State your reasoning for the choices you make.

> `alcohol` has the highest correlation with quality

In [6]:
# ANSWER
wine.corr()['quality'].sort_values()

density                -0.305858
volatile acidity       -0.265699
chlorides              -0.200666
red_wine               -0.119323
fixed acidity          -0.076743
total sulfur dioxide   -0.041385
residual sugar         -0.036980
pH                      0.019506
sulphates               0.038485
free sulfur dioxide     0.055463
citric acid             0.085532
alcohol                 0.444319
quality                 1.000000
Name: quality, dtype: float64

### 4. Using Linear Regression Create a Model and Test Score

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [9]:
# Train-Test Split
X = wine.alcohol
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

for data in [X_train, X_test, y_train, y_test]:
    print(type(data))


<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [25]:
# Create a model for Linear Regression
wine_model = LinearRegression()

# Fit the model with the Training data
wine_model.fit(X_train, y_train)

# Calculate the score (R^2 for Regression) for Training Data
r2_train = wine_model.score(X_train, y_train)

# Calculate the score (R^2 for Regression) for Testing Data
r2_test = wine_model.score(X_test, y_test)

print(f'training data R^2: {r2_train}')
print(f'testing data: R^2: {r2_test}')

training data R^2: 0.201095495779963
testing data: R^2: 0.18158453862144686


## BONUS: Cross validation

In [32]:
# Cross validation
from sklearn.model_selection import KFold
from sklearn.metrics import root_mean_squared_error  # mean_squared_error

⬆️ Original imported was `mean_squared_error` but code asks for RMSE, and `squared` argument of `mean_squared_error` is deprecated; using `root_mean_squared_error` instead.

`k_fold = KFold(5, shuffle=True)`

A `KFold` object has a method `.split()` that takes a dataset and returns access to *k* pairs of arrays.
- Note: It doesn't return the pairs directly.  It returns a 'generator' which you can use in a `for` loop *as if* it were a list of pairs.

Each pair is a partition of the row indices of the data set.
- The first array in the pair is the row indices of the training set.
- The second array in the pair is the row indices of the test set.

In [35]:
y = wine['quality'] 
X = wine.drop(columns=['quality'])

# Set up 5-fold cross validation
k_fold = KFold(5, shuffle=True)
train_scores = []
train_rmse = []
test_scores = []
test_rmse = []


for k, (train_indices, test_indices) in enumerate(k_fold.split(X)):
  
    # Get training and test sets for X and y
    X_train = X.iloc[train_indices]
    X_test = X.iloc[test_indices]
    y_train = y.iloc[train_indices]
    y_test = y.iloc[test_indices]

    # Fit model with training set
    wine_model.fit(X_train, y_train)
    
    # Make predictions with training and test set
    training_predictions = wine_model.predict(X_train)
    test_predictions = wine_model.predict(X_test)

    # Score R2 and RMSE on training and test sets and store in list
    train_scores.append(wine_model.score(X_train, y_train))
    train_rmse.append(root_mean_squared_error(y_train, training_predictions))
    test_scores.append(wine_model.score(X_test, y_test))
    test_rmse.append(root_mean_squared_error(y_test, test_predictions))

# Create a metrics_df dataframe to display r2 and rmse scores
metrics = pd.DataFrame({'train_scores':train_scores, 'test_scores':test_scores, 'train_rmse':train_rmse, 'test_rmse':test_rmse})

col_index = pd.MultiIndex.from_arrays(arrays=[['R^2', 'R^2', 'RMSE', 'RMSE'], ['test', 'train', 'test', 'train']])
metrics.columns = col_index

print(metrics)

        R^2                RMSE          
       test     train      test     train
0  0.292625  0.310721  0.735644  0.719899
1  0.294401  0.298055  0.730588  0.742603
2  0.299043  0.284113  0.726410  0.757086
3  0.298344  0.285390  0.733801  0.728356
4  0.301098  0.272336  0.733761  0.727960


In [36]:
# Describe the metrics
metrics.describe()

Unnamed: 0_level_0,R^2,R^2,RMSE,RMSE
Unnamed: 0_level_1,test,train,test,train
count,5.0,5.0,5.0,5.0
mean,0.297102,0.290123,0.732041,0.735181
std,0.003486,0.01468,0.003635,0.014724
min,0.292625,0.272336,0.72641,0.719899
25%,0.294401,0.284113,0.730588,0.72796
50%,0.298344,0.28539,0.733761,0.728356
75%,0.299043,0.298055,0.733801,0.742603
max,0.301098,0.310721,0.735644,0.757086


### 5. Feature Selection

What's your score (R^2 for Regression) for Testing Data?

> training data R^2 = 0.201
>
> testing data: R^2 = 0.181

How many feature have you selected? Can you improve your score by selecting different features?

**Please continue with Lab 4.2.2 with the same dataset.**

In [18]:
X = wine[['alcohol', 'citric acid']]
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X, y, test_size=0.20, random_state=42)

In [20]:
# Create a model for Linear Regression
wine_model_2 = LinearRegression()

# Fit the model with the Training data
wine_model_2.fit(X_train, y_train)

# Calculate the score (R^2 for Regression) for Training Data
r2_train = wine_model_2.score(X_train, y_train)

# Calculate the score (R^2 for Regression) for Testing Data
r2_test = wine_model_2.score(X_test, y_test)

print(f'training data R^2: {r2_train}')
print(f'testing data: R^2: {r2_test}')

training data R^2: 0.20902689039142175
testing data: R^2: 0.19055557165795323




---



---



> > > > > > > > > © 2024 Institute of Data


---



---



