In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler

from scipy.stats import spearmanr
from scipy.stats import pearsonr
from scipy.stats import kendalltau

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Ref:
1. https://github.com/Ayushijain09/Automobile-Dataset-Multiple-Linear-Regression-with-Feature-Selection/blob/master/_Automobile_Multiple_LinearRegression_Feature_Selection.ipynb
2. https://towardsdatascience.com/learn-how-to-do-feature-selection-the-right-way-61bca8557bef

### 3 techniques of feature selection methods:
1. Filter methods(6)
    - missing values ratio
    - variance threshold
    - correlation coefficient
    - chi-square test of independent (for categorical data)
    - mutual info (for both regression & classification)
    - analysis of variance (ANOVA)
2. Wrapper methods
    - sequential feature selection
    - recursive feature elimination (RFE)
3. Embedded methods
    - L1 (LASSO) regularization
    - Tree model (for regression & classification)

#### With filter methods, 

we primarily apply a statistical measure that suits our data to assign each feature column a calculated score. Based on that score, it'll be decided whether that feature will be kept or removed from our predictive model

Pros and cons:
- Pros: computationally inexpensive + best for eliminating redundant irrelevant features
- Cons: don't take feature correlations into consideration since they work independently on each feature

2 types of filter methods:
1. univariate filter: work on ranking a single feature 
2. multivariate filer: evaluate the entire feature space

#### In wrapper methods,

we primarily choose a subset of features and train them using a ML algorithm. Based on the inferences from this model, we employ a search strategy to look through the space of possible feature subsets and decide which feature to add or remove for the next model development.

This loop continues until the model performance no longer changes with the desired count of features. (k_features)

Pros and cons:
- Pros: takes care of the interactions between features, ultimately finding the optimal subset of features for your model with the lowest possible error.
- Cons: computationally expensive as the features increase

#### In embedded methods,

combining the functionalities of both Filter and Wrapper methods

Embedded methods perform feature selection during the process of training (which is why they're called embedded)

**Upside:** the computational speed is as good as of filter methods + better accuracy

## Data Pre-processing

Step 1. Data Wrangling to clean data

Step 2. Data Scaling to normalize data

## Step 0 - Data Wrangling
**Step 0.1 - Data Overview**

In [2]:
# take here: https://github.com/justdinhnq/python_practice/blob/main/feature_selection/cars_desc.names
names = [
    "symboling", "normalized_losses", "make", "fuel_type", 
    "aspiration", "num_of_doors", "body_style", "drive_wheels",
    "engine_location", "wheel_base", "length", "width", "height",
    "curb_weight", "engine_type", "num_of_cylinders", "engine_size",
    "fuel_system", "bore", "stroke", "compression_ratio", "horse_power",
    "peak_rpm", "city_mpg", "highway_mpg", "price"
]

cars = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
    names = names
)
cars.info()
cars.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized_losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel_type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num_of_doors       205 non-null    object 
 6   body_style         205 non-null    object 
 7   drive_wheels       205 non-null    object 
 8   engine_location    205 non-null    object 
 9   wheel_base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb_weight        205 non-null    int64  
 14  engine_type        205 non-null    object 
 15  num_of_cylinders   205 non-null    object 
 16  engine_size        205 non

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


**Step 0.2 - change data types and drop columns**

1. reduce categories 
    - 'make' = 'company name' + 'version': only keep 'company name' instead
2. change 'object' type to 'category' type
    - 'company_name', 'num_of_cylinders'
3. change to 'numeric' type for columns mentioned as words
    - 'num_of_doors', 'num_of_cylinders'
4. change 'object' (known as 'categorical') to dummy variables

1. reduce categories

In [3]:
cars['company_name'] = cars['make'].apply(lambda x: x.split(" ")[0])

# it's done, now drop the 'make' column for clean
cars = cars.drop('make', axis = 1)

2. change 'object' type to 'category' type

In [4]:

cars['company_name'].astype('category')
cars['company_name'].value_counts()

# if some are misspelled, use this example to correct them
# cars.loc[cars['company_name'] == 'Nissan', 'company_name'] = 'nissan'



toyota           32
nissan           18
mazda            17
mitsubishi       13
honda            13
volkswagen       12
subaru           12
peugot           11
volvo            11
dodge             9
mercedes-benz     8
bmw               8
audi              7
plymouth          7
saab              6
porsche           5
isuzu             4
jaguar            3
chevrolet         3
alfa-romero       3
renault           2
mercury           1
Name: company_name, dtype: int64

3. change to 'numeric' type for columns mentioned as words

In [5]:
# map 'words' to 'numeric' type
cars['num_of_doors'] = cars['num_of_doors'].map(
    {
        'four':4,'two':2
    }
)

cars['num_of_cylinders'] = cars['num_of_cylinders'].map(
    {
        'four':4,'two':2,'six':6,'five':5,'eight':8,'three':3,'twelve':12
    }
)

In [6]:
# convert '?' to 0.0 before change 'object' type to 'float-64' type
cars['price'] = cars['price'].replace('?', 0.0)
cars['price'] = cars['price'].astype(float)

4. change 'object' (known as 'categorical') to dummy variables

In [7]:
# models can't learn from categorical variables, 
# so we need to make dummy variables to convert them into numeric types

# extra part: no need 'normalized_losses' 'object' column
cars = cars.drop(['normalized_losses'], axis = 1)


# get categorical vars
cars_category = cars.select_dtypes(include=['object'])

# make dummy vars
# drop_first = True => don't have 'fuel_type_diesel', same as for other dummy vars
cars_dummies = pd.get_dummies(cars_category, drop_first=True)
cars_dummies.head()

Unnamed: 0,fuel_type_gas,aspiration_turbo,body_style_hardtop,body_style_hatchback,body_style_sedan,body_style_wagon,drive_wheels_fwd,drive_wheels_rwd,engine_location_rear,engine_type_dohcv,...,company_name_nissan,company_name_peugot,company_name_plymouth,company_name_porsche,company_name_renault,company_name_saab,company_name_subaru,company_name_toyota,company_name_volkswagen,company_name_volvo
0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# drop categorical columns 
list_cats = list(cars_category)
cars = cars.drop(list_cats, axis = 1)

# and add new dummies
cars = pd.concat([cars, cars_dummies], axis = 1)

## Step 0 - Data Scaling

Now we have all features as numeric data types.

The categorical features don't need to be scaled.

Let's scale all the numerical features to normalize the data within a particular range.

But we can only scale the training set features, not the test set.

In [9]:
# 1. split the dataset
df_train, df_test = train_test_split(
    cars, train_size = 0.7, test_size = 0.3, random_state = 100
)

# 2. scaling 
scale = StandardScaler()
scale_features = cars.select_dtypes(include = ['float64', 'int64']).columns

df_train[scale_features] = scale.fit_transform(df_train[scale_features])

# review 
df_train.head()

Unnamed: 0,symboling,num_of_doors,wheel_base,length,width,height,curb_weight,num_of_cylinders,engine_size,compression_ratio,...,company_name_nissan,company_name_peugot,company_name_plymouth,company_name_porsche,company_name_renault,company_name_saab,company_name_subaru,company_name_toyota,company_name_volkswagen,company_name_volvo
122,0.170159,0.885895,-0.811836,-0.487238,-0.9245,-1.134628,-0.642128,-0.351431,-0.660242,-0.172569,...,0,0,1,0,0,0,0,0,0,0
125,1.848278,-1.128802,-0.677177,-0.359789,1.114978,-1.382026,0.439415,-0.351431,0.637806,-0.146125,...,0,0,0,1,0,0,0,0,0,0
166,0.170159,-1.128802,-0.677177,-0.37572,-0.833856,-0.392434,-0.441296,-0.351431,-0.660242,-0.172569,...,0,0,0,0,0,0,0,1,0,0
1,1.848278,-1.128802,-1.670284,-0.367754,-0.788535,-1.959288,0.015642,-0.351431,0.123485,-0.278345,...,0,0,0,0,0,0,0,0,0,0
199,-1.50796,0.885895,0.97239,1.225364,0.616439,1.627983,1.13772,-0.351431,0.123485,-0.675002,...,0,0,0,0,0,0,0,0,0,1


## Step 1 - Feature Selection methods

**1.1. Filter methods - Missing Values Ratio**

Theoretically, the acceptable threshold of missing values: 25-30%.

If having the domain knowledge, it's always better to make an educated guess if the feature is crucial to the model. In such a case, try imputing the missing values using various techniques [here](https://towardsdatascience.com/all-about-missing-data-handling-b94b8b5d2184)

In [10]:
sum_na = df_train.isnull().sum()
df_len = len(df_train)

ratio = (sum_na / df_len) * 100

# 5 largest values
print(ratio.nlargest())

num_of_doors    1.398601
symboling       0.000000
wheel_base      0.000000
length          0.000000
width           0.000000
dtype: float64


**1.2. Filter methods - Variance threshold**

**Zero variance**: features in which identical value occupies the majority of the samples 

Such features carrying little information won't affect the target variable and can be dropped

Threshold value:
- default = 0: remove features that have the same value in all samples
- 0.01: drop the column where 99% of the values are similar. 

Using '0.01' for quasi-constant features, that have the same value for a very large subset.

In [11]:
# before applying
print(df_train.shape)

# drop features with Zero variance 
var_filter = VarianceThreshold(threshold = 0.0)
train = var_filter.fit_transform(df_train)

# after applying: get the count of features that aren't constant
print(train.shape)

indexes = var_filter.get_support()
print(len(df_train.columns[indexes]))

(143, 212)
(143, 197)
197


**1.3. Correlation coefficient**

2 independent features (X) are highly correlated if they have a strong relationship with each other and move in a similar direction. In that case, we don't need 2 similar features to be fed to the model, if one can suffice.

If a pair of columns cross a certain threshold, the one that shows a high correlation with the target variable (y) will be kept and the other one will be dropped

- Pearson correlation (for continuous data):
    + assumption: the observed data follows some distribution pattern (e.g. normal, gaussian)
    + -1 <= r < 1
    + Ho: correlation between variables is not significant (= 0)
    + p-value < alpha: the sample contains sufficient evidence to reject Ho and conclude that the correlation coefficient does not equal zero.
- Spearman rank (for continuous + ordinal data)
    + Spearman = Pearson but fail the assumptions
    
- Kendall (for discrete/ordinal data)
    + compares # of concordant and discordant pairs of data

In [12]:
# use this one!
#coef, p = pearsonrr(x, y)
#x.corr(y)

# don't use Spearman and Kendall
# since they work best only with ordinal variables and
# we have 60% continuous variables
#coef, p = spearman(x, y)
#x.corr(y, method = 'spearman')

#coef, p = kendalltau(x, y)
#x.corr(y, method = 'kendall')

**1.4. Chi-Square tests (for categorical data)**

Ho: the 2 variables are independent

Those tests come in 2 variations:
- one evaluates the goodness-of-fit
- other one: the test of independence

With this dataset, Chi-square would not work since it needs categorical variables and non-negative values.

In [13]:
# X_train = X_train.astype(int)
# chi2_features = SelectKBest(chi2, k = 12)
# X_kbest_features = chi2_features.fit_transform(X_train, y_train)