In [17]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

In [18]:
import warnings
warnings.filterwarnings('ignore')

### 1. Loading the train and test data

In [19]:
df_train = pd.read_csv("data/train.csv")
df_test = pd.read_csv("data/test.csv")


train = df_train.drop(['TARGET'], axis=1)

##### We notice that there are a huge number of features, hence, we proceed with Feature Selection to choose only the ones that contribute to the need.

### 2. Feature Selection

With the dataset with so many features, we might run into problems such as:

<ol>
<li><b>Constant features:</b></li>
<ul>
    <li>Features that show the same value, one value, for all the observations of the dataset.</li>
    <li>They provide no information that allows a machine learning model to predict the target.</li>
    <li>To identify constant features, we are using VarianceThreshold from sklearn and remove them.</li>
</ul>
    
<h6></h6>
    
<li><b>Quasi-Constant features:</b></li>
<ul>
    <li>Quasi-constant features have the same value for the majority of observations in a dataset. </li>
    <li>These types of features typically do not provide much information for machine learning models to predict or classify a target.</li>
    <li>To identify quasi-constant features, we are again going to use VarianceThreshold from sklearn and remove them.</li>
</ul>

In [21]:
#Fitting the VarianceThreshold

vthres = VarianceThreshold(threshold = 0.05)
vthres.fit(train)

In [22]:
#Number of features that are non-constant

print("Originial Features: ",train.shape[1])
print("Non-Constant Features: ", sum(vthres.get_support()))

Originial Features:  370
Non-Constant Features:  236


In [23]:
# printing the constant features

print(len([
        x for x in train.columns
        if x not in train.columns[vthres.get_support()]
    ]))

columns_drop = [x for x in train.columns if x not in train.columns[vthres.get_support()]]

134


> We can see the above 34 columns have constant value hence we are going to get rid of them, as they make no contributions.

In [24]:
#Dropping the Quasi-Constant columns

df_train = df_train.drop(columns_drop, axis=1)
df_test = df_test.drop(columns_drop, axis = 1)

In [25]:
# Checking the updated shape of train and test data

df_train.shape, df_test.shape

((76020, 237), (75818, 236))

New Info

<span style = 'font-size:16px; font-family:TimesNewRoman'> By removing constant and quasi-constant features, we reduced the feature space from 370 to 273. We can see that 97 features were removed from the present dataset.</span>

### PCA

PCA stands for Principal Component Analysis. It is a statistical technique used to reduce the dimensionality of a dataset while retaining as much of the variation in the data as possible.

PCA works by transforming a dataset consisting of many variables into a smaller set of variables called principal components. The principal components are linear combinations of the original variables that capture the maximum amount of variation in the data.

The first principal component captures the most variation in the data, the second principal component captures the second most variation, and so on. By retaining only the top principal components, PCA can reduce the number of variables in the dataset while still preserving most of the information.

PCA is often used in data preprocessing, data visualization, and machine learning. It can be used to identify patterns and relationships in the data, to reduce noise, and to create more parsimonious models.