-------
# DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING
-------

### OBJECTIVE :

- Practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building   efficient machine learning models

### DATA EXPLORATION AND PREPROCESSING :

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler ,MinMaxScaler

In [6]:
from sklearn.preprocessing import LabelEncoder

In [7]:
from sklearn.ensemble import IsolationForest

In [2]:
import ppscore as pps
import seaborn as sns
import matplotlib.pyplot as plt

ValueError: Key backend: 'module://matplotlib_inline.backend_inline' is not a valid value for backend; supported values are ['gtk3agg', 'gtk3cairo', 'gtk4agg', 'gtk4cairo', 'macosx', 'nbagg', 'notebook', 'qtagg', 'qtcairo', 'qt5agg', 'qt5cairo', 'tkagg', 'tkcairo', 'webagg', 'wx', 'wxagg', 'wxcairo', 'agg', 'cairo', 'pdf', 'pgf', 'ps', 'svg', 'template']

In [None]:
df=pd.read_csv('adult_with_headers.csv')

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
num_col=['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week']

In [None]:
ss=StandardScaler()
mms=MinMaxScaler()

In [None]:
df_ss=df.copy()
df_ss[num_col]=ss.fit_transform(df[num_col])

In [None]:
df_mm=df.copy()
df_mm[num_col]=mms.fit_transform(df[num_col])

In [None]:
df_ss

In [None]:
df_mm

#### UNDERSTANDING FROM SCALING TECHNIQUES :

- Standard Scaling:
   * It transforms the data so that it has a mean of 0 and a standard deviation of 1
   * Good for data that is normally distributed
   * Best for algorithms that assume normality in the data (e.g., Logistic Regression, SVM, K-Means)

- Min-Max Scaling:
  * It scales data to fit between 0 and 1
  * Ideal for algorithms that are sensitive to the scale of features (e.g., k-NN, Neural Networks)
  * Suitable for features with varying ranges or when you need to preserve relative distances



### ENCODING TECHNIQUES :

In [None]:
cat_col=df.select_dtypes(include=['object']).columns
cat_col

In [None]:
cat_cnt = {col: df[col].nunique() for col in cat_col}
cat_cnt

- From the analysis of unique items :-
  - Low Cardinality (Gender: Male, Female): One-Hot Encoding creates 2 columns, "Male" and "Female", which is manageable.
  - High Cardinality (native_country): If you use One-Hot Encoding for hundreds of cities, it would create hundreds of new columns, which is inefficient. Label Encoding assigns integers like 0, 1, 2, etc., to each city, keeping it compact.

In [None]:
one_hot_col = [col for col in cat_col if cat_cnt[col] < 5]
label_col = [col for col in cat_col if cat_cnt[col] >= 5]
label_col,one_hot_col

In [None]:
df_hot = pd.get_dummies(df, columns=one_hot_col)
df_hot

In [None]:
label_encoder=LabelEncoder()
df_label=df_hot.copy()
for col in label_col :
    df_label[col]=label_encoder.fit_transform(df_label[col])

In [None]:
df_label

### Pros and Cons of One-Hot Encoding :
  - Pros :
    - It avoids introducing any ordinal relationship between categories
    - Works well with algorithms that don’t assume any specific order (e.g., tree-based methods)
    - Makes it easier to interpret the data, as each category gets its own column
  - Cons :
    - Can lead to a large number of columns, especially for features with many categories
    - Creates sparse matrices (many zeros), which can increase computational complexity
   
      
- Pros and Cons of Label Encoding  :
  - Pros :
    - Uses a single column, leading to less memory consumption
    - Faster to compute since it assigns integer labels
  - Cons :
    - It assumes an ordinal relationship between categories, which may not always be appropriate
    - Can negatively affect models that are sensitive to the numerical magnitude of features (e.g., linear models)

#### NEW BENEFICIAL FEATURE :

In [None]:
df['working_hours']=df['capital_gain']/(df['hours_per_week']+1)

In [None]:
df

In [None]:
df['capital_net_gain']=df['capital_gain']-df['capital_loss']

In [None]:
df

In [None]:
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 45, 65, 90], labels=['Young', 'Middle-aged', 'Senior', 'Old'])

In [None]:
df

- Transformation :- Log Transformation
     - The capital_gain feature is often highly skewed with many values concentrated at zero and a few very high values
     - A log transformation will reduce the skewness by compressing the range of high values
     - Helping the model better differentiate income levels while preventing extreme values from distorting predictions

In [None]:
df['log_capital_gain'] = np.log1p(df['capital_gain'])

In [None]:
df

### FEATURE SELECTION :

#### ISOLATION FOREST TO DETECT AND REMOVE OUTLIERS :

In [None]:
iso_f=IsolationForest(contamination=0.05,random_state=42)
outliers=iso_f.fit_predict(df.select_dtypes(include=['float64','int64']))
df['outlier']=outliers

In [None]:
df_out=df[df['outlier']!=-1].drop(columns='outlier')

In [None]:
df_out.shape,df.shape

#### PPS (PREDICTIVE POWER SCORE) AND CORRELATION MATRIX :

In [None]:
pps_matrix=pps.matrix(df_out)

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(pps_matrix.pivot('x','y','ppscore'),annot=True,cmap='coolwarm')
plt.title('PPS Matrix - PPS Heatmap')
plt.show()

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(df_out.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

- PPS (Predictive Power Score) :-
    -  detects both linear and non-linear relationships between features
    -   It provides more flexibility compared to correlation, which is limited to linear relationships
- Correlation Matrix :-
    -  Focuses on linear relationships
    -   Positive and negative correlations can be interpreted based on the sign of the values