## DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING

### Tasks:

##### 1. Data Exploration and Preprocessing:

• Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).

• Handle missing values as per the best practices (imputation, removal, etc.).

• Apply scaling techniques to numerical features:

• Standard Scaling

• Min-Max Scaling

• Discuss the scenarios where each scaling technique is preferred and why.

#### 2.Encoding Techniques:


• Apply One-Hot Encoding to categorical variables with less than 5 categories.

• Use Label Encoding for categorical variables with more than 5 categories.

• Discuss the pros and cons of One-Hot Encoding and Label Encoding.

#### 3. Feature Engineering:


• Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.

• Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

#### 4. Feature Selection:


• Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.

• Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler


# Load the dataset
df= pd.read_csv('adult_with_headers.csv')
# Basic data exploration
print(df.info())
print(df.describe())

# Handling missing values
df.dropna(inplace=True)  # Remove rows with missing values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None
                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  3

In [2]:
# Encoding categorical variables
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country', 'income'], drop_first=True)

# Scaling numerical features
scaler_standard = StandardScaler()
scaler_minmax = MinMaxScaler()

numerical_features = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
df_scaled_standard = pd.DataFrame(scaler_standard.fit_transform(df[numerical_features]), columns=numerical_features)
df_scaled_minmax = pd.DataFrame(scaler_minmax.fit_transform(df[numerical_features]), columns=numerical_features)

# Combine scaled numerical features with encoded categorical features
df_final_standard = pd.concat([df_scaled_standard, df.drop(numerical_features, axis=1)], axis=1)
df_final_minmax = pd.concat([df_scaled_minmax, df.drop(numerical_features, axis=1)], axis=1)


###### Discussion on Scaling Techniques

Standard Scaling (Z-score normalization): Preferred when the distribution of the numerical feature is normal or approximately normal. It centers the data around 0 with a standard deviation of 1, which is useful for algorithms that assume normally distributed features or require features to be on a similar scale.


Min-Max Scaling (Normalization): Preferred when the distribution of the numerical feature is not normal and when the algorithm does not assume a normal distribution. It scales the data to a fixed range (usually 0 to 1), preserving the shape of the original distribution and making it suitable for algorithms that require features to be within a specific range.


In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Load the dataset
df = pd.read_csv('adult_with_headers.csv')

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

In [4]:
# Apply One-Hot Encoding to categorical variables with less than 5 categories
for col in categorical_cols:
    if len(df[col].unique()) < 5:
        df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
        df.drop(col, axis=1, inplace=True)

In [6]:
# Print the first few rows of the modified dataframe
print(df.head())

   age  workclass  fnlwgt  education  education_num  marital_status  \
0   39          7   77516          9             13               4   
1   50          6   83311          9             13               2   
2   38          4  215646         11              9               0   
3   53          4  234721          1              7               2   
4   28          4  338409          9             13               2   

   occupation  relationship  race  capital_gain  capital_loss  hours_per_week  \
0           1             1     4          2174             0              40   
1           4             0     4             0             0              13   
2           6             1     4             0             0              40   
3           6             0     2             0             0              40   
4          10             5     2             0             0              40   

   native_country  sex_ Female  sex_ Male  income_ <=50K  income_ >50K  
0   United-St

Discussion on Encoding Techniques

One-Hot Encoding:


Pros: It does not assume an ordinal relationship between categories, which can be beneficial for algorithms that should not assume such a relationship. It avoids introducing fake ordering or relationship between categories.
Cons: It can lead to a large increase in the dimensionality of the dataset, especially if there are many unique categories in a variable. This can lead to the curse of dimensionality and increased computational complexity.


Label Encoding:


Pros: It reduces the dimensionality of the dataset by encoding categorical variables as integers. It can be useful for algorithms that require ordinal relationships between categories.
Cons: It may introduce an ordinal relationship between categories where none exists, which can be problematic for algorithms that should not assume such a relationship. It can also lead to issues if the encoded integers are misinterpreted as continuous values.

In [7]:
import pandas as pd

# Load the dataset
df= pd.read_csv('adult_with_headers.csv')

# Display the first few rows of the dataset
df.head()



Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:

# Convert 'income' column to numerical format
df['income'] = pd.to_numeric(df['income'], errors='coerce')

# New Feature 2: Capital Gain to Income Ratio
df['capital_gain_income_ratio'] = df['capital_gain'] / df['income']

# Display the updated dataset with new features
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,work_hours_per_week,capital_gain_income_ratio
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,39.031587,United-States,,39.031587,
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,44.421881,United-States,,44.421881,
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40.267096,United-States,,40.267096,
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40.267096,United-States,,40.267096,
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40.267096,Cuba,,40.267096,


In [16]:
# Install ppscore library
!pip install ppscore



In [17]:
import ppscore as pps

# Calculate the PPS matrix
pps_matrix = pps.matrix(df)

# Display the PPS matrix
print(pps_matrix)

# Compare with the correlation matrix
correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)






                             x                          y   ppscore  \
0                          age                        age  1.000000   
1                          age                  workclass  0.011232   
2                          age                     fnlwgt  0.000000   
3                          age                  education  0.052315   
4                          age              education_num  0.000000   
..                         ...                        ...       ...   
284  capital_gain_income_ratio             hours_per_week  0.000000   
285  capital_gain_income_ratio             native_country  0.000000   
286  capital_gain_income_ratio                     income  0.000000   
287  capital_gain_income_ratio        work_hours_per_week  0.000000   
288  capital_gain_income_ratio  capital_gain_income_ratio  1.000000   

                                  case  is_valid_score               metric  \
0                       predict_itself            True              

  correlation_matrix = df.corr()
