# Data Cleaning  
**Objective:**   
- Address data quality issues identified during EDA.  
  
- Handle missing values, outliers, and incorrect data types.  
   
- Ensure data integrity for modeling.  

### Import Libraries and Load Data

In [29]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/raw/cs-training.csv', index_col=0)

### Handle Missing Values

**NumberOfDependents:**

In [30]:
# Handle the missing value of NumberOfDependents via imputation
df['NumberOfDependents'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['NumberOfDependents'].fillna(0, inplace=True)


**MonthlyIncome:**

In [31]:
# Include Flag for missing MonthlyIncome
df['MonthlyIncomeMissing'] = df['MonthlyIncome'].isna().astype(int)

# Handle the missing value of MonthlyIncome via imputation
df['MonthlyIncome'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['MonthlyIncome'].fillna(0, inplace=True)


### Correcting Data Types

**NumberOfDependents:**

In [32]:
df['NumberOfDependents'] = df['NumberOfDependents'].astype('int64')

### Address Data Inconsistencies

**Age:**

In [33]:
# Remove ages less than 0
df = df[df['age'] > 0] 

**RevolvingUtilizationOfUnsecuredLines:**

In [34]:
# Cap values to 10 to address extreme outliers
df['RevolvingUtilizationOfUnsecuredLines'] = df['RevolvingUtilizationOfUnsecuredLines'].clip(upper=10)

**DebtRatio:**

In [35]:
# Calculate the 99th percentile
debt_ratio_99perc = df['DebtRatio'].quantile(0.99)
print(f'99th percentile of DebtRatio: {debt_ratio_99perc}')

# Cap values at the 99th percentile
df['DebtRatio'] = df['DebtRatio'].clip(upper=debt_ratio_99perc)

99th percentile of DebtRatio: 4979.079999999958


### Remove Duplicates

In [36]:
df.drop_duplicates(inplace=True)

### Save Cleaned Data

In [37]:
df.to_csv('../data/processed/cleaned_data.csv', index=False)