# Exploratory Data Analysis (EDA)

### Datathon challenge task
Predict the duration of time it takes for patients to receive metastatic cancer diagnosis.

### EDA Techniques
- Analysis
    - Load data
    - Understand the data:
        - Categorize the types of features
        - Data shape, types, missing (null) values, unique values
        - Categorical vs numeric data, and which of the categorical data is ordinal (has an order)
        - Statistical analysis
        - Visualizations
            - Univariate
            - Bivariate
            - Multivariate
    - Feature correlations
- Cleaning
    - Removing features
    - Handling missing data
    - Encoding categorical variables
- Feature engineering


In [None]:
#pip install seaborn

In [None]:
# Import libraries

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, RobustScaler
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="whitegrid")

# Analysis

## Load data

In [None]:
# Load .csv data as dataframes and make patient_id the index

df_train = pd.read_csv('data/train.csv', index_col='patient_id')
df_test = pd.read_csv('data/test.csv', index_col='patient_id')

# Look at the first few rows of data
df_train.head()

## Understand the data

### Categorize the types of features

Read the [data descriptions](https://www.kaggle.com/competitions/widsdatathon2024-challenge2/data) on Kaggle.

**Patient characteristics:** patient_race, payer_type, patient_age, patient_gender, bmi

**Patient location:** patient_state, patient_zip3, region, division

**Breast cancer diagnosis information:** breast_cancer_diagnosis_code, breast_cancer_diagnosis_desc, metastatic_cancer_diagnosis_code, 
    metastatic_first_novel_treatment, metastatic_first_novel_treatment_type

**Geo (zip-code level) demographic data:** Many! (population, income, education, rent, race, poverty etc)

**Climate data:** 72 columns showing the zip 3 Monthly Average Temperature for the patient’s zip 3 and month referenced

**Target variable:** metastatic_diagnosis_period

In [None]:
# Define a function that shows the data dimensions (# rows and columns), total # NA values, and data types, 
# number of distinct NA (null) values for each column (feature)

def initial_eda(df):
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%38s %10s     %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NaN Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%38s %10s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
        
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))

In [None]:
# Call the function on our training data

initial_eda(df_train)

In [None]:
# Call the function on our testing data

initial_eda(df_test)

#### Observations

**Data types**

- 11 object types in train and test - these are categorical features
    - None of these seem to be ordinal

- 3 integer types in train, 2 in test - zip, age, and the target variable (metastatic_diagnosis_period)

- The rest of the features are floats

**# distinct values**

- Only 1 distinct value for patient_gender (female) in train and test data, so we can drop this feature

- Only 1 distinct value for metastatic_first_novel_treatment_type in train and test data, and only 11 rows contain a value in train and 7 in test, so we can drop this feature

- Only 2 distinct values for metastatic_first_novel_treatment in train and test data, and only 11 rows contain a value in train and 7 in test, so we can probably drop this feature, but look for any strong correlation with the target feature

- patient_zip3 and population have the same number of unique values in train (751) and test (669), so do we need them both? 
Population is a size, so it might be important. Is zip random? How strong is the correlation with the target variable?

**NaN values - need to start thinking about how to handle these**

- patient_race: ~half are missing in train and test

- payer_type: 13-14% are missing in train and test

- bmi: 69-70% are missing in train and test

- metastatic_cancer_diagnosis_code and metastatic_first_novel_treatment_type: almost all are missing

- Several features (23) are missing 5 values in the train data but none of those are missing in the test data. Are the 5 missing values for
each feature in the same row? If so, can we just drop those rows? (Yes, all are in the same rows and they are in the same zip, and those
are the only 5 records for that zip, but that zip is not in the test data, so we can probably drop them.)

- Avg temps: Missing 0-180 in train and 0-95 in test. The 180/95 are in Apr-14 and are all in NY (6 zips) or ID (1 zip) in both train
and test


### Drop features and observations that logically add no value

In [None]:
# Drop features

# Only one value for patient_gender
# Only one value for metastatic_first_novel_treatment_type and two values for metastatic_first_novel_treatment, 
# and there are only 11 non-null rows in the train data, and 7 non-null in the test data

drop_features = [
                    'metastatic_first_novel_treatment',         # rarely filled in
                    'metastatic_first_novel_treatment_type',    # rarely filled in
                    'patient_gender',                           # always the same value
                   ]

# Always perform the same actions on train and test data
df_train = df_train.drop(drop_features, axis=1)
df_test = df_test.drop(drop_features, axis=1)


In [None]:
# Drop rows

# Drop the 5 rows that have 23 missing features all from the same zip; that zip is not present in the test set
mask = (df_train['patient_zip3'] == 772)
df_train = df_train.drop(df_train[df_train['patient_zip3'] == 772].index)

# Check the shape (originally we had 13173 rows)
df_train.shape

### Statistical summary

In [None]:
# Do not truncate

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
print(df_train.describe())

In [None]:
print(df_test.describe())

#### Observations

**patient_zip3** max in test is different than train so we know there are values in test that are not in train (and probably vice versa)

**patient_age** range from 18-91

**bmi** max is 97 in train but only 43.7 in test! Looking closer, there is one 90 and one 97 in the train data, which are anomalies,
perhaps we remove those rows

**target variable** min=0, max=365, mean=96.5, std=109, count of 0=3126, count >= 350=224

### Categorical vs numerical features

In [None]:
# Create variables for the categorical and numeric features of df_train (df_test will be the same except the target col)
# Also create a variable for the target column

cat_cols=df_train.select_dtypes(include=['object']).columns
num_cols = df_train.select_dtypes(include=np.number).columns.tolist()
target_col = 'metastatic_diagnosis_period'
print("Categorical Variables:")
print(cat_cols)
print("Numerical Variables:")
print(num_cols)

### Compare unique values in df_train and df_test

Which features are present in the training data but not in the test data, and vice versa

In [None]:
# Iterate over the columns of df_train
for col in df_train.columns:
    # Check if the column exists in df_test
    if col in df_test.columns:
        # Get the unique values of col in df_train
        unique_values_train = df_train[col].unique()
        # Get the unique values of col in df_test
        unique_values_test = df_test[col].unique()
        # Find the differences between the two sets
        differences = set(unique_values_train) - set(unique_values_test)
        diff_count = len(differences)
        if len(differences) > 0:
            print(f"{col} has the following unique values that are not present in the test dataset: {differences}")
            print(f"{col} has the following number of differences: {diff_count}")

In [None]:
# Iterate over the columns of df_test
for col in df_test.columns:
    # Check if the column exists in df_test
    if col in df_train.columns:
        # Get the unique values of col in df_train
        unique_values_test = df_test[col].unique()
        # Get the unique values of col in df_test
        unique_values_train = df_train[col].unique()
        # Find the differences between the two sets
        differences = set(unique_values_test) - set(unique_values_train)
        diff_count = len(differences)
        if len(differences) > 0:
            print(f"{col} has the following unique values that are not present in the train dataset: {differences}")
            print(f"{col} has the following number of differences: {diff_count}")

#### Observations

**patient_race, payer_type, patient_state, Region, Division, patient_age** Same values in train and test

**patient_zip3 and population** Train has 93 zips/populations not in test, test has 12 zips/populations not in train

**breast_cancer_diagnosis_code** Train has 7 values not in test (but just 1 or a few occurences of each), test has 1 value not in train (but only 1 occurrence of it)

**metastatic_cancer_diagnosis_code** Train has 7 values not in test (but just 1 or a few occurences of each), test has 2 values not in train (but only 1 occurrence of each)

### Visualizations

Categorical variables can be visualized using a Count plot, Bar Chart, Pie Plot, etc.

Numerical Variables can be visualized using Histogram, Box Plot, Density Plot, etc.

#### Univariate analysis

Looks at each feature individually

In [None]:
# Histogram and box plot of numerical data for df_train

for col in num_cols:
    print(col)
    print('Skew :', round(df_train[col].skew(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    df_train[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df_train[col])
    plt.show()

In [None]:
# Histogram and box plot of numerical data for df_test

for col in num_cols:
    print(col)
    print('Skew :', round(df_test[col].skew(), 2))
    plt.figure(figsize = (15, 4))
    plt.subplot(1, 2, 1)
    df_test[col].hist(grid=False)
    plt.ylabel('count')
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df_test[col])
    plt.show()

#### Observations of data

**bmi** 3 major outliers for train data, perhaps we drop those rows

**density** is quite skewed and has many outliers, but same for both data sets

**metastatic_diagnosis_period** Median is less than 50 days, has a right positive skew

**age_over_80** Out of proportion between the data sets based on histograms (box plots are very similar)


In [None]:
print(cat_cols)

In [None]:
# Count plot of categorical data for df_train

fig, axes = plt.subplots(4, 2, figsize = (18, 18))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'patient_race', data = df_train, color = 'blue', 
              order = df_train['patient_race'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'payer_type', data = df_train, color = 'blue', 
              order = df_train['payer_type'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'patient_state', data = df_train, color = 'blue', 
              order = df_train['patient_state'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Region', data = df_train, color = 'blue', 
              order = df_train['Region'].value_counts().index);
sns.countplot(ax = axes[2, 0], x = 'Division', data = df_train, color = 'blue', 
              order = df_train['Division'].value_counts().index);
sns.countplot(ax = axes[2, 1], x = 'breast_cancer_diagnosis_code', data = df_train, color = 'blue', 
              order = df_train['breast_cancer_diagnosis_code'].value_counts().index);
sns.countplot(ax = axes[3, 0], x = 'breast_cancer_diagnosis_desc', data = df_train, color = 'blue', 
              order = df_train['breast_cancer_diagnosis_desc'].value_counts().index);
sns.countplot(ax = axes[3, 1], x = 'metastatic_cancer_diagnosis_code', data = df_train, color = 'blue', 
              order = df_train['metastatic_cancer_diagnosis_code'].value_counts().index);

axes[1][0].tick_params(labelrotation=90);
axes[2][0].tick_params(labelrotation=45);
axes[2][1].tick_params(labelrotation=90);
axes[3][0].tick_params(labelrotation=90);
axes[3][1].tick_params(labelrotation=90);

In [None]:
# Count plot of categorical data for df_test

fig, axes = plt.subplots(4, 2, figsize = (18, 18))
fig.suptitle('Bar plot for all categorical variables in the dataset')
sns.countplot(ax = axes[0, 0], x = 'patient_race', data = df_test, color = 'blue', 
              order = df_test['patient_race'].value_counts().index);
sns.countplot(ax = axes[0, 1], x = 'payer_type', data = df_test, color = 'blue', 
              order = df_test['payer_type'].value_counts().index);
sns.countplot(ax = axes[1, 0], x = 'patient_state', data = df_test, color = 'blue', 
              order = df_test['patient_state'].value_counts().index);
sns.countplot(ax = axes[1, 1], x = 'Region', data = df_test, color = 'blue', 
              order = df_test['Region'].value_counts().index);
sns.countplot(ax = axes[2, 0], x = 'Division', data = df_test, color = 'blue', 
              order = df_test['Division'].value_counts().index);
sns.countplot(ax = axes[2, 1], x = 'breast_cancer_diagnosis_code', data = df_test, color = 'blue', 
              order = df_test['breast_cancer_diagnosis_code'].value_counts().index);
sns.countplot(ax = axes[3, 0], x = 'breast_cancer_diagnosis_desc', data = df_test, color = 'blue', 
              order = df_test['breast_cancer_diagnosis_desc'].value_counts().index);
sns.countplot(ax = axes[3, 1], x = 'metastatic_cancer_diagnosis_code', data = df_test, color = 'blue', 
              order = df_test['metastatic_cancer_diagnosis_code'].value_counts().index);

axes[1][0].tick_params(labelrotation=90);
axes[2][0].tick_params(labelrotation=45);
axes[2][1].tick_params(labelrotation=90);
axes[3][0].tick_params(labelrotation=90);
axes[3][1].tick_params(labelrotation=90);


In [None]:
# Group the data by breast_cancer_diagnosis_code and calculate the mean metastatic_diagnosis_period

grouped_data = df_train.groupby("breast_cancer_diagnosis_code").mean()["metastatic_diagnosis_period"]
grouped_data.head()

In [None]:
# Plot a bar chart of the mean metastatic_diagnosis_period for each breast_cancer_diagnosis_code

plt.figure(figsize=(15, 8))
plt.bar(range(len(grouped_data)), grouped_data)
plt.xticks(range(len(grouped_data)), grouped_data.index, rotation=90)
plt.xlabel("Diagnosis Code")
plt.ylabel("Mean Metastatic Diagnosis Period")
plt.title("Mean Metastatic Diagnosis Period for Each Breast Cancer Diagnosis Code")
plt.show()

In [None]:
# Group the data by metastatic_cancer_diagnosis_code and calculate the mean metastatic_diagnosis_period

grouped_data = df_train.groupby("metastatic_cancer_diagnosis_code").mean()["metastatic_diagnosis_period"]
grouped_data.head()

In [None]:
# Plot a bar chart of the mean metastatic_diagnosis_period for each metastatic_cancer_diagnosis_code

plt.figure(figsize=(15, 8))
plt.bar(range(len(grouped_data)), grouped_data)
plt.xticks(range(len(grouped_data)), grouped_data.index, rotation=90)
plt.xlabel("Diagnosis Code")
plt.ylabel("Mean Metastatic Diagnosis Period")
plt.title("Mean Metastatic Diagnosis Period for Each Metastatic Cancer Diagnosis Code")
plt.show()

## Visualizations - bivariate analysis

In [None]:
print(cat_cols)

In [None]:
print(df_train['breast_cancer_diagnosis_code'].unique())

In [None]:
# Use bar plot to show the relationship between Categorical variables and the target continuous variables 

fig, axarr = plt.subplots(4, 2, figsize=(18, 24))
df_train.groupby('patient_race')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][0], fontsize=12)
axarr[0][0].set_title("patient_race Vs target", fontsize=18)
df_train.groupby('payer_type')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[0][1], fontsize=12)
axarr[0][1].set_title("payer_type Vs target", fontsize=18)
df_train.groupby('patient_state')[target_col].mean().sort_values(ascending=False).head(20).plot.bar(ax=axarr[1][0], fontsize=12)
axarr[1][0].set_title("patient_state top 10", fontsize=18)
df_train.groupby('Region')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[1][1], fontsize=12)
axarr[1][1].set_title("Region Vs target", fontsize=18)
df_train.groupby('Division')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[2][0], fontsize=12)
axarr[2][0].set_title("Division Vs target", fontsize=18)
df_train.groupby('breast_cancer_diagnosis_code')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[2][1], fontsize=12)
axarr[2][1].set_title("breast_cancer_diagnosis_code Vs target", fontsize=18)
df_train.groupby('breast_cancer_diagnosis_desc')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][0], fontsize=12)
axarr[3][0].set_title("breast_cancer_diagnosis_desc Vs target", fontsize=18)
df_train.groupby('metastatic_cancer_diagnosis_code')[target_col].mean().sort_values(ascending=False).plot.bar(ax=axarr[3][1], fontsize=12)
axarr[3][1].set_title("metastatic_cancer_diagnosis_code Vs target", fontsize=18)
plt.subplots_adjust(hspace=1.0)
plt.subplots_adjust(wspace=.5)
sns.despine()

In [None]:


plt.figure(figsize=(15, 8))
df_train.groupby('patient_state')[target_col].mean().sort_values(ascending=False).plot.bar(fontsize=12)
plt.show()

## Clean data

In [None]:
# Data prep - replace categorical column nulls with 'unknown'

for column in cat_cols:
    df_train[column] = df_train[column].mask(df_train[column].isnull(), 'unknown')
    df_test[column] = df_test[column].mask(df_test[column].isnull(), 'unknown')

In [None]:
# Drop target from num_cols

num_cols.remove(target_col)

In [None]:
# Data prep - remaining nulls are in numeric columns - convert them to mean values from df_train
# (since df_train has more data and will likely have a more representative mean value than df_test)

for column in num_cols:
    mean_value = df_train[column].mean()
    df_train[column] = df_train[column].mask(df_train[column].isnull(), mean_value)
    df_test[column] = df_test[column].mask(df_test[column].isnull(), mean_value) 

df_train.head()

# Alternatively, assign large numbers to nulls in numeric columns
'''
for column in num_cols:
    replacement_value = 1000000
    df_train[column] = df_train[column].mask(df_train[column].isnull(), replacement_value)
    df_test[column] = df_test[column].mask(df_test[column].isnull(), replacement_value) 

df_train.head()
'''

In [None]:
# Convert categorical columns to numbers using LabelEncoder()

for column in cat_cols:
    df_combined = pd.concat([df_train, df_test], axis=0)
    le = LabelEncoder()
    le.fit(df_combined[column])
    df_train[column] = le.transform(df_train[column])
    df_test[column] = le.transform(df_test[column])

df_train.head()

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
df_test.head()

## Correlations

### Correlation EDA by feature category

Since there are so many features, a heatmap of all features is not comprehensible. So let us break down the features in a way that makes sense for the analysis.

In [None]:
patient_demographics = [
    "patient_race",
    "patient_state",
    "patient_zip3",
    "patient_age",
    "bmi",
    "metastatic_diagnosis_period"
]

diagnosis_codes = [
    "breast_cancer_diagnosis_code",
    "metastatic_cancer_diagnosis_code"
]

treatment = [
    "metastatic_first_novel_treatment",
    "metastatic_first_novel_treatment_type"
]

demographics_by_zip_code = [
    "population",
    "density",
    "age_median",
    "male",
    "female",
    "married",
    "family_size",
    "income_household_median",
    "income_household_six_figure",
    "home_ownership",
    "housing_units",
    "home_value",
    "rent_median",
    "education_college_or_above",
    "labor_force_participation",
    "unemployment_rate",
    "metastatic_diagnosis_period"
]

race_and_ethnicity = [
    "race_white",
    "race_black",
    "race_asian",
    "race_native",
    "race_pacific",
    "race_other",
    "race_multiple",
    "hispanic",
    "metastatic_diagnosis_period"
]

age_groups = [
    "age_under_10",
    "age_10_to_19",
    "age_20s",
    "age_30s",
    "age_40s",
    "age_50s",
    "age_60s",
    "age_70s",
    "age_over_80", 
    "metastatic_diagnosis_period"
]

marital_status = [
    "divorced",
    "never_married",
    "widowed",
    "metastatic_diagnosis_period"
]

income = [
    "family_dual_income",  # Create this feature based on family_dual_income
    "income_household_under_5",
    "income_household_5_to_10",
    "income_household_10_to_15", 
    "income_household_15_to_20",
    "income_household_20_to_25",
    "income_household_25_to_35",
    "income_household_35_to_50",
    "income_household_50_to_75",
    "income_household_75_to_100",
    "income_household_100_to_150",
    "income_household_150_over", 
    "income_individual_median",
    "metastatic_diagnosis_period"
]

socioeconomic_factors = [
    "poverty",
    "rent_burden",
    "metastatic_diagnosis_period"
]

education = [
    "education_less_highschool",
    "education_highschool",
    "education_some_college",
    "education_bachelors",
    "education_graduate",
    "education_stem_degree",
    "metastatic_diagnosis_period"
]

employment = [
    "self_employed",
    "farmer",
    "metastatic_diagnosis_period"
]

other = [
    "disabled",
    "limited_english",
    "commute_time",
    "health_uninsured",
    "veteran",
    "metastatic_diagnosis_period"
]

# Time-based averages (needs further handling based on your approach)
time_based_averages = [
    'Average of Jan-13', 'Average of Feb-13', 'Average of Mar-13', 'Average of Apr-13', 'Average of May-13', 'Average of Jun-13', 'Average of Jul-13', 'Average of Aug-13', 'Average of Sep-13', 'Average of Oct-13', 'Average of Nov-13', 'Average of Dec-13', 'Average of Jan-14', 'Average of Feb-14', 'Average of Mar-14', 'Average of Apr-14', 'Average of May-14', 'Average of Jun-14', 'Average of Jul-14', 'Average of Aug-14', 'Average of Sep-14', 'Average of Oct-14', 'Average of Nov-14', 'Average of Dec-14', 'Average of Jan-15', 'Average of Feb-15', 'Average of Mar-15', 'Average of Apr-15', 'Average of May-15', 'Average of Jun-15', 'Average of Jul-15', 'Average of Aug-15', 'Average of Sep-15', 'Average of Oct-15', 'Average of Nov-15', 'Average of Dec-15', 'Average of Jan-16', 'Average of Feb-16', 'Average of Mar-16', 'Average of Apr-16', 'Average of May-16', 'Average of Jun-16', 'Average of Jul-16', 'Average of Aug-16', 'Average of Sep-16', 'Average of Oct-16', 'Average of Nov-16', 'Average of Dec-16', 'Average of Jan-17', 'Average of Feb-17', 'Average of Mar-17', 'Average of Apr-17', 'Average of May-17', 'Average of Jun-17', 'Average of Jul-17', 'Average of Aug-17', 'Average of Sep-17', 'Average of Oct-17', 'Average of Nov-17', 'Average of Dec-17', 'Average of Jan-18', 'Average of Feb-18', 'Average of Mar-18', 'Average of Apr-18', 'Average of May-18', 'Average of Jun-18', 'Average of Jul-18', 'Average of Aug-18', 'Average of Sep-18', 'Average of Oct-18', 'Average of Nov-18', 'Average of Dec-18',
    'metastatic_diagnosis_period'
]  # List to store features based on your chosen approach

#### Patient Demographics

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[patient_demographics].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()


####  Demographics by Zip Code

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[demographics_by_zip_code].corr()

# Create a heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

#### race_and_ethnicity 

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[race_and_ethnicity].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

#### Age group correlation 

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[age_groups].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

####   marital_status

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[marital_status].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()


### Income Correlation

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[income].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

#### Socio-economic factors

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[socioeconomic_factors].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()


#### Education

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[education].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

#### Employment

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[employment].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()


#### Other

In [None]:
# Calculate correlation coefficients
correlation_matrix = df_train[other].corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')  # Adjust 'cmap' for color scheme
plt.title('Correlation Matrix (Features vs. Target)')
plt.show()

### Correlation between each features and the target using .corr() with spearman and pearson methods

In [None]:
df_train.corr(method='spearman')[target_col].abs().sort_values()  # Spearman correlation for non-normal data

In [None]:
df_train.corr()[target_col].abs().sort_values() # Pearson is the default

### Feature engineering

Creating new features in data science is an essential process that involves generating synthetic attributes from existing data. This process aims to extract additional information from the existing dataset, which can help improve the performance of machine learning models. By creating new features, data scientists can capture complex relationships and patterns that may be overlooked with only the original features.

# Regression modeling