# <p style="font-family: helvetica; letter-spacing: 3px; font-size: 30px; font-weight: bold; color:#1B2631; align:left;padding: 0px"> Feature Imputation with a Heat Flux Dataset<span class="emoji">📖🤓📖</span>
</p>

![image.jpg](attachment:images.jpg)

<p style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">🗨️</span>Context
</p>

<div class="alert alert-block alert-info"> <b>NOTES TO THE READERS</b><br> This is a 2023 edition of Kaggle's Playground Series where the Kaggle Community hosts a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science</div>


## <a id="1"></a>
<p style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">🔀</span>Install and Import
</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from IPython.display import display, HTML
import seaborn as sns
import random
import matplotlib.pyplot as plt
import plotly.express as px
import pandas_profiling as pp
from scipy import stats
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
from sklearn.impute import SimpleImputer

<a id="1"></a>
<div style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">📈🔭</span>Data Overview
</div>

#### As per the competition, this is a fairly light-weight dataset that is synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various models/feature engineering ideas. 
#### Also, as given in the dataset description, this is both train & test dataset generated from a deep learning model trained on the Predicting Critical Heat Flux dataset (link avaiable below). Feature distributions are close to, but not exactly the same, as the original.

In [None]:
data              = pd.read_csv('/kaggle/input/playground-series-s3e15/data.csv')
sample_submission = pd.read_csv('/kaggle/input/playground-series-s3e15/sample_submission.csv')

<div class="alert alert-block alert-info"> <b>NOTES TO THE READER</b><br> Use the link:- https://www.kaggle.com/datasets/saurabhshahane/predicting-heat-flux to take a look at the Prediction Critical Heat Flux Dataset from which our data has been obtained.</div>

<a id="2"></a>
<div style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">🔍📊</span>Exploratory Data Analysis
</div>

In [None]:
def eda(df):
    print("==================================================================")
    print("1. Dataframe Shape: ",df.shape)
    print("==================================================================")
    print("2. Explore the Data: ")
    display(HTML(df.head(5).to_html()))
    print("==================================================================")
    print("3. Information on the Data: ")
    data_info_df                      = pd.DataFrame(df.dtypes, columns=['data type'])
    data_info_df['Duplicated_Values'] = df.duplicated().sum()
    data_info_df['Missing_Values']    = df.isnull().sum().values 
    data_info_df['%Missing']          = df.isnull().sum().values / len(df)* 100
    data_info_df['Unique_Values']     = df.nunique().values
    df_desc                           = df.describe(include='all').transpose()
    data_info_df['Count']             = df_desc['count'].values
    data_info_df['Mean']              = df_desc['mean'].values
    data_info_df['STD']               = df_desc['std'].values
    data_info_df['Min']               = df_desc['min'].values
    data_info_df['Max']               = df_desc['max'].values
    data_info_df                      = data_info_df[['Count','Mean','STD', 'Min', 'Max','Duplicated_Values','Missing_Values',
                                                     '%Missing','Unique_Values']]   
    display(HTML(data_info_df.to_html()))
    print("==================================================================")
    print("4. Correlation Matrix Heatmap - For Numeric Variables:")
    num_cols = df.select_dtypes(include = ['float64']).columns.tolist()
    correlation_matrix = df[num_cols].corr()
    plt.figure(figsize=(12, 8))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt='.2f')
    plt.show()
    print("==================================================================")
    print("5. Correlation with Target Variable :")
    target_corr = correlation_matrix['x_e_out [-]'].drop('x_e_out [-]')
    target_corr_sorted = target_corr.sort_values(ascending=False)
    sns.set(font_scale=0.8)
    sns.set_style("white")
    sns.set_palette("PuBuGn_d")
    sns.heatmap(target_corr_sorted.to_frame(), cmap="coolwarm", annot=True, fmt='.2f')
    plt.show()
    print("==================================================================")
    print("6. Distribution of Numerical Variables")
    for col in num_cols:
        sns.histplot(df[col], kde=True)
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.title('Distribution of {}'.format(col))
        plt.show()
    print("==================================================================")
    print("7. Distribution of Categorical Variables")
    cat_cols = df.select_dtypes(include = ['object']).columns.tolist()
    for col in cat_cols:
        value_counts = df[col].value_counts(normalize=True) * 100
        fig, ax = plt.subplots(figsize=(8, 3))
        #top_n = min(17, len(value_counts))
        #ax.barh(value_counts.index[:top_n], value_counts.values[:top_n])
        ax.barh(value_counts.index, value_counts.values)
        ax.set_xlabel('Percentage Distribution')
        ax.set_ylabel(f'{col}')
        plt.tight_layout()
        plt.show()
    print("==================================================================")

In [None]:
eda(data)

<a id="3"></a>
<p style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">📝📚</span>Analysis Summary
</p>

1. There are 10 variables - 8 features, 1 target variable ('x_e_out [-]') & 1 primary key ('id')

2. 8 variables are numeric and 2 - 'author' & 'geometry' are categorical in nature

3. No duplicates in the data. 
   However, the columns in this dataset comprise of missing values which need to be imputed. Only 'id' and 
   'chf_exp[MW/m2]' do not have missing values.

4. Positive correlation observed between 'D_e' & D_h' variables. 
   Negative correlation observed between 'pressure' & 'D_e'/'D_h'due to formulation reasons.

##### Other Facts:
5. Distribution of our Target Variable is primarily normal with a few outliers which can be treated.

6. Distribution of Categorical Variables shows the maximum occurrence of 'Thompson' among authors & 'tube' among geometry. This information can be utilized while imputation of Cateogrical Variables.

<div class="alert alert-block alert-info"> <b>NOTES TO THE READER</b><br> Refer to this notebook for further iterations !!!</div>


<a id="4"></a>
<p style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">🧩</span>Basic Imputation Techniques
</p>

In [None]:
df = data.copy()
df = df.drop(["x_e_out [-]"], axis=1) #Since this is our target variable
num_cols = df.select_dtypes(include = ['float64']).columns.tolist()
cat_cols = df.select_dtypes(include = ['object']).columns.tolist()

In [None]:
#Imputation for Numeric Variables
impute_mode = SimpleImputer(strategy = 'median')
impute_mode.fit(df[num_cols])
df[num_cols] = impute_mode.transform(df[num_cols])

#Imputation for Categorical Variables
impute_mode = SimpleImputer(strategy = 'most_frequent')
impute_mode.fit(df[cat_cols])
df[cat_cols] = impute_mode.transform(df[cat_cols])

In [None]:
df.head(5)

<div class="alert alert-block alert-info"> <b>NOTE TO THE READER</b><br>Since we saw above that a specific column value in both the categorical columns has very high occurrences, therefore, we can fill the missing values using 'most frequent' imputation method.<br>Alternative:<br>However, if you think that it might make the data more biased and imbalanced towards a single column value, you can also perform KNN imputation, for which we need to normalize the input data and perform One Hot Encoding to categorical variables. Please note that this method will increase the dimensionality of our training dataset.</div>

<a id="4"></a>
<p style="font-family: helvetica; letter-spacing: 1px; font-size: 20px; font-weight: bold; color:#1B2631; align:left;padding: 0px;border-bottom: 1px solid #003300"><span class="emoji">🔜🚀</span>Suggestive Next Steps
</p>

Few added methods of Data Imputation :
1. Mean/Median Imputation: Replace missing values with the mean or median of the available data in the respective feature/column. This method assumes that the missing values are missing at random.

2. Mode Imputation: For categorical variables, replace missing values with the mode (the most frequent value) of the feature.

3. Forward Fill/Backward Fill: Propagate the last known value forward or the next known value backward to fill missing values. This method is suitable for time series or sequential data.

4. Interpolation: Use interpolation methods to estimate missing values based on the values of neighboring data points. Common interpolation methods include linear interpolation, polynomial interpolation, or spline interpolation.

5. Model-based Imputation: Train a machine learning model on the available data and use it to predict missing values. This method can provide more accurate imputations but requires a separate training phase.

<div class="alert alert-block alert-info"> <span>🔄</span>This is currently a WIP (work in progress) version, check out the same notebook for my further approaches!<span>👀</span></div>


<p style="font-family:helvetica;font-size: 18px;letter-spacing:1px;color:#82E0AA; align:center;padding: 0px">
<span class="emoji">😊</span>Thank you so much for taking the time to check out my notebook on Kaggle! <br> <span class="emoji">🚀</span>Your support means a lot to me. If you found it helpful, please consider giving it a thumbs up or leaving a comment. <br><span class="emoji">💌</span>Your encouragement and feedback are greatly appreciated. Thank you again!
</p>