# Aerogel Bonding AI & ML Project 2024
## Importing Libraries & Reading Data

In [27]:
#Importing Libraries 

import numpy as np 
import pandas as pd
import sklearn
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#Reading Data Set
path = './aerogel_bonding.csv'
df = pd.read_csv(path)

## EDA & Data Pre-Processing
Here we will visualize the data, take into account important values and clean the dataset from irrelevant data

### Vizualising and Understaning Data Structure & Values

In [31]:
pd.set_option('display.max_columns', None) # View all columns
df.head()

Unnamed: 0,HolidaysTaken,PercentageOfCompletedTasks,CurrentJobDuration,RecentHolidaysTaken,RequestedProcessAmount,JobStatus,BondingRiskRating,TotalMaterialProcessed,ByproductRation,working_skills,CivilStatus,dependability,MistakesLastYear,HighestEducationAttained,BondingSuccessful,ChurnRisk,ProcessedKilograms,SkillRating,ProcessingTimestamp,WorkExperience,HistoricalBehavior,TotalMaterialToProcess,WorkHistoryDuration,ApplicantAge,PriorExecutionDefaults,DifferentTasksCompleted,TotalChurnRisk,OtherCompaniesMaterialProcessed,BondingPeriod,trustability,MonthlyExecutions
0,4.0,,6.0,1.0,51172.0,Employed,46.0,300388.0,0.273137,1.24412,Married,2.798099,0.0,Master,0.0,0.223172,44305.0,606.0,2055-07-29,48.0,23.0,14193.0,3.0,67.0,0.0,,,,24.0,3.144274,440.0
1,1.0,0.323046,4.0,0.0,11246.0,Employed,54.0,299914.0,0.450387,2.228183,Married,2.58644,0.0,Bachelor,,0.215746,,561.0,2072-03-10,31.0,17.0,85355.0,28.0,52.0,,0.0,0.232582,214559.0,36.0,3.704809,
2,5.0,0.491574,3.0,1.0,14075.0,Employed,42.4,74687.0,0.325027,2.699264,Married,1.949641,0.0,Bachelor,1.0,0.256075,67954.0,,2032-01-24,20.0,25.0,14006.0,,45.0,0.0,0.0,0.240812,60681.0,60.0,2.427195,171.0
3,4.0,0.108916,3.0,1.0,18957.0,Employed,40.8,47866.0,,0.445854,Married,1.569581,0.0,Bachelor,1.0,0.240457,98184.0,607.0,2029-11-12,19.0,32.0,13240.0,16.0,42.0,0.0,4.0,0.23152,34626.0,84.0,1.156431,212.0
4,,0.174628,1.0,2.0,17902.0,Employed,51.0,18181.0,0.388317,1.940075,Single,2.149917,,Associate,,0.206902,48981.0,612.0,2031-08-22,28.0,14.0,44217.0,28.0,50.0,0.0,0.0,0.214425,4812.0,48.0,3.185402,323.0


### Evaluation on dataset features

- **HolidaysTaken:** Integer of days taken for holiday by the worker, it can be NaN if the worker didn't take any days off, same as it would be for 0, it wouldn't affect any calculations, only rows with negative numbers if found should be excluded.
- **PercentageOfCompletedTasks:** Float between 0 and 1, representing percentage(divided by 100) of completed tasks by worker, it can be NaN in case someone hasn't completed any given task, or in case it hasn't been assigned any work
- **CurrentJobDuration:** Integer representing number of months worker has been working there, in case of NaN, negative or 0, row should be excluded.
- **RecentHolidaysTaken:** Integer of days taken for recent holiday, it should be smaller or equal to HolidaysTaken, otherwise row should be excluded.
- **RequestedProcessAmount:** Integer of amount of material requested for bonding process, NaN and anything smaller than 0 should be excluded.
- **JobStatus:** Enum for status of worker, in order to represent it we will be using One-Hot Encoding as we don't want to impose any ordering between categories, additionally it would impose big issues regarding memory usage as our dataset is not so big and we just have three categories, which makes it very suitable for this problem. Additionally, all NaNs should be excluded.
- **BondingRiskRating:** Float between 0 and 100, indicating percentage associated to risk between bonding using certain materials or processes, NaNs can be used as zeros, anything out of range 0-100 should be consider as error and excluded.
- **TotalMaterialProcessed:** Integer representing cumulative amount of materials used by worker or team of workers through bonding processes, zeros and negatives should be excluded.
- **ByproductRation:** Float between 0 and 1, representing percentage(divided by 100) of byproducts generated bonding relative to total number of materials used, anything out of range or NaNs can be excluded.
- **working_skills:** Float between 0 and 5, representing rating of a worker's skills, 
- **CivilStatus:** 
- **dependability:**

In [29]:
df.dtypes

HolidaysTaken                      float64
PercentageOfCompletedTasks         float64
CurrentJobDuration                 float64
RecentHolidaysTaken                float64
RequestedProcessAmount             float64
JobStatus                           object
BondingRiskRating                  float64
TotalMaterialProcessed             float64
ByproductRation                    float64
working_skills                     float64
CivilStatus                         object
dependability                      float64
MistakesLastYear                   float64
HighestEducationAttained            object
BondingSuccessful                  float64
ChurnRisk                          float64
ProcessedKilograms                 float64
SkillRating                        float64
ProcessingTimestamp                 object
WorkExperience                     float64
HistoricalBehavior                 float64
TotalMaterialToProcess             float64
WorkHistoryDuration                float64
ApplicantAg

### Checking Data Integrity

In [None]:
# Calculate missing values
missing_values = df.isnull().sum()

# Calculate total values (missing + non-missing)
total_values = df.count() + missing_values

# Calculate percentage of missing values and format with percentage sign
percentage_missing = (missing_values / total_values * 100).apply(lambda x: f'{x:.2f}%')

# Create a summary DataFrame
summary_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Total Values': total_values,
    'Percentage Missing': percentage_missing
})

# Print the summary
print("Summary:")
print(summary_df)

Summary:
                                 Missing Values  Total Values  \
HolidaysTaken                              1905         20000   
PercentageOfCompletedTasks                 2045         20000   
CurrentJobDuration                         2048         20000   
RecentHolidaysTaken                        1969         20000   
RequestedProcessAmount                     2075         20000   
JobStatus                                  2018         20000   
BondingRiskRating                          2050         20000   
TotalMaterialProcessed                     2043         20000   
ByproductRation                            1972         20000   
working_skills                             1966         20000   
CivilStatus                                1954         20000   
dependability                              1949         20000   
MistakesLastYear                           1987         20000   
HighestEducationAttained                   2063         20000   
BondingSuccessfu