# Target creation notebook  
## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('../../data/preprocessed_data.csv')

## Analysis on interesting features for the target creation 
### Columns selection 
We can observe that those columns correspond to tests that can be representative of the welding quality


In [30]:
col_test = ['Yield strength (MPa)','Ultimate tensile strength (MPa)', 'Elongation (%)',
            'Reduction of Area (%)','Charpy impact toughness (J)']

We have various steel, yet the charpy temperature seems to have huge differences depending of the type.

In [31]:
df = df.drop('Charpy temperature (deg C)',axis=1)

In [32]:
#To have the features list easily after
features = [c for c in df.columns if c not in col_test]

### Information printing  
This part will be usefull to decide how the target will be created with the columns referring to the welding test quality (Can be called "test columns")  

We start with a Boolean columns creation : see for each row if the test is done

In [33]:
#Init
dfd = df.copy()
Lt=['nb']
dfd['nb'] = 0

#For each test columns we create one with the suffix "_test"
#to see if the test is done or not (blank cells or not)
for c in col_test:
    n=c+'_test'
    dfd[n] = np.where(dfd[c].isna(),0,1)
    Lt.append(n)
    
    #We create as well a column that will tell how many test
    #each observation has 
    dfd['nb'] = dfd['nb'] + dfd[n]

In [34]:
# Display of test completion rate
for k in Lt[1:]:
    df_inter = dfd[k].value_counts().reset_index()
    df_inter['count'] = (df_inter['count']*100)/df_inter['count'].sum()
    print(k,'\nTest not done',int(df_inter['count'].to_list()[0]), '%\nTest done',int(df_inter['count'].to_list()[1]),'%\n')

Yield strength (MPa)_test 
Test not done 52 %
Test done 47 %

Ultimate tensile strength (MPa)_test 
Test not done 50 %
Test done 49 %

Elongation (%)_test 
Test not done 53 %
Test done 46 %

Reduction of Area (%)_test 
Test not done 52 %
Test done 47 %

Charpy impact toughness (J)_test 
Test not done 50 %
Test done 49 %



We realize that each test was done on only one half of the observations.

In [35]:
#Display the number of tests performed on each observation
dfd['nb'].value_counts()

nb
1    650
4    510
5    134
0     82
3     52
2     35
Name: count, dtype: int64

According to the table above, the majority of observations have exactly one test performed. What's more, we end up with just 82 pieces with no test at all.  

Since the data is normalized, we can impute missing values via a nearest-neighbor (KNN) algorithm.  
However, it's important to note that we'll be generating almost as many values as we start with. This is bound to be “dangerous” for the quality of the results. We'll keep this in mind in order to take into account a specific selection of values for target creation and thus limit bias and overfitting (due to lack of value diversity). This generation will therefore only be used on a limited population (the 82 observations without tests).  

## Preparation of the columns that are use to create the target 
### KNN Algorithm

In [36]:
#We are going to apply the knn on the float columns only
quanti = df.select_dtypes(include=['Float64']).columns

#Df were the knn will be performed
dfknn = df[quanti]  #(quanti include the test columns)

#KNN
imputer = KNNImputer(n_neighbors=5)
dfknn = imputer.fit_transform(dfknn)

#Formatting the resulting dataframe
dfknn = pd.DataFrame(dfknn)
dfknn.columns = quanti
dfknn = dfknn[col_test] #We only wants the filled test columns at the end

### Init of the final df that will contains the target  
We need to know if a test has been performed, his real value/generated value (mixed in the same test column) and the number of test realised for an observation

In [37]:
dfcible = dfd[Lt].reset_index().merge(dfknn.reset_index(),on='index',how='left')

In [38]:
dfcible.head()

Unnamed: 0,index,nb,Yield strength (MPa)_test,Ultimate tensile strength (MPa)_test,Elongation (%)_test,Reduction of Area (%)_test,Charpy impact toughness (J)_test,Yield strength (MPa),Ultimate tensile strength (MPa),Elongation (%),Reduction of Area (%),Charpy impact toughness (J)
0,0,4,1,1,1,1,0,-1.278195,-1.469907,1.166875,1.002657,-0.567962
1,1,5,1,1,1,1,1,-1.515708,-1.582746,1.840257,1.002657,0.336142
2,2,4,1,1,1,1,0,-1.051477,-1.108825,1.024036,1.002657,-0.266594
3,3,5,1,1,1,1,1,-1.170234,-1.199096,0.983225,1.002657,0.336142
4,4,4,1,1,1,1,0,-0.457694,-0.510783,0.656736,0.7904,-0.266594


==> the 'x_test' give the information if the x is generated or real  
### Threshold creation  
Because we have Z-normalized values, I need to get the mean and the std of my test columns

In [39]:
data = pd.read_table("../data/welddb.data",sep = " ", header=None)
data.replace('N', None, inplace=True)
data.columns = ["C concentration (weight%)","Si concentration (weight%)", "Mn concentration (weight%)","S concentration (weight%)", "P concentration (weight%)", "Ni concentration (weight%)", "Cr concentration (weight%)", "Mo concentration (weight%)", "V concentration (weight%)", "Cu concentration (weight%)", "Co concentration (weight%)", "W concentration (weight%)", "O concentration (ppm/weight)", "Ti concentration (ppm/weight)", "N concentration (ppm/weight)", "Al concentration (ppm/weight)", "B concentration (ppm/weight)", "Nb concentration (ppm/weight)", "Sn concentration (ppm/weight)", "As concentration (ppm/weight)", "Sb concentration (ppm/weight)", "Current (A)", "Voltage (V)", "AC or DC", "Electrode positive or negative", "Heat input (kJ/mm)", "Interpass temperature (deg C)", "Type of weld", "Post weld heat treatment temperature (deg C)", "Post weld heat treatment time (hours)", "Yield strength (MPa)", "Ultimate tensile strength (MPa)", "Elongation (%)", "Reduction of Area (%)", "Charpy temperature (deg C)", "Charpy impact toughness (J)", "Hardness (kg/mm2)", "50 FATT", "Primary ferrite in microstructure (%)", "Ferrite with second phase (%)", "Acicular ferrite (%)", "Martensite(%)", "Ferrite with carbide aggreagate (%)", "Weld ID"]
data = data.replace("<","",regex=True)
data['N concentration (ppm/weight)'] = data['N concentration (ppm/weight)'].str.split("tot").str[0]
data['Hardness (kg/mm2)'] = data['Hardness (kg/mm2)'].str.split("(").str[0]
data['Hardness (kg/mm2)'] = data['Hardness (kg/mm2)'].str.split("H").str[0]
data['Interpass temperature (deg C)'] = data['Interpass temperature (deg C)'].replace('150-200','175')
for i, column in enumerate(data.columns):
    if i not in [23,24,27,43]:
        data[column] = data[column].astype(float)

data.columns = data.columns.map(str)
data = data[col_test]
means = data.mean()
stds = data.std()

Since I'm not studying mechanical engineering I ask ChatGPT for general threshold for each test. It gives me values for "normal/soft" steel

In [40]:
#Dictionnary init to have both of the real thresholds and normalized thresholds
Seuils = {
    'Yield strength (MPa)' : {350:[]},
    'Ultimate tensile strength (MPa)' : {500:[]},
    'Elongation (%)': {20:[]},
    'Reduction of Area (%)':{20:[]},
    'Charpy impact toughness (J)':{50:[]}
}

#Normalization
for c in col_test:
    val = list(Seuils[c].keys())[0]
    Seuils[c][val] = (val - means[c]) / stds[c]

### Boolean columns to know if a real/generated test is successful or not + how many real/generated test are successful or not

In [41]:
#Real tests successful :
dfcible['vt'] = 0
for c in col_test:
    t = c+'_test'
    tr = c+'_vrai_reussi'
    seuil = seuil = list(Seuils[c].keys())[0]
    #Successfull = is a real test and >= threshold
    dfcible[tr] = np.where( dfcible[t]==0,0,
                           np.where( dfcible[c]>= Seuils[c][seuil] , 1, 0) )
    dfcible['vt'] += dfcible[tr]

#Generated test successful :
dfcible['ft'] = 0
for c in col_test:
    t = c+'_test'
    tr = c+'_faux_reussi'
    seuil = seuil = list(Seuils[c].keys())[0]
    #Successfull = is generated and >= threshold
    dfcible[tr] = np.where( dfcible[t]==1,0,
                           np.where( dfcible[c]>= Seuils[c][seuil] , 1, 0) )    
    dfcible['ft'] += dfcible[tr]

## Binary target definition :  
### An observation is a "great welding" (=1) : 
### Condition 1 = All the real tests must be successful
### Condition 2 = Half of the generated test are successful

In [42]:
dfcible = dfcible[['nb','ft','vt']]
dfcible['y'] = np.where( dfcible['vt']!=dfcible['nb'],0,
                        np.where( dfcible['ft']>= (len(col_test)-dfcible['nb'])/2,1,0 ) )
dfcible['y'].value_counts()

y
1    1054
0     409
Name: count, dtype: int64

It seems normal to have the double of volume for the great welding. Moreover, the volumes are not so unbalanced so we can use "normal" models and not outliers detection model.  
### CSV preparation  
For the modelization, we only need the features and the target (not the test columns and those depending on them)

In [43]:
dfcible = dfcible.reset_index()
dff = df[features].reset_index()
dff = dff.merge(dfcible[['index','y']],on='index',how='left')
dff = dff.drop(columns='index',axis=1)


In [44]:
dff.head()

Unnamed: 0,C concentration (weight%),Si concentration (weight%),Mn concentration (weight%),S concentration (weight%),P concentration (weight%),V concentration (weight%),O concentration (ppm/weight),Ti concentration (ppm/weight),N concentration (ppm/weight),Al concentration (ppm/weight),...,Type of weld_GMAA,Type of weld_GTAA,Type of weld_MMA,Type of weld_NGGMA,Type of weld_NGSAW,Type of weld_SA,Type of weld_SAA,Type of weld_ShMA,Type of weld_TSA,y
0,-1.794624,-0.241903,-1.47409,-0.150353,-0.062002,1.5429810000000002e-17,-1.143271e-16,0.0,0.0,8.694071000000001e-17,...,False,False,True,False,False,False,False,False,False,0
1,-1.794624,-0.241903,-1.47409,-0.150353,-0.062002,1.5429810000000002e-17,-1.143271e-16,0.0,0.0,8.694071000000001e-17,...,False,False,True,False,False,False,False,False,False,0
2,-1.794624,-0.157085,-0.467487,-0.23453,0.034196,1.5429810000000002e-17,-1.143271e-16,0.0,0.0,8.694071000000001e-17,...,False,False,True,False,False,False,False,False,False,0
3,-1.794624,-0.157085,-0.467487,-0.23453,0.034196,1.5429810000000002e-17,-1.143271e-16,0.0,0.0,8.694071000000001e-17,...,False,False,True,False,False,False,False,False,False,0
4,-1.474696,0.182188,0.592095,-0.23453,0.034196,1.5429810000000002e-17,-1.143271e-16,0.0,0.0,8.694071000000001e-17,...,False,False,True,False,False,False,False,False,False,1


In [20]:
dff.to_csv('../../CD1_target_created.csv',index=False)