## Data Description

This data represents the results of a large product testing study. For each `product_code` you are given a number of product `attributes` (fixed for the code) as well as a number of `measurement` values for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (`loading`) to see whether or not it fails. &nbsp;

Your task is to use the data to predict individual product failures of new codes with their individual lab test results.

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Understanding

- Same product_code means they are the same products
- Same product consists of same attributes

## Blueprint

1. Numerize 'attribute_0' and 'attribute_1'
2. Drop the product code(A, B, C, D, E)
3. Apply PCA
4. Split into training and validating data
5. Apply ML models

### Import Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn import metrics     # error function : metrics.roc_auc_score()

# Show all the columns and rows
pd.set_option("display.max_columns", None)  # columns
# pd.set_option("display.max_rows", None)   # rows

## 1. Data Loading

In [41]:
# Load dataset
data = pd.read_csv('train.csv') # training
te = pd.read_csv('test.csv')    # testing

print(data.shape)
data.head()

(26570, 26)


Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [42]:
# Store 'id'
id = data.id

# Drop 'id'
data = data.drop(columns=['id'])

print(data.shape)
data.head()

(26570, 25)


Unnamed: 0,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


## 2. Data Exploration

In [43]:
# Check data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   product_code    26570 non-null  object 
 1   loading         26320 non-null  float64
 2   attribute_0     26570 non-null  object 
 3   attribute_1     26570 non-null  object 
 4   attribute_2     26570 non-null  int64  
 5   attribute_3     26570 non-null  int64  
 6   measurement_0   26570 non-null  int64  
 7   measurement_1   26570 non-null  int64  
 8   measurement_2   26570 non-null  int64  
 9   measurement_3   26189 non-null  float64
 10  measurement_4   26032 non-null  float64
 11  measurement_5   25894 non-null  float64
 12  measurement_6   25774 non-null  float64
 13  measurement_7   25633 non-null  float64
 14  measurement_8   25522 non-null  float64
 15  measurement_9   25343 non-null  float64
 16  measurement_10  25270 non-null  float64
 17  measurement_11  25102 non-null 

In [44]:
# Check the distribution
data["failure"].value_counts()  # target
data["product_code"].value_counts() # product code

C    5765
E    5343
B    5250
D    5112
A    5100
Name: product_code, dtype: int64

### - Treat Missing Values

In [45]:
# # Opt A. Drop missing values
# data = data.dropna()

# Opt B. Replace missing values with 0
data = data.fillna(0)

# # Opt C. Replace missing values with the feature's mean
# data = data.fillna(data.mean())

In [46]:
# Check the distribution again
data["failure"].value_counts()     # target
data["product_code"].value_counts()    # product code

C    5765
E    5343
B    5250
D    5112
A    5100
Name: product_code, dtype: int64

### - Objective Values

In [47]:
# # Unique values in each column
# data['product_code'].unique()   # array(['A', 'B', 'C', 'D', 'E'], dtype=object)
# data['attribute_0'].unique()    # array(['material_7', 'material_5'], dtype=object)
# data['attribute_1'].unique()    # array(['material_8', 'material_5', 'material_6'], dtype=object)

In [48]:
# Check attribute combinations for each product
def combinations(df):
    products = df['product_code'].unique()      # product codes
    attr = []       # list of the combination of attributes

    for product in products:
        attr = []
        subset = df.loc[df['product_code']==product, :]     # get subsets for each 'product code'

        attr.append(subset['attribute_0'].unique())
        attr.append(subset['attribute_1'].unique())
        attr.append(subset['attribute_2'].unique())
        attr.append(subset['attribute_3'].unique())

        print("Product",product, "consists of", attr)

In [49]:
# combinations(data)

In [50]:
# combinations(te)

### - Int/Float Values

In [51]:
data.describe()

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0,26570.0
mean,126.623502,6.754046,7.240459,7.415883,8.232518,6.256568,17.536406,11.494434,16.692034,16.986161,11.303433,18.274322,10.902855,15.329114,18.112822,10.997322,14.607806,14.916537,13.861717,15.153533,640.986841,0.212608
std,40.759151,1.471852,1.456493,4.11669,4.199401,3.309109,2.337115,1.924252,2.870842,3.142212,2.374236,3.832824,2.589918,3.737513,4.622983,3.136565,4.063576,4.353575,4.235109,4.743193,229.212732,0.40916
min,0.0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,99.285,6.0,6.0,4.0,5.0,4.0,17.08325,11.002,16.379,16.768,10.955,18.237,10.63725,15.041,17.941,10.473,14.69725,14.79,13.65,14.907,588.50875,0.0
50%,121.97,6.0,8.0,7.0,8.0,6.0,17.768,11.707,17.097,17.479,11.668,18.966,11.369,16.0425,19.106,11.595,15.53,15.895,14.806,16.239,686.71,0.0
75%,148.8375,8.0,8.0,10.0,11.0,8.0,18.456,12.396,17.785,18.15575,12.36375,19.676,12.065,16.975,20.144,12.643,16.306,16.997,15.91775,17.514,774.63375,0.0
max,385.86,9.0,9.0,29.0,29.0,24.0,21.499,16.484,21.425,21.543,15.419,23.807,15.412,22.479,25.64,17.663,22.713,22.303,21.626,24.094,1312.794,1.0


## 3. Preprocessing

### (1) Numerize 'attribute_0' and 'attribute_1'

In [52]:
# Apply function element-wise
numerized_attr01 = data.loc[:, ['attribute_0','attribute_1']].applymap(lambda x: int(x[9]))

In [53]:
# Drop 'attribute_0', 'attribute_1' and concat
data = data.drop(columns=['attribute_0', 'attribute_1'])
data = pd.concat([numerized_attr01, data], axis=1)

In [54]:
print(data.shape)
data.head()     # 12183(A), 26570(B)

(26570, 25)


Unnamed: 0,attribute_0,attribute_1,product_code,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,7,8,A,80.1,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,0.0,13.034,14.684,764.1,0
1,7,8,A,84.89,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,7,8,A,82.43,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,0.0,13.798,16.711,18.631,14.094,17.946,663.376,0
3,7,8,A,101.07,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,7,8,A,188.06,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [55]:
# Double check the combination
combinations(data)

Product A consists of [array([7]), array([8]), array([9]), array([5])]
Product B consists of [array([5]), array([5]), array([8]), array([8])]
Product C consists of [array([7]), array([8]), array([5]), array([8])]
Product D consists of [array([7]), array([5]), array([6]), array([6])]
Product E consists of [array([7]), array([6]), array([6]), array([9])]


### (2) Drop 'product_code'(A, B, C, D, E)

In [56]:
# Drop 'product_code'
data = data.drop(columns=['product_code'])

print(data.shape)
data.head()

(26570, 24)


Unnamed: 0,attribute_0,attribute_1,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,7,8,80.1,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,0.0,13.034,14.684,764.1,0
1,7,8,84.89,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,7,8,82.43,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,0.0,13.798,16.711,18.631,14.094,17.946,663.376,0
3,7,8,101.07,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,7,8,188.06,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


### (3). Apply PCA

In [57]:
X = data.iloc[:, 0:23]  # all features
y = data.loc[:, 'failure']  # target

In [58]:
# Standardize
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = pd.DataFrame(scaler.transform(X), columns=list(X.columns))

In [59]:
# # Mean Centering
# X_centered = X - X.mean()
# X_centered.head()

In [60]:
# Check Variance
# print(X_centered.var())
print(X_scaled.var())

attribute_0       1.000038
attribute_1       1.000038
loading           1.000038
attribute_2       1.000038
attribute_3       1.000038
measurement_0     1.000038
measurement_1     1.000038
measurement_2     1.000038
measurement_3     1.000038
measurement_4     1.000038
measurement_5     1.000038
measurement_6     1.000038
measurement_7     1.000038
measurement_8     1.000038
measurement_9     1.000038
measurement_10    1.000038
measurement_11    1.000038
measurement_12    1.000038
measurement_13    1.000038
measurement_14    1.000038
measurement_15    1.000038
measurement_16    1.000038
measurement_17    1.000038
dtype: float64


In [61]:
# Covariance matrix
# cov = np.dot(X_centered.T, X_centered)/(len(X_centered)-1)
cov = np.dot(X_scaled.T, X_scaled)/(len(X_scaled)-1)
# df_cov = pd.DataFrame(cov)

In [62]:
# Eigenvalues & Eigenvectors
eig = np.linalg.eig(cov)
# print('<eigenvalues>\n', eig[0],'\n')
# print('<eigenvectors>\n', eig[1])

In [63]:
# Contribution in the Data
def percent_variation(n):
    '''Percent Variation of top 'n' factors'''
    trace = sum(eig[0])    # sum of all eigenvalues
    contribution = [np.round(i/trace,5) for i in eig[0]]
    return contribution[:n]    # return the first n number of contributions

In [64]:
n = 22
print(percent_variation(n))
print('Total contribution of 22 components:', sum(percent_variation(n))*100, '%\n')

[0.09783, 0.07748, 0.00608, 0.02047, 0.02999, 0.03362, 0.04985, 0.03746, 0.0386, 0.04528, 0.04483, 0.04458, 0.04444, 0.04195, 0.04212, 0.04405, 0.04386, 0.04364, 0.0424, 0.04254, 0.04287, 0.04309]
Total contribution of 22 components: 95.703 %



In [65]:
# Fit to PCA (getting a projection matrix)
pca = PCA(n_components=22)
pca.fit(X_scaled)

PCA(n_components=22)

In [66]:
# How much each of the features influences the PC
influence = pd.DataFrame(pca.components_, columns=list(X_scaled.columns))
influence

Unnamed: 0,attribute_0,attribute_1,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17
0,0.271502,0.419609,-0.003592,0.318464,-0.511587,0.417178,-0.422139,-0.164859,0.010795,-0.002893,-0.011196,-0.01543,0.005836,0.010719,0.005706,0.012007,0.006939,0.055613,-0.013599,0.022337,-0.053295,0.010059,0.00797
1,-0.586443,-0.37409,-0.008142,0.569483,-0.182768,-0.002329,-0.198578,0.312567,-0.008259,0.003958,0.016993,0.009726,0.000984,0.002172,-0.006852,0.006511,0.0121,0.121217,0.02801,0.036771,-0.051889,0.031491,-6.1e-05
2,-0.0041,0.00496,-0.022334,0.009691,0.004675,-0.002239,-0.00439,-0.009845,-0.121887,-0.171025,-0.39297,-0.248533,-0.348179,-0.340876,-0.065741,-0.106742,-0.009911,0.070533,0.050091,-0.033317,0.019749,-0.023075,-0.690049
3,-0.042106,0.019055,0.16369,0.027623,0.009073,0.015058,-0.079439,0.081481,0.296495,-0.392509,-0.227577,0.144087,0.176255,0.074594,0.163887,-0.139483,-0.169628,-0.386529,0.003502,0.323016,0.219502,-0.477293,-0.032543
4,-0.029753,0.011006,-0.03896,0.031775,0.023487,-0.011978,-0.052016,-0.004323,-0.114664,-0.149069,0.271529,-0.395042,-0.185167,0.240591,0.116752,-0.347586,0.431407,-0.11449,-0.44434,0.050227,0.317658,0.027894,0.020209
5,0.051719,-0.064148,-0.340957,-0.040692,-0.067619,0.0567,0.028627,0.169685,-0.117021,-0.273663,0.039687,-0.014574,-0.242839,0.307778,0.150096,0.332054,-0.290782,-0.193908,-0.305274,-0.315519,-0.340238,-0.168597,-0.044248
6,0.041159,-0.022823,0.112398,-0.033493,-0.038457,-0.01161,0.066337,0.015485,-0.395521,-0.282912,-0.018323,0.041699,0.302739,-0.068188,-0.259276,0.459977,0.116993,0.170787,-0.337103,0.448612,0.014801,0.041151,-0.066813
7,0.020401,-0.0154,0.451406,0.000542,-0.021591,0.027286,-0.009471,0.045386,0.039986,-0.265091,0.275426,0.317381,-0.292931,-0.099986,-0.478234,-0.074141,0.250287,-0.176446,0.022925,-0.246492,-0.215453,-0.10966,0.015484
8,-0.021141,0.032882,-0.34187,-0.002656,0.022362,0.0108,-0.047026,0.000498,0.086077,-0.101072,0.418749,-0.339709,-0.004625,-0.020119,-0.531548,0.053973,-0.328381,-0.041638,0.289715,0.117472,0.246186,-0.128172,-0.012758
9,0.014676,-0.003667,-0.162039,0.015595,0.016095,-0.083055,0.06405,-0.094514,0.305654,0.224237,0.009278,-0.181368,-0.000776,-0.084348,-0.08646,-0.028189,0.276653,0.275674,-0.165734,0.197367,-0.458783,-0.575454,0.020557


In [67]:
# Projection (transforming the original data via projection matrix)
# data_pca = pca.transform(X)
# data_pca = pd.DataFrame(data_pca, columns = ['PC1', 'PC2']) # rename
data_pca = pca.transform(X_scaled)
data_pca = pd.DataFrame(data_pca)

print(data_pca.shape)
data_pca.head()

(26570, 22)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
0,2.001438,0.282609,-0.524231,-1.757483,-0.870656,1.087183,-1.627562,0.061246,-0.430954,0.069473,-0.815777,-0.596132,-0.869941,1.311743,0.702258,-0.543884,2.063567,-0.208595,-0.156561,-0.020101,-1.404432,0.63501
1,3.272759,0.408291,-0.60556,-0.033182,-0.271715,0.195149,0.08571,-0.852117,0.250826,-0.009468,-0.006299,-0.817683,-0.420923,0.32845,-0.078706,-0.552138,-0.207985,-0.558651,-0.340092,0.3106,0.179787,0.121165
2,3.194248,0.776747,-0.345485,0.596613,-2.058003,0.661558,-0.393992,-2.241933,1.400374,-1.155395,0.808199,0.397873,0.00607,1.396422,-0.665657,1.542618,-0.4304,-0.093196,-0.043754,-0.311635,-0.083251,-0.056863
3,3.064628,0.670544,-1.198433,0.223466,0.263086,0.074306,-0.036434,-0.456165,0.235656,-0.857355,0.243503,-0.491978,-0.394242,0.019065,-0.038112,-0.609893,-0.310991,0.072056,-0.055988,-0.091056,-0.010544,0.167993
4,2.624642,0.940839,0.030422,-0.366066,-0.326252,-0.772384,-0.388612,0.714925,-0.307804,0.060759,-0.48594,-0.28625,1.482997,-0.512493,0.229338,0.180207,0.39215,0.376254,-0.362246,-0.852884,-0.474602,0.251062


## 4. Apply ML models

In [68]:
# Split into training and validating
PCA_train, PCA_val, y_train, y_val = train_test_split(data_pca, y, test_size=0.33, 
                                                    random_state=0, stratify=y)

### (1) Linear SVC

In [69]:
from sklearn.svm import LinearSVC

LinSVC = LinearSVC(random_state=0)
LinSVC.fit(PCA_train, y_train)
y_LinSVC = LinSVC.predict(PCA_val)



In [70]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_LinSVC)  # 0.452(A), 0.539(B), 0.480(C)

0.5

### (2) SGD Classifier

In [71]:
from sklearn.linear_model import SGDClassifier

SGD = SGDClassifier(random_state=0)
SGD.fit(PCA_train, y_train)
y_SGD = SGD.predict(PCA_val)

In [72]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_SGD)     # 0.491(A), 0.530(B), 0.538(C)

0.5

### (3) KNN Classifier

In [73]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier(n_neighbors=2)
KNN.fit(PCA_train, y_train)
y_KNN = KNN.predict(PCA_val)

In [74]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_KNN)     # 0.497(A), 0.503(B), 0.502(C)

0.5046415873923542

### (4) Kernel Approximation

In [75]:
# check

### (5) SVC

In [76]:
from sklearn.svm import SVC

SVC = SVC(random_state=0)
SVC.fit(PCA_train, y_train)
y_SVC = SVC.predict(PCA_val)

In [None]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_SVC)     # 0.5(A), 0.5(B), 0.5(C)

0.5

### (6) Ensemble Classifiers

In [None]:
from sklearn.ensemble import RandomForestRegressor

Ensb = RandomForestRegressor(max_depth=5, random_state=0)
Ensb.fit(PCA_train, y_train)
y_Ensb = SVC.predict(PCA_val)

In [None]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_Ensb)    # 0.5(A), 0.5(B), 0.5(C)

0.5

## 5. Apply The Best-fit Model to The Testing Data