## Data Description

This data represents the results of a large product testing study. For each `product_code` you are given a number of product `attributes` (fixed for the code) as well as a number of `measurement` values for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (`loading`) to see whether or not it fails. &nbsp;

Your task is to use the data to predict individual product failures of new codes with their individual lab test results.

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Understanding

- Same product_code means they are the same products
- Same product consists of same attributes

## Blueprint

1. Numerize 'attribute_0' and 'attribute_1'
2. Drop the product code(A, B, C, D, E)
3. Apply PCA
4. Split into training and validating data
5. Apply ML models

### Import Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn import metrics     # error function : metrics.roc_auc_score()

# Show all the columns and rows
pd.set_option("display.max_columns", None)  # columns
# pd.set_option("display.max_rows", None)   # rows

## 1. Data Loading

In [25]:
# Load dataset
data = pd.read_csv('train.csv') # training
te = pd.read_csv('test.csv')    # testing

print(data.shape)
data.head()

(26570, 26)


Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [26]:
# Store 'id'
id = data.id

# Drop 'id'
data = data.drop(columns=['id'])

print(data.shape)
data.head()

(26570, 25)


Unnamed: 0,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


## 2. Data Exploration

In [27]:
# Check data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   product_code    26570 non-null  object 
 1   loading         26320 non-null  float64
 2   attribute_0     26570 non-null  object 
 3   attribute_1     26570 non-null  object 
 4   attribute_2     26570 non-null  int64  
 5   attribute_3     26570 non-null  int64  
 6   measurement_0   26570 non-null  int64  
 7   measurement_1   26570 non-null  int64  
 8   measurement_2   26570 non-null  int64  
 9   measurement_3   26189 non-null  float64
 10  measurement_4   26032 non-null  float64
 11  measurement_5   25894 non-null  float64
 12  measurement_6   25774 non-null  float64
 13  measurement_7   25633 non-null  float64
 14  measurement_8   25522 non-null  float64
 15  measurement_9   25343 non-null  float64
 16  measurement_10  25270 non-null  float64
 17  measurement_11  25102 non-null 

In [28]:
# Check the distribution
data["failure"].value_counts()  # target
data["product_code"].value_counts() # product code

C    5765
E    5343
B    5250
D    5112
A    5100
Name: product_code, dtype: int64

### - Treat Missing Values

In [29]:
# Opt A. Drop missing values
data = data.dropna()

# # Opt B. Replace missing values with 0
# data = data.fillna(0)

# # Opt C. Replace missing values with the feature's mean
# data = data.fillna(data.mean())

In [30]:
# Check the distribution again
data["failure"].value_counts()     # target
data["product_code"].value_counts()    # product code

C    2666
B    2420
E    2392
A    2381
D    2324
Name: product_code, dtype: int64

### - Objective Values

In [31]:
# # Unique values in each column
# data['product_code'].unique()   # array(['A', 'B', 'C', 'D', 'E'], dtype=object)
# data['attribute_0'].unique()    # array(['material_7', 'material_5'], dtype=object)
# data['attribute_1'].unique()    # array(['material_8', 'material_5', 'material_6'], dtype=object)

In [32]:
# Check attribute combinations for each product
def combinations(df):
    products = df['product_code'].unique()      # product codes
    attr = []       # list of the combination of attributes

    for product in products:
        attr = []
        subset = df.loc[df['product_code']==product, :]     # get subsets for each 'product code'

        attr.append(subset['attribute_0'].unique())
        attr.append(subset['attribute_1'].unique())
        attr.append(subset['attribute_2'].unique())
        attr.append(subset['attribute_3'].unique())

        print("Product",product, "consists of", attr)

In [33]:
# combinations(data)

In [34]:
# combinations(te)

### - Int/Float Values

In [35]:
data.describe()

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0,12183.0
mean,127.802374,6.764754,7.228515,7.419109,8.236395,6.215546,17.782234,11.734573,17.128798,17.498797,11.715155,19.02271,11.43393,16.137332,19.183826,11.708558,15.638239,16.046096,15.004323,16.464259,700.912242,0.209801
std,39.023844,1.479066,1.457148,4.14273,4.224814,3.303928,1.002298,0.994764,1.002633,1.000499,0.998095,1.002018,0.998223,1.396516,1.523551,1.49689,1.16933,1.488628,1.558413,1.710481,123.456724,0.407183
min,40.81,5.0,5.0,0.0,0.0,0.0,14.1,8.097,12.073,12.715,7.973,15.268,7.537,9.676,12.461,5.167,11.035,10.318,9.158,10.064,196.787,0.0
25%,100.07,6.0,6.0,4.0,5.0,4.0,17.113,11.061,16.433,16.818,11.05,18.3385,10.7575,15.227,18.1635,10.698,14.866,15.055,13.9505,15.271,618.1365,0.0
50%,122.18,6.0,8.0,7.0,8.0,6.0,17.776,11.733,17.128,17.503,11.712,19.014,11.434,16.158,19.233,11.727,15.624,16.032,14.985,16.443,700.159,0.0
75%,149.19,8.0,8.0,10.0,11.0,8.0,18.469,12.413,17.812,18.169,12.395,19.706,12.1015,17.0335,20.23,12.732,16.3755,17.0845,16.038,17.61,782.908,0.0
max,385.86,9.0,9.0,26.0,27.0,24.0,21.248,16.484,21.425,21.076,15.243,23.328,15.045,21.459,25.429,17.318,22.388,22.303,20.644,23.164,1181.998,1.0


## 3. Preprocessing

### (1) Numerize 'attribute_0' and 'attribute_1'

In [36]:
# Apply function element-wise
numerized_attr01 = data.loc[:, ['attribute_0','attribute_1']].applymap(lambda x: int(x[9]))

In [37]:
# Drop 'attribute_0', 'attribute_1' and concat
data = data.drop(columns=['attribute_0', 'attribute_1'])
data = pd.concat([numerized_attr01, data], axis=1)

In [38]:
print(data.shape)
data.head()     # 12183(A), 26570(B), 26570(C)

(12183, 25)


Unnamed: 0,attribute_0,attribute_1,product_code,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
1,7,8,A,84.89,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
3,7,8,A,101.07,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,7,8,A,188.06,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0
7,7,8,A,177.92,9,5,4,8,8,17.062,13.634,17.879,15.894,11.029,18.643,10.254,16.449,20.478,12.207,15.624,16.968,15.176,17.231,684.0,1
11,7,8,A,175.38,9,5,7,3,2,17.029,11.507,18.377,16.338,10.019,20.242,11.309,16.31,18.959,11.52,14.659,15.355,15.175,15.829,792.591,1


In [39]:
# Double check the combination
combinations(data)

Product A consists of [array([7]), array([8]), array([9]), array([5])]
Product B consists of [array([5]), array([5]), array([8]), array([8])]
Product C consists of [array([7]), array([8]), array([5]), array([8])]
Product D consists of [array([7]), array([5]), array([6]), array([6])]
Product E consists of [array([7]), array([6]), array([6]), array([9])]


### (2) Drop 'product_code'(A, B, C, D, E)

In [40]:
# Drop 'product_code'
data = data.drop(columns=['product_code'])

print(data.shape)
data.head()

(12183, 24)


Unnamed: 0,attribute_0,attribute_1,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
1,7,8,84.89,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
3,7,8,101.07,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,7,8,188.06,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0
7,7,8,177.92,9,5,4,8,8,17.062,13.634,17.879,15.894,11.029,18.643,10.254,16.449,20.478,12.207,15.624,16.968,15.176,17.231,684.0,1
11,7,8,175.38,9,5,7,3,2,17.029,11.507,18.377,16.338,10.019,20.242,11.309,16.31,18.959,11.52,14.659,15.355,15.175,15.829,792.591,1


### (3). Apply PCA

In [41]:
X = data.iloc[:, 0:23]  # all features
y = data.loc[:, 'failure']  # target

In [42]:
# Mean Centering
X_centered = X - X.mean()
X_centered.head()

Unnamed: 0,attribute_0,attribute_1,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17
1,0.397275,1.560863,-42.912374,2.235246,-2.228515,6.580891,-5.236395,-3.215546,0.430766,-0.194573,0.588202,0.394203,1.032845,-1.13371,1.01407,1.809668,-1.268826,0.046442,-0.906239,-0.621096,-0.609323,-0.833259,-18.855242
3,0.397275,1.560863,-26.732374,2.235246,-2.228515,5.580891,-6.236395,-0.215546,-0.487234,-0.546573,1.447202,0.840203,0.867845,0.03729,1.03707,0.208668,-0.806826,-1.688558,-0.388239,-0.484096,1.149677,0.707741,125.369758
4,0.397275,1.560863,60.257626,2.235246,-2.228515,1.580891,-6.236395,1.784454,1.563766,1.215427,-0.138798,-1.752797,-0.409155,-0.92971,-1.09693,0.944668,0.748174,0.719442,0.543761,-3.286096,-1.851323,-0.052259,-121.027242
7,0.397275,1.560863,50.117626,2.235246,-2.228515,-3.419109,-0.236395,1.784454,-0.720234,1.899427,0.750202,-1.604797,-0.686155,-0.37971,-1.17993,0.311668,1.294174,0.498442,-0.014239,0.921904,0.171677,0.766741,-16.912242
11,0.397275,1.560863,47.577626,2.235246,-2.228515,-0.419109,-5.236395,-4.215546,-0.753234,-0.227573,1.248202,-1.160797,-1.696155,1.21929,-0.12493,0.172668,-0.224826,-0.188558,-0.979239,-0.691096,0.170677,-0.635259,91.678758


In [43]:
# Check Variance
print(X_centered.var())

attribute_0           0.636775
attribute_1           1.853769
loading            1522.860381
attribute_2           2.187636
attribute_3           2.123281
measurement_0        17.162209
measurement_1        17.849055
measurement_2        10.915940
measurement_3         1.004601
measurement_4         0.989555
measurement_5         1.005274
measurement_6         1.000997
measurement_7         0.996194
measurement_8         1.004040
measurement_9         0.996449
measurement_10        1.950258
measurement_11        2.321209
measurement_12        2.240680
measurement_13        1.367332
measurement_14        2.216012
measurement_15        2.428652
measurement_16        2.925746
measurement_17    15241.562673
dtype: float64


In [44]:
# Covariance matrix
cov = np.dot(X_centered.T, X_centered)/(len(X_centered)-1)
# df_cov = pd.DataFrame(cov)

In [45]:
# Eigenvalues & Eigenvectors
eig = np.linalg.eig(cov)
# print('<eigenvalues>\n', eig[0],'\n')
# print('<eigenvectors>\n', eig[1])

In [46]:
# Contribution in the Data
def percent_variation(n):
    '''Percent Variation of top 'n' factors'''
    trace = sum(eig[0])    # sum of all eigenvalues
    contribution = [np.round(i/trace,5) for i in eig[0]]
    return contribution[:n]    # return the first n number of contributions

In [47]:
n = 2
print(percent_variation(n))
print('Total contribution of 2 components:', sum(percent_variation(n))*100, '%\n')

[0.90517, 0.09041]
Total contribution of 2 components: 99.558 %



In [48]:
# Fit to PCA (getting a projection matrix)
pca = PCA(n_components=2)
pca.fit(X)

PCA(n_components=2)

In [49]:
# How much each of the features influences the PC
influence = pd.DataFrame(pca.components_, columns=list(X_centered.columns))
influence

Unnamed: 0,attribute_0,attribute_1,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17
0,-1.7e-05,-3.3e-05,-0.004916,3.9e-05,1.5e-05,-0.00016,0.000301,2.1e-05,-0.000697,-0.001493,-0.003637,-0.002634,-0.002706,-0.003965,-0.001151,-0.000147,0.000219,0.00011,0.000101,-0.000143,-4.8e-05,-2e-06,-0.999964
1,0.00012,7.1e-05,0.999986,-0.000159,-0.000239,-0.00068,-0.000516,-0.000887,-0.00023,-0.000638,0.000207,8.8e-05,-0.000333,-0.000121,-0.000238,0.000122,0.000134,-0.000112,8.3e-05,0.000418,-0.000329,-0.000679,-0.004915


In [50]:
# Projection (transforming the original data via projection matrix)
data_pca = pca.transform(X)
data_pca = pd.DataFrame(data_pca, columns = ['PC1', 'PC2']) # rename

print(data_pca.shape)
data_pca.head()

(12183, 2)


Unnamed: 0,PC1,PC2
0,19.059668,-42.817493
1,-125.247055,-27.348888
2,120.733567,60.851142
3,16.669795,50.200744
4,-91.911789,47.13442


## 4. Apply ML models

In [51]:
# Split into training and validating
PCA_train, PCA_val, y_train, y_val = train_test_split(data_pca, y, test_size=0.33, 
                                                    random_state=0, stratify=y)

### (1) Linear SVC

In [52]:
from sklearn.svm import LinearSVC

LinSVC = LinearSVC(random_state=0)
LinSVC.fit(PCA_train, y_train)
y_LinSVC = LinSVC.predict(PCA_val)



In [53]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_LinSVC)  # 0.452(A), 0.539(B), 0.480(C)

0.45222492977517614

### (2) SGD Classifier

In [54]:
from sklearn.linear_model import SGDClassifier

SGD = SGDClassifier(random_state=0)
SGD.fit(PCA_train, y_train)
y_SGD = SGD.predict(PCA_val)

In [55]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_SGD)     # 0.491(A), 0.530(B), 0.538(C)

0.49114693584069147

### (3) KNN Classifier

In [56]:
from sklearn.neighbors import KNeighborsClassifier

KNN = KNeighborsClassifier(n_neighbors=2)
KNN.fit(PCA_train, y_train)
y_KNN = KNN.predict(PCA_val)

In [57]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_KNN)     # 0.497(A), 0.503(B), 0.502(C)

0.496913725279594

### (4) Kernel Approximation

In [58]:
# check

### (5) SVC

In [59]:
from sklearn.svm import SVC

SVC = SVC(random_state=0)
SVC.fit(PCA_train, y_train)
y_SVC = SVC.predict(PCA_val)

In [60]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_SVC)     # 0.5(A), 0.5(B), 0.5(C)

0.5

### (6) Ensemble Classifiers

In [61]:
from sklearn.ensemble import RandomForestRegressor

Ensb = RandomForestRegressor(max_depth=5, random_state=0)
Ensb.fit(PCA_train, y_train)
y_Ensb = SVC.predict(PCA_val)

In [62]:
# Compute the area under the ROC curve
metrics.roc_auc_score(y_val, y_Ensb)    # 0.5(A), 0.5(B), 0.5(C)

0.5

## 5. Apply The Best-fit Model to The Testing Data