## Data Description

This data represents the results of a large product testing study. For each `product_code` you are given a number of product `attributes` (fixed for the code) as well as a number of `measurement` values for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (`loading`) to see whether or not it fails. &nbsp;

Your task is to use the data to predict individual product failures of new codes with their individual lab test results.

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.<br>
The higher, the better

## Understanding

- Same product_code means they are the same products
- Same product consists of same attributes

## Blueprint

1. Numerize 'attribute_0' and 'attribute_1'
2. Drop the product code(A, B, C, D, E)
3. Apply PCA
4. Split into training and validating data
5. Apply ML models

### Import Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn import metrics 
from sklearn.preprocessing import OneHotEncoder

# Show all the columns and rows
pd.set_option("display.max_columns", None)  # columns
# pd.set_option("display.max_rows", None)   # rows

# Ignore all future warnings
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

## 1. Data Loading

In [2]:
# Load dataset
data = pd.read_csv('train.csv') # training
te = pd.read_csv('test.csv')    # testing

print(data.shape)
data.head()

(26570, 26)


Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [3]:
# Store 'id'
id = data.id
id_te = te.id

# Drop 'id'
data = data.drop(columns=['id'])
te = te.drop(columns=['id'])

print(data.shape)
data.head()

(26570, 25)


Unnamed: 0,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


## 2. Data Exploration

In [4]:
# Check data types and missing values
data.info()     # missing: loading, measurement_3 ~ measurement_17

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   product_code    26570 non-null  object 
 1   loading         26320 non-null  float64
 2   attribute_0     26570 non-null  object 
 3   attribute_1     26570 non-null  object 
 4   attribute_2     26570 non-null  int64  
 5   attribute_3     26570 non-null  int64  
 6   measurement_0   26570 non-null  int64  
 7   measurement_1   26570 non-null  int64  
 8   measurement_2   26570 non-null  int64  
 9   measurement_3   26189 non-null  float64
 10  measurement_4   26032 non-null  float64
 11  measurement_5   25894 non-null  float64
 12  measurement_6   25774 non-null  float64
 13  measurement_7   25633 non-null  float64
 14  measurement_8   25522 non-null  float64
 15  measurement_9   25343 non-null  float64
 16  measurement_10  25270 non-null  float64
 17  measurement_11  25102 non-null 

In [5]:
# Check the distributions
print(data["failure"].value_counts())   # target
print(data["product_code"].value_counts())  # product

0    20921
1     5649
Name: failure, dtype: int64
C    5765
E    5343
B    5250
D    5112
A    5100
Name: product_code, dtype: int64


### - Objective Values

In [6]:
# Unique values in each column
print(data['product_code'].unique())
print(data['attribute_0'].unique())
print(data['attribute_1'].unique())

['A' 'B' 'C' 'D' 'E']
['material_7' 'material_5']
['material_8' 'material_5' 'material_6']


### - Int/Float Values

In [7]:
data.describe()

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,26320.0,26570.0,26570.0,26570.0,26570.0,26570.0,26189.0,26032.0,25894.0,25774.0,25633.0,25522.0,25343.0,25270.0,25102.0,24969.0,24796.0,24696.0,24561.0,24460.0,24286.0,26570.0
mean,127.826233,6.754046,7.240459,7.415883,8.232518,6.256568,17.791528,11.731988,17.127804,17.510759,11.716624,19.024714,11.430725,16.117711,19.172085,11.702464,15.652904,16.048444,14.995554,16.460727,701.269059,0.212608
std,39.03002,1.471852,1.456493,4.11669,4.199401,3.309109,1.0012,0.996085,0.996414,0.99598,1.000836,1.008591,0.999137,1.405978,1.520785,1.488838,1.155247,1.491923,1.549226,1.708935,123.304161,0.40916
min,33.16,5.0,5.0,0.0,0.0,0.0,13.968,8.008,12.073,12.715,7.968,15.217,7.537,9.323,12.461,5.167,10.89,9.14,9.104,9.701,196.787,0.0
25%,99.9875,6.0,6.0,4.0,5.0,4.0,17.117,11.051,16.443,16.839,11.045,18.34025,10.757,15.209,18.17,10.703,14.89,15.057,13.957,15.268,618.9615,0.0
50%,122.39,6.0,8.0,7.0,8.0,6.0,17.787,11.733,17.132,17.516,11.712,19.021,11.43,16.127,19.2115,11.717,15.6285,16.04,14.969,16.436,701.0245,0.0
75%,149.1525,8.0,8.0,10.0,11.0,8.0,18.469,12.41,17.805,18.178,12.391,19.708,12.102,17.025,20.207,12.709,16.374,17.082,16.018,17.628,784.09025,0.0
max,385.86,9.0,9.0,29.0,29.0,24.0,21.499,16.484,21.425,21.543,15.419,23.807,15.412,22.479,25.64,17.663,22.713,22.303,21.626,24.094,1312.794,1.0


### - Compare details based on the failure

In [8]:
# Details of failure=0
data.loc[data["failure"]==0].describe()     # higher max in measurement_0 (29>25)

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,20715.0,20921.0,20921.0,20921.0,20921.0,20921.0,20601.0,20511.0,20417.0,20296.0,20181.0,20091.0,19977.0,19898.0,19764.0,19676.0,19520.0,19460.0,19342.0,19247.0,19136.0,20921.0
mean,125.205495,6.749199,7.255007,7.395249,8.256106,6.229387,17.789662,11.737408,17.118474,17.503106,11.707892,19.015737,11.432582,16.118818,19.17588,11.699068,15.654003,16.043638,14.998406,16.458737,699.100303,0.0
std,37.763502,1.467673,1.451816,4.104495,4.195542,3.284838,1.003835,0.996028,0.994795,1.000764,1.002154,1.010856,1.000807,1.408742,1.524026,1.485416,1.156601,1.488948,1.553272,1.715717,123.471655,0.0
min,33.16,5.0,5.0,0.0,0.0,0.0,13.968,8.008,13.395,12.715,7.973,15.217,7.537,9.323,12.461,5.167,10.89,9.593,9.104,9.701,206.571,0.0
25%,98.1,6.0,6.0,4.0,5.0,4.0,17.114,11.06,16.434,16.832,11.034,18.3325,10.759,15.205,18.171,10.7,14.888,15.05775,13.958,15.27,616.983,0.0
50%,119.9,6.0,8.0,7.0,8.0,6.0,17.787,11.735,17.121,17.5085,11.704,19.012,11.434,16.1305,19.2145,11.711,15.632,16.0385,14.97,16.433,698.7035,0.0
75%,145.885,8.0,8.0,10.0,11.0,8.0,18.472,12.413,17.8,18.172,12.384,19.698,12.103,17.031,20.213,12.708,16.38,17.07425,16.02875,17.621,781.0695,0.0
max,374.33,9.0,9.0,29.0,28.0,24.0,21.499,16.484,20.791,21.543,15.419,23.807,15.412,21.761,25.64,17.663,22.388,22.303,21.626,23.164,1312.794,0.0


In [9]:
# Details of failure=1
data.loc[data["failure"]==1].describe()     # higher loading in general

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,5605.0,5649.0,5649.0,5649.0,5649.0,5649.0,5588.0,5521.0,5477.0,5478.0,5452.0,5431.0,5366.0,5372.0,5338.0,5293.0,5276.0,5236.0,5219.0,5213.0,5150.0,5649.0
mean,137.511973,6.771995,7.186582,7.4923,8.145158,6.357231,17.798403,11.711852,17.162583,17.539114,11.748947,19.057923,11.423809,16.11361,19.158035,11.715089,15.648834,16.066307,14.984985,16.468072,709.327565,1.0
std,41.998788,1.487217,1.472562,4.161019,4.212885,3.39589,0.991485,0.996126,1.001751,0.977618,0.995366,0.999559,0.992956,1.395817,1.508785,1.501566,1.150325,1.502939,1.534241,1.683804,122.355345,0.0
min,49.64,5.0,5.0,0.0,0.0,0.0,14.166,8.196,12.073,14.093,7.968,15.823,8.103,10.635,13.797,5.867,11.496,9.14,9.425,10.735,196.787,1.0
25%,107.3,6.0,6.0,5.0,5.0,4.0,17.12475,11.024,16.474,16.859,11.093,18.3695,10.75,15.22,18.164,10.71,14.911,15.0555,13.9465,15.264,628.13175,1.0
50%,131.84,6.0,8.0,7.0,8.0,6.0,17.7845,11.725,17.171,17.543,11.754,19.053,11.415,16.112,19.2,11.741,15.613,16.048,14.967,16.452,708.9985,1.0
75%,161.15,8.0,8.0,10.0,11.0,8.0,18.459,12.401,17.82,18.202,12.41125,19.7475,12.097,16.9995,20.175,12.716,16.354,17.11325,15.978,17.647,792.56,1.0
max,385.86,9.0,9.0,25.0,29.0,24.0,21.267,15.164,21.425,20.621,15.269,22.525,15.154,22.479,25.429,17.594,22.713,21.847,20.784,24.094,1181.998,1.0


### - Combinations for Product Codes

In [10]:
# Check attribute combinations for each product
def combinations(df):
    products = df['product_code'].unique()      # product codes
    attr = []       # list of the combination of attributes

    for product in products:
        attr = []
        subset = df.loc[df['product_code']==product, :]     # get subsets for each 'product code'

        attr.append(subset['attribute_0'].unique())
        attr.append(subset['attribute_1'].unique())
        attr.append(subset['attribute_2'].unique())
        attr.append(subset['attribute_3'].unique())

        print("Product",product, "consists of", attr)

In [11]:
# Check the combination of attibutes for each product (training set)
combinations(data)  # training set
print("\n")
combinations(te)    # testing set

Product A consists of [array(['material_7'], dtype=object), array(['material_8'], dtype=object), array([9]), array([5])]
Product B consists of [array(['material_5'], dtype=object), array(['material_5'], dtype=object), array([8]), array([8])]
Product C consists of [array(['material_7'], dtype=object), array(['material_8'], dtype=object), array([5]), array([8])]
Product D consists of [array(['material_7'], dtype=object), array(['material_5'], dtype=object), array([6]), array([6])]
Product E consists of [array(['material_7'], dtype=object), array(['material_6'], dtype=object), array([6]), array([9])]


Product F consists of [array(['material_5'], dtype=object), array(['material_6'], dtype=object), array([6]), array([4])]
Product G consists of [array(['material_5'], dtype=object), array(['material_6'], dtype=object), array([9]), array([7])]
Product H consists of [array(['material_7'], dtype=object), array(['material_7'], dtype=object), array([7]), array([9])]
Product I consists of [array([

## 3. Preprocessing

### (1) Drop 'product_code'

In [12]:
# Drop 'product_code'
data = data.drop(columns=['product_code'])
te = te.drop(columns=['product_code'])

print(data.shape)
data.head()

(26570, 24)


Unnamed: 0,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


### (2) Treat categorical values ('attributes')

In [13]:
cat_tr = data.iloc[:, [1, 2, 3, 4]]
non_cat_tr = data.drop(columns = cat_tr.columns.to_list())

cat_te = te.iloc[:, [1, 2, 3, 4]]
non_cat_te = te.drop(columns = cat_te.columns.to_list())

# One Hot Encoding
ohe_cat_tr = pd.get_dummies(cat_tr, 
                columns = ['attribute_0', 'attribute_1', 'attribute_2', 'attribute_3'])

ohe_cat_te = pd.get_dummies(cat_te, 
                columns = ['attribute_0', 'attribute_1', 'attribute_2', 'attribute_3'])

In [14]:
data = pd.concat([ohe_cat_tr, non_cat_tr], axis=1)
te = pd.concat([ohe_cat_te, non_cat_te], axis=1)

print(data.shape)
data.head()

(26570, 33)


Unnamed: 0,attribute_0_material_5,attribute_0_material_7,attribute_1_material_5,attribute_1_material_6,attribute_1_material_8,attribute_2_5,attribute_2_6,attribute_2_8,attribute_2_9,attribute_3_5,attribute_3_6,attribute_3_8,attribute_3_9,loading,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,1,0,0,1,0,0,0,1,1,0,0,0,80.1,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,0,1,0,0,1,0,0,0,1,1,0,0,0,84.89,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,0,1,0,0,1,0,0,0,1,1,0,0,0,82.43,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,0,1,0,0,1,0,0,0,1,1,0,0,0,101.07,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,0,1,0,0,1,0,0,0,1,1,0,0,0,188.06,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


### (3) Check correlations

In [15]:
corr = round(data.corr(), 3)
corr[((corr > 0.3) | (corr < -0.3)) & (corr != 1) & (corr != -1)]

Unnamed: 0,attribute_0_material_5,attribute_0_material_7,attribute_1_material_5,attribute_1_material_6,attribute_1_material_8,attribute_2_5,attribute_2_6,attribute_2_8,attribute_2_9,attribute_3_5,attribute_3_6,attribute_3_8,attribute_3_9,loading,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
attribute_0_material_5,,,0.621,,-0.413,,-0.4,,,,,0.59,,,,,,,,,,,,,,,,,,,,,
attribute_0_material_7,,,-0.621,,0.413,,0.4,,,,,-0.59,,,,,,,,,,,,,,,,,,,,,
attribute_1_material_5,0.621,-0.621,,-0.401,-0.665,-0.421,,0.621,-0.39,-0.39,0.61,,-0.401,,,,0.303,,,,,,,,,,,,,,,,
attribute_1_material_6,,,-0.401,,-0.417,,0.623,,,,,-0.422,,,,,,,,,,,,,,,,,,,,,
attribute_1_material_8,-0.413,0.413,-0.665,-0.417,,0.633,-0.67,-0.413,0.586,0.586,-0.406,,-0.417,,,-0.315,,,,,,,,,,,,,,,,,
attribute_2_5,,,-0.421,,0.633,,-0.424,,,,,0.626,,,,,,,,,,,,,,,,,,,,,
attribute_2_6,-0.4,0.4,,0.623,-0.67,-0.424,,-0.4,-0.393,-0.393,0.606,-0.678,0.623,,,,,,,,,,,,,,,,,,,,
attribute_2_8,,,0.621,,-0.413,,-0.4,,,,,0.59,,,,,,,,,,,,,,,,,,,,,
attribute_2_9,,,-0.39,,0.586,,-0.393,,,,,-0.41,,,0.345,-0.465,,,,,,,,,,,,,,,,,
attribute_3_5,,,-0.39,,0.586,,-0.393,,,,,-0.41,,,0.345,-0.465,,,,,,,,,,,,,,,,,


### (4) Treat Missing Values

In [16]:
# # Opt A. Drop missing values
# data = data.dropna()

# Opt B. Replace missing values with 0
data = data.fillna(0)

# # Opt C. Replace missing values with the feature's mean
# data = data.fillna(data.mean())

In [17]:
# Check the distributions again
print(data["failure"].value_counts())   # target
print(data.shape)

0    20921
1     5649
Name: failure, dtype: int64
(26570, 33)


### (5) PCA

In [18]:
X = data.iloc[:, 0:32]  # all features
y = data.loc[:, 'failure']  # target

In [19]:
# Standardize
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X)
X_scaled = pd.DataFrame(scaler.transform(X), columns=list(X.columns))

In [20]:
cov = np.dot(X_scaled.T, X_scaled)/(len(X_scaled)-1)    # covariance
eig = np.linalg.eig(cov)        # eigenvalues & eigenvectors

In [21]:
# Contribution in the Data
def percent_variation(n):
    '''Percent Variation of top 'n' factors'''
    trace = sum(eig[0])    # sum of all eigenvalues
    contribution = [np.round(i/trace,5) for i in eig[0]]
    return contribution[:n]    # return the first n number of contributions

In [22]:
n = 20
print(percent_variation(n))
print('Total contribution of 20 components:', sum(percent_variation(n))*100, '%\n')

[(0.15242+0j), (0.13283+0j), (0.07738+0j), (0.06543+0j), (0.02022+0j), (0.02474+0j), (0.03582+0j), (0.02691+0j), (0.02801+0j), (0.03251+0j), (0.03221+0j), (0.02987+0j), (0.03198+0j), (0.03175+0j), (0.03163+0j), (0.03145+0j), (0.03024+0j), (0.03036+0j), (0.0305+0j), (0.0312+0j)]
Total contribution of 20 components: (90.746+0j) %



In [23]:
# Fit to PCA (getting a projection matrix)
pca = PCA(n_components=20)
pca.fit(X_scaled)

PCA(n_components=20)

In [24]:
# How much each of the features influences the PC
influence = pd.DataFrame(pca.components_, columns=list(X_scaled.columns))
# influence

In [25]:
# Projection (transforming the original data via projection matrix)
data_pca = pca.transform(X_scaled)
data_pca = pd.DataFrame(data_pca)

print(data_pca.shape)
data_pca.head()

(26570, 20)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-2.349394,-1.964698,-1.303966,-0.98768,-0.554604,-1.617399,-0.738669,2.12423,0.315575,0.183741,-0.802564,-0.083918,-0.914289,-0.685622,-0.023878,1.248657,-0.863789,0.264279,1.926349,-0.270734
1,-2.71464,-2.531594,-1.822463,-0.914283,-0.625165,-0.045687,-0.223977,0.159129,0.714605,-0.559149,0.223976,-0.23283,-0.120726,-0.993069,0.021892,0.026256,-0.050733,-0.533418,-0.073019,-0.655557
2,-2.558816,-2.615596,-1.880682,-0.896443,-0.34752,0.37264,-1.88825,1.019234,1.692147,-1.248787,1.024169,-1.782918,1.249223,-0.185768,-0.708649,0.244224,1.824514,0.891206,-0.277401,0.132153
3,-2.559979,-2.504797,-1.84789,-0.882382,-1.210079,0.159983,0.285513,0.088075,0.139307,-0.324613,0.032903,-0.891223,0.30696,-0.759363,0.167718,-0.142139,-0.160595,-0.551755,-0.169424,0.215321
4,-2.358885,-2.338624,-1.815631,-0.968781,0.017734,-0.417007,-0.483347,-0.210421,-1.184965,0.206227,0.02806,0.346867,-0.629823,0.28971,-1.046042,-0.246809,-0.659602,0.600728,-0.315509,0.953749


## 4. Apply ML models

In [26]:
# Split into training and validating
X_train, X_val, y_train, y_val = train_test_split(data_pca, y, test_size=0.3, random_state=0, stratify=y)

### (1) Linear SVC

In [27]:
# from sklearn.svm import LinearSVC

# LinSVC = LinearSVC(random_state=0)
# LinSVC.fit(X_train, y_train)
# y_LinSVC = LinSVC.predict(X_val)

In [28]:
# # Compute AUC score
# metrics.roc_auc_score(y_val, y_LinSVC)  # 0.5(A), 0.5(B), 0.5(C)

In [29]:
# metrics.plot_roc_curve(LinSVC, X_val, y_val) 
# plt.show()

### (2) SGD Classifier

In [30]:
# from sklearn.linear_model import SGDClassifier

# SGD = SGDClassifier(random_state=0)
# SGD.fit(X_train, y_train)
# y_SGD = SGD.predict(X_val)

In [31]:
# # Compute AUC score
# metrics.roc_auc_score(y_val, y_SGD)     # 0.5(A), 0.5(B), 0.5(C)

In [32]:
# metrics.plot_roc_curve(SGD, X_val, y_val) 
# plt.show()

### (3) KNN Classifier

In [33]:
# from sklearn.neighbors import KNeighborsClassifier

# KNN = KNeighborsClassifier(n_neighbors=2)
# KNN.fit(X_train, y_train)
# y_KNN = KNN.predict(X_val)

In [34]:
# # Compute AUC score
# metrics.roc_auc_score(y_val, y_KNN)     # 0.507(A), 0.502(B), 0.502(C)

In [35]:
# metrics.plot_roc_curve(KNN, X_val, y_val) 
# plt.show()

### (4) Kernel Approximation

In [36]:
# check

### (5) SVC

In [37]:
# from sklearn.svm import SVC

# SVC = SVC(random_state=0)
# SVC.fit(X_train, y_train)
# y_SVC = SVC.predict(X_val)

In [38]:
# # Compute AUC score
# metrics.roc_auc_score(y_val, y_SVC)     # 0.5(A), 0.5(B), 0.5(C)

In [39]:
# metrics.plot_roc_curve(SVC, X_val, y_val) 
# plt.show()

### (6) Ensemble Classifiers _ Randome Forest Regressor

In [48]:
from sklearn.ensemble import RandomForestRegressor

EnsbRnd = RandomForestRegressor(max_depth=5, random_state=0)
EnsbRnd.fit(X_train, y_train)
y_EnsbRnd = EnsbRnd.predict(X_val)

In [49]:
# Compute AUC score
metrics.roc_auc_score(y_val, y_EnsbRnd)    

# max_depth=4 : 0.576(B)
# max_depth=5 : 0.551(A), 0.578(B), 0.577(C)
# max_depth=6 : 0.577(B), 0.578(C)
# max_depth=7 : 0.577(B), 0.578(C)
# max_depth=8 : 0.577(B), 0.576(C)

0.5760326833881377

### (7) Ensemble Classifiers _ Bagging

In [42]:
# from sklearn.svm import SVC
# from sklearn.ensemble import BaggingClassifier

# EnsbBag = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0)
# EnsbBag.fit(X_train, y_train)
# y_EnsbBag = EnsbBag.predict(X_val)

In [43]:
# # Compute AUC score
# metrics.roc_auc_score(y_val, y_EnsbBag)     # 0.5(A), 0.5(B), 0.5(C)

## 5. Apply The Best-fit Model to The Testing Data