## Data Description

This data represents the results of a large product testing study. For each `product_code` you are given a number of product `attributes` (fixed for the code) as well as a number of `measurement` values for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (`loading`) to see whether or not it fails. &nbsp;

Your task is to use the data to predict individual product failures of new codes with their individual lab test results.

## Evaluation

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Understanding

- Same product code means 'same product'
- Same product consists of same 'attribute_0' and 'attribute_1'

## Blueprint

1. Drop 'attribute_0' and 'attribute_1'
2. Get subsets for each product code (A, B, C, D, E)
3. Split into

### Import Packages

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.decomposition import PCA

# Show all the columns and rows
pd.set_option("display.max_columns", None)  # columns
# pd.set_option("display.max_rows", None)   # rows

## 1. Data Loading

In [9]:
# Load dataset
data = pd.read_csv('train.csv') # training
te = pd.read_csv('test.csv')    # testing

print(data.shape)
data.head()

(26570, 26)


Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [10]:
# Store 'id'
id = data.id

# Drop 'id'
data = data.drop(columns=['id'])

print(data.shape)
data.head()

(26570, 25)


Unnamed: 0,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,A,80.1,material_7,material_8,9,5,7,8,4,18.04,12.518,15.748,19.292,11.739,20.155,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,A,84.89,material_7,material_8,9,5,14,3,3,18.213,11.54,17.717,17.893,12.748,17.889,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,A,82.43,material_7,material_8,9,5,12,1,5,18.057,11.652,16.738,18.24,12.718,18.288,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,A,101.07,material_7,material_8,9,5,13,2,6,17.295,11.188,18.576,18.339,12.583,19.06,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,A,188.06,material_7,material_8,9,5,9,2,8,19.346,12.95,16.99,15.746,11.306,18.093,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


## 2. Data Exploration

In [11]:
# Check data types
data.dtypes

product_code       object
loading           float64
attribute_0        object
attribute_1        object
attribute_2         int64
attribute_3         int64
measurement_0       int64
measurement_1       int64
measurement_2       int64
measurement_3     float64
measurement_4     float64
measurement_5     float64
measurement_6     float64
measurement_7     float64
measurement_8     float64
measurement_9     float64
measurement_10    float64
measurement_11    float64
measurement_12    float64
measurement_13    float64
measurement_14    float64
measurement_15    float64
measurement_16    float64
measurement_17    float64
failure             int64
dtype: object

### - Objective Values

In [20]:
# Unique values in each column
data['product_code'].unique()   # array(['A', 'B', 'C', 'D', 'E'], dtype=object)
data['attribute_0'].unique()    # array(['material_7', 'material_5'], dtype=object)
data['attribute_1'].unique()    # array(['material_8', 'material_5', 'material_6'], dtype=object)

array(['material_8', 'material_5', 'material_6'], dtype=object)

In [40]:
# Check attribute combinations for each product
def combinations(df):
    products = df['product_code'].unique()      # product codes
    attr = []       # list of the combination of attributes

    for product in products:
        attr = []
        subset = df.loc[df['product_code']==product, :]     # get subsets for each 'product code'

        attr.append(subset['attribute_0'].unique())
        attr.append(subset['attribute_1'].unique())

        print("Product",product, "consists of", attr)

In [41]:
combinations(data)

Product A consists of [array(['material_7'], dtype=object), array(['material_8'], dtype=object)]
Product B consists of [array(['material_5'], dtype=object), array(['material_5'], dtype=object)]
Product C consists of [array(['material_7'], dtype=object), array(['material_8'], dtype=object)]
Product D consists of [array(['material_7'], dtype=object), array(['material_5'], dtype=object)]
Product E consists of [array(['material_7'], dtype=object), array(['material_6'], dtype=object)]


In [43]:
combinations(te)

Product F consists of [array(['material_5'], dtype=object), array(['material_6'], dtype=object)]
Product G consists of [array(['material_5'], dtype=object), array(['material_6'], dtype=object)]
Product H consists of [array(['material_7'], dtype=object), array(['material_7'], dtype=object)]
Product I consists of [array(['material_7'], dtype=object), array(['material_5'], dtype=object)]


### - Int/Float Values

In [17]:
data.describe()

Unnamed: 0,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,measurement_6,measurement_7,measurement_8,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,26320.0,26570.0,26570.0,26570.0,26570.0,26570.0,26189.0,26032.0,25894.0,25774.0,25633.0,25522.0,25343.0,25270.0,25102.0,24969.0,24796.0,24696.0,24561.0,24460.0,24286.0,26570.0
mean,127.826233,6.754046,7.240459,7.415883,8.232518,6.256568,17.791528,11.731988,17.127804,17.510759,11.716624,19.024714,11.430725,16.117711,19.172085,11.702464,15.652904,16.048444,14.995554,16.460727,701.269059,0.212608
std,39.03002,1.471852,1.456493,4.11669,4.199401,3.309109,1.0012,0.996085,0.996414,0.99598,1.000836,1.008591,0.999137,1.405978,1.520785,1.488838,1.155247,1.491923,1.549226,1.708935,123.304161,0.40916
min,33.16,5.0,5.0,0.0,0.0,0.0,13.968,8.008,12.073,12.715,7.968,15.217,7.537,9.323,12.461,5.167,10.89,9.14,9.104,9.701,196.787,0.0
25%,99.9875,6.0,6.0,4.0,5.0,4.0,17.117,11.051,16.443,16.839,11.045,18.34025,10.757,15.209,18.17,10.703,14.89,15.057,13.957,15.268,618.9615,0.0
50%,122.39,6.0,8.0,7.0,8.0,6.0,17.787,11.733,17.132,17.516,11.712,19.021,11.43,16.127,19.2115,11.717,15.6285,16.04,14.969,16.436,701.0245,0.0
75%,149.1525,8.0,8.0,10.0,11.0,8.0,18.469,12.41,17.805,18.178,12.391,19.708,12.102,17.025,20.207,12.709,16.374,17.082,16.018,17.628,784.09025,0.0
max,385.86,9.0,9.0,29.0,29.0,24.0,21.499,16.484,21.425,21.543,15.419,23.807,15.412,22.479,25.64,17.663,22.713,22.303,21.626,24.094,1312.794,1.0


### - Target Values

In [16]:
# Check the distribution
data["failure"].value_counts()

0    20921
1     5649
Name: failure, dtype: int64

## 3. Preprocessing

In [None]:
fdf