In [1]:
import pandas as pd 
import numpy as np # linear algebra
import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport as PR
%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


# Work package 1

In this work package, the purpose is to preprocess the data, and select the relervant features for developing a classification model. There are a few tasks in this work package:
* Task 1: Cleaning. In this task, the aim is to clean the data by removing/imputating the missing values and outliers. 
* Task 2: Preprocessing. The aim of this task is to encode the categorical features and combine them with the numerical features. Also, we need to scale the numerical features.
* Task 3: Feature selection. The aim of this task is to decide which features are relervant and needs to be used in the classification model to be developed.

In this section, you need to present the following (with code and texts to explain):
1. Data preprocessing: How do you clean and preprocess the data.
2.  Feature engineering: What are the features you selected and how do you select them.

## Reading the data.

The original data are stored in an excel file. In the supporting script utility.py, we provide you with a supproting function read_data, which can read the origial data and tranform it into a DataFrame for easy use.

In [2]:
from utility import read_data

file_name = 'Excel - Jeu de données 30min.xlsx'
df_data_org = read_data(file_name)

## Creating a function for the preprocessing


In [3]:
def handle_outlier(x):
    ul = x.mean() + 3*x.std()
    ll = x.mean() - 3*x.std()
    x[(x>ul) | (x<ll)] = x.mean()

    return x


def preprocess(data, scaler):
    if data.isnull().values.any():
        print('null values detected')
    X = data.copy(deep=True)
    for speed in [100, 500, 1000]:
        for col in data.columns:
            x = X.loc[(df_data_org['Speed']==speed), col]
            x = handle_outlier(x)
            X.loc[(df_data_org['Speed']==speed), col] = x
    Speed = pd.get_dummies(data['Speed'], drop_first=True)
    X = pd.DataFrame(data=scaler.transform(X), columns=X.columns)
    X = pd.concat([X, Speed], axis=1)
    X = X.rename(columns={500:'500', 1000:'1000'})
    id_to_keep = [0,3,4,5,7,9,10,13,14]
    X = X.iloc[:, id_to_keep]
    return X

# Work package 2

In WP1, you already clean the data, and select the most relevant features. In this work package, the purpose is to develop a machine learning to detect the failed bearing with best performance. 
* Task 1: Design an experiment scheme which evaluates the performance of the developed model based on cross validation.
* Task 2: Develop a model to achieve the best failure detection performance on an individual measurement.
* Task 3: Develop a model to achieve the best failure detection performance on time-series data under variate working conditions.

In this section, you need to present the following (with code and texts to explain):
1. Model selection: Report all the models you have tried on the training dataset and evaluate their performance using cross validation.
2. Data challenge: Apply the best model in the testing dataset:
    * Test bearing 1&2:
        * Explain how do you apply your model to deal with the time-series problem.
        * Apply the model to predict the labels at each time instants.
    * Test bearing 3&4:
        * Explain how do you apply your model to solve the problem of missing the speed measurement.
        * Apply the model to predict the labels at each time instants.
    * Save the results of the predictions in the givne excel file.