# Tagup data science challenge

ExampleCo, Inc has a problem: maintenance on their widgets is expensive. They have contracted with Tagup to help them implement predictive maintenance. They want us to provide a model that will allow them to prioritize maintenance for those units most likely to fail, and in particular to gain some warning---even just a few hours!---before a unit does fail.

They collect two kinds of data for each unit. First, they have a remote monitoring system for the motors in each unit, which collects information about the motor (rotation speed, voltage, current) as well as two temperature probes (one on the motor and one at the inlet). Unfortunately, this system is antiquated and prone to communication errors, which manifest as nonsense measurements. Second, they have a rule-based alarming system, which can emit either warnings or errors; the system is known to be noisy, but it's the best they've got. 

They have given us just over 100MB of historical remote monitoring data from twenty of their units that failed in the field. The shortest-lived units failed after a few days; the longest-lived units failed after several years. Typical lifetimes are on the order of a year. This data is available in .csv files under `data/train` in this repository. In addition, they have provided us with operating data from their thirty active units for the past month; this data is available under `data/test` in this repository.

You have two main objectives. First, **tell us as much as you can about the process that generated the data**. Does it show meaningful clustering? Do the observations appear independent? How accurately can we forecast future observations, and how long a window do we need to make an accurate forecast? Feel free to propose multiple models, but be sure to discuss the ways each is useful and the ways each is not useful. Second, **predict which of the thirty active units are most likely to fail**. The data from these units are in `data/test`. Be sure to quantify these predictions, and especially your certainty.

A few notes to help:
1. A good place to start is by addressing the noise due to comm errors. 
2. There is a signal in the data that you can identify and exploit to predict failure.
3. If you can't find the signal in the noise, don't despair! We're much more interested in what you try and how you try it than in how successful you are at helping a fictional company with their fictional problems.
4. Feel free to use any libraries you like, or even other programming languages. Your final results should be presented in this notebook, however.
5. There are no constraints on the models or algorithms you can bring to bear. Some ideas include: unsupervised clustering algorithms such as k-means; hidden Markov models; forecasting models like ARMA; neural networks; survival models built using features extracted from the data; etc.
6. Don't feel compelled to use all the data if you're not sure how. Feel free to focus on data from a single unit if that makes it easier to get started.
7. Be sure to clearly articulate what you did, why you did it, and how the results should be interpreted. In particular you should be aware of the limitations of whatever approach or approaches you take.
8. Don't hesitate to reach out with any questions.

In [4]:
import datetime as dt
import decimal
import pandas as pd
import os
import numpy as np

alarms_data=pd.DataFrame()
for root, dirs, files in os.walk(os.path.join(os.getcwd(),'data','train')):
    for file in files:
        if file.endswith("alarms.csv") and os.stat(os.path.join(root, file)).st_size > 0:
            alarms_temp = pd.read_csv(os.path.join(root, file),header=None)
            alarms_temp['unit_id'] = file[4:8]
            alarms_data = alarms_data.append(alarms_temp)
alarms_data.columns = ['timestamp', 'status','unit_id']
alarms_data['unit_id'] = alarms_data['unit_id'].astype('int64')
alarms_data['timestamp'] = alarms_data['timestamp'].str.split('.').str.get(0)
alarms_data['timestamp'] = alarms_data['timestamp'].str.split(':').str.get(0)
alarms_data['timestamp']=alarms_data['timestamp'].str.replace('-','')
alarms_data['timestamp']=alarms_data['timestamp'].str.replace(' ','.')
alarms_data['timestamp'] = alarms_data['timestamp'].astype(np.dtype(decimal.Decimal))

In [5]:
alarms_data.head()

Unnamed: 0,timestamp,status,unit_id
0,20031215.19,warning,0
1,20031215.23,warning,0
2,20031217.16,warning,0
3,20031218.02,warning,0
4,20040104.17,warning,0


In [6]:
# Removing outliers using percentile cut-off technique. This is one of the method that can be used to remove outliers.

from pandas.api.types import is_numeric_dtype
def clean_rms(df):
    low = .05
    high = .95
    quant_df = df.quantile([low, high])
    for name in list(df.columns):
        if is_numeric_dtype(df[name]):
            df = df[(df[name] > quant_df.loc[low, name]) & (df[name] < quant_df.loc[high, name])]    
    df['timestamp'] = df['timestamp'].str.split('.').str.get(0)    
    return df

In [7]:
import datetime
import time
rms_data = pd.DataFrame()
for root, dirs, files in os.walk(os.path.join(os.getcwd(),'data','train')):
    for file in files:
        if file.endswith("_rms.csv"):
            rms_temp = pd.read_csv(os.path.join(root, file))            
            rms_temp =clean_rms(rms_temp)
            rms_temp.index= pd.to_datetime(rms_temp['timestamp'])
            rms_temp.drop(columns=['timestamp'])
            rms_temp=rms_temp.resample('3H').mean()
            rms_temp['unit_id'] = file[4:8]
            rms_data = rms_data.append(rms_temp)

In [8]:
import numpy as np

rms_data.reset_index(inplace=True)
rms_data = rms_data.loc[-rms_data['rpm'].isnull()]
rms_data['unit_id'] = rms_data['unit_id'].astype('int64')
rms_data['timestamp'] = rms_data['timestamp'].astype('str')
rms_data['timestamp']= rms_data['timestamp'].str.split(':').str.get(0)
rms_data['timestamp']=rms_data['timestamp'].str.replace('-','')
rms_data['timestamp']=rms_data['timestamp'].str.replace(' ','.')
rms_data['timestamp'] = rms_data['timestamp'].astype(np.dtype(decimal.Decimal))

rms_data.head()

Unnamed: 0,timestamp,rpm,motor_voltage,motor_current,motor_temp,inlet_temp,unit_id
0,20031214.03,1096.193585,220.192553,30.117273,90.369066,51.702992,0
1,20031214.06,1082.36226,219.513428,29.997431,97.721842,50.609085,0
2,20031214.09,1050.289937,225.737659,30.080688,98.659876,53.011027,0
3,20031214.12,1036.566833,223.248597,30.419686,104.263228,58.730093,0
4,20031214.15,1055.993925,219.674913,30.789283,109.021939,63.503735,0


In [351]:
final_feature = rms_data.merge(alarms_data, on=['timestamp', 'unit_id'], how='left')
#labeled_features = labeled_features.fillna(method='bfill', limit=7) # fill backward up to 24h
final_feature = final_feature.fillna('none')
final_feature.head()

Unnamed: 0,timestamp,rpm,motor_voltage,motor_current,motor_temp,inlet_temp,unit_id,status
0,20031214.03,1096.193585,220.192553,30.117273,90.369066,51.702992,0,warning
1,20031214.06,1082.36226,219.513428,29.997431,97.721842,50.609085,0,warning
2,20031214.09,1050.289937,225.737659,30.080688,98.659876,53.011027,0,warning
3,20031214.12,1036.566833,223.248597,30.419686,104.263228,58.730093,0,warning
4,20031214.15,1055.993925,219.674913,30.789283,109.021939,63.503735,0,warning


In [352]:
final_feature['status'].unique()

