### Files Needed: 
TODO

### Goal:
find the relative rank ordering of the 'intensity' of a maintenance event.

### Rationale: 
running any kind of analysis with 'categorical variables' is difficult because you must make a new column for each one. (Making a new column for a categorical variable is called 'one hot encoding.') because things like maintenance events or action tags aren't continuous numbers they cant be normally used in a regression (or other kinds of data analysis.) so a new column is made. 

The data that we deal with has a lot categorical variables (e.g. TMR mission tags, malfunction codes, action taken codes) so we cut those down by ranking the intensity of these variables and only including the most intense ones into our analysis. 

### Definitions:
intensity: How many days after that event was completed (using 'Comp Date' column) did a plane 'degrade'

degrade/degredation event: the day in which a plane went from FMC ->PMC/ PMC->NMC/ FMC->NMC.(e.g. if a plane on 8/18 was FMC and on 8/19 its reported as PMC then the degredation event is said to be on 8/19

maintenance event: a single row of the decplate file - the two variables we are ranking are -malfunciton code

# Add Dependencies

In [None]:
from sklearn import linear_model
from datetime import datetime
import numpy as np
import operator

# Load in Files

In [None]:
mc_status=pd.read_csv('/Users/jordancoursey/Desktop/Navy/Models/DailyBunoMC.csv')
#maintDf = 
#this gives maintDf and mcstatus - the two files that get merged - the same names in which they are merged on top of

# Clean Data

In [None]:
#high level: you're mergin the status of a plane on a given day with all of the maintenance that was logged for it on that day
# the goal here is to associate maintenance done on a plane and how that is associated with the plane's MC status

maintDf = maintDf.rename(columns={'Bu/SerNo': 'Buno', 'Rcvd Date': 'Date' })
#converting to string types so they can be merged
maintDf['Date'] = maintDf['Date'].astype(str)
mc_status['Buno'] = mc_status['Buno'].astype(str)
#TODO
mc_merged = mc_status.merge(maintDf,how='left',on=['Buno','Date'])
features = [ 'Date', 'Buno', 'MC_x','Maint Level', 
        'Type Maint Code', 'Trans Code', 'Malfunction Code',
       'Action Taken', 'Position Code', 'Manhours',
        'In Work Date', 'Comp Date',]
mc_merged = mc_merged[features]
#converting the order of the date format ( year, month, date) so they can be merged on top of one another
mc_merged['Date'] = mc_merged['Date'].apply(lambda x : datetime.strptime(x, '%Y-%m-%d') )
mc_merged['Date'] = mc_merged['Date'].apply(lambda x :'{0}/{1}/{2}'.format(x.month, x.day, x.year ) )
#TODO
completed_actions_df = mc_merged[['Comp Date', 'Action Taken','Buno']]
completed_actions_df = completed_actions_df.dropna()
completed_actions_df = completed_actions_df.rename(columns={'Comp Date': 'Date', 'Action Taken':'Action Completed'})
mc_merged = mc_merged.merge(completed_actions_df, how='left',on=['Date','Buno'])


In [None]:
mc_down = mc_merged[['MC_x','Buno']]
mc_down['Buno'] = mc_down['Buno'].astype(int)
mc_down = mc_down.diff()
# todo
mc_down = mc_down.rename(columns = {'MC_x':'MC_diff', 'Buno':'Buno_diff'})
mc_df = pd.concat( [mc_merged, mc_down],axis=1)
mc_df = mc_df.rename(columns={'Action Completed': 'Action_Completed'})

mc_df = mc_df[(~mc_df.Action_Completed.isnull())  | ( mc_df.MC_diff < 0.0) | (mc_df.Buno_diff !=0.0) ]
#save date of degredation by index ( find it by finding the negative MC_diff) and then subtract the date from that

mc_df['Date']= pd.to_datetime(mc_df['Date'],format='%m/%d/%Y') 
mc_df['Comp Date'] = pd.to_datetime(mc_df['Comp Date'],format='%m/%d/%Y')
mc_df = mc_df.sort_values(by=['Buno', 'Comp Date','MC_diff'])

# Feature Creation (days until degredation)

In [None]:
#Description: this will create an array of all the indices where a breakage event occured.
#purpose: 
mc_df = mc_df.dropna(subset=['Comp Date'])
mc_df = mc_df.reset_index(drop=True)
degredation_indices=[]
for index, row in mc_df.iterrows():
     if row['MC_diff']<0:
            degredation_indices.append(index)

In [None]:
#Description: this creates a dictionary of the indices where a new Buno starts in the dataframe
#Purpose: used to skip to the first instance of a buno in tandem with the breakage occurences. so that whne you're iterating through
# the dataframe you dont start counting the days until breakage from a different buno. 
buno_indices =[]
for index, row in mc_df.iterrows():
    if row['Buno_diff'] !=0:
        buno_indices.append(index)
bunos = list(set(mc_df['Buno']))
buno_dict = dict(zip(bunos, buno_indices))

Potentially stop the cell below within 30 minutes of running - not strictly necessary to run through entire maintenance history if it takes too long

In [None]:
#description: Adds in the date at which a plane degrades after a particular maintenance event
#purpose: used later to calculate total days before degredation
#notes: this code takes awhie to run. because these tags are somewhat optional its ok to stop it in hte middle.
#as a general rule of thumb the longer 
mc_df['Days_until_degredation'] = 0
mc_df['Degredation_date'] = 0

deg_i = 0
deg_buno = mc_df.iloc[degredation_indices[deg_i]]['Buno']
deg_date = mc_df.iloc[degredation_indices[deg_i]]['Comp Date']

for index, row in mc_df.iterrows():
    if deg_i == len(degredation_indices)-1:
        break
    
    if mc_df.iloc[index]['Comp Date'] > deg_date or index > degredation_indices[deg_i]:
        index = degredation_indices[deg_i]+1
        deg_i +=1
        deg_date = mc_df.iloc[degredation_indices[deg_i]]['Comp Date']
        deg_buno = mc_df.iloc[degredation_indices[deg_i]]['Buno']
        
        
        if mc_df.iloc[index]['Buno'] != deg_buno:
            
            index = buno_dict[deg_buno]    
        continue
    
    mc_df.at[index, 'Degredation_date'] = deg_date

In [2]:
# to account for if you stopped the code above halfway through it'll get rid of all the rows that didn't finish
mc_df_temporary= mc_df_temporary[ mc_df_temporary.Degredation_date != 0]


In [None]:
#optional checkpoint
#optional to save file in the event that it crashes or you want to reuse the code. 
#mc_df_temporary.to_csv(path_or_buf='/Users/jordancoursey/Desktop/Navy/Data/Decplate/maintenance_tags.csv')
#mc_df_temporary = pd.read_csv('/Users/jordancoursey/Desktop/Navy/Data/Decplate/maintenance_tags.csv')

In [None]:
mc_df_temporary['Comp Date'] = pd.to_datetime(mc_df_temporary['Comp Date'])
mc_df_temporary['Degredation_date'] = mc_df_temporary['Degredation_date'].str[:10]
mc_df_temporary['Degredation_date'] = pd.to_datetime(mc_df_temporary['Degredation_date'])
mc_df_temporary['Days_until_degredation'] = mc_df_temporary['Comp Date'] - mc_df_temporary['Degredation_date']

# Regression

In [None]:
regression_features= ['Malfunction Code', 'Days_until_degredation']
mc_regression = mc_df_temporary[regression_features].copy(deep=True)
# mc_regression.dropna(inplace=True)

mc_regression = pd.get_dummies(mc_regression, columns=['Malfunction Code'])
X = mc_regression.iloc[:, mc_regression.columns != 'Days_until_degredation'].copy(deep=True)
y = mc_regression.iloc[:, mc_regression.columns == 'Days_until_degredation'].copy(deep=True)

In [None]:
#spits out list of 
code_model = linear_model.LinearRegression()
code_model.fit(X, y)
coefficients = code_model.coef_
coefficient_names = list(X.columns)
coefficient_dict = dict(zip( coefficient_names, coefficients[0]))

In [None]:
#optional add the n top tags below 
sorted_d = sorted(coefficient_dict.items(), key=operator.itemgetter(1))
sorted_d[-11:]