![title](CBMpy.png)

**INSTITUTO NACIONAL DE PESQUISAS ESPACIAIS** 

Disciplina: Introdution to Data Science
    
Professores: Rafael Santos e Gilberto Queiroz
    
Acadêmica: Marcelly Homem Coelho
    
Contato: marcellyhc@gmail.com 

**Título:** Aplicação de Técnicas de Data Science no Desenvolvimento de um Sistema para Manutenção Aeronáutica Baseada em Condição 

**Descrição:** Este programa tem como objetivo analisar as mensagens de falha e as remoções dos sistemas das aeronaves.

In [1]:
# Importar as bibliotecas

import numpy as np
import pandas as pd
import seaborn as sns

import random

import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)

import matplotlib.pyplot as plt
%matplotlib inline

# 1. Investigação Inicial da Estrutura e Conteúdo do Arquivo de Falha

In [2]:
# Criar um dataframe para entrada de dados de falha (arquivo do tipo .csv)

df_dataFailure = pd.read_csv('CBMpy_dataFailureCode.csv')   

In [3]:
# Exibir o cabeçalho do dataframe

df_dataFailure.head()

Unnamed: 0,Aircraft,Flight Phase,Date,Fault Text,Maintenance Message
0,2640,,2006-05-14 16:19:00,FDE_Outhers02,
1,2640,Enroute Cruise,2006-07-01 15:17:00,FDE_B_System3,MMSG_A_System3
2,2640,Enroute Cruise,2006-07-01 15:17:00,FDE_C_System3,MMSG_A_System3
3,2640,Enroute Cruise,2006-07-02 04:48:00,FDE_B_System3,MMSG_A_System3
4,2640,Enroute Cruise,2006-07-02 04:48:00,FDE_C_System3,MMSG_A_System3


In [4]:
# Verifica a dimensão do dataframe (qtd linhas, qtd colunas)

df_dataFailure.shape

(7238, 5)

In [5]:
# Verificar o tipo de dado de cada coluna do dataframe

df_dataFailure.dtypes

Aircraft                int64
Flight Phase           object
Date                   object
Fault Text             object
Maintenance Message    object
dtype: object

In [6]:
# Converter os dados da coluna 'Date' para o formato de data-hora

df_dataFailure['Date'] =  pd.to_datetime(df_dataFailure['Date'], format='%Y/%m/%d %H:%M')

In [7]:
# Verificar o tipo de dado de cada coluna do dataframe

df_dataFailure.dtypes

Aircraft                        int64
Flight Phase                   object
Date                   datetime64[ns]
Fault Text                     object
Maintenance Message            object
dtype: object

In [8]:
# Determinar quantas 'Flight Phase' diferentes há no dataframe

len(df_dataFailure['Aircraft'].unique())

15

In [9]:
# Verificar quais 'Aircraft' apresentaram maior quantidade de Mensagens de Falha  

df_dataFailure['Aircraft'].value_counts()

2766    1059
1950     728
1990     668
1151     626
2640     560
326      475
131      465
2436     421
1419     419
791      417
2838     369
2209     324
312      268
1710     243
2982     196
Name: Aircraft, dtype: int64

$\color{red}{\text{OBSERVAÇÃO:}}$ O Aircraft igual a 2766 é o que apresenta a maior quantidade de Fault Text. Além disso, é o segundo em relação a quantidade de remoções de peças.

## 1.1 Análise das Mensagens de Falha de um Aircraft Específico

In [10]:
# Definir uma variável para a seleção de um arcraft específico

var_aircraftSelected = 2766

In [11]:
# Criar um dataframe para o aircraft específico 

df_dataFailure_airSelec = df_dataFailure[df_dataFailure['Aircraft'] == var_aircraftSelected]

In [12]:
# Exibir o cabeçalho do dataframe

df_dataFailure_airSelec.head()

Unnamed: 0,Aircraft,Flight Phase,Date,Fault Text,Maintenance Message
1492,2766,Power On,2006-04-19 13:29:00,FDE_Outhers00,MMSG_Others04
1493,2766,Power On,2006-04-19 13:29:00,FDE_Outhers00,MMSG_Others04
1494,2766,Power On,2006-04-21 16:12:00,FDE_Outhers00,MMSG_Others04
1495,2766,Power On,2006-04-21 16:12:00,FDE_Outhers00,MMSG_Others04
1496,2766,Initial Climb,2006-04-22 19:59:00,FDE_E_System1,MMSG_F_System1


In [13]:
# Verifica a dimensão do dataframe (qtd linhas, qtd colunas)

df_dataFailure_airSelec.shape

(1059, 5)

In [14]:
# Contagem de Fault Text por ano para o aircraft selecionado 

df_dataFailure_airSelec.groupby(df_dataFailure_airSelec['Date'].dt.year)['Fault Text'].count()

Date
2006    140
2007     75
2008    118
2009     17
Name: Fault Text, dtype: int64

# 2. Investigação Inicial da Estrutura e Conteúdo do Arquivo de Remoção

In [15]:
# Criar um dataframe para entrada de dados de remoção (arquivo do tipo .csv)

df_dataRemoval = pd.read_csv('CBMpy_dataRemovalCode.csv')  

In [16]:
# Exibir do cabeçalho do dataframe

df_dataRemoval.head()

Unnamed: 0,Aircraft,Component,System,Date,Reason,Time Hours,Time Cycles
0,1140,REM_Component_A,System1,2006-05-29,3,118123,15961
1,1140,REM_Component_A,System1,2006-05-29,3,118123,15961
2,1140,REM_Component_B,System1,2006-05-29,3,1092,139
3,1140,REM_Component_B,System3,2006-06-24,3,312,37
4,1140,REM_Component_B,System3,2006-07-10,3,118698,16028


In [17]:
# Verifica a dimensão do dataframe (qtd linhas, qtd colunas)

df_dataRemoval.shape

(1282, 7)

In [18]:
# Verificar o tipo de dado de cada coluna do dataframe

df_dataRemoval.dtypes

Aircraft         int64
Component       object
System          object
Date            object
Reason           int64
Time Hours       int64
Time Cycles      int64
dtype: object

In [19]:
# Converter os dados da coluna 'Date' para o formato de data

df_dataRemoval['Date'] =  pd.to_datetime(df_dataRemoval['Date'], format='%Y/%m/%d')

In [20]:
# Verificar o tipo de dado de cada coluna do dataframe

df_dataRemoval.dtypes

Aircraft                 int64
Component               object
System                  object
Date            datetime64[ns]
Reason                   int64
Time Hours               int64
Time Cycles              int64
dtype: object

In [21]:
# Determinar quantos 'Component' diferentes há no dataframe

len(df_dataRemoval['Component'].unique())

17

In [22]:
# Verificar quais foram os 'Component' mais trocados 

df_dataRemoval['Component'].value_counts()

REM_Component_B    268
REM_Component_A    210
REM_Component_D    177
REM_Component_F    113
REM_Component_J     87
REM_Component_G     86
REM_Component_H     77
REM_Component_I     64
REM_Component_N     45
REM_Component_E     38
REM_Component_L     37
REM_Component_K     25
REM_Component_O     22
REM_Component_M     19
REM_Component_C      6
REM_Component_P      5
REM_Component_Q      3
Name: Component, dtype: int64

In [23]:
# Verificar quais 'Aircraft' realizaram mais trocas de componentes  

df_dataRemoval['Aircraft'].value_counts()

2640    99
2766    94
2361    92
2326    91
2567    86
1950    78
2982    74
1399    62
2838    60
2436    59
131     55
1151    54
312     53
1990    50
1419    49
736     46
1710    41
2209    38
791     37
326     30
1140    25
165      9
Name: Aircraft, dtype: int64

## 2.1 Análise das Remoções de um Aircraft Específico

In [24]:
# Criar um dataframe para o aircraft específico 

df_dataRemoval_airSelec = df_dataRemoval[df_dataRemoval['Aircraft'] == var_aircraftSelected]

In [25]:
# Exibir o cabeçalho do dataframe

df_dataRemoval_airSelec.head()

Unnamed: 0,Aircraft,Component,System,Date,Reason,Time Hours,Time Cycles
896,2766,REM_Component_D,,2006-03-19,3,3782,399
897,2766,REM_Component_B,System1,2006-03-23,3,95539,10871
898,2766,REM_Component_B,System1,2006-03-23,3,530,54
899,2766,REM_Component_A,System2,2006-04-19,3,21438,2265
900,2766,REM_Component_B,System1,2006-04-23,3,430,44


In [26]:
# Verifica a dimensão do dataframe (qtd linhas, qtd colunas)

df_dataRemoval_airSelec.shape

(94, 7)

In [27]:
# Contagem de Removals por ano para o aircraft selecionado

df_dataRemoval_airSelec.groupby(df_dataRemoval_airSelec['Date'].dt.year)['Component'].count()

Date
2006    33
2007    28
2008    22
2009    11
Name: Component, dtype: int64

# 3. Agrupamento do Conjunto de Dados

## 3.1 Agrupamento dos Dados de Falha

In [28]:
# Identificar todas as 'Fault Text' (FDE) existentes para o aircraft selecionado

array_FDE_airSelec = np.array(df_dataFailure_airSelec['Fault Text'].unique())

In [29]:
# Exibir os valores do array

array_FDE_airSelec

array(['FDE_Outhers00', 'FDE_E_System1', 'FDE_A_System1', 'FDE_D_System1',
       'FDE_C_System1', 'FDE_B_System1', 'FDE_I_System1', 'FDE_M_System1',
       'FDE_N_System1', 'FDE_A_System2', 'FDE_E_System2', 'FDE_F_System3',
       'FDE_A_System3', 'FDE_D_System3', 'FDE_H_System3', 'FDE_E_System3',
       'FDE_B_System3', 'FDE_C_System3', 'FDE_F_System2', 'FDE_Outhers12',
       'FDE_Outhers02', 'FDE_J_System3', 'FDE_G_System2', 'FDE_G_System3',
       'FDE_E_System4', 'FDE_A_System4', 'FDE_C_System2', 'FDE_B_System2',
       'FDE_G_System4', 'FDE_M_System3', 'FDE_C_System4', 'FDE_B_System4',
       nan, 'FDE_F_System4', 'FDE_Outhers01', 'FDE_M_System2'],
      dtype=object)

In [30]:
# Excluir os itens NaN do array

array_FDE_airSelec = array_FDE_airSelec[~pd.isnull(array_FDE_airSelec)]  # 1D array with NaNs removed

In [32]:
# Realizar o merge do dataframe (agrupamento por data)

arrayY = []

df_dataFailure_airSelec_result = pd.DataFrame(columns= ['year', 'month', 'day'])


# aux é o Fault Text corrente
for aux in array_FDE_airSelec:
    
    # Cria um dataframe para um Fault Text corrente 
    dfMsg = pd.DataFrame(df_dataFailure_airSelec[df_dataFailure_airSelec['Fault Text'] == aux])
    
    # Contar as Fault Text por dia para o aircraft selecionado
    arrayY = dfMsg.groupby([dfMsg['Date'].dt.year.rename('year'),
                            dfMsg['Date'].dt.month.rename('month'),
                            dfMsg['Date'].dt.day.rename('day')]).count()['Fault Text']
    
    # Transformar os objetos do groupby para dataframe (depois possibilita fazer o merge).
    arrayY = arrayY.to_frame().reset_index()
    
    arrayY.columns = ['year', 'month', 'day', aux]
       
    # Utilizar o método "outer" (apropriado para acrescentar colunas e manter os índices compostos de dia-mes-ano). 
    df_dataFailure_airSelec_result = pd.merge(df_dataFailure_airSelec_result, arrayY, how='outer', on=['year','month','day'])

In [33]:
# Exibir o cabeçalho do dataframe (resultado da junção do agrupamento de todas as FDE).

df_dataFailure_airSelec_result.head()

Unnamed: 0,year,month,day,FDE_Outhers00,FDE_E_System1,FDE_A_System1,FDE_D_System1,FDE_C_System1,FDE_B_System1,FDE_I_System1,...,FDE_A_System4,FDE_C_System2,FDE_B_System2,FDE_G_System4,FDE_M_System3,FDE_C_System4,FDE_B_System4,FDE_F_System4,FDE_Outhers01,FDE_M_System2
0,2006,4,19,2.0,,,,,,,...,,,,,,,,,,
1,2006,4,21,2.0,,,,,,,...,,,,,,,,,,
2,2006,10,30,2.0,,,,,,,...,,,,,,,,,,
3,2006,4,22,,1.0,1.0,,,,,...,,,,,,,,,,
4,2006,4,23,,1.0,1.0,,,,,...,,,,,,,,,,


### 3.1.1 Manipulação dos Dados de Falha Agrupados

In [34]:
# Substituir elementos NaN por zeros 

df_dataFailure_airSelec_result = df_dataFailure_airSelec_result.fillna(0) 

In [35]:
# Ordenar o dataframe por: year -> month -> day

df_dataFailure_airSelec_result = df_dataFailure_airSelec_result.sort_values(['year', 'month', 'day'])

In [36]:
# Adicionar uma coluna date no dataframe (coo dados dos campos year, month e day) 

df_dataFailure_airSelec_result['Date'] = pd.to_datetime(df_dataFailure_airSelec_result.year*10000 + df_dataFailure_airSelec_result.month*100 + df_dataFailure_airSelec_result.day, format='%Y%m%d') 

In [37]:
# Exibir o cabeçalho do dataframe

df_dataFailure_airSelec_result.head()

Unnamed: 0,year,month,day,FDE_Outhers00,FDE_E_System1,FDE_A_System1,FDE_D_System1,FDE_C_System1,FDE_B_System1,FDE_I_System1,...,FDE_C_System2,FDE_B_System2,FDE_G_System4,FDE_M_System3,FDE_C_System4,FDE_B_System4,FDE_F_System4,FDE_Outhers01,FDE_M_System2,Date
0,2006,4,19,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-19
1,2006,4,21,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-21
3,2006,4,22,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-22
4,2006,4,23,0.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-23
16,2006,4,25,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-25


### 3.1.2 Gráfico de Série Temporal das FDE

In [38]:
# Função generate_color é utilizada para gerar cores aleatórias

def generate_color():
    color = '#{:02x}{:02x}{:02x}'.format(*map(lambda x: random.randint(0, 255), range(3)))
    return color

In [39]:
array_data = []

for aux in array_FDE_airSelec:
    
    trace = go.Bar(x = df_dataFailure_airSelec_result['Date'],
                   y = df_dataFailure_airSelec_result[aux],
                   name = aux,
                   marker = {'color': generate_color().upper()}) 
    
    # Adicionar o trace no array_data
    array_data.append(trace)
    
    layout = go.Layout(title='Fault Text Graphic',
                       xaxis=dict(tickfont=dict(size=14, color='rgb(107, 107, 107)')),
                       yaxis=dict(title='Quantity', titlefont=dict(size=16, color='rgb(107, 107, 107)'),
                       tickfont=dict(size=14, color='rgb(107, 107, 107)')), 
                       legend=dict(x=-0.5, y=-1.0, bgcolor='rgba(255, 255, 255, 0)',
                       bordercolor='rgba(255, 255, 255, 0)'),
                       barmode='group',
                       bargap=0.15,
                       bargroupgap=0.1)

    fig = dict(data=array_data, layout=layout) 

py.iplot(fig, filename='style-bar')

Imagem do gráfico interativo:
![title](plot_FDE.png)

## 3.2 Agrupamento dos Dados de Remoção

In [41]:
# Identificar todas as 'Fault Text' (FDE) existentes para o aircraft selecionado

array_Removal_airSelec = np.array(df_dataRemoval_airSelec['Component'].unique())

In [42]:
# Exibir os valores do array

array_Removal_airSelec

array(['REM_Component_D', 'REM_Component_B', 'REM_Component_A',
       'REM_Component_G', 'REM_Component_K', 'REM_Component_H',
       'REM_Component_J', 'REM_Component_N', 'REM_Component_I',
       'REM_Component_E', 'REM_Component_F'], dtype=object)

In [43]:
# Excluir os itens NaN do array

array_Removal_airSelec = array_Removal_airSelec[~pd.isnull(array_Removal_airSelec)]  # 1D array with NaNs removed

In [45]:
# Realizar o merge do dataframe (agrupamento por data)

arrayY = []

df_dataRemoval_airSelec_result = pd.DataFrame(columns= ['year', 'month', 'day'])


# aux é o Fault Text corrente
for aux in array_Removal_airSelec:
    
    # Cria um dataframe para um Fault Text corrente 
    dfMsg = pd.DataFrame(df_dataRemoval_airSelec[df_dataRemoval_airSelec['Component'] == aux])
    
    # Contar as Fault Text por dia para o aircraft selecionado
    arrayY = dfMsg.groupby([dfMsg['Date'].dt.year.rename('year'),
                            dfMsg['Date'].dt.month.rename('month'),
                            dfMsg['Date'].dt.day.rename('day')]).count()['Component']
    
    # Transformar os objetos do groupby para dataframe (depois possibilita fazer o merge).
    arrayY = arrayY.to_frame().reset_index()
    
    arrayY.columns = ['year', 'month', 'day', aux]
       
    # Utilizar o método "outer" (apropriado para acrescentar colunas e manter os índices compostos de dia-mes-ano). 
    df_dataRemoval_airSelec_result = pd.merge(df_dataRemoval_airSelec_result, arrayY, how='outer', on=['year','month','day'])

In [46]:
# Exibir o cabeçalho do dataframe (resultado da junção do agrupamento de todas as FDE).

df_dataRemoval_airSelec_result.head()

Unnamed: 0,year,month,day,REM_Component_D,REM_Component_B,REM_Component_A,REM_Component_G,REM_Component_K,REM_Component_H,REM_Component_J,REM_Component_N,REM_Component_I,REM_Component_E,REM_Component_F
0,2006,3,19,1.0,,,,,,,,,,
1,2006,12,17,1.0,,,,,,,,,,
2,2007,3,5,1.0,,,,,,,,,,
3,2007,3,11,1.0,,,,,,,,,,
4,2007,3,14,1.0,,,,,,,,,,


### 3.2.1 Manipulação dos Dados de Remoção Agrupados

In [48]:
# Substituir elementos NaN por zeros 

df_dataRemoval_airSelec_result = df_dataRemoval_airSelec_result.fillna(0) 

In [49]:
# Ordenar o dataframe por: year -> month -> day

df_dataRemoval_airSelec_result = df_dataRemoval_airSelec_result.sort_values(['year', 'month', 'day'])

In [50]:
# Adicionar uma coluna date no dataframe (coo dados dos campos year, month e day) 

df_dataRemoval_airSelec_result['Date'] = pd.to_datetime(df_dataRemoval_airSelec_result.year*10000 + df_dataRemoval_airSelec_result.month*100 + df_dataRemoval_airSelec_result.day, format='%Y%m%d') 

In [52]:
# Exibir o cabeçalho do dataframe

df_dataRemoval_airSelec_result.head()

Unnamed: 0,year,month,day,REM_Component_D,REM_Component_B,REM_Component_A,REM_Component_G,REM_Component_K,REM_Component_H,REM_Component_J,REM_Component_N,REM_Component_I,REM_Component_E,REM_Component_F,Date
0,2006,3,19,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-03-19
17,2006,3,23,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-03-23
43,2006,4,19,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-19
18,2006,4,23,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-04-23
19,2006,5,25,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2006-05-25


### 3.2.2 Gráfico de Série Temporal das FDE

In [53]:
array_data = []

for aux in array_Removal_airSelec:
    
    trace = go.Bar(x = df_dataRemoval_airSelec_result['Date'],
                   y = df_dataRemoval_airSelec_result[aux],
                   name = aux,
                   marker = {'color': generate_color().upper()}) 
    
    # Adicionar o trace no array_data
    array_data.append(trace)
    
    layout = go.Layout(title='Removal Graphic',
                       xaxis=dict(tickfont=dict(size=14, color='rgb(107, 107, 107)')),
                       yaxis=dict(title='Quantity', titlefont=dict(size=16, color='rgb(107, 107, 107)'),
                       tickfont=dict(size=14, color='rgb(107, 107, 107)')), 
                       legend=dict(x=-0.5, y=-1.0, bgcolor='rgba(255, 255, 255, 0)',
                       bordercolor='rgba(255, 255, 255, 0)'),
                       barmode='group',
                       bargap=0.15,
                       bargroupgap=0.1)

    fig = dict(data=array_data, layout=layout) 

py.iplot(fig, filename='style-bar')

Imagem do gráfico interativo:
![title](plot_removal.png)

# 4. Detecção de falha 

Uma falha é caracterizada pela concentração frequente de FDEs.