# Motivation
- Did you know that a lot of accidents happen every day in manufacturing plants? 
- Did you also know that it **kill thousands of people everywhere, globally**?

# Objective
   - Analyse real labor accident data aiming **to help manufacturing plants to save lives**.
   - This notebook does not intend to use any machine learning technique, but to help people to **take valuable insights from few lines of data**
   
## Few simple Concepts to learn
   - We will use two basic concepts about data exploration: **Highlight** and **Insight**
   - **Highlight**: when we resume data information (you will see examples throught the notebook)
   - **Insight**: ideas and questions that come from the Highlights
    
## Further details
- Programming Language for data analysis: **Python**
- Data handling tools: **pandas** helping with data-frame operations
- Data Visualization tools: **Seaborn** and **Plotly**

The following data source is used: [here](https://kaggle.com/ihmstefanini/industrial-safety-and-health-analytics-database)

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
import plotly.plotly as py
import plotly.graph_objs as go
import seaborn as sns
init_notebook_mode(connected=True)
import os
print(os.listdir("../input"))

> # 1.0 Load Dataset (carregando os dados)

In [2]:
data = pd.read_csv("../input/IHMStefanini_industrial_safety_and_health_database.csv", delimiter=',', header=0, parse_dates = ["Data"], index_col ="Data")

### 1.1 First look to the data shape (visualizando o formato da matriz de dados)

In [3]:
data.shape

> ### **Outcome (saída)** : 
- small data set but with relevant information (let's see it)
- base de dados pequena mas porém com informação relevante

  ### 1.2 First data vizualization (visualizando um pouco dos dados)

In [4]:
data.head()

### **Outcome (saída)** : 
- Some categories look generic (although they represent a real value), but they were modified to keep it anonymous and preserve company's identity
- Algumas categorias parecem genéricas (embora representem um valor real), mas foram modificadas para mantê-las anônimas e preservar a identidade da empresa

### 1.3 Checking data index format (verificando o formato do índice da matriz de dados)

In [5]:
data.index

### **Outcome (saída)** : 
- index is as expected (datetime)
- o índice está no formato esperado (timestamp)

### 1.4 Identifing data types (identificando o "tipo" ou "formato" de cada coluna)

In [6]:
datadict = pd.DataFrame(data.dtypes)
datadict

### 1.5 Checking null values (verificando a presença de valores nulos)

In [7]:
datadict['MissingVal'] = data.isnull().sum()
datadict

### **Outcome (saída)** : 
- good, no need to worry about missing data!
- legal, não precisa preocupar com valores faltantes

### 1.6 Identify number of unique values (identifica a quantidade de valores únicos presente em cada coluna)

In [8]:
datadict['NUnique']=data.nunique()
datadict

# 2.0 Basic Exploratory Data Analysis (análise exploratória de dados básica)

   - In this part we will see how to get valuable Insights from a relevant data, which can help you to argue with about a such important issue: accidents in shop floor kill thousands of people everywere, globally

### 2.1 Basic descriptive statistics, but a lot of value on that!

In [9]:
data.describe(include=['object'])

Pandas *describe* command reference: [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

### Highlights:

1. Country_01 is the country where most of the accidents happen (more than 50%)
2. Local_03 (which also belongs to Country_01) is where most of the accidents happen 
3. Mining is also the most significant contributor to accidents
4. Male (95%) and Third Party (43%) also counts for kind of people that suffers more accident

### Insights:

1. What makes Country_01 the most prominent contributor? Why is the amount of accidents higher there than in the other places?
2. What also makes Local_03 the dangerous place to work?
3. Why are Male and Third Party people suffering more accidents? Are they been trained correctly to prevent accidents?


### 2.2 Some analysis as a kind of *excel dynamic table*

In [11]:
data['Day of the Week'] = data.index.dayofweek
grouped_data = pd.DataFrame(data.groupby(['Countries','Day of the Week']).count())
grouped_data

### Highlights:

1. Thursday is the day when most of the accidents happen in Country_01
2. Wednesday is the day when most of the accidents happen in Country_02

### Insights:

1. What happens or what kind of routine leads to most of the accidents happen on Thursday on Country_01
2. Same question for Country_02

### Changing the way of groupby data, its possible to see another patter:

In [12]:
grouped_data = pd.DataFrame(data.groupby(['Industry Sector','Day of the Week']).count())
grouped_data

### Insights:

1. What happens that most of the accidents in Mining happen on Saturday (45)?

### Hint: 
 - Change the way you group, and you will have different views/insights about the same data

### 2.3 Basic trend plotting using plotly offline

In [13]:
# Faz o resampling dos dados para 24h
df = data
df = df.Countries.resample('24H').count()

#Plot o gráfico
trace_high = go.Scatter(
                x=df.index,
                y=df,
                name = "AAPL High",
                line = dict(color = '#17BECF'),
                opacity = 0.8)

dados= [trace_high]

layout = dict(
    title = "Number of Accidents/Day (all countries)",

)

fig = dict(data=dados, layout=layout)
iplot(fig, filename = "Manually Set Range")

### Highlight
 - It can be seen that there two peaks of accidents in February of 2016 and 2017.
 - Maybe a moving average filter would help us to know some tendencies

### Insights:

1. Why there are peaks of accidents at every beginning of the year? Is this because people are more relaxed, coming backing from vacations?

### Using a moving average filter to try to see some tendencies

In [14]:
# Faz o resampling dos dados para 24h
df2 = data
df2 = df2.Countries.resample('24H').count()
temp = df2.rolling(window=30)
b = temp.mean()

#Plot o gráfico
trace_high = go.Scatter(
                x=b.index,
                y=b,
                name = "AAPL High",
                line = dict(color = '#17BECF'),
                opacity = 0.8)

dados= [trace_high]

layout = dict(
    title = "Moving Average of 30 Days of the number of accidents/Day (all countries)",

)

fig = dict(data=dados, layout=layout)
iplot(fig, filename = "Manually Set Range")

### Highlight
 - It is clear that there is an increase in the average of the number of accidents every beginning of the year.
 - It is also possible to see that in 2017 the mean is lower than in 2016 (which is good news).
 
### Insights:

1. Why this pattern persist (number of accidents increases every beginning of a year)?

### 2.4 Basic bars plot using seaborn

### Just a simple bar graph to show that is also lot of options to represent the same data

In [15]:
g = sns.factorplot(data=data, kind="count", x="Countries", hue = "Local", size=8, aspect=1)

### 2.5 A final pareto graph to analyze Crtical Risks using Plotly

In [16]:
columns = ['total','cumulative_sum', 'cumulative_perc','demarcation']
paretodf = pd.DataFrame(columns=columns)
paretodf = paretodf.fillna(0)

paretodf['total'] = data["Risco Critico"].value_counts()
#print(paretodf)

paretodf['cumulative_sum'] = paretodf.cumsum()
#print(paretodf)

paretodf['cumulative_perc'] = 100*paretodf.cumulative_sum/paretodf.total.sum()
#print(paretodf)

paretodf['demarcation'] = 80
#print(paretodf)

In [17]:
trace1 = Bar(
    x=paretodf.index[0:7],
    y=paretodf.total[0:7],
    name='Count',
    marker=dict(
        color='rgb(34,163,192)'
               )
)
trace2 = Scatter(
    x=paretodf.index[0:7],
    y=paretodf.cumulative_perc[0:7],
    name='Cumulative Percentage',
    yaxis='y2',
    line=dict(
        color='rgb(243,158,115)',
        width=2.4
       )
)
trace3 = Scatter(
    x=paretodf.index[0:7],
    y=paretodf.demarcation[0:7],
    name='80%',
    yaxis='y2',
    line=dict(
        color='rgba(128,128,128,.45)',
        dash = 'dash',
        width=1.5
       )
)
dataplot = [trace1, trace2,trace3]
layout = Layout(
    title='Critical Risks Pareto',
    titlefont=dict(
        color='',
        family='',
        size=0
    ),
    font=Font(
        color='rgb(128,128,128)',
        family='Balto, sans-serif',
        size=12
    ),
    width=623,
    height=623,
    paper_bgcolor='rgb(240, 240, 240)',
    plot_bgcolor='rgb(240, 240, 240)',
    hovermode='compare',
    margin=dict(b=250,l=60,r=60,t=65),
    showlegend=True,
       legend=dict(
          x=.83,
          y=1.3,
          font=dict(
            family='Balto, sans-serif',
            size=12,
            color='rgba(128,128,128,.75)'
        ),
    ),
    annotations=[ dict(
                  text="Cumulative Percentage",
                  showarrow=False,
                  xref="paper", yref="paper",
                  textangle=90,
                  x=1.100, y=.75,
                  font=dict(
                  family='Balto, sans-serif',
                  size=14,
                  color='rgba(243,158,115,.9)'
            ),)],
    xaxis=dict(
      tickangle=-90
    ),
    yaxis=dict(
        title='Count',
        range=[0,300],
      tickfont=dict(
            color='rgba(34,163,192,.75)'
        ),
      tickvals = [0,6000,12000,18000,24000,30000],
        titlefont=dict(
                family='Balto, sans-serif',
                size=14,
                color='rgba(34,163,192,.75)')
    ),
    yaxis2=dict(
        range=[0,101],
        tickfont=dict(
            color='rgba(243,158,115,.9)'
        ),
        tickvals = [0,20,40,60,80,100],
        overlaying='y',
        side='right'
    )
)

fig = dict(data=dataplot, layout=layout)
iplot(fig)

### Insight:
 - Its clear that this database needs attention as the column "Others" represent most of the Critical risks
 

# Final outcomes:

- Its possible to think that it could be done in Excel, yeah! But how fancy would it be using Excel? How much data would you be able to process in Excel instead of a Python Notebook?
- Imagine the possibilities to communicate and show this Report to your boss or responsible for managing accidents in your plant?
- You can also now explore **NLP** with this data...

## Thank you!