<h1>Industrial Accidents Predictions</h1>


<p> Today we are going to build a Machine learning model to predicts accidents in the industrial sector.

For this project we are going to use : <br />
    -Python languague <br />
    -Pandas, Numpy, Sklearn, MatplotLib and Seaborn frameworks. <br />
    -Jupyter notebooks for this scenenario <br />
</p>


<p> The first thing is setting the necessary imports </p>

In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import mutual_info_regression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import StratifiedKFold
from sklearn.multioutput import MultiOutputClassifier
import matplotlib.pyplot as plt
import seaborn as sns

<h4> Disclaimer </h4>
<p> Since I know that I will encounter different problems during the process, I don't like doing all the imports at first, I just <br />
write the ones that I am 100% sure I will need. I think that a lot of time is wasted when you try to realise at the very first moment which are all the <br /> tools that you are going to use, I personally preffer to prepare some general tools that I always use and then grab the neccesary ones. </p>


<p> Now let's see which is the content of our dataset and if there is some missing information </p>

In [9]:
class DataExtractor():
    def datapreparer(self, route):
        self.dataset = pd.read_csv(route)
        dataset_columns = self.dataset.columns
        print(self.dataset.columns)
    
    def data_visualization(self):
        print(self.dataset.head())
        mising_col =[col for col in self.dataset.columns if self.dataset[col].isnull().sum() > 0]
        if len(mising_col) > 0:
            print("There is missing information")
        else:
            print("There is no missing data")

    def model_selection(self):
        self.oneh = OneHotEncoder(sparse_output = False)
        self.label = LabelEncoder()
        
        self.selected_model = RandomForestClassifier(max_depth = 10, n_estimators = 100)
        features = self.dataset.drop('Industry Sector', axis = 'columns')
        encoded_features = self.oneh.fit_transform(features)
        features_dataframe = pd.DataFrame(encoded_features)
        self.encoded_columns = features_dataframe.columns

        labels = self.dataset['Industry Sector']
        encoded_labels = self.label.fit_transform(labels)

        x_train, x_test, y_train, y_test = train_test_split(encoded_features, encoded_labels, test_size = 0.2, random_state = 0)

        self.selected_model.fit(x_train, y_train)
        predictions = self.selected_model.predict(x_test)
        error = mean_squared_error(y_test, predictions)
        print(f"Training time output {predictions.shape}")
        print(f"Error in training time {error * 100:.2f}%")

    def new_predictions(self, new_data):
    	new_data_encoded = pd.get_dummies(new_data)
    	predictions = self.selected_model.predict(new_data_encoded)
    	print(f"Normal prediction shape {predictions.shape}")
    	return predictions

    def get_encoding_info(self, prediction):
    	output_decoded = self.label.inverse_transform(prediction)
    	print(f"Categories: \n {output_decoded}")

        

In [10]:
route = "./archive/IHMStefanini_industrial_safety_and_health_database.csv"
new_data_extractor = DataExtractor()
new_data_extractor.datapreparer(route)


Index(['Data', 'Countries', 'Local', 'Industry Sector', 'Accident Level',
       'Potential Accident Level', 'Genre', 'Employee ou Terceiro',
       'Risco Critico'],
      dtype='object')


<h2> Value information</h2>
<p> The columns can sometimes provide information about what is in them, but many times watching their name can not be enough, <br /> because they are not so descriptive or another reason, so let's print some of their values but not all of them </p>.
<p> On the other side we are iterating over every column of our dataset finding any possibility of having a null value, our model won't perform correctly if some values are missing, if it was the case, we could fix that missing value with 0 and the corresponding column's mean</p>

In [11]:
new_data_extractor.data_visualization()

                  Data   Countries     Local Industry Sector Accident Level  \
0  2016-01-01 00:00:00  Country_01  Local_01          Mining              I   
1  2016-01-02 00:00:00  Country_02  Local_02          Mining              I   
2  2016-01-06 00:00:00  Country_01  Local_03          Mining              I   
3  2016-01-08 00:00:00  Country_01  Local_04          Mining              I   
4  2016-01-10 00:00:00  Country_01  Local_04          Mining             IV   

  Potential Accident Level Genre  Employee ou Terceiro        Risco Critico  
0                       IV  Male           Third Party              Pressed  
1                       IV  Male              Employee  Pressurized Systems  
2                      III  Male  Third Party (Remote)         Manual Tools  
3                        I  Male           Third Party               Others  
4                       IV  Male           Third Party               Others  
There is no missing data


<h2> Dataset general overview </h2>
<p> So watching the values we can have a general overview of this dataset content <br />
This dataset contains information about accidents in the industrial sector, with the following details:</br> 
&nbsp -Date when it happened. </br>
&nbsp -Country where the accident took place. </br>
&nbsp -Local number where it occurs. </br>
&nbsp -In which sector did it happened. </br>
&nbsp -Accident Level (we can guess it is the consequences level, how harsh they were). </br>
&nbsp -Potential Accident Level (we can guess it's also about the consequences level, how severe they could have been). </br>
&nbsp -Genre of the person who suffered the accident. </br>
&nbsp -Was it an Employee our Third party member. </br>
&nbsp -The department, type of job that the accident belonged. </br>
</p> 

<h4> Disclaimer </h4>
<p> When some information is not clear enough, maybe the columns names are not so descriptive <br />
and the values doesn't help at speaking about the information, we should ask our clients to be 100% sure about <br />
what the data is saying about the accidents in this case. <br /> <br />
For example "Local" column, the name is not very descriptive, we could guess It's talking about the location where the industry is, <br />
but It can be literally whatever the client decided it to be, and the values just indicates that there are more than one. <br />
So this is a good case to show to our clients, the interest that we have making a really good work.

<h2> What are we capable of doing using with Machine learning algorithms? </h2>

<p/> One of the thing that I personally love the most about Machine learning, is that we are able to predict what is <strong> most likely to </strong>strong> happen in the future <br/>
thank to that information we have in the dataset. <br /> <br />
Imagine that you are Head Officer of this industry and you want to know what is <strong>most likely going to </strong>strong> happen in a certain date, with some conditions, in the Mining industry so you can actually take some safety measures in order to avoid accidents. This is the moment to apply Machine learning in your business</p>

So what we are first going to do, is <strong> predicting </strong> under certain conditions which is the Industry that is <strong> most likely to </strong> have some accidents 

<h3> This is a screenshoot, if you want to copy some code, the whole class is at start</h3>

![image.png](attachment:b2c8eb75-7ff3-4395-a47d-dce984f467a4.png)



<h2> What are we doing in this function? </h2>
<p> This dataset contains categorical information, this means we are not talking about numbers, just places, dates, industries, so we need something </br>
to turn this data into numbers that our algorithm can work with to get new predictions. This is the moment to use encoders, this tool is going to transform our data into numbers, this way our algorithm is going to be able to handle categorical data. </br> </p>

<h3> OneHotEncoder vs LabelEncoder </h3>
<p> OneHotEncoder builds Binary matrix (one-hot vectors) with our data so each columns is a matrix with the possible values of this column represented, but we all know that "A picture is worth a thousand words", so I will let this magnificient image speak for me. </p>

![image.png](attachment:d5adc74c-d0c4-43a8-b6a9-df94a17b7459.png)

The column city has 4 posible values, so OneHot build a matrix with the possible values, the values marked with 1 are the one selected for the specific row




On the other hand LabelEncoder is simply turning a Not a Number value(NaN) into a number, <strong> but this can be problematic  Why? </strong>

<p>Imagine the column Genre, where the value Male is turned into 0 and the value Female is turned into 1, with a LabelEncoding the algorithm understand that 1 is greater(better) than 0 and it will asign a higher value to Female. <strong> That's a big problem at making predictions </strong><p>

The algorithm selected is a RandomForestClassifier, if we want to validate what classifier algorithm to use, we could do a GridSearch to compare different classification algorithms to use and it's parameters, then get the one with the higher accuracy. <br />
We will use the <strong>RandomForestClassifier</strong> implementation from <strong> sklearn.ensemble</strong>

<p> The label we are trying to predict is the Industry Sector where the accidents would take place so, we have to: <br />
    &nbsp -Drop that columns from our dataset in training time. <br/>
    &nbsp -After removing that column, we encode the rest of the features using OneHot. <br />
    &nbsp -Save the columns configuration created after passing the dataset through OneHot, this process is neccesary for the future at inference time. </p>

![image.png](attachment:850bdb99-f8ad-47b5-b882-302fe49fccc6.png)


After that we create a new dataframe, for the labels that we dropped before, and encode them using the LabelEncoder.

<p>
The next steps are: <br />
&nbsp -Dividing our dataset into training and testing data. <br/>
&nbsp -Fit the model to our training features and labels. <br/>
&nbsp -Make predictions with out testing features.<br/>
&nbsp -Compare predictions with testing labels.<br/>
</p>
<h4>Disclaimer</h4>
This dataset it's pretty short and that would probably drop the model performance, it will probably overfit that short quantity of training data, but the idea is to show the process, in a real case we would need more traning data and make some better hyperparameter tunning to get the best performance

![image.png](attachment:033ef796-57ce-4592-825b-f6fb267e37a6.png)

In [12]:
new_data_extractor.model_selection()


Training time output (88,)
Error in training time 4.55%


<h2> Now what? </h2>
Our model is already fitted, the next step is to prepare for inference time, so we can make predictions for the future using our already trained model and the following function

![image.png](attachment:104665e5-ab97-413d-bc6f-333c2c6ae852.png)

<h3> What is this function doing?</h3>
<p>In this function we are receiving new data, for example: <br/>
    &nbsp -A date. <br/>
    &nbsp -One of the countries. <br/>
    &nbsp -A concrete local. <br/>
    &nbsp -Accident level. <br/> 
    &nbsp -Potential accident level. 
    &nbsp -A genre. <br/>
    &nbsp -One of the type of workers. <br/>
    &nbsp -A "department". <br/>

Creating a new dataframe for this data, and reindexing it with the encoded columns, Do you remember that moment before when we saved a column configuration? Now we need it for encoding new data into the same format so our algorithm can work with it.<br />
After that we call self.selected_model.predict() on new data so we can make the prediction.<br ><br />
Now in the predictions variable we have the result, the industry sector in which the accident is most likely to take place. However it is represented by a number, specifically an encoded industry sector, so we need to decode it back to it's original name."
</p>


In [13]:
new_data = pd.DataFrame([[
    '2016-01-01 00:00:00', 'Country_01', 'Local_01', 'I', 'IV', 'Male', 'Third Party', 'Pressed'
]], columns=['Data', 'Countries', 'Local',  'Accident Level',
             'Potential Accident Level', 'Genre', 'Employee ou Terceiro', 'Risco Critico'])


new_data = new_data.reindex(columns = new_data_extractor.encoded_columns, fill_value = 0)
prediction = new_data_extractor.new_predictions(new_data)

Normal prediction shape (1,)


<h2> Decoding the prediction </h2>

![image.png](attachment:c49df1fa-cbd6-48eb-8160-a5383b9a72fc.png)

<p> This code receives a OneHotEncoded prediction and decodes it back using the inverse_transform function </p>

In [14]:
new_data_extractor.get_encoding_info(prediction)


Categories: 
 ['Mining']
