#***Bix Tecnologia Challenge***

##Evaluated candidate
* Helder da Silva Galbier

##Proposed Situation

*  Reduced maintenance costs for air systems in a truck fleet.
* List factors that indicate failure in the maintenance system currently used.



In [8]:
# preparing drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##[Exploratory data analysis and Dataprep]

**1. What steps would you take to solve this problem? Please describe as completely and clearly as possible all the steps that you see as essential for solving the problem.**

Some important points to consider:
* The company wants to reduce maintenance costs only for trucks with air system problems. So trucks with other types of problems cannot be included in the ML model.
* The dataset columns were encoded. This may compromise the accuracy of the analysis of the applied Machine Learning model. Even so, a study will be carried out.
* I assumed that each dataset column corresponds to a maintenance occurrence for each vehicle.

Therefore, I chose to implement data preparation as follows: consider only the total number of trucks with air system problems. First, let's count how many trucks are undergoing maintenance this year and the last few years.



In [30]:
# importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# importing files

present_file_path = '/content/drive/MyDrive/bix_tech/air_system_present_year.csv'
previous_file_path = '/content/drive/MyDrive/bix_tech/air_system_previous_years.csv'

df_present = pd.read_csv('/content/drive/MyDrive/bix_tech/air_system_present_year.csv')
df_previous = pd.read_csv('/content/drive/MyDrive/bix_tech/air_system_previous_years.csv')

# counting trucks with air system maintenance this year

count_pres = df_present['class'].value_counts().get('pos', 0)
print(f"There are {count_pres} trucks in maintenance this year")

# counting trucks with another issues this year

count_pres_other = df_present['class'].value_counts().get('neg', 0)
total_pres = count_pres + count_pres_other
print(f"There are {count_pres_other} trucks in maintenance this year for another issues")
print(f"There are {total_pres} trucks in this dataset")
print("----------------------------------------------------------------------------------------------------------------------")

# counting trucks with air system maintenance in recent years

count_prev = df_previous['class'].value_counts().get('pos', 0)
print(f"There are {count_prev} trucks in maintenance in recent years")

# counting trucks with another issues in recent years

count_prev_other = df_previous['class'].value_counts().get('neg', 0)
total_prev = count_prev + count_prev_other
print(f"There are {count_prev_other} trucks in maintenance in recent years for another issues")
print(f"There are {total_prev} trucks in this dataset")
print("----------------------------------------------------------------------------------------------------------------------")

# [dataprep] Removing rows that contain "na" (current year) and filtering the rows thar contains "pos"

df_present.replace('na', pd.NA, inplace=True)
df_cleaned = df_present.dropna()
df_cleaned_pos_pres = df_cleaned[df_cleaned['class'] == 'pos']
print(df_cleaned_pos_pres.head())
print("----------------------------------------------------------------------------------------------------------------------")

# [dataprep] Removing rows that contain "na" (past years) and filtering the rows thar contains "pos"

df_previous.replace('na', pd.NA, inplace=True)
df_cleaned = df_previous.dropna()
df_cleaned_pos_prev = df_cleaned[df_cleaned['class'] == 'pos']
print(df_cleaned_pos_prev.head())

There are 375 trucks in maintenance this year
There are 15625 trucks in maintenance this year for another issues
There are 16000 trucks in this dataset
----------------------------------------------------------------------------------------------------------------------
There are 1000 trucks in maintenance in recent years
There are 59000 trucks in maintenance in recent years for another issues
There are 60000 trucks in this dataset
----------------------------------------------------------------------------------------------------------------------
     class   aa_000 ab_000 ac_000 ad_000 ae_000 af_000 ag_000  ag_001  \
486    pos  1172556      0    246    326      0      0      0       0   
583    pos   354598      0   4944   4648      0      0      0       0   
728    pos   468666      2   2596   1794      0      0      0  194100   
1348   pos   251818      0   3082   1238      0      0     30    3668   
1687   pos   195268      0    472    380      0      0      0       0   

      

##[Model Choice]

**2. Which technical data science metric would you use to solve this challenge? Ex: absolute error, rmse, etc.**

After separating the data between the absolute number of trucks with air system problems and the absolute number of trucks with other maintenance problems, both in the dataset with information from previous years and in the dataset from the most current year, it became more suggestive to implement a model of Logistic Regression for each dataset, considering the absolute number of trucks with air system problems over the total number of trucks. This approach predicts the unknown data value using another related and known data value.
For this model, I used the accuracy metric that defines the precision of the regression model used.
I considered 20% as a parameter for the proportion of data for the test.


In [88]:
# loading regression model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# splitting data between "test" and "training" for df_cleaned_pos_pres

x_train, x_test, y_train, y_test = train_test_split(df_cleaned_pos_pres['ab_000'],df_cleaned_pos_pres['ee_009'],test_size=0.2)

# splitting data between "test" and "training" for df_cleaned_pos_pres

x_train, x_test, y_train, y_test = train_test_split(df_cleaned_pos_prev['ab_000'],df_cleaned_pos_prev['ee_009'],test_size=0.2)

# creating Logistics Regression model

model_df_pres = LogisticRegression()
model_df_prev = LogisticRegression()

Having created the Regression models, now let's train them and input some prediction values

In [90]:
# training model

model_df_pres.fit(np.array([x_train]).reshape(-1, 1),y_train)
model_df_prev.fit(np.array([x_train]).reshape(-1, 1),y_train)

# inputing prediction values

model_df_pres.predict(np.array([800]).reshape(-1,1))
model_df_prev.predict(np.array([90]).reshape(-1,1))

# checking the accuracy of the implemented model

accuracy_pres = model_df_pres.score(np.array([x_test]).reshape(-1, 1),y_test)
accuracy_prev = model_df_prev.score(np.array([x_test]).reshape(-1, 1),y_test)
print(f"The accuracy of this model for current data is {accuracy_pres}")
print(f"The accuracy of this model for data from previous years is {accuracy_prev}")

The accuracy of this model for current data is 0.875
The accuracy of this model for data from previous years is 0.875


**3. Which business metric  would you use to solve the challenge?**


As business metrics to support decisions in solving this problem, I consider:
- Average time between failures: indicator that shows the average time a vehicle is used before a failure occurs. The longer the indicated time, the greater reliability in the truck fleet;
- Failure rate: Indicator that indicates the percentage of trucks that require corrective maintenance in relation to the absolute number of vehicles;
- Preventive versus Corrective maintenance cost: I consider it a good business metric as we can compare costs with corrective and preventive maintenance. Increased corrective costs may mean a greater need for preventive maintenance (the most important);
- Average repair time: indicator that measures the time needed to complete maintenance on a vehicle. It is a metric that also checks maintenance efficiency;

**4. How do technical metrics relate to the business metrics?**

Conceptually, logistic regression deals with a statistical model used to predict the probability of some event occurring. In other words, it has two possible results, such as "yes/no", "success/failure", "pos/neg", etc.
The model presented demonstrated an accuracy of 87.5%, which is considered relatively high. The applicability of this model to the business metrics discussed previously are:
- Mean Time Between Failures: the logistic regression model can be used to predict the probability of a truck failing within a certain period of time. By exploring the factors that induce failures, we can identify key predictors that increase or decrease this metric with higher precision/accuracy. Based on this forecast, the company can study preventive maintenance strategies.
- Failure rate: Logistic regression model can predict the probability of failure for each truck. The sum of these probabilities indicates the failure rate for the entire fleet. This way, the company can study a better allocation of resources and develop better preventive maintenance planning, aiming to reduce the failure rate.
- Cost of preventive versus corrective maintenance: with the logistic regression model, we can determine which trucks are at greater risk of needing corrective maintenance. This allows you to plan preventative maintenance more efficiently, reducing total maintenance costs. The cost and benefit analysis can be improved, showing that a greater investment in preventive maintenance can significantly reduce corrective costs.
-Average repair time: the regression model helps predict the probability of different types of failures, considering repair time for each event. Thus, the company can optimize the operational resources used in the vehicle maintenance logistics chain.


**5. What types of analyzes would you like to perform on the customer database**

The statement says that "for bureaucratic reasons related to the company's contracts, all columns had to be coded". This data encoding can distort the results of the implemented Machine Learning model. For effective analysis, it is essential to work in a database with as accurate information as possible. More analyzes could be implemented, such as:
- exploratory data analysis;
- draw measures of central tendency;
- draw dispersion measurements;
- trace offenders (locations, routes, type of vehicles, incidence per driver, etc.)
All this, of course, respecting the Corporate Governance guidelines, current data protection laws, among other legal and administrative restrictions.


**6. What techniques would you use to reduce the dimensionality of the problem**

An important data science technique that I would use is Clustering, such as the K-Means method, widely used to check patterns in data sets and group them by similar characteristics.
Another technique I would use is Feature Selection. In addition to the already proposed method (Model-Based), statistical methods such as analysis of variance and correlation testing would help to identify which characteristics of this data set are most closely related to maintenance costs.

**7. What techniques would you use to select variables for your predictive model?**

In addition to data input, which fills NA values ​​with some reasonable value (usually some measure of central tendency), I would also use:
- Variable scaling: to transform data into a common scale in order to reduce the dominance of large magnitudes of data over smaller magnitudes, together with Min-Max normalization;
- Coding of categorical variables: with the techniques of one-hot coding (one column for each category) and ordinal coding (mapping of categories with values ​​in numbers);


**8. What predictive models would you use or test for this problem? Please indicate at least 3.**

In addition to the Logistic Regression model used, I would also use:
- Linear Regression (supervised method), which predicts continuous values ​​and are exactly those used in the dataset.
- K-Means (unsupervised): is a clustering method that divides data into similar sets. It helps identify patterns in the dataset.
- K-nearest neighbors (supervised): classifies each sample in the dataset considering the distance in relation to its neighbors. It would identify, for example, which routes cause the most damage to trucks.

**9. How would you rate which of the trained models is the best?**


I think that the K-means clustering model would be the most appropriate for a first stage of analysis, as it would help to identify similar patterns in the data set. This would facilitate the identification of:
- which routes have the highest incidence of damaged vehicles;
- which drivers have the most maintenance records;
- how much each route or driver contributes to the maintenance cost;
- frequency of vehicles requiring maintenance, etc.;

**10. How would you explain the result of your model? Is it possible to know which variables are most important?**

Initially, the Logistic Regression model showed an accuracy of 87.5%. It is considered a high accuracy rate for a Machine Learning model applied to business problems. It is worth presenting graphs here to better visualize this accuracy and the impacts this model will have on maintenance costs. The model coefficients, in a regression model, are the major influencers on the response variables, depending on their distance from the mean.


**11. How would you assess the financial impact of the proposed model?**

With 87.5% accuracy, the model is effective in predicting potential failures. This means that, on average, the model makes correct predictions 87.5% of the time.
Regarding failure reduction: it is possible to estimate the percentage of failures that can be avoided with preventive interventions based on model predictions. This way, there is an estimated savings in maintenance costs, as the company will be able to have an overview of where it allocates resources to reduce its maintenance costs.

**12. What techniques would you use to perform the hyperparameter optimization of the chosen model?**

Optimizing hyperparameters encompasses the search for the best values ​​of parameters that are not directly learned by the model, but that influence its performance. Some techniques I would apply to this model:
- Grid search: ensuring the evaluation of all possible combinations in a defined space;
- Bayesian Optimization: within the same analysis spectrum, it will tend to find the best hyperparameters with fewer evaluations, compared to grid or random search;

**13. What risks or precautions would you present to the customer before putting this model into production?**

In any application of a Machine Learning model, the first precaution is about data quality (missing data, outliers, variable scaling). This is an important point to reinforce with the client because a Machine Learning model fed and trained on a biased dataset will bring inefficient or inconclusive results.

**14. If your predictive model is approved, how would you put it into production?**

After processing and preparing the data, training and validating the model, we need to follow some steps to put it into production:
- Serialization of the model, which consists of saving and loading the model to be used in production. In Python there are specific libraries for this like "pickle" or "joblib"
- Version the model in a versioning system such as Git.
- Create an API for the model, configuring an environment where the model will be made available;

**15. If the model is in production, how would you monitor it?**

To monitor any system we need to observe some essential details, such as:
- performance monitoring, observing forecast latency, error rate and other metrics. Some tools can help with this task, such as Grafana.
- Implement logging to track predictions and errors.


**16. If the model is in production, how would you know when to retrain it?**

- Ensure that the model is updated, periodically training it with new data. In the specific case of the transport company, monthly is a reasonable frequency.