## Shape: challenge for DS position
--- 

Applicant name: **Mauricio Branbilla Junior**

--- 

### Challenge description

To enable the operations of an FPSO, we use sensors to make sure the equipment does not fail. These sensors measure different parameters of the equipment in different setup configurations (preset 1 and preset 2) over time. We want you to investigate one piece of equipment in different time cycles to understand what characteristics and parameters of the sensors might indicate that the equipment is on the verge of failing. To solve this problem, we expect you to answer a few questions regarding the attached dataset:

    1 – Calculate how many times the equipment has failed.

    2 – Categorize equipment failures by setups configurations (preset 1 and preset 2).

    3 – Categorize equipment failures by their nature/root cause according to parameter readings (temperature, pressure, and others).

    4 – Create a model using the technique you think is most appropriate and measure its performance.

    5 – Analyze variable importance.



**Few Tips:**

Please write down any insights and conclusions throughout your code when you think it is necessary, keeping them as clear and complete as possible. Think of this exercise as your first technical report for Shape!
At Shape, we generally work with Python, and you are encouraged to use this language. However, if you can't use Python, you can use R or Julia for this assessment. We value clean, concise, and production-ready code.
Once you’re done, please send us your analyses and answers in an Html notebook, but also other conclusions and insights that you think are relevant to the project. We value creativity!
Feel free to discuss how you think you would put the models in production.


**What do we expect you to do:**

Present storytelling of data and analyses performed.
Problem comprehension.
Data exploration.
Logic and concise model definition.
Rationale explanation.
Results evaluation.


**What you DON'T need to do?**

Super complex models with no logic or rationale behind them.

---

### Some notes about my solution


--- 
### <a id='index'>Table of Contents:</a>

- [1. Environment setup](#sec_1)
- [2. Data overview](#sec_2)

### <a id='sec_1'>1. Environment setup</a>

At this section:

- Set some variables with paths from current working directory
- Install and import required Python libraries
- Set some constants that will be used on the solutions 


[(back to Table of Contents)](#index)


In [9]:
# Set working directory and paths

import os

MAIN_PATH = os.getcwd()
DATA_FILE_PATH = MAIN_PATH + '/data/O_G_Equipment_Data.xlsx'

print(f"Current Path: {MAIN_PATH}")


Current Path: /Users/mbranbilla/Projects/shape_challenge


In [5]:
%%capture
# Requirements

import subprocess
import sys

requirements = """#Python 3.11.4
pandas==2.0
numpy>=1.21.0, <1.27.0, !=1.24.0
scipy==1.10
scikit-learn==1.3
matplotlib==3.7
seaborn==0.12.2
shap==0.42.1
tqdm
"""
with open('requirements.txt', 'w') as f:
    f.write(requirements)

subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "pip"])
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])

del(requirements)
del(subprocess)
del(sys)
del(f)



In [8]:
# Import mudules
import pandas as pd

**Constants: a brief description of each seted value above:**

The constants will be identified with `UPPER_CASE` named variables.

- `RND_SEED` (int): value to be used as seed for random number generation, necessary for reproducibility in many methods used in the solution (this value is passed on every method that have an optional argument called `random_state`)




In [7]:
# Constants
RND_SEED = 42

### <a id='sec_2'>2. Data overview</a>

At this section:

- Read dataset from provided XLSX file
- Show basic statistics and informations about the data (shape, presence of missing values, distributions)
- Show the solution of following challenge objectives:

    - Calculate how many times the equipment has failed
    - Categorize equipment failures by setups configurations (preset 1 and preset 2)
    -  Categorize equipment failures by their nature/root cause according to parameter readings (temperature, pressure, and others)

[(back to Table of Contents)](#index)

In [10]:
# Load data
df = pd.read_excel(DATA_FILE_PATH)

df.head()

Unnamed: 0,Cycle,Preset_1,Preset_2,Temperature,Pressure,VibrationX,VibrationY,VibrationZ,Frequency,Fail
0,1,3,6,44.235186,47.657254,46.441769,64.820327,66.45452,44.48325,False
1,2,2,4,60.807234,63.172076,62.005951,80.714431,81.246405,60.228715,False
2,3,2,1,79.027536,83.03219,82.64211,98.254386,98.785196,80.993479,False
3,4,2,3,79.716242,100.508634,122.362321,121.363429,118.652538,80.315567,False
4,5,2,5,39.989054,51.764833,42.514302,61.03791,50.716469,64.245166,False


In [11]:
df.describe()

Unnamed: 0,Cycle,Preset_1,Preset_2,Temperature,Pressure,VibrationX,VibrationY,VibrationZ,Frequency
count,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0,800.0
mean,400.5,1.98875,4.55125,69.263494,78.997945,73.860275,72.786878,71.866211,68.223449
std,231.0844,0.805875,2.293239,25.536252,32.501834,31.229631,32.739745,27.844616,29.138702
min,1.0,1.0,1.0,2.089354,3.480279,3.846343,10.057744,18.784169,4.380101
25%,200.75,1.0,3.0,51.040134,55.508564,50.752461,48.523982,50.787638,45.861762
50%,400.5,2.0,5.0,65.906716,75.014848,69.394953,65.50477,69.319237,65.664252
75%,600.25,3.0,7.0,80.52722,99.30253,90.195059,94.075572,88.891205,90.097457
max,800.0,3.0,8.0,255.607829,189.995681,230.861142,193.569947,230.951134,178.090303
