# DATA PROCESSES ASSIGNMENT

## Understand the business
Clearly define your business and note down the problem you are solving. Do a fundamental analysis on whether providing a data solution will add value to your customer and company or not.

TODO

In this regard,
you must consider univariate and bivariate analysis, survival curves (e.g.,
Kaplan–Meier), and any other analysis that may help to understand the survival
of a patient. You must also train and test different models to predict the
survival (the most important part of the technical execution is the previous
analysis though).



Steps in Data Exploration and Preprocessing:\
Identification of variables and data types\
Analyzing the basic metrics\
Non-Graphical Univariate Analysis\
Graphical Univariate Analysis\
Bivariate Analysis\
Variable transformations\
Missing value treatment\
Outlier treatment\
Correlation Analysis\
Dimensionality Reduction\

TODO
https://towardsai.net/p/data-analysis/exploratory-data-analysis-in-python-ebdf643a33f6

https://mode.com/blog/python-data-visualization-libraries/

## Data exploration
Do you have enough data? — Ask this question if you don’t have data, don’t waste your time and money in making models, evaluating — stop it. Many data science projects are not profitable. Collect data first and build a long-term pipeline of structured data collection. Go for digital transformation, add digital data collection and wait. It is a long-term investment.
Collect your data and guard it with all your might. Self-data collection is a gold mine which only you have it and no one else.

In [10]:
import pandas as pd
import numpy as np
import matplotlib as plt 
import seaborn as sns 
from IPython.display import display

In [74]:
filepath = "COVID19_data.csv"
col_names = ["id", "age", "sex", "days_hospital", "days_icu", "exitus", "destination", "temp", "heart_rate", "glucose", "sat_o2", "blood_pres_sys", "blood_pres_dias"]
df = pd.read_csv(filepath, header= 0, names= col_names).drop("id", axis= 1)

#### Identification of variables and data types

In [75]:
print(f"Number of patients: {df.shape[0]}")
print(f"Number of variables to study: {df.shape[1]}")
df.head().style

Number of patients: 2054
Number of variables to study: 12


Unnamed: 0,age,sex,days_hospital,days_icu,exitus,destination,temp,heart_rate,glucose,sat_o2,blood_pres_sys,blood_pres_dias
0,15.0,FEMALE,4,0,NO,,37.0,0,0,92,0,0
1,18.0,FEMALE,4,0,NO,ADMISSION,37.3,105,0,97,0,0
2,21.0,MALE,7,0,NO,,38.5,112,0,95,85,47
3,21.0,MALE,10,0,NO,ADMISSION,39.2,113,0,97,0,0
4,22.0,MALE,4,0,NO,,36.3,80,0,92,111,70


In [73]:
print(f"Data types in Covid-19 dataset: \n\n{df.dtypes}")

Data types in Covid-19 dataset: 

age                float64
sex                 object
days_hospital        int64
days_icu             int64
exitus              object
destination         object
temp               float64
heart_rate           int64
glucose              int64
sat_o2               int64
blood_pres_sys       int64
blood_pres_dias      int64
dtype: object


#### Analyzing basic metrics

In [76]:
df.describe()

Unnamed: 0,age,days_hospital,days_icu,temp,heart_rate,glucose,sat_o2,blood_pres_sys,blood_pres_dias
count,2050.0,2054.0,2054.0,2054.0,2054.0,2054.0,2054.0,2054.0,2054.0
mean,70.856585,8.118793,0.355404,28.386319,70.787731,1.776047,73.39776,83.571568,48.32814
std,20.456931,6.177872,2.173721,15.419158,41.802038,20.434622,37.863716,67.450853,44.225438
min,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,57.0,4.0,0.0,35.4,63.0,0.0,81.0,0.0,0.0
50%,68.0,7.0,0.0,36.4,84.0,0.0,93.0,115.0,64.0
75%,98.0,10.0,0.0,36.9,98.0,0.0,96.0,137.0,79.0
max,189.0,98.0,36.0,40.1,593.0,448.0,99.0,772.0,845.0


#### Non-graphical univariate analysis
##### Unique values

In [82]:
df["age"].value_counts()

98.0     574
77.0     140
74.0     105
72.0      94
57.0      77
        ... 
15.0       1
106.0      1
105.0      1
18.0       1
102.0      1
Name: age, Length: 62, dtype: int64

## Data preparation
Now you have your data, prepare it, clean it, store it properly, update it, and arrange it in a meaningful way when it is ready.

## Data problem research
Define your data problem and see if there are readily available solutions to it by service providers like Google Cloud, AWS or Azure, or any other API services. Try not to reinvent the wheel by building some generic models like recommender systems, OCRs, etc. These are now available in the plug-and-play fashion and are very cheap, cheaper than the time, energy, money, and computation power you will spend developing a half-accurate model.

## Predictive model and data transformation
Now quickly come up with simple models. Note that all solutions need not require machine learning or deep learning. Rule-based models work just as fine, sometimes even better. Transform your data based on a model that you have selected. In the first iteration, try all the simple models and decide on a benchmark.

## Testing validation
Define an evaluation metric based on your business problem and test your simple model in this metric. This metric will be your benchmark as we advance.

## Model evaluation
Based on your Validation, rework your model from step 6 and recalculate the evaluation metric to see if you can do better than the benchmark. It may happen that a simple model will solve your problem, but this is rare.
One more insight that you will get from steps 5 and 6 is your data is sufficient or not. You may have to spend some more time on steps 2 and 3 then.

## Solution deployment
Now that you are happy with the evaluation and this model solves your problem deploy the model based on your use case. The best way to consume your model is to build an API around it and integrate that into the solution.
If possible, deploy your model using Docker. Dockerization will help in easy deployment. Making an API around your model will help you upgrade without downtime.
Deployment is one thing that is primarily dependent on how you want to consume the model.

## Optimization
Optimization is the last step and perhaps most crucial step as well, as, and when you collect more data, you should upgrade your model and see if it is appropriately solving your purpose. Did it solve the business problem of step 1? If not! Start over.
