# Insiders - All In One Place

## Cardio Catch Diseases ( CCD ) company

Cadio Catch Diseases is a company that specializes in detecting early-stage heart disease. Its business model is of the Service type, that is, the company offers the early diagnosis of a cardiovascular disease for a certain price.

Currently, the diagnosis of cardiovascular disease is made manually by a team of specialists. The current accuracy of the diagnosis varies between 55% and 65%, due to the complexity of the diagnosis and also the fatigue of the staff who take turns to minimize the risks. The cost of each diagnosis, including equipment and analysts' payroll, is around `R$1,000.00`.

The price of the diagnosis, paid by the client, varies according to the accuracy achieved by the specialists' time, the client pays `R$ 500.00` for each 5% accuracy above 50%. For example, for an accuracy of 55% the diagnosis costs the customer `$500.00`, for an accuracy of 60% the value is `$1000.00`, and so on. If the diagnostic accuracy for 50% of the customer does not pay for it.

Note that the variation in accuracy given by the time of specialists means that the company now has an operation with profit, revenue greater than the cost, or an operation with a loss, revenue less than the cost. This diagnostic instability causes the company to have an unpredictable Cashflow.

---

## Project Objectives

Your goal as the Data Scientist hired by Cardio Catch Diseases is to create a tool that increases diagnostic accuracy and that accuracy is stable for all diagnoses.

So your job as Data Scientist is to create a patient classification tool, like a stable accuracy. Along with the tool, you need to submit a report to the CEO of Cardio Catch Diseases reporting the results and answering the following questions: (He will likely ask these questions on the day of your presentation.)

1. What is the Accuracy and Precision of the tool?
2. How much profit will Cardio Catch Diseases have with the new tool?
3. How reliable is the result given by the new tool?

---

## Data

The dataset that will be used to create a solution for cardiovascular disease is available on the [Kaggle platform](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset).

This dataset contains 70,000 patient diagnoses. You will use this data to build your solution.

There are 3 types of input features:

* Objective: factual information;
* Examination: results of medical examination;
* Subjective: information given by the patient.


Features:

* Age | Objective Feature | age | int (days)
* Height | Objective Feature | height | int (cm) |
* Weight | Objective Feature | weight | float (kg) |
* Gender | Objective Feature | gender | categorical code |
* Systolic blood pressure | Examination Feature | ap_hi | int |
* Diastolic blood pressure | Examination Feature | ap_lo | int |
* Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
* Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
* Smoking | Subjective Feature | smoke | binary |
* Alcohol intake | Subjective Feature | alco | binary |
* Physical activity | Subjective Feature | active | binary |
* Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

All of the dataset values were collected at the moment of medical examination.

# Summary
* [1. Invoices Dataframe](#1.)
    * [1.1 Missing Values](#1.1)
    * [1.2 New Features](#1.2)
    * [1.3 Negative Quantities](#1.3)
    * [1.4 Data Analysis](#1.4)
* [2. Customers Dataframe](#2.)
    * [2.1 Dataframe](#2.1)
    * [2.2 New Features](#2.2)
    * [2.3 Data Analysis](#2.3)
    * [2.4 Data Preprocessing](#2.4)
* [3. Model](#3.)
    * [3.1 K-Means](#3.1)
    * [3.2 Agglomerative Clustering](#3.2)
    * [3.3 DBSCAN](#3.3)
    * [3.4 Save Model](#3.4)
    * [3.5 Prediction](#3.5)
* [4. Conclusion](#4.)
    * [4.1 Who are the people eligible to participate in the Insiders program?](#4.1)
    * [4.2 How many customers will be part of the group?](#4.2)
    * [4.3 What are the main characteristics of these customers?](#4.3)
    * [4.4 What is the percentage of revenue contribution, coming from Insiders?](#4.4)
    * [4.5 What is this group's revenue expectation for the coming months?](#4.5)
    * [4.6 What are the conditions for a person to be eligible for Insiders?](#4.6)
    * [4.7 What are the conditions for a person to be removed from Insiders?](#4.7)
    * [4.8 What is the guarantee that the Insiders program is better than the rest of the base?](#4.8)
    * [4.9 What actions can the marketing team take to increase revenue?](#4.9)

# References
* [Cardiovascular disease](https://www.nhs.uk/conditions/cardiovascular-disease/)
* [9 doenças cardiovasculares comuns: sintomas, causas e tratamento](https://www.tuasaude.com/doencas-cardiovasculares/)
* [Understanding Your BMI Result](https://www.truthaboutweight.global/global/en/support/whats-your-body-mass-index-bmi.html?unit=metric&height=152&weight=34&bmi=14.72)

# Import libraries

In [55]:
# data analysis
import pandas as pd
import numpy as np

# 1. Dataframe <a class='anchor' id='1.'></a>

In [56]:
# removes the limit of max columns to be displayed in the notebook
pd.options.display.max_columns = None

In [57]:
df = pd.read_csv('csv/cardio_train.csv', delimiter=';')
df

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


In [58]:
# let's take a look at the general info of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


* There is no missing values in this dataframe
* All features are numerics

# 1.1 Descriptive Statistics <a class='anchor' id='1.1'></a>

In [59]:
df.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


### Observations:
* The column age is in days. We can create a new column for the age of the client in years (int value)
* Create a new feature with the Body Mass Index (bmi) of the client (float value)
* Create a new feature that classifies the bmi result of the patient in | underweight, normal, overweight, obesity |
* Create a new feature that analyse the ap_hi and ap_lo to define if the client has hypertension | binary |
* There are some negative values in ap_hi and ap_lo columns, that shouldn't be negative --> Needs to analyse these cases
* There are some outliers in ap_hi and ap_lo columns --> Needs to analyse these cases
* The minimum weight is 10kg, but this is teoretically impossible --> Check this(these) row(s)

# 1.2 New Features <a class='anchor' id='1.2'></a>
* age_year: The age of the patient in years | int | years
* bmi: The Body Mass Index of the patient | float | kg/m²
* bmi_class: The classification of the patient based on the bmi value | underweight, normal, overweight, obesity |
* hypertension: The classification of the patient pressure level | binary | 0: normal, 1: hypertensive |

## 1.2.1 age_year

We will assume that every year has 365 days, so:

<p style='text-align:center'>$age\_year = \frac{age}{365}$</p>

Where:
* age is in days
* age_year only return the integer part of the fraction result

In [60]:
df['age_year'] = (df['age']/365).astype(int)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_year
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47


## 1.2.2 bmi

The Body Mass Index (BMI) is a weight-for-height index that classifies underweight, normal, overweight, and obesity in adults.

<p style='text-align:center'>$bmi = \frac{weight}{height²}$</p>

Where:
* weight is in kg
* height is in m

Reference: [Understanding Your BMI Result](https://www.truthaboutweight.global/global/en/support/whats-your-body-mass-index-bmi.html?unit=metric&height=152&weight=34&bmi=14.72)

In [61]:
df['bmi'] = round(df['weight']/((df['height']/100)**2), 2)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_year,bmi
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.97
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.93
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.51
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.71
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.01


## 1.2.3 bmi_class

The Body Mass Index (BMI) is a weight-for-height index that classifies underweight, overweight, and obesity in adults.

<img src="https://latex.codecogs.com/svg.image?bmi\_class&space;=&space;\begin{Bmatrix}underweight,&space;\,&space;if&space;\,&space;bmi&space;<&space;18.5&space;\\&space;normal,&space;\,&space;if&space;\,&space;18.5&space;\leq&space;bmi&space;<&space;25&space;\\&space;overweight,&space;\,&space;if&space;\,&space;25&space;\leq&space;bmi&space;<&space;30&space;\\&space;obesity,&space;\,&space;otherwise\end{Bmatrix}" title="bmi\_class = \begin{Bmatrix}underweight, \, if \, bmi < 18.5 \\ normal, \, if \, 18.5 \leq bmi < 25 \\ overweight, \, if \, 25 \leq bmi < 30 \\ obesity, \, otherwise\end{Bmatrix}" />

Reference: 
* [Understanding Your BMI Result](https://www.truthaboutweight.global/global/en/support/whats-your-body-mass-index-bmi.html?unit=metric&height=152&weight=34&bmi=14.72)
* [Equation Editor for online mathematics](https://latex.codecogs.com/)

In [62]:
df['bmi_class'] = ['underweight' if x < 18.5 else 'normal' if x < 25 else 'overweight' if x < 30 else 'obesity' for x in df['bmi']]
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_year,bmi,bmi_class
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.97,normal
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.93,obesity
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.51,normal
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.71,overweight
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.01,normal


## 1.2.4 hypertension

A person is classified as hypertensive if his/her systolic blood pressure is equal or higher than 140 mmHg or diastolic blood pressure is equal or higher than 90 mmHg.

<img src="https://latex.codecogs.com/svg.image?hypertension&space;=&space;\left\{\begin{matrix}1,\,&space;if\,ap\_hi&space;\geq&space;&space;140\,or\,ap\_lo&space;\geqslant&space;90&space;\\&space;0,\,&space;otherwise\end{matrix}\right." title="hypertension = \left\{\begin{matrix}1,\, if\,ap\_hi \geq 140\,or\,ap\_lo \geqslant 90 \\ 0,\, otherwise\end{matrix}\right." />


Reference: 
* [Understanding Your BMI Result](http://departamentos.cardiol.br/sbc-dha/profissional/pdf/Diretriz-HAS-2020.pdf) - Page 13 (528)
* [Equation Editor for online mathematics](https://latex.codecogs.com/)

In [63]:
df['hypertension'] = [1 if (x >= 140 or y >= 90) else 0 for x,y in zip(df['ap_hi'], df['ap_lo'])]
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_year,bmi,bmi_class,hypertension
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.97,normal,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.93,obesity,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.51,normal,0
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.71,overweight,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.01,normal,0


# 1.3 Data Cleaning <a class='anchor' id='1.3'></a>

## 1.3.1 Body Mass Index (bmi)

* According to the news from bbc, the bmi of 13.6 was the lowest the doctor has ever seen. So I will exclude all values that are lower than that, assuming that there was some kind of typo in height or weight measures
* According to the list of heaviest people from wikipedia, the highest bmi listed is 251.1. So I will exclude all values that are higher than that, assuming that there was some kind of typo in height or weight measures

Reference: 
* [Jordan Burling death: Teen had 'lowest BMI doctor had seen'](https://www.bbc.com/news/uk-england-leeds-44488822)
* [List of heaviest people](https://en.wikipedia.org/wiki/List_of_heaviest_people)

In [64]:
low_bmi = df[df['bmi'] < 13.6].index
len(df[df['bmi'] < 13.6])

12

In [65]:
high_bmi = df[df['bmi'] > 251.1].index
len(df[df['bmi'] > 251.1])

3

In [66]:
df.drop(low_bmi,inplace=True)
df.drop(high_bmi,inplace=True)
df.shape

(69985, 17)

15 entries were removed from the dataset

## 1.3.2 Systolic blood pressure (ap_hi)

* According to the post from quora, a very low value for systolic blood pressure is of 60 mmHg. So I will exclude all values that are lower than that, assuming that there was some kind of typo (this will handle all negative values for this column too)
* According to the news from science abc, the highest systolic blood pressure ever recorded was of 370 mmHg. So I will exclude all values that are higher than that, assuming that there was some kind of typo

Reference: 
* [What is the lowest blood pressure recorded on a living person?](https://www.quora.com/What-is-the-lowest-blood-pressure-recorded-on-a-living-person)
* [How High Can Blood Pressure Go?](https://www.scienceabc.com/eyeopeners/how-high-can-a-blood-pressure-go.html)

In [67]:
high_ap_hi = df[df['ap_hi'] > 370].index
len(df[df['ap_hi'] > 370])

39

In [68]:
low_ap_hi = df[df['ap_hi'] < 60].index
len(df[df['ap_hi'] < 60])

188

In [69]:
df.drop(high_ap_hi,inplace=True)
df.drop(low_ap_hi,inplace=True)
df.shape

(69758, 17)

227 entries were removed from the dataset

## 1.3.3 Diastolic blood pressure (ap_lo)

* We will use the lowest record from nature.com scientific report of 47 mmHg as the minimum acceptable value in the dataset
* According to the news from science abc, the highest diastolic blood pressure ever recorded was of 360 mmHg. So I will exclude all values that are higher than that, assuming that there was some kind of typo

Reference: 
* [Low on-treatment diastolic blood pressure and cardiovascular outcome: A post-hoc analysis using NHLBI SPRINT Research Materials](https://www.nature.com/articles/s41598-019-49557-4)
* [How High Can Blood Pressure Go?](https://www.scienceabc.com/eyeopeners/how-high-can-a-blood-pressure-go.html)

In [70]:
high_ap_lo = df[df['ap_lo'] > 360].index
len(df[df['ap_lo'] > 360])

949

In [71]:
low_ap_lo = df[df['ap_lo'] < 47].index
len(df[df['ap_lo'] < 47])

66

In [72]:
df.drop(high_ap_lo,inplace=True)
df.drop(low_ap_lo,inplace=True)
df.shape

(68743, 17)

1015 entries were removed from the dataset