# Planning 

## Reference
[Link](https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection) to Kaggle Project.

## The Problem Statement:
To build an application to classify the patients to be healthy or suffering from cardiovascular disease based on the given attributes.

## Features:

Age | Objective Feature | age | int (days)  
Height | Objective Feature | height | int (cm) |  
Weight | Objective Feature | weight | float (kg) |  
Gender | Objective Feature | gender | categorical code |   
Systolic blood pressure | Examination Feature | ap_hi | int |    
Diastolic blood pressure | Examination Feature | ap_lo | int |  
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |  
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |  
Smoking | Subjective Feature | smoke | binary |  
Alcohol intake | Subjective Feature | alco | binary |  
Physical activity | Subjective Feature | active | binary |  
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |  

# Import libraries

In [1]:
import pandas as pd
import numpy as np

# Loading dataset

In [2]:
data_raw = pd.read_csv('dataset/cardio_train.csv', delimiter=';')

# Descriptive analysis

## Dimensions

In [3]:
data_raw.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,988,22469,1,155,69.0,130,80,2,2,0,0,1,0
1,989,14648,1,163,71.0,110,70,1,1,0,0,1,1
2,990,21901,1,165,70.0,120,80,1,1,0,0,1,0
3,991,14549,2,165,85.0,120,80,1,1,1,1,1,0
4,992,23393,1,155,62.0,120,80,1,1,0,0,1,0


In [4]:
data_raw.shape

(69301, 13)

## Renaming columns

In [5]:
data_raw.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

In [6]:
data_raw.columns = ['id', 'age', 'gender', 'height', 'weight', 'sys_press', 'dia_press',
                    'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']

In [7]:
data_raw.columns

Index(['id', 'age', 'gender', 'height', 'weight', 'sys_press', 'dia_press',
       'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio'],
      dtype='object')

## Feature engineering

### Changing `age` from days to years

In [8]:
data_raw['age_year'] = data_raw['age'].apply(lambda x: x/365)

In [9]:
data_raw.sample(3)

Unnamed: 0,id,age,gender,height,weight,sys_press,dia_press,cholesterol,gluc,smoke,alco,active,cardio,age_year
51448,74381,20542,2,166,50.0,140,90,1,1,0,0,1,1,56.279452
64475,93031,19652,2,167,66.0,110,70,1,1,0,0,1,0,53.841096
56483,81624,18281,2,181,91.0,119,74,2,1,0,0,1,0,50.084932


In [10]:
data_raw.drop('age', axis=1,inplace=True)

In [11]:
data_raw.sample(3)

Unnamed: 0,id,gender,height,weight,sys_press,dia_press,cholesterol,gluc,smoke,alco,active,cardio,age_year
34265,49941,1,160,62.0,120,80,1,1,0,0,1,0,44.068493
46103,66827,1,160,63.0,90,60,1,1,0,0,1,0,51.676712
53717,77611,1,156,52.0,110,70,1,1,0,0,1,0,47.641096


In [12]:
data_raw.rename(columns={'age_year': 'age'}, inplace=True)

In [13]:
data_raw.sample(3)

Unnamed: 0,id,gender,height,weight,sys_press,dia_press,cholesterol,gluc,smoke,alco,active,cardio,age
17616,26153,2,168,83.0,120,80,2,2,1,1,1,1,41.561644
30649,44747,1,154,85.0,120,70,1,1,0,0,0,0,47.558904
51406,74319,1,161,68.0,110,70,1,1,0,0,1,0,50.665753


In [14]:
# round `age` values to 1 decimal
data_raw['age'] = data_raw['age'].apply( lambda x: np.round(x, 1) )

In [15]:
data_raw.sample(3)

Unnamed: 0,id,gender,height,weight,sys_press,dia_press,cholesterol,gluc,smoke,alco,active,cardio,age
37577,54632,1,169,68.0,90,60,1,1,0,0,1,0,62.5
26351,38674,1,154,67.0,140,90,3,1,0,0,1,1,59.9
54535,78793,1,175,65.0,120,70,2,1,0,0,1,0,54.8


## Checking NA 

In [16]:
data_raw.isna().sum()

id             0
gender         0
height         0
weight         0
sys_press      0
dia_press      0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
age            0
dtype: int64

## Descriptive statistics

### Numerical attributes

In [17]:
data_raw.dtypes 

id               int64
gender           int64
height           int64
weight         float64
sys_press        int64
dia_press        int64
cholesterol      int64
gluc             int64
smoke            int64
alco             int64
active           int64
cardio           int64
age            float64
dtype: object

In [18]:
ct1 = pd.DataFrame( data_raw.apply ( np.mean) ).T 
ct2 = pd.DataFrame( data_raw.apply ( np.median ) ).T

d1 = pd.DataFrame( data_raw.apply( np.std )).T
d2 = pd.DataFrame( data_raw.apply( min )).T
d3 = pd.DataFrame( data_raw.apply( max )).T
d4 = pd.DataFrame( data_raw.apply( lambda x: x.max() - x.min() )).T
d5 = pd.DataFrame( data_raw.apply( lambda x: x.skew() )).T
d6 = pd.DataFrame( data_raw.apply( lambda x: x.kurtosis() )).T

m = pd.concat([d2,d3,d4,ct1,ct2,d1,d5,d6]).T.reset_index()

# rename columns
m.columns = ["attributes","min","max","range","mean","median","std","skew","kurtosis"]
m

Unnamed: 0,attributes,min,max,range,mean,median,std,skew,kurtosis
0,id,988.0,99999.0,99011.0,50471.480397,50494.0,28562.894266,-0.001317,-1.198215
1,gender,1.0,2.0,1.0,1.349519,1.0,0.476818,0.631203,-1.601629
2,height,55.0,250.0,195.0,164.362217,165.0,8.205278,-0.63404,7.860684
3,weight,10.0,200.0,190.0,74.203027,72.0,14.383365,1.00512,2.514805
4,sys_press,-150.0,16020.0,16170.0,128.829584,120.0,154.774688,84.886144,7506.346872
5,dia_press,-70.0,11000.0,11070.0,96.650092,80.0,189.094876,32.101546,1421.287364
6,cholesterol,1.0,3.0,2.0,1.366806,1.0,0.680265,1.58748,0.994715
7,gluc,1.0,3.0,2.0,1.226447,1.0,0.572242,2.39752,4.294805
8,smoke,0.0,1.0,1.0,0.088051,0.0,0.283369,2.907579,6.4542
9,alco,0.0,1.0,1.0,0.053881,0.0,0.225783,3.951845,13.617472


## To Do
* Entender diferença entre `object` e `category` dtypes
* Continuar exploração dos dados