# `Pandas` Student Exercises

**Please do not share this material on any platform or by any other means.**

    `Pandas` - Analyzing data 
    Use `Pandas` library to answer the questions about cardiovascular disease (CVD) dataset to predict the presence or absence of CVD using the patient examination results. Data is given as a seperate file. 

    - group by 
    - mean/median
    - subsetting data with single/multiple criteria

---
**Data dictionary:**

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

| Feature | Variable Type | Variable      | Value Type |
|---------|--------------|---------------|------------|
| Age | Objective Feature | age | int (days) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
---

---
#  Preliminary data analysis

In [1]:
# Import all required modules
import pandas as pd

### Read the cardiovascular_data.csv as a dataframe 'df', find the shape of the dataframe and review first 4 records

In [3]:
# add your explanation and code here
df = pd.read_csv('cardiovascular_data.csv',';')
df.head(4)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1


### Summarize the data (how many records, what sort of variables, numerical or categorical)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


There are no missing records, most of the data is int64 and only `weight` feature is float 64. We need to look at whether the data types are making sense. See the code below. 

In [5]:
for c in df.columns:
    n = df[c].nunique()
    print(c)
    if n <= 3:
        print(n, sorted(df[c].value_counts().to_dict().items()))
    else:
        print(n)
    print(10 * '-')

id
70000
----------
age
8076
----------
gender
2 [(1, 45530), (2, 24470)]
----------
height
109
----------
weight
287
----------
ap_hi
153
----------
ap_lo
157
----------
cholesterol
3 [(1, 52385), (2, 9549), (3, 8066)]
----------
gluc
3 [(1, 59479), (2, 5190), (3, 5331)]
----------
smoke
2 [(0, 63831), (1, 6169)]
----------
alco
2 [(0, 66236), (1, 3764)]
----------
active
2 [(0, 13739), (1, 56261)]
----------
cardio
2 [(0, 35021), (1, 34979)]
----------


**Interpretation**: 
We can see there are some categorical variables, e.g. gender, cholesterol, smoke etc. which are saved as int64, we can convert them into categorical variables. 

---
# Part 2 

### Question 6: What is the CVD risk of men (smokers, aged 60 to 64 inclusive) who have cholesterol level of 3 and systolic blood pressure is between [160, 180)?

**Compare the result with the result of Question 5.**

In [6]:
# add your explanation and code here
df['age_years'] = round(df['age']/365)

In [9]:
osm = df[(df['gender'] == 2) & (df['age_years']>=60) & (df['age_years']<=64) & (df['smoke'] ==1)]

In [12]:
osm[(osm['cholesterol']==3) & (osm['ap_hi']>=160) & (osm['ap_hi']<180)]['cardio'].mean()

0.8636363636363636

### Question 7A: Using your Python skills, find out whether this statement is True or False. Explain.

Median BMI in the sample is within the range of normal BMI values.


In [32]:
# add your explanation and code here
df['height_m'] = df['height']/100
df['height_m']

0        1.68
1        1.56
2        1.65
3        1.69
4        1.56
         ... 
69995    1.68
69996    1.58
69997    1.83
69998    1.63
69999    1.70
Name: height_m, Length: 70000, dtype: float64

In [33]:
max_weight = df['weight'].max()
min_weight = df['weight'].min()

mid_weight = max_weight + min_weight / 2
mid_weight

205.0

In [49]:
max_weight/max_height**2

32.0

In [50]:
min_weight/min_height**2

33.05785123966942

In [34]:
max_height = df['height_m'].max()
min_height = df['height_m'].min()

mid_height = max_height + min_height / 2
mid_height

2.775

In [48]:
mid_weight / mid_height**2

26.621215810405

In [67]:
df.bmi.median()

26.374068120774975

### Question 7B: Using your Python skills, find out whether this statement is True or False. Explain.

The BMI for women is on average higher than for men.

In [70]:
# add your explanation and code here
df['bmi'] = df['weight'] / (df['height']/100)**2

In [71]:
df.groupby('gender')['bmi'].mean()

gender
1    27.987583
2    26.754442
Name: bmi, dtype: float64

### Question 7C: Using your Python skills, find out whether this statement is True or False. Explain.

For healthy (cardio 0), non-drinking men, BMI is closer to the norm than for healthy, non-drinking women

In [57]:
df[(df['alco']==0) & (df['cardio']==0)]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,height_m,bmi
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50.0,1.68,0.369048
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,48.0,1.56,0.358974
5,8,21914,1,151,67.0,120,80,2,2,0,0,0,0,60.0,1.51,0.443709
6,9,22113,1,157,93.0,130,80,3,1,0,0,1,0,61.0,1.57,0.592357
8,13,17668,1,158,71.0,110,70,1,1,0,0,1,0,48.0,1.58,0.449367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69987,99979,18852,1,151,49.0,120,80,1,1,0,0,1,0,52.0,1.51,0.324503
69988,99981,21978,1,160,59.0,110,70,1,1,0,0,1,0,60.0,1.60,0.368750
69991,99988,20609,1,159,72.0,130,90,2,2,0,0,1,0,56.0,1.59,0.452830
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0,53.0,1.68,0.452381


In [72]:
df.groupby(by=['gender','alco','cardio'])['bmi'].mean()

gender  alco  cardio
1       0     0         26.845407
              1         29.052771
        1     0         28.671457
              1         30.812347
2       0     0         25.872638
              1         27.522450
        1     0         26.097220
              1         28.226569
Name: bmi, dtype: float64

In [74]:
mask = (df['cardio'] == 0) & (df['alco'] == 0)
df[mask].groupby('gender')['bmi'].mean()

gender
1    26.845407
2    25.872638
Name: bmi, dtype: float64

### Question 8: Create a new feature – BMI ([Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index))
To do this, divide weight in kilograms by the square of the height in meters. Normal BMI values are said to be from 18.5 to 25. 

In [66]:
# add your explanation and code here
df['bmi'] = df['weight'] / (df['height']/100)**2
df['bmi']

0        21.967120
1        34.927679
2        23.507805
3        28.710479
4        23.011177
           ...    
69995    26.927438
69996    50.472681
69997    31.353579
69998    27.099251
69999    24.913495
Name: bmi, Length: 70000, dtype: float64

**Student Exercises Part 2 Done**