# `Pandas` Student Exercises

**Please do not share this material on any platform or by any other means.**

    `Pandas` - Analyzing data 
    Use `Pandas` library to answer the questions about cardiovascular disease (CVD) dataset to predict the presence or absence of CVD using the patient examination results. Data is given as a seperate file. 

    - group by 
    - mean/median
    - subsetting data with single/multiple criteria

---
**Data dictionary:**

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

| Feature | Variable Type | Variable      | Value Type |
|---------|--------------|---------------|------------|
| Age | Objective Feature | age | int (days) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
---

---
#  Preliminary data analysis

In [1]:
# Import all required modules
import pandas as pd
import numpy as np

# Import plotting modules
import seaborn as sns
sns.set()
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker
%matplotlib inline

Use the `seaborn` library for visual analysis, execute the settings below:

In [3]:
# Tune the visual settings for figures in `seaborn`
sns.set_context(
    "notebook", 
    font_scale=1.5,       
    rc={ "figure.figsize": (11, 8), "axes.titlesize": 18 })

from matplotlib import rcParams
rcParams['figure.figsize'] = 11, 8

### Read the cardiovascular_data.csv as a dataframe 'df', find the shape of the dataframe and review first 4 records

In [7]:
# add your explanation and code here
df = pd.read_csv('cardiovascular_data.csv',delimiter=';')
df

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69995,99993,19240,2,168,76.0,120,80,1,1,1,0,1,0
69996,99995,22601,1,158,126.0,140,90,2,2,0,0,1,1
69997,99996,19066,2,183,105.0,180,90,3,1,0,1,0,1
69998,99998,22431,1,163,72.0,135,80,1,2,0,0,0,1


### Summarize the data (how many records, what sort of variables, numerical or categorical)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


There are no missing records, most of the data is int64 and only `weight` feature is float 64. We need to look at whether the data types are making sense. See the code below. 

In [8]:
for c in df.columns:
    n = df[c].nunique()
    print(c)
    if n <= 3:
        print(n, sorted(df[c].value_counts().to_dict().items()))
    else:
        print(n)
    print(10 * '-')

id
70000
----------
age
8076
----------
gender
2 [(1, 45530), (2, 24470)]
----------
height
109
----------
weight
287
----------
ap_hi
153
----------
ap_lo
157
----------
cholesterol
3 [(1, 52385), (2, 9549), (3, 8066)]
----------
gluc
3 [(1, 59479), (2, 5190), (3, 5331)]
----------
smoke
2 [(0, 63831), (1, 6169)]
----------
alco
2 [(0, 66236), (1, 3764)]
----------
active
2 [(0, 13739), (1, 56261)]
----------
cardio
2 [(0, 35021), (1, 34979)]
----------


**Interpretation**: 
We can see there are some categorical variables, e.g. gender, cholesterol, smoke etc. which are saved as int64, we can convert them into categorical variables. 

---
# Part 1 Basic observations

### Question 1: How many men and women are present in this dataset? Values of the `gender` feature were not given (whether "1" stands for women or for men) – if we assume men are taller on average, can you identify which value corresponds to which gender?
1. 45530 women and 24470 men
2. 45530 men and 24470 women
3. 45470 women and 24530 men
4. 45470 men and 24530 women

In [18]:
# add your explanation and code here

df.groupby('gender', as_index = False)['height'].mean()

Unnamed: 0,gender,height
0,1,161.355612
1,2,169.947895


In [19]:
df[['gender', 'height']].groupby('gender').mean()

Unnamed: 0_level_0,height
gender,Unnamed: 1_level_1
1,161.355612
2,169.947895


In [28]:
df.gender.value_counts()

1    45530
2    24470
Name: gender, dtype: int64

### Question 2: Which gender more often reports consuming alcohol - men or women?
1. women
2. men

In [24]:
# add your explanation and code here
df[['gender', 'alco']].groupby('gender').mean()

Unnamed: 0_level_0,alco
gender,Unnamed: 1_level_1
1,0.0255
2,0.106375


### Question 3: What is the difference between the percentages of smokers among men and women (rounded)?
1. 4
2. 16
3. 20
4. 24

In [27]:
# add your explanation and code here
df[['gender', 'smoke']].groupby('gender').mean()

Unnamed: 0_level_0,smoke
gender,Unnamed: 1_level_1
1,0.017856
2,0.21888


In [29]:
men_smokers = df[df.gender ==2]['smoke'].mean()
men_smokers

0.21888026154474868

In [30]:
women_smokers = df[df.gender ==1]['smoke'].mean()
women_smokers

0.017856358444981332

In [32]:
round(men_smokers - women_smokers,2)

0.2

### Question 4: Calculate ``age_years`` feature – round age to the nearest number of years. (this is a prep question for the following questions)

In [27]:
df['age_years'] = round(df['age']/365)

### Question 5: Find the average cardio % score of men (aged 60-64, inclusive) who are smoking and have systolic blood pressure < 120 and have cholesterol level of 1. This is the CVD risk of this group. 

1. 50
2. 26
3. 29
4. 42

In [32]:
mask = (df['age_years'] >= 60) & (df['age_years'] <= 64) & (df['smoke'] == 1) & (df['ap_hi'] < 120) & (df['cholesterol'] == 1)
df[mask].groupby('gender')['cardio'].mean()


gender
1    0.428571
2    0.256410
Name: cardio, dtype: float64

In [35]:
old_smoking_men = df[(df['gender'] == 2) & (df['age_years']>=60) & (df['age_years']<=64) & (df['smoke'] ==1)]

**Part 1 Done**