# `Pandas` Student Exercises

**Please do not share this material on any platform or by any other means.**

    `Pandas` - Analyzing data 
    Use `Pandas` library to answer the questions about cardiovascular disease (CVD) dataset to predict the presence or absence of CVD using the patient examination results. Data is given as a seperate file. 

    - group by 
    - mean/median
    - subsetting data with single/multiple criteria

---
**Data dictionary:**

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

| Feature | Variable Type | Variable      | Value Type |
|---------|--------------|---------------|------------|
| Age | Objective Feature | age | int (days) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
---

---

In [1]:
# Import all required modules
import pandas as pd

### Read the cardiovascular_data.csv as a dataframe 'df', find the shape of the dataframe and review first 4 records

In [2]:
# add your explanation and code here
df = pd.read_csv('cardiovascular_data.csv',';')
df.head(4)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1


### Summarize the data (how many records, what sort of variables, numerical or categorical)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


There are no missing records, most of the data is int64 and only `weight` feature is float 64. We need to look at whether the data types are making sense. See the code below. 

In [4]:
for c in df.columns:
    n = df[c].nunique()
    print(c)
    if n <= 3:
        print(n, sorted(df[c].value_counts().to_dict().items()))
    else:
        print(n)
    print(10 * '-')

id
70000
----------
age
8076
----------
gender
2 [(1, 45530), (2, 24470)]
----------
height
109
----------
weight
287
----------
ap_hi
153
----------
ap_lo
157
----------
cholesterol
3 [(1, 52385), (2, 9549), (3, 8066)]
----------
gluc
3 [(1, 59479), (2, 5190), (3, 5331)]
----------
smoke
2 [(0, 63831), (1, 6169)]
----------
alco
2 [(0, 66236), (1, 3764)]
----------
active
2 [(0, 13739), (1, 56261)]
----------
cardio
2 [(0, 35021), (1, 34979)]
----------


**Interpretation**: 
We can see there are some categorical variables, e.g. gender, cholesterol, smoke etc. which are saved as int64, we can convert them into categorical variables. 

## PART 3

### Question 9: Filter out the following patient segments 

(we consider these as erroneous data)

- diastolic pressure is higher than systolic 
- height is strictly less than 2.5 percentile (Use `pd.Series.quantile` to compute this value. If you are not familiar with the function, please read the docs.)
- height is strictly more than 97.5 percentile
- weight is strictly less than 2.5 percentile
- weight is strictly more than 97.5 percentile

In [7]:
# add your explanation and code here
# assuming the resulting data is called "good_data"

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [12]:
df['ap'] = df['ap_hi'] > df['ap_lo']

In [19]:
df['height_min'] = pd.Series.quantile(df['height'],0.025)

In [20]:
df['height_max'] = pd.Series.quantile(df['height'],0.975)

In [21]:
df['weight_min'] = pd.Series.quantile(df['weight'],0.025)

In [22]:
df['weight_max'] = pd.Series.quantile(df['weight'],0.975)

In [39]:
mask = ((df['height'] >= df['height_min']) & (df['height'] <= df['height_max']) & (df['weight'] >= df['weight_min']) & (df['weight'] <= df['weight_max']))

In [41]:
good_data.shape

(64352, 19)

In [40]:
good_data = df[mask]

In [43]:
perc = good_data.shape[0] / df.shape[0]
perc

0.9193142857142858

### Question 10: What percent of the original data (rounded) did we throw away?

In [44]:
# add your explanation and code here
1 - perc

0.08068571428571425

In [34]:
# assuming the filtered data set is named "filtered_df"
print (f'remaining data set size % is{good_data}')

remaining data set size % is          id    age  gender  height  weight  ap_hi  ap_lo  cholesterol  gluc  \
0          0  18393       2     168    62.0    110     80            1     1   
1          1  20228       1     156    85.0    140     90            3     1   
2          2  18857       1     165    64.0    130     70            3     1   
3          3  17623       2     169    82.0    150    100            1     1   
4          4  17474       1     156    56.0    100     60            1     1   
...      ...    ...     ...     ...     ...    ...    ...          ...   ...   
69993  99991  19699       1     172    70.0    130     90            1     1   
69994  99992  21074       1     165    80.0    150     80            1     1   
69995  99993  19240       2     168    76.0    120     80            1     1   
69998  99998  22431       1     163    72.0    135     80            1     2   
69999  99999  20540       1     170    72.0    120     80            2     1   

       smok

In [None]:
print ('removed data set size % is'  ___)

**Student Exercises Part 3 Done**