# Programming for Data Analysis Project

Notebook for the Project for Programming for Data Analysis module @ GMIT - 2020

Author: Maciej Izydorek (G00387873@gmit.ie) Github: [mizydorek](https://github.com/mizydorek/Machine-Learning-Tasks-2020)

***

#### — Problem statement
*For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the **numpy.random** package for this purpose.*

*Specifically, in this project you should:*

• *Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.*

• *Investigate the types of variables involved, their likely distributions, and their relationships with each other.*

• *Synthesise/simulate a data set as closely matching their properties as possible.*

• *Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.*

*Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.*

#### — Introduction

Cardiovascular diseases

![Kitten](https://www.clearlake-specialties.com/wp-content/uploads/SystolicDiastolic_Heartfailure.5518685646fab-e1553788922847.png)

[description]


#### — Content


Columns:

1. Age: Age of the patient, Measurement: Years [40 - 95]    

2. Anemia: Decrease of red blood cells or hemoglobin, Measurement: Boolean [0, 1]

3. High blood pressure: If a patient has hypertension, Measurement: Boolean [0, 1]

4. Creatinine phosphokinase (CPK): evel of the CPK enzyme in the blood, Measurement: mcg/L  [23 - 7861]

5. Diabetes: If the patient has diabetes, Measurement: Boolean [0, 1]

6. Ejection fraction: Percentage of blood leaving, Measurement: Percentage [14 - 80]

7. Sex : Woman or man, Measurement: Binary [0, 1]

8. Platelets: Platelets in the blood, Measurement: kiloplatelets/mL [25.01 - 850.00]

9. Serum creatinine : Level of creatinine in the blood, Measurement: mg/dL [0.50 - 9.40]

10. Serum sodium: Level of sodium in the blood, Measurement: mEq/L [114 - 148]

11. Smoking: If the patient smokes, Measurement: Boolean [0, 1]

12. Time: Follow-up period, Measurement: Days [4 - 285]

13. Death event: If the patient died during the follow-up period, Measurement: Boolean [0, 1]



#### — Packages

In [5]:
# import libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

# plot settings
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [12,7]

rng = np.random.default_rng(5)

#### — Dataset

In [12]:
# Load dataset.
df = pd.read_csv("heart.csv")

Unnamed: 0,0,1,2,3,4
age,75.0,55.0,65.0,50.0,65.0
anaemia,0.0,0.0,0.0,1.0,1.0
creatinine_phosphokinase,582.0,7861.0,146.0,111.0,160.0
diabetes,0.0,0.0,0.0,0.0,1.0
ejection_fraction,20.0,38.0,20.0,20.0,20.0
high_blood_pressure,1.0,0.0,0.0,0.0,0.0
platelets,265000.0,263358.03,162000.0,210000.0,327000.0
serum_creatinine,1.9,1.1,1.3,1.9,2.7
serum_sodium,130.0,136.0,129.0,137.0,116.0
sex,1.0,1.0,1.0,1.0,0.0


In [10]:
# Print information about a DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
age                         299 non-null float64
anaemia                     299 non-null int64
creatinine_phosphokinase    299 non-null int64
diabetes                    299 non-null int64
ejection_fraction           299 non-null int64
high_blood_pressure         299 non-null int64
platelets                   299 non-null float64
serum_creatinine            299 non-null float64
serum_sodium                299 non-null int64
sex                         299 non-null int64
smoking                     299 non-null int64
time                        299 non-null int64
DEATH_EVENT                 299 non-null int64
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [13]:
# Shape of dataset.
df.shape

(299, 13)

In [15]:
# Have a look at some basic statistical details.
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,299.0,60.833893,11.894809,40.0,51.0,60.0,70.0,95.0
anaemia,299.0,0.431438,0.496107,0.0,0.0,0.0,1.0,1.0
creatinine_phosphokinase,299.0,581.839465,970.287881,23.0,116.5,250.0,582.0,7861.0
diabetes,299.0,0.41806,0.494067,0.0,0.0,0.0,1.0,1.0
ejection_fraction,299.0,38.083612,11.834841,14.0,30.0,38.0,45.0,80.0
high_blood_pressure,299.0,0.351171,0.478136,0.0,0.0,0.0,1.0,1.0
platelets,299.0,263358.029264,97804.236869,25100.0,212500.0,262000.0,303500.0,850000.0
serum_creatinine,299.0,1.39388,1.03451,0.5,0.9,1.1,1.4,9.4
serum_sodium,299.0,136.625418,4.412477,113.0,134.0,137.0,140.0,148.0
sex,299.0,0.648829,0.478136,0.0,0.0,1.0,1.0,1.0


#### — Standard Missing values

In [16]:
# checks if dataset contains any missing values
# https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

#### — Non-Standard Missing values

In [19]:
# checks if dataset contains any missing values according to specified list
# https://stackoverflow.com/questions/43424199/display-rows-with-one-or-more-nan-values-in-pandas-dataframe
missing_values=['n/a', 'na', '--', ' ']
df = pd.read_csv('heart.csv', na_values=missing_values)
df.isna().sum().sum()

0

#### — Negative values

In [37]:
# checks if dataset contains any negative values
df[(df.iloc[:,:] < 0)].sum()

age                         0.0
anaemia                     0.0
creatinine_phosphokinase    0.0
diabetes                    0.0
ejection_fraction           0.0
high_blood_pressure         0.0
platelets                   0.0
serum_creatinine            0.0
serum_sodium                0.0
sex                         0.0
smoking                     0.0
time                        0.0
DEATH_EVENT                 0.0
dtype: float64