

Complete attribute documentation:

1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
"Typical (classic) angina chest pain consists of (1) Substernal chest pain or discomfort that is (2) Provoked by exertion or emotional stress and (3) relieved by rest or nitroglycerine (or both). Atypical (probable) angina chest pain applies when 2 out of 3 criteria of classic angina are present."
-- Value 1: typical angina
-- Value 2: atypical angina
-- Value 3: non-anginal pain
-- Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholestoral in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
-- Value 0: normal
-- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
-- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak = ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
-- Value 1: upsloping
-- Value 2: flat
-- Value 3: downsloping
12. ca: number of major vessels (0-3) colored by flourosopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14. num: diagnosis of heart disease (angiographic disease status)
-- Value 0: < 50% diameter narrowing 
-- Value 1: > 50% diameter narrowing
        (in any major blood vessel)

More information on this you can find here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

The authors of the databases have requested:

      ...that any publications resulting from the use of the data include the 
      names of the principal investigator responsible for the data collection
      at each institution.  They would be:

       1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
       2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
       3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
       4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
	  Robert Detrano, M.D., Ph.D.


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')

import warnings
warnings.filterwarnings('ignore')

from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import mean_squared_error, make_scorer

In [2]:
data=pd.read_csv('heart.csv')
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


Our target column indicatets if trhe person is sick or not.

-- Value 0: < 50% diameter narrowing - It means that there is less than 50% narrowing of the artery, therefore no heart disease.
-- Value 1: > 50% diameter narrowing - There is a more than 50% narrowing of a major blood vessel, therefore heart disease is pressent.

0 - no heart disease
1 - heart disease

In [4]:
#Check for values that are not numbers.
data.isna().any()

age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
target      False
dtype: bool

In [5]:
#Check Dataframe for duplicate rows
duplicateRowsData = data[data.duplicated()]
print(duplicateRowsData)

     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
164   38    1   2       138   175    0        1      173      0      0.0   

     slope  ca  thal  target  
164      2   4     2       1  


In [6]:
#Seems that the row with age = 38 is duplicated. Let's investigate further.
data_38=data[data['age'] == 38]
data_38

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
163,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2,1
259,38,1,3,120,231,0,1,182,1,3.8,1,0,3,0


Rows 163 and 164 seem to be duplicates. We're going to drop 163.

In [None]:
#drop row with index 163 from Dataframe.
data=data.drop([df.index[163]])

In the details on the database the authors specify that there are values missing and they replaced them with -9.0.
Let's see if the non-values are still in the set.

In [8]:
#Check if there is any value equal to -9.0
data.where(data == -9.0).sum()

age         0.0
sex         0.0
cp          0.0
trestbps    0.0
chol        0.0
fbs         0.0
restecg     0.0
thalach     0.0
exang       0.0
oldpeak     0.0
slope       0.0
ca          0.0
thal        0.0
target      0.0
dtype: float64

Therefore no non-values present.

In [9]:
# Function that will find the number of unique values and tell us if it's high or low.
def find_number_unique_values(df):
    #Creates two empty lists for categorical and continuous data
    cont_val=[]
    cat_val=[]
    #Adds column to categorical data list if it has less than 5 unique values or adds it to the continuous data list it it has more than 5 unique values.
    for i in df.columns:
        if df[i].nunique()<10:
            cat_val.append(i) 
        else:
            cont_val.append(i)
    print(f" These are continuous values: {cont_val}.") , print(f" These are categorical values: {cat_val}.")
        
        #print(f" There are {df[i].nunique()} {i} uniques values.")

In [10]:
find_number_unique_values(data)

 These are continuous values: ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'].
 These are categorical values: ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target'].


# Remove outliers.
Remove outliers in all the features using a function I found here: https://github.com/nadinezab/kc-house-prices-prediction/blob/master/kc-house-prices.ipynb

In [11]:
# Define function to remove outliers
def remove_outliers(df):
    '''removes entries with z-score above specific columns'''
    variables = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target','age', 'trestbps', 'chol', 'thalach', 'oldpeak']
    
    for variable in variables:
        df = df[np.abs(df[variable]-df[variable].mean()) <= (3*df[variable].std())]
        
    return df

In [12]:
data1 = remove_outliers(data)
len(data1)

287

In [13]:
loss=(len(data1)*100)/len(data)
print(f"We lost {loss}% of our data.")

We lost 94.71947194719472% of our data.


We observe that the length of our dataframe stays the same, therfore there were no outliers to remove.

In [None]:
df['age'].describe()

In [None]:
sns.jointplot(x='age', y='target', data=df, 
              kind='reg', label='age', joint_kws={'line_kws':{'color':'red'}})
plt.title('age vs heart disease')
plt.xlabel('Age')
plt.show()