Reading the information from the CSV and storing it into a dataframe named heart.

In [1]:
import pandas as pd
import numpy as np
heart = pd.read_csv("../input/heart-disease-uci/heart.csv")

Checking the columns present in the csv file.

In [2]:
heart.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

Checking the contents present in the CSV file.

In [3]:
heart.head

<bound method NDFrame.head of      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1      

In [4]:
heart.target.value_counts()

1    165
0    138
Name: target, dtype: int64

In [5]:
y=heart.target
heart=heart.drop('target',axis=1)
a=heart.columns

Scaling needs to be done and it can be inferred when the data is being described all other columns except trestbps,chol,thalach and age are in the same range so these 4 needs to be scaled.

The 2 predominant types of scaling are: 1.Standard Scaler 2.Robust Scaler

Standard Scaler:The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centred around 0, with a standard deviation of 1.

Robust Scaler:The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rathar than the min-max, so that it is robust to outliers. Therefore it follows the formula:

xi–Q1(x)/Q3(x)–Q1(x) For each feature.

Of course this means it is using the less of the data for scaling so it’s more suitable for when there are outliers in the data.

In [6]:
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()

heart['scaled_trestbps'] = rob_scaler.fit_transform(heart['trestbps'].values.reshape(-1,1))
heart['scaled_chol'] = rob_scaler.fit_transform(heart['chol'].values.reshape(-1,1))
heart['scaled_thalach'] = rob_scaler.fit_transform(heart['thalach'].values.reshape(-1,1))
heart['scaled_age'] = rob_scaler.fit_transform(heart['age'].values.reshape(-1,1))
heart.drop(['trestbps','chol','thalach', 'age'], axis=1, inplace=True)

In [7]:
scaled_trestbps = heart['scaled_trestbps']
scaled_chol = heart['scaled_chol']
scaled_thalach = heart['scaled_thalach']
scaled_age = heart['scaled_age']
heart.drop(['scaled_trestbps', 'scaled_chol', 'scaled_thalach', 'scaled_age'], axis=1, inplace=True)
heart.insert(0, 'scaled_trestbps', scaled_trestbps)
heart.insert(1, 'scaled_chol', scaled_chol)
heart.insert(2, 'scaled_thalach', scaled_thalach)
heart.insert(3, 'scaled_age', scaled_age)
heart.head()

Unnamed: 0,scaled_trestbps,scaled_chol,scaled_thalach,scaled_age,sex,cp,fbs,restecg,exang,oldpeak,slope,ca,thal
0,0.75,-0.110236,-0.092308,0.592593,1,3,1,0,0,2.3,0,0,1
1,0.0,0.15748,1.046154,-1.333333,1,2,0,1,0,3.5,0,0,2
2,0.0,-0.566929,0.584615,-1.037037,0,1,0,0,0,1.4,2,0,2
3,-0.5,-0.062992,0.769231,0.074074,1,1,0,1,0,0.8,2,0,2
4,-0.5,1.795276,0.307692,0.148148,0,0,0,1,1,0.6,2,0,2


In [8]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(heart, y, test_size=0.2, random_state=42)

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform. Below is a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic function.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
model=LogisticRegression()
model.fit(x_train,y_train)
pred=model.predict(x_test)
target_names=['class 0','class 1']
print(classification_report(y_test,pred,target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.86      0.86      0.86        29
     class 1       0.88      0.88      0.88        32

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In [10]:
import xgboost as xgb
D_train = xgb.DMatrix(x_train, label=y_train)
D_test = xgb.DMatrix(x_test, label=y_test)
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3} 

steps = 20

model = xgb.train(param, D_train, steps)

preds2 = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds2])

target_names=['class 0','class 1']
print(classification_report(y_test,best_preds,target_names=target_names))

              precision    recall  f1-score   support

     class 0       0.83      0.86      0.85        29
     class 1       0.87      0.84      0.86        32

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61



  if getattr(data, 'base', None) is not None and \


In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.

Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features. (See the parameter tuning guidelines for more details).

The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

In [11]:
from sklearn.ensemble import RandomForestClassifier

regressor = RandomForestClassifier(n_estimators=20, random_state=0)
regressor.fit(x_train, y_train)
y_pred = regressor.predict(x_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86        29
           1       0.88      0.88      0.88        32

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well (look at the below snapshot).

In [12]:
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(x_train, y_train)
y_pred = svclassifier.predict(x_test)
print(classification_report(y_test,y_pred))
preds = pd.DataFrame(y_pred,x_test)
preds.to_csv('submission.csv')

              precision    recall  f1-score   support

           0       0.86      0.86      0.86        29
           1       0.88      0.88      0.88        32

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61

