<a href="https://colab.research.google.com/github/pandharkardeep/Compute_tasks_Deep_P/blob/main/Deep_PCA_heart_disease_T7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Task: Perform dimensionality reduction by applying Principal Component Analysis on the given dataset and eventually fit a logistic regression model on the reduced data**

## Importing libraries and data

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA

In [3]:
df = pd.read_csv("/content/heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.shape

(918, 12)

In [5]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


**No null values are present**

# Pre-processing

In [7]:
X = df.drop("HeartDisease",axis=1)
y = df.HeartDisease

## Preprocessing task
* Apply label encoding on `X`
* Re-scale the data to get the data of same magnitude
* Apply train test split

### Label encoding

In [10]:
le = LabelEncoder()
for col in X.columns:

    if X[col].dtype == 'object':
        X[col] = le.fit_transform(X[col])

### Standard scaler

In [12]:
sc = StandardScaler()

X= sc.fit_transform(X)

### Split data

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25)

## Apply Logistic Regression before applying PCA

In [16]:
model = LogisticRegression()
model.fit(X,y)
model.score(X_test, y_test)

0.8347826086956521

## PCA

* Apply PCA with number of components = 5
* fit and transform pca on `X` and store new data in `X2`


In [17]:
pca = PCA(n_components=5)
X2 = pca.fit_transform(X)

* Find number of components of pca

In [18]:
print("Shape of X2:", X2.shape)

Shape of X2: (918, 5)


**Explained variance ratio** is the amount of variance explained by each feature (component) of PCA

In [19]:

for i, explained_variance in enumerate(pca.explained_variance_ratio_):
    print(f"Component {i+1}: {explained_variance:.2f}")

Component 1: 0.25
Component 2: 0.13
Component 3: 0.11
Component 4: 0.09
Component 5: 0.08


In [20]:
total_variance_explained = sum(pca.explained_variance_ratio_)
print("Total Variance Explained:", total_variance_explained)


Total Variance Explained: 0.6596718447968661


### Run the cells below to apply LogisticRegression on reduced data

In [21]:
# Train test split of X2
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state=47)

In [22]:
log = LogisticRegression(max_iter=1000)
log = log.fit(X_train, y_train)
log.score(X_test, y_test)

0.8206521739130435