# Search Tasks

### **What** are the types of Decision Trees

1. Classification Trees
    * Used for categorical data
    * Example: Email (spam / not spam)

2. Regression Trees
    * Used for continuous data
    * Example: Predicting house price
3. Hyrbid Trees
    * Used for both categorical and contnuous data
    * Not commonly used
    * Decision trees in gradient boosting methods like XGBoost

### What are Voting and Stacking algorithms?

1. Voting Ensemble
    * Combine predictions from multiple models and decide based on voting or averging
    * Hard Voting: takes the majority class prediction from all models
    * Soft Voting: averages the probabilities and select the highest probability class
    * Commonly used with classification problems
    * Examples:
        * (Yes, No, Yes) -> Yes (Majority)
        * (0.8, 0.3, 0.7) -> 0.6 (Yes)
2. Stacking Ensemble
    * Use multiple base models and a meta-model to learn how to best combine the output
    * Commonly used for classification and regression problems

### Search about PyCaret

* PyCaret is an open-source machine learning library in Python that automates machine learning workflows. It serves as an end-to-end tool for model management, significantly accelerating the experimentation cycle and enhancing productivity.
* It allows users to replace extensive code with just a few lines, making experiments faster and more efficient.
* Various ML Tasks: regression, classification, clustering, NLP, time series
* Can be integrated with several ML libraries such as sklearn

### What are the different types of Encoding preprocessing?

1. Label Encoding
    * Assign a unique integer for each class
    * Simple and memory efficient
    * Used for ordinal data
2. One-Hot Encoding
    * Create binary columns for each class
    * Prevent ordinal misinterpretation
    * Increase dimentionality
3. Binary Encoding
    * Convert categories into binary representation
    * Reduces dimensionality compared to One-Hot Encoding
4. Frequency Encoding
    * Replace each class with the frequency of its occurence
    * May not work will with unseen classes
5. Target Encoding
    * Replaces classes with the mean of the target variable for each class
6. Dummy Encoding
    * Similar to One-Hot Encoding but drops one category to avoid redundancy

### What is Stratify parameter in train_test_split?

* It ensures that the distribution of target classes remains the same in both the training and testing sets.
* It is useful for imbalanced datasets, where one class has significantly more samples than another.
* Prevents bias
* Avoids underrepresented classes in training or testing: Especially useful when dealing with classification problems where some classes are rare.

#  Apply EDA on Heart Diseases data

In [1]:
import pandas as pd
import numpy as nd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('heart.csv')
df.head(10)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
5,39,M,NAP,120,339,0,Normal,170,N,0.0,Up,0
6,45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0
7,54,M,ATA,110,208,0,Normal,142,N,0.0,Up,0
8,37,M,ASY,140,207,0,Normal,130,Y,1.5,Flat,1
9,48,F,ATA,120,284,0,Normal,120,N,0.0,Up,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [4]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [5]:
df.isnull().sum()

Unnamed: 0,0
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


In [6]:
df.duplicated().sum()

0

In [7]:
X = df.drop(columns=['HeartDisease'], axis=1)
y = df['HeartDisease']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [9]:
encoder = LabelEncoder()

categorical_columns = X.select_dtypes(include=['object']).columns

for col in categorical_columns:
    X_train[col] = encoder.fit_transform(X_train[col])
    X_test[col] = encoder.transform(X_test[col])