#  **Detailed Explanation of Requirements Used in Project**

My project performs **data cleaning**, **EDA**, **visualization**, **machine learning modeling**, and **model evaluation**.
To complete these tasks, your code uses several Python libraries.
Below is a detailed explanation of each one and why it is required.

---

#  **1. Data Handling Libraries**

## **- pandas**

**Why used:**
Pandas is the main library for handling datasets. It helps you load, clean, and analyze data.

**What it does in your project:**

* Loads CSV files
* Shows dataset summary (`df.info()`, `df.describe()`)
* Removes duplicates
* Handles missing values
* Selects and modifies columns

Pandas is essential for almost all steps in data cleaning and EDA.

---

## **- numpy**

**Why used:**
Numpy helps with numerical operations. Machine learning models require data in numeric form, and NumPy makes that possible.

**What it does in your project:**

* Creates arrays
* Performs mathematical operations
* Supports sklearn models internally

indirectly using numpy when working with ML models and data transformations.

---

# **2. Visualization Libraries**

## **- matplotlib**

**Why used:**
Matplotlib is used for creating basic graphs.

**What it does in your project:**

* Plots simple graphs
* Draws ROC curve
* Draws trend lines and custom visualizations

Often used alongside seaborn for more detailed plots.

---

## **- seaborn**

**Why used:**
Seaborn makes attractive and statistical graphs.

**What it does in your project:**

* Creates heatmaps
* Creates pairplots
* Boxplots, histograms, countplots
* Helps visualize patterns and correlations

Seaborn simplifies and beautifies EDA visualizations.

---

#  **3. Machine Learning – Scikit-Learn**

`scikit-learn` (sklearn) is the **backbone of your ML modeling**.
Almost every ML step in your notebook uses a module from sklearn.

Below is a detailed breakdown:

---

## **(A) Model Selection**

### **from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score**

**Why used:**

* `train_test_split` → splits data into training and testing sets
* `StratifiedKFold` → performs balanced K-fold cross-validation
* `cross_val_score` → checks model performance using multiple folds

This ensures model training is **fair, balanced, and accurate**.

---

## **(B) Data Preprocessing**

### **from sklearn.preprocessing import StandardScaler, OneHotEncoder**

**Why used:**

* `StandardScaler` → normalizes numeric features
* `OneHotEncoder` → converts categorical features into numerical format

These transformations improve model performance and stability.

---

## **(C) Handling Missing Data**

### **from sklearn.impute import SimpleImputer**

**Why used:**

* Fills missing values using:

  * mean
  * median
  * most frequent value

project uses this in cleaning and modeling pipelines.

---

## **(D) Column Transformations**

### **from sklearn.compose import ColumnTransformer**

**Why used:**

* Applies different preprocessing steps to different columns
  Example:

  * Scale numeric columns
  * Encode categorical columns

This modular approach makes preprocessing more organized.

---

## **(E) Machine Learning Pipeline**

### **from sklearn.pipeline import Pipeline**

**Why used:**

* Combines preprocessing + model into one object
* Makes the entire ML workflow clean and reproducible

Example:

```
Pipeline([
    ('preprocess', ColumnTransformer(...)),
    ('model', RandomForestClassifier())
])
```

---

## **(F) Models Used in Your Project**

### **-LogisticRegression**

Used as a baseline model for classification.

### **-RandomForestClassifier**

Used for strong, high-accuracy modeling with tree-based learning.

### **-GradientBoostingClassifier**

Used for boosting performance using weak learners.

### **-SVC (Support Vector Machine)**

Used for high-margin separation in classification tasks.

These models help you compare and choose the best performer.

---

###  **4. ML Evaluation Metrics (from sklearn)**

Your project evaluates model performance using the following:

* **accuracy_score** → Measures correct predictions
* **precision_score** → Focuses on how many predicted positives are correct
* **recall_score** → Measures how many actual positives were found
* **f1_score** → Balance of precision and recall
* **roc_auc_score** → Measures classification quality
* **roc_curve** → Used to draw ROC curve
* **confusion_matrix** → Shows TP, FP, FN, TN

All these help analyze which model performs best and why.

---

##  **5. Model Saving Library**

## **joblib**

**Why used:**

* Saves trained models into a `.pkl` file
* Helps reload the model later without training again

Example:

```
joblib.dump(model, "heart_disease_model.pkl")
```

This is important for deployment and future use.

---

###  **Final Requirements List**
```
pandas
numpy
matplotlib
seaborn
scikit-learn
joblib
```

---

