###Q1. ***You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.
Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the datasetC
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns using the mean of the column valuesC
Scale the numerical columns using standardisationC
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns using the most frequent value of the columnC
One-hot encode the categorical columnsC
Combine the numerical and categorical pipelines using a ColumnTransformerC
Use a Random Forest Classifier to build the final modelC
Evaluate the accuracy of the model on the test dataset.
Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.***

## 🎯 Assignment Q1: Preprocessing Pipeline with Feature Selection and Random Forest Classifier

---

### ✅ Problem Statement:

We are working with a dataset containing both **numerical** and **categorical** features. Some features are highly correlated, and there are **missing values**. Our goal is to:

* Build a pipeline that automates **feature engineering**, handles **missing values**, and uses **Random Forest** to classify.
* Evaluate the model using **accuracy**.
* Suggest possible improvements.

---

### 📦 Step 1: Import Required Libraries

In [4]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.compose import make_column_selector as selector

---

### 📄 Step 2: Load and Split the Dataset

In [6]:
# For demonstration, let's use a mock dataset
# Replace this with: df = pd.read_csv("your_dataset.csv")

from sklearn.datasets import fetch_openml

# Using a dataset with mixed types (numerical + categorical)
data = fetch_openml(name='adult', version=2, as_frame=True)
df = data.frame

# Define features and target
X = df.drop("class", axis=1)
y = df["class"]

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




### 🛠️ Step 3: Define Pipelines

#### 🔢 Numerical Pipeline

* **Imputation:** Replace missing values with column mean
* **Scaling:** StandardScaler to normalize data

In [8]:
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

#### 🔠 Categorical Pipeline

* **Imputation:** Replace missing values with most frequent value
* **Encoding:** One-hot encoding for categorical variables

In [10]:
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

---

### 🧱 Step 4: Combine Pipelines using `ColumnTransformer`

In [12]:
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_pipeline, selector(dtype_include=np.number)),
    ('cat', categorical_pipeline, selector(dtype_include=object))
])

---

### 📌 Step 5: Feature Selection + Random Forest Model Pipeline

In [14]:
# Use Random Forest as an estimator for feature importance
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

---

### 🚀 Step 6: Train the Pipeline

In [17]:
model_pipeline.fit(X_train, y_train)

---

### 📊 Step 7: Evaluate Accuracy

In [19]:
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

Model Accuracy: 0.6936


---

### 📈 Interpretation of Results

* The model achieved an accuracy of around **X.XXX** (replace with actual result after running).
* **Feature selection** with `SelectFromModel` reduced dimensionality and removed irrelevant features.
* **Imputation and encoding** were seamlessly handled inside the pipeline, which ensures data leakage is avoided.

---

### 💡 Suggestions for Improvement

1. **Hyperparameter Tuning**: Use `GridSearchCV` or `RandomizedSearchCV` to optimize `n_estimators`, `max_depth`, etc.
2. **Feature Engineering**: Create interaction features or polynomial features for better representation.
3. **Alternative Models**: Try Gradient Boosting (e.g., XGBoost, LightGBM) for potentially better performance.
4. **Correlation Analysis**: Remove highly correlated features manually or use PCA.
5. **Cross-Validation**: Use cross-validation instead of a single train-test split for more robust evaluation.

---



---

## 🌼 Q2: Voting Classifier with Random Forest and Logistic Regression on Iris Dataset

---

### 📌 Objective:

Build a pipeline that:

* Trains a **Random Forest** and a **Logistic Regression** model.
* Combines them using a **Voting Classifier** (ensemble technique).
* Trains on the **Iris dataset**.
* Evaluates model accuracy on the test set.

---

### ✅ Step-by-Step Code

#### 📦 Import Required Libraries

In [31]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

---

#### 🌸 Load Iris Dataset

In [36]:
# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---

#### 🛠️ Define Pipelines for Base Models

In [39]:
# Logistic Regression Pipeline (scaling is important)
logreg_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=200))
])

# Random Forest Pipeline (does not require scaling)
rf_pipeline = Pipeline([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

---

#### 🗳️ Combine with Voting Classifier

In [42]:
# Voting classifier with hard voting (majority rule)
voting_clf = VotingClassifier(estimators=[
    ('logreg', logreg_pipeline),
    ('rf', rf_pipeline)
], voting='hard')

---

#### 🚀 Train the Voting Classifier

In [45]:
voting_clf.fit(X_train, y_train)

---

#### 📊 Evaluate Accuracy

In [48]:
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Voting Classifier Accuracy: {accuracy:.4f}")

Voting Classifier Accuracy: 1.0000


---

### ✅ Example Output

```plaintext
Voting Classifier Accuracy: 1.0000
```

*Note: Due to the simplicity and balance of the Iris dataset, it's common to get very high accuracy.*

---

### 💡 Interpretation & Notes

* The Voting Classifier combines the strengths of both base models.
* Logistic Regression performs well on linearly separable data, while Random Forest handles nonlinear relationships.
* This ensemble often improves generalization compared to individual models.

---

