<a href="https://colab.research.google.com/github/golu628/assignment/blob/main/ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ Q1. Full ML Pipeline with Feature Engineering and Random Forest
🔧 1. Imports & Setup
python
Copy
Edit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml
🧹 2. Load Dataset (Example: Titanic from OpenML)
python
Copy
Edit
# Using a mixed-type dataset for demonstration
titanic = fetch_openml(name='titanic', version=1, as_frame=True)
df = titanic.frame.copy()
df = df.drop(columns=['name', 'boat', 'body', 'home.dest', 'ticket', 'cabin'])  # Drop high-missing columns
df.dropna(subset=['survived'], inplace=True)
🧪 3. Preprocessing: Define Columns
python
Copy
Edit
target = 'survived'
X = df.drop(columns=[target])
y = df[target].astype(int)

numerical_cols = X.select_dtypes(include=np.number).columns.tolist()
categorical_cols = X.select_dtypes(include='object').columns.tolist()
🧱 4. Build Numerical & Categorical Pipelines
python
Copy
Edit
# Numerical pipeline
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
🔄 5. Combine using ColumnTransformer
python
Copy
Edit
preprocessor = ColumnTransformer([
    ('num', num_pipeline, numerical_cols),
    ('cat', cat_pipeline, categorical_cols)
])
✨ 6. Feature Selection + Random Forest Pipeline
python
Copy
Edit
# Final pipeline
pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('feature_selection', SelectFromModel(RandomForestClassifier(n_estimators=100))),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
🧪 7. Train & Evaluate the Model
python
Copy
Edit
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}")
📊 Interpretation
Feature selection helps reduce multicollinearity.

Pipelines ensure consistent preprocessing during training and testing.

You can improve performance by:

Using grid search for hyperparameter tuning.

Trying other feature selectors (e.g., SelectKBest).

Balancing class imbalance if it exists.

✅ Q2. Pipeline with Voting Classifier (RandomForest + LogisticRegression)
📦 1. Load Iris Dataset
python
Copy
Edit
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

iris = load_iris(as_frame=True)
X = iris.data
y = iris.target
🛠 2. Create Pipelines for Each Model
python
Copy
Edit
pipe_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(max_iter=200))
])
🤝 3. Voting Classifier
python
Copy
Edit
voting_clf = VotingClassifier(
    estimators=[('rf_pipe', pipe_rf), ('lr_pipe', pipe_lr)],
    voting='hard'
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate
voting_clf.fit(X_train, y_train)
accuracy = voting_clf.score(X_test, y_test)
print(f"Voting Classifier Accuracy: {accuracy:.4f}")
✅ Suggestions for Improvement
Use GridSearchCV or RandomizedSearchCV for hyperparameter tuning.

Use voting='soft' if models output calibrated probabilities.

Add more models like SVM or KNN for better ensemble diversity.

