### Q1. Designing a Machine Learning Pipeline

Here's how you can design a pipeline that automates the feature engineering process and handles missing values, including feature selection, preprocessing, and model building using a Random Forest Classifier:

#### Step-by-Step Solution

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
#### 1. Import Necessary Libraries:

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel


#### 2. Load the Dataset:

In [None]:
# Assuming 'data.csv' is your dataset file
df = pd.read_csv('data.csv')

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']


#### 3. Train-Test Split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### 4. Automated Feature Selection:

Use RandomForestClassifier for feature selection.

In [None]:
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))


#### 5. Preprocessing Pipelines:

- Numerical Pipeline:

In [None]:
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])


- Categorical Pipeline:

In [None]:
categorical_features = X.select_dtypes(include=['object']).columns

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


- Combine Pipelines:

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ]
)


- 6. Build the Final Pipeline:

In [None]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('feature_selection', feature_selector),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])


- 7. Train the Model:

In [None]:
pipeline.fit(X_train, y_train)
#Evaluate
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


#### Interpretation and Possible Improvements

- **Interpretation:** This pipeline automates preprocessing (handling missing values and encoding), feature selection, and model training. The accuracy score gives an initial measure of model performance.
- **Possible Improvements:**
  - **Hyperparameter Tuning:** Use GridSearchCV to tune hyperparameters.
  - **More Sophisticated Imputation:** Explore other imputation methods like K-Nearest Neighbors.
  - **Feature Engineering:** Create interaction features or polynomial features.
  - **Model Ensemble:** Combine multiple models (e.g., gradient boosting) for better performance.


### Q2. Building a Voting Classifier Pipeline

Here's how to create a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier and then use a Voting Classifier to combine their predictions:

#### Step-by-Step Solution

**1. Import Necessary Libraries:**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier


**2. Define the Pipeline with Both Classifiers:**

In [None]:
# Reuse the preprocessor from Q1
pipeline_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Combine the pipelines into a Voting Classifier
voting_pipeline = VotingClassifier(
    estimators=[('rf', pipeline_rf), ('lr', pipeline_lr)],
    voting='soft'
)

**3. Train the Voting Classifier:**


In [None]:
voting_pipeline.fit(X_train, y_train)

**4. Evaluate the Model:**

In [None]:
y_pred_voting = voting_pipeline.predict(X_test)
accuracy_voting = accuracy_score(y_test, y_pred_voting)
print(f'Voting Classifier Accuracy: {accuracy_voting:.2f}')


### Interpretation

- **Accuracy Improvement:** Combining different classifiers often results in better performance due to the complementary strengths of the individual models.
- **Robustness:** The Voting Classifier is more robust to overfitting than individual models.
- **Further Improvements:**
  - **Add More Classifiers:** Include additional models such as SVM or Gradient Boosting.
  - **Stacking:** Instead of simple voting, use stacking for more complex blending of model predictions.

This setup provides a comprehensive pipeline for handling various preprocessing tasks and model training, ensuring a robust and flexible machine learning workflow.