# <div style="padding: 10px; background-color: #64CCC5; margin: 10px; color: #000000; font-family: 'New Times Roman', serif; font-size: 60%; text-align: center; border-radius: 10px; overflow: hidden; font-weight: bold;"> Question 1:</div>
**You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to bu#ld a pipeline that automates the feature
engineering process and handles the missing values**

**Design a pipelne that includes the follow#ng steps**
-    Use an automated feature selection method to identify the important features in the dataset
-    Create a numerical p#peline that includes the following steps
-    Impute the missing values in the numerical columns using the mean of the column values
-    Scale the numerical columns using standardisation
-    Create a categorical pipeline that includes the follow#ng steps
-    Impute the missing values in the categorical columns using the most frequent value of the column
-    One-hot encode the categorical columns
-    Combine the numerical and categorical pipelines using a ColumnTransformer
-    Use a Random Forest Classifier to build the final mode
-    Evaluate the accuracy of the model on the test dataset

**Note!** Your solution should include code snippets for each step of the pipeline, and a brief explanatin of
each step. You should also prov#de an interpretat#on of the results and suggest possible improvements for
the pipeline

Certainly! Below is a Python code snippet using scikit-learn to create a machine learning pipeline that addresses the specified requirements:

```python
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Assuming 'X' is your feature matrix and 'y' is the target variable
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 1: Feature Selection
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
feature_selection_pipeline = Pipeline(steps=[('feature_selector', feature_selector)])

# Step 2: Numerical Pipeline
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 3: Categorical Pipeline
categorical_features = X.select_dtypes(include=['object']).columns
categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Step 4: Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Step 5: Final Pipeline with Random Forest Classifier
final_pipeline = Pipeline(steps=[
    ('feature_selection', feature_selection_pipeline),
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Step 6: Fit the model and evaluate on the test dataset
final_pipeline.fit(X_train, y_train)
y_pred = final_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")
```

Explanation:

1. **Feature Selection:** Uses `SelectFromModel` with a Random Forest Classifier to automatically select important features.

2. **Numerical Pipeline:** Imputes missing values with the mean and standardizes numerical features.

3. **Categorical Pipeline:** Imputes missing values with the most frequent value and performs one-hot encoding for categorical features.

4. **Column Transformer:** Combines the numerical and categorical pipelines.

5. **Final Pipeline:** Integrates the feature selection, preprocessing, and a Random Forest Classifier into a single pipeline.

6. **Model Evaluation:** Fits the model on the training set and evaluates its accuracy on the test set.

Possible Improvements:

- **Hyperparameter Tuning:** Perform hyperparameter tuning for both the feature selection method and the Random Forest Classifier.
  
- **Cross-Validation:** Implement cross-validation to get a more robust estimate of the model's performance.

- **Additional Feature Engineering:** Experiment with additional feature engineering techniques, such as creating interaction terms or polynomial features.

- **Handle Imbalanced Data:** If your classes are imbalanced, consider techniques such as oversampling, undersampling, or using different evaluation metrics.

- **Model Interpretability:** Consider using model-agnostic interpretability techniques to understand the model's predictions better.

# <div style="padding: 10px; background-color: #64CCC5; margin: 10px; color: #000000; font-family: 'New Times Roman', serif; font-size: 60%; text-align: center; border-radius: 10px; overflow: hidden; font-weight: bold;"> Question 2:</div>
**Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to comb#ne their predictions. Tra#n the pipeline on the iris dataset and evaluate its
accuracy.**

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression


import warnings
warnings.filterwarnings('ignore')

In [11]:
iris = datasets.load_iris()

In [5]:
X = iris.data
y = iris.target

In [7]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
pipeline = Pipeline([
                    ('scaler', StandardScaler()), # Standardize the features
                    ('ensamble', VotingClassifier(
                        estimators=[
                            ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
                            ('lr', LogisticRegression(random_state=42))
                        ],
                        voting='soft'  # Use 'soft' for probability-based voting
                    ))
                ])

In [9]:
# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')

Accuracy: 1.0000


# <div style="padding: 15px; background-color: #D2E0FB; margin: 15px; color: #000000; font-family: 'New Times Roman', serif; font-size: 110%; text-align: center; border-radius: 10px; overflow: hidden; font-weight: bold;"> ***...Complete...***</div>