Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values
Design a pipeline that includes the following steps"
Use an automated feature selection method to identify the important features in the dataset.
Create a numerical pipeline that includes the following steps"
Impute the missing values in the numerical columns using the mean of the column values.
Scale the numerical columns using standardisation.
Create a categorical pipeline that includes the following steps"
Impute the missing values in the categorical columns using the most frequent value of the column.
One-hot encode the categorical columns.
Combine the numerical and categorical pipelines using a ColumnTransformer.
Use a Random Forest Classifier to build the final model.
Evaluate the accuracy of the model on the test dataset.
Note! Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.


To design a machine learning pipeline that handles missing values, automates feature engineering, and builds a Random Forest classifier, we can use the following steps:

1. **Automated Feature Selection**: Identify the important features using a feature selection method.
2. **Numerical Pipeline**:
   - Impute missing values using the mean of the column values.
   - Scale the numerical columns using standardization.
3. **Categorical Pipeline**:
   - Impute missing values using the most frequent value of the column.
   - One-hot encode the categorical columns.
4. **Combine Pipelines**: Use `ColumnTransformer` to combine the numerical and categorical pipelines.
5. **Build and Evaluate the Model**: Use a Random Forest Classifier to build the model and evaluate its accuracy.

Here's the code implementation for each step:

### Step 1: Automated Feature Selection
We will use `SelectKBest` with `f_classif` for feature selection.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif

# Assuming df is your DataFrame and 'target' is your target column
X = df.drop(columns='target')
y = df['target']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature selection
selector = SelectKBest(score_func=f_classif, k='all')  # Adjust 'k' as needed
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
```

### Step 2: Numerical Pipeline
Impute missing values and scale the numerical columns.

```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

numerical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])
```

### Step 3: Categorical Pipeline
Impute missing values and one-hot encode the categorical columns.

```python
from sklearn.preprocessing import OneHotEncoder

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
```

### Step 4: Combine Pipelines using ColumnTransformer
Combine the numerical and categorical pipelines.

```python
from sklearn.compose import ColumnTransformer

# Identify numerical and categorical columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])
```

### Step 5: Build and Evaluate the Model
Build the Random Forest classifier and evaluate the model.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Combine the preprocessor and the classifier in a pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train the model
model_pipeline.fit(X_train, y_train)

# Make predictions
y_pred = model_pipeline.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
```

### Interpretation of Results and Possible Improvements
- **Interpretation**: The accuracy score gives an initial indication of the model's performance. Further evaluation metrics such as precision, recall, and F1-score should be calculated to understand the model's effectiveness, especially if the dataset is imbalanced.
- **Possible Improvements**:
  - **Feature Engineering**: Additional feature engineering steps such as creating interaction terms or polynomial features.
  - **Hyperparameter Tuning**: Perform grid search or random search to find the optimal hyperparameters for the Random Forest classifier.
  - **Model Selection**: Experiment with different models like Gradient Boosting, XGBoost, or neural networks.
  - **Cross-Validation**: Use cross-validation to get a more reliable estimate of the model's performance.
  - **Handling Imbalance**: If the dataset is imbalanced, consider techniques like SMOTE or using class weights in the classifier.

This pipeline provides a structured approach to handling missing values, feature selection, and building a robust model using Random Forest.

Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.


To build a pipeline that includes both a Random Forest classifier and a Logistic Regression classifier, and then use a Voting Classifier to combine their predictions, we can follow these steps:

1. **Preprocess the data**: For the Iris dataset, we don't need to handle missing values or categorical features, but we'll scale the numerical features.
2. **Build individual classifiers**: Random Forest and Logistic Regression.
3. **Combine classifiers using a Voting Classifier**.
4. **Train the pipeline** on the Iris dataset.
5. **Evaluate the accuracy** of the combined model.

Here's the implementation in Python using scikit-learn:

```python
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline for scaling
preprocessing_pipeline = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocess the training and test sets
X_train_scaled = preprocessing_pipeline.fit_transform(X_train)
X_test_scaled = preprocessing_pipeline.transform(X_test)

# Define the individual classifiers
random_forest = RandomForestClassifier(random_state=42)
logistic_regression = LogisticRegression(random_state=42, max_iter=200)

# Combine classifiers using a Voting Classifier
voting_classifier = VotingClassifier(
    estimators=[
        ('rf', random_forest),
        ('lr', logistic_regression)
    ],
    voting='soft'  # 'soft' voting uses predicted probabilities
)

# Define the final pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessing', preprocessing_pipeline),
    ('classifier', voting_classifier)
])

# Train the pipeline
model_pipeline.fit(X_train, y_train)

# Make predictions
y_pred = model_pipeline.predict(X_test)

# Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
```

### Explanation of Each Step

1. **Load the Iris dataset**: We use `load_iris()` from scikit-learn to load the dataset.
2. **Split the data**: The dataset is split into training and test sets using `train_test_split`.
3. **Preprocessing pipeline**: A pipeline is created to scale the numerical features using `StandardScaler`.
4. **Individual classifiers**: We define a Random Forest classifier and a Logistic Regression classifier.
5. **Voting Classifier**: We use a `VotingClassifier` to combine the predictions of the Random Forest and Logistic Regression classifiers. The `voting='soft'` parameter uses the predicted probabilities to make the final prediction.
6. **Final pipeline**: The preprocessing pipeline and the Voting Classifier are combined into a single pipeline.
7. **Train the pipeline**: The pipeline is trained on the training data.
8. **Make predictions**: The trained pipeline is used to make predictions on the test set.
9. **Evaluate accuracy**: The accuracy of the combined model is evaluated using `accuracy_score`.

### Interpretation of Results
- **Accuracy**: The accuracy score provides a measure of how well the combined model is performing on the test set. In this case, it will be displayed in the output.
- **Further Evaluation**: Additional evaluation metrics such as precision, recall, and F1-score can be calculated to gain deeper insights into the model's performance, especially if the dataset has class imbalances.

This pipeline demonstrates how to combine multiple classifiers using a Voting Classifier to potentially improve the overall performance.