# Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

Design a pipeline that includes the following steps:

• Use an automated feature selection method to identify the important features in the dataset.

• Create a numerical pipeline that includes the following steps:

• Impute the missing values in the numerical columns using the mean of the column values.

• Scale the numerical columns using standardisation.

• Create a categorical pipeline that includes the following steps:

• Impute the missing values in the categorical columns using the most frequent value of the column.

• One-hot encode the categorical columns.

• Combine the numerical and categorical pipelines using a ColumnTransformer.

• Use a Random Forest Classifier to build the final model.

• Evaluate the accuracy of the model on the test dataset.

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.

A1

Creating a machine learning pipeline for feature engineering, preprocessing, and modeling is a common practice in data science. Below, I'll outline the steps you described and provide Python code snippets for each part of the pipeline. I'll assume you're using scikit-learn for this purpose.

```python
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Separate the target variable (y) from the features (X)
X = data.drop(columns=['target_column'])
y = data['target_column']

# Step 1: Automated Feature Selection
# Use a feature selection method to identify important features
# In this example, we use SelectFromModel with a RandomForestClassifier
feature_selector = SelectFromModel(RandomForestClassifier(n_estimators=100))
X_selected = feature_selector.fit_transform(X, y)

# Step 2: Preprocessing Pipelines for Numerical and Categorical Data
# Create a numerical pipeline
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Create a categorical pipeline
categorical_features = X.select_dtypes(include=['object']).columns
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Step 3: Combine Numerical and Categorical Pipelines
# Use ColumnTransformer to apply the pipelines to the respective feature sets
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Step 4: Build the Final Model
# Add the preprocessor and the classifier (Random Forest) to a final pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

# Step 5: Split Data and Train the Model
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)

# Step 6: Evaluate the Model
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on test data: {accuracy:.2f}')
```

**Interpretation and Possible Improvements:**

- The pipeline handles feature selection, missing value imputation, scaling for numerical features, and one-hot encoding for categorical features.

- The Random Forest Classifier is used as the final model.

- The pipeline's performance is evaluated using accuracy on the test dataset.

**Possible Improvements:**

1. **Hyperparameter Tuning:** Optimize hyperparameters for the Random Forest Classifier and other components of the pipeline using techniques like grid search or random search.

2. **Feature Selection:** Experiment with different feature selection methods or consider using feature importance scores from the Random Forest for feature selection.

3. **Handling Class Imbalance:** If the dataset is imbalanced, consider techniques like resampling (oversampling or undersampling) or using different evaluation metrics (e.g., F1-score) to handle class imbalance.

4. **Advanced Imputation:** Explore more advanced imputation techniques, such as K-nearest neighbors imputation, especially for missing categorical data.

5. **Ensemble Methods:** Consider using ensemble methods like stacking or boosting to improve model performance further.

6. **Feature Engineering:** Depending on your domain knowledge, you may explore additional feature engineering steps to create new meaningful features.

Remember that the choice of preprocessing steps and model should be guided by the specific characteristics of your dataset and problem domain. The pipeline can be further refined and customized based on your domain knowledge and experimentation.

# Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy

A2

To build a pipeline that includes a Random Forest Classifier and a Logistic Regression Classifier, and then use a Voting Classifier to combine their predictions on the Iris dataset, you can follow these steps using scikit-learn:

In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the individual classifiers
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Create the Voting Classifier
voting_classifier = VotingClassifier(
    estimators=[
        ('rf', rf_classifier),
        ('lr', lr_classifier)
    ],
    voting='hard'  # 'hard' for majority voting, 'soft' for weighted voting based on class probabilities
)

# Train the Voting Classifier on the training data
voting_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = voting_classifier.predict(X_test)

# Evaluate the accuracy of the Voting Classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of Voting Classifier: {accuracy:.2f}')

Accuracy of Voting Classifier: 1.00


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The Voting Classifier combines the predictions from both classifiers and provides an accuracy score. You can adjust the hyperparameters of the individual classifiers and the Voting Classifier as needed for your specific use case.