# Ensemble Techniques And Its Types-5

### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values:

-- Design a pipeline that includes the following steps:

-- Use an automated feature selection method to identify the important features in the dataset

-- Create a numerical pipeline that includes the following steps:

-- Impute the missing values in the numerical columns using the mean of the column values

-- Scale the numerical columns using standardisation

-- Create a categorical pipeline that includes the following steps:

-- Impute the missing values in the categorical columns using the most frequent value of the column

-- One-hot encode the categorical columns

-- Combine the numerical and categorical pipelines using a ColumnTransformer:

-- Use a Random Forest Classifier to build the final model

-- Evaluate the accuracy of the model on the test dataset

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline.

In [None]:
# Firstly importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Split the dataset into features (X) and target (y)
X = data.drop('target_column', axis=1)
y = data['target_column']

# Step 1: Feature Selection
# Use an automated feature selection method to identify important features
# Example: Using a Random Forest for feature selection
feature_selector = SelectFromModel(RandomForestClassifier())
X_selected = feature_selector.fit_transform(X, y)

# Step 2: Numerical Pipeline
numerical_features = X_selected.select_dtypes(include=[np.number]).columns
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Step 3: Categorical Pipeline
categorical_features = X_selected.select_dtypes(include=[np.object]).columns
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

# Step 4: Combine Numerical and Categorical Pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ])

# Step 5: Final Model Pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Step 6: Train and Evaluate the Model
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy on the test dataset: {accuracy:.2f}')

# Interpretation of Results:
# We created a pipeline that selects important features, imputes missing values, scales numerical features,
# and one-hot encodes categorical features before feeding them into a Random Forest Classifier.

# Possible Improvements:
# 1. Experiment with different feature selection methods to find the best set of features.
# 2. Tune hyperparameters of the Random Forest Classifier for better performance.
# 3. Consider other imputation methods or more advanced encoding techniques for categorical features.
# 4. Use cross-validation to better estimate model performance.


### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [123]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score

In [84]:
from sklearn.datasets import load_iris

In [124]:
iris_data= load_iris()

In [125]:
df = pd.DataFrame(data = iris_data['data'], columns = iris_data['feature_names'])

In [126]:
df['species'] = iris_data['target']

In [127]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [129]:
X = df.drop('species', axis=1)
y = df['species']

In [103]:
# Pipelining Numerical Features
numerical_features = X.select_dtypes(include=[np.number]).columns
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

#### As we have only numerical features in our dataset we won't make pipeline for categorical features

In [130]:
# Appling Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features)
        ])

In [131]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)


In [132]:
X_train.shape,X_test.shape

((120, 4), (30, 4))

In [133]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [108]:
# Evaluating Models Separately

models={
    'Random Forest':RandomForestClassifier(),
    'Logistic Classifier':LogisticRegression()}

In [109]:
def evaluate_model(X_train,y_train,X_test,y_test,models):
    #This function will take training , testing and models data to predict the accuracy with respect to each model
    report = {}
    for i in range(len(models)):
        model = list(models.values())[i]
        # Train model
        model.fit(X_train,y_train)
        
        # Predict Testing data
        y_test_pred =model.predict(X_test)

        # Get accuracy for test data prediction
       
        test_model_score = accuracy_score(y_test,y_test_pred)

        report[list(models.keys())[i]] =  test_model_score
                        
    return report

In [110]:
evaluate_model(X_train,y_train,X_test,y_test,models)

{'Random Forest': 1.0, 'Logistic Classifier': 1.0}

In [121]:
# Predict Using Voting Classifier

from sklearn.ensemble import VotingClassifier
from sklearn.metrics import classification_report

clf1 = RandomForestClassifier()
clf2 = LogisticRegression()

# Checking Voting Classifier with soft voting
v_clf1 = VotingClassifier(estimators=[('Random Forest', clf1), ('Logistic Classifier', clf2)], voting='soft')
v_clf1.fit(X_train, y_train)
y_pred_vc = v_clf1.predict(X_test)
print(classification_report(y_test, y_pred_vc))
print('Test Accuracy for Voting Classifier is',accuracy_score(y_test,y_pred_vc))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Test Accuracy for Voting Classifier is 1.0


In [122]:
# Checking Voting Classifier with hard voting

v_clf2 = VotingClassifier(estimators=[('Random Forest', clf1), ('Logistic Classifier', clf2)], voting='hard')
v_clf2.fit(X_train, y_train)
y_pred_vc2 = v_clf2.predict(X_test)
print(classification_report(y_test, y_pred_vc2))
print('Test Accuracy for Voting Classifier is',accuracy_score(y_test,y_pred_vc2))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Test Accuracy for Voting Classifier is 1.0
