In [None]:
Q1. Ans-

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Define the column transformer for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Use a feature selection method to identify important features
selector = SelectFromModel(RandomForestClassifier())

# Define the final pipeline with preprocessor, feature selector, and classifier
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('selector', selector),
    ('classifier', RandomForestClassifier())
])

# Train the model on the training data
rf_pipeline.fit(X_train, y_train)

# Evaluate the accuracy of the model on the test data
accuracy = rf_pipeline.score(X_test, y_test)
print("Accuracy:", accuracy)
In this pipeline, we first define a ColumnTransformer to handle the numerical and categorical features separately. The numerical features are imputed using the mean of the column values and then scaled using standardization. The categorical features are imputed using the most frequent value of the column and then one-hot encoded.

Next, we use a SelectFromModel feature selection method with a RandomForestClassifier to identify important features.

Finally, we define the final pipeline with the preprocessor, feature selector, and a RandomForestClassifier as the final classifier.

We train the model on the training data and evaluate its accuracy on the test data.

To improve the pipeline, we could try different feature selection methods or hyperparameters for the classifier. We could also try different imputation strategies for missing values or different scaling methods for the numerical features.

Q.2 Ans-
       
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# Create pipelines for the individual classifiers
rf_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])

lr_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder()),
    ('scaler', StandardScaler(with_mean=False)),
    ('lr', LogisticRegression())
])

# Create the voting classifier
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_pipeline), ('lr', lr_pipeline)],
    voting='soft'
)

# Train the pipeline on the data
voting_classifier.fit(X_train, y_train)

# Evaluate the accuracy on the test set
accuracy = voting_classifier.score(X_test, y_test)
print("Accuracy:", accuracy)
In this example, we create separate pipelines for the Random Forest Classifier and the Logistic Regression Classifier, each including any necessary preprocessing steps such as imputation, scaling, and one-hot encoding. We then create a Voting Classifier that combines the predictions of these two classifiers using a soft voting strategy, which takes the predicted probabilities and averages them to make a final prediction.

We train the entire pipeline on the training data and evaluate its accuracy on the test set. The final accuracy score can be used to assess the performance of the ensemble classifier.