### Q1. You are working on a machine learning project where you have a dataset containing numerical and categorical features. You have identified that some of the features are highly correlated and there are missing values in some of the columns. You want to build a pipeline that automates the feature engineering process and handles the missing values.

### Design a pipeline that includes the following steps:

- Use an automated feature selection method to identify the important features in the dataset. 
- Create a numerical pipeline that includes the following stops:
- impute the missing values in the numerical columns using the mean of the column values. 
- Scale the numerical columns using standardisation.
- Create a categorical pipeline that includes the following steps:
- impute the missing values in the categorical columns using the most frequent value of the column. 
- One-hot encode the categorical columns.
- Combine the numerical and categorical pipelines using a ColumnTransformer.
- Use a Random Forest Classifier to build the final model.
- Evaluate the accuracy of the model on the test dataset.

#### Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of each step. You should also provide an interpretation of the results and suggest possible improvements for the pipeline.



#### Step 1: Automated Feature Selection
#### For automated feature selection, we can use techniques like Recursive Feature Elimination (RFE) or feature importance from tree-based models like Random Forest. Here, let's use the feature importance from Random Forest to select important features.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('your_dataset.csv')
X = data.drop(columns=['target'])  
y = data['target'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': rf_classifier.feature_importances_})
important_features = importance_df[importance_df['Importance'] > 0.01]['Feature'].tolist()
X_train_selected = X_train[important_features]
X_test_selected = X_test[important_features]


#### Step 2: Numerical Pipeline
#### In this step, we'll handle missing values and scale the numerical columns.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')), 
    ('scaler', StandardScaler())  
])


#### Step 3: Categorical Pipeline In this step, we'll handle missing values and perform one-hot encoding for categorical columns.

In [None]:
from sklearn.preprocessing import OneHotEncoder

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  
    ('encoder', OneHotEncoder(handle_unknown='ignore')) 
])


#### Step 4: Combine Numerical and Categorical Pipelines Here, we'll use ColumnTransformer to combine the numerical and categorical pipelines.

In [None]:
from sklearn.compose import ColumnTransformer

numerical_columns = X_train_selected.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = X_train_selected.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_columns),
        ('cat', categorical_pipeline, categorical_columns)
    ]
)

X_train_preprocessed = preprocessor.fit_transform(X_train_selected)
X_test_preprocessed = preprocessor.transform(X_test_selected)


#### Step 5: Build the Final Model and Evaluate Now, let's use the preprocessed data to train the Random Forest Classifier and evaluate its accuracy on the test dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf_classifier_final = RandomForestClassifier(random_state=42)
rf_classifier_final.fit(X_train_preprocessed, y_train)

y_pred = rf_classifier_final.predict(X_test_preprocessed)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on the test dataset:", accuracy)


### Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its accuracy.

In [3]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

random_forest_clf = RandomForestClassifier()
logistic_regression_clf = LogisticRegression()

voting_clf = VotingClassifier(estimators=[('rf', random_forest_clf), ('lr', logistic_regression_clf)], voting='hard')

pipeline = Pipeline([
    ('voting', voting_clf)
])

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the Voting Classifier on the test dataset:", accuracy)


Accuracy of the Voting Classifier on the test dataset: 1.0
