# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name:  Harshil Patel

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [15]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected.

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [16]:
# Load the Iris dataset


iris = load_iris()
iris_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])



In [17]:

# Split the dataset into features (X) and target variable (y)
X = iris_df[iris.feature_names]
y = iris_df['target']

In [18]:


# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*

# Questions

---


# (1 mark) What is the source of your dataset?
# Answer: The Iris dataset is a built-in dataset in scikit-learn.

# (1 mark) Why did you pick this particular dataset?
# Answer: The Iris dataset is commonly used for classification tasks, making it suitable for demonstrating the pipeline and hyperparameter tuning process.

# (1 mark) Was there anything challenging about finding a dataset that you wanted to use?
# Answer: for like   the Iris dataset,  finding a relevant and clean dataset for real-world problems was  challenging.


## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [19]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

# Preprocessing steps using ColumnTransformer

numeric_features = iris.feature_names
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])




### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*

(1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why?
# Answer: No missing values in the Iris dataset. If there were, we could use SimpleImputer to replace them.The SimpleImputer class is a useful tool for imputing missing values, and it provides various strategies for imputation.

# (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?
# Answer: The Iris dataset has numeric features. I  applied mean imputation and standard scaling.


## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [20]:
# Define models and hyperparameters to test

models = {
    'Logistic Regression': LogisticRegression(),
    'Support Vector Machine': SVC(),
    'Random Forest': RandomForestClassifier()
}

param_grids = {
    'Logistic Regression': {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
    'Support Vector Machine': {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100], 'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
    'Random Forest': {'classifier__n_estimators': [50, 100, 150], 'classifier__max_depth': [None, 10, 20, 30]}
}

# Iterate through models, create pipelines, and perform grid search


best_models = {}
for model_name, model in models.items():
    pipe = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    grid = GridSearchCV(pipe, param_grids[model_name], cv=5, scoring='accuracy', error_score='raise')

    try:
        grid.fit(X_train, y_train)
        best_models[model_name] = grid.best_estimator_
        print(f"Grid search successful for {model_name}")
    except Exception as e:
        print(f"Error during grid search for {model_name}: {e}")

Grid search successful for Logistic Regression
Grid search successful for Support Vector Machine
Grid search successful for Random Forest


In [21]:
# Print the best models and their parameters


for model_name, best_model in best_models.items():
    print(f"Best {model_name} parameters: {best_model.named_steps['classifier'].get_params()}")

Best Logistic Regression parameters: {'C': 1, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}
Best Support Vector Machine parameters: {'C': 100, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 0.01, 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}
Best Random Forest parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 20, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 150, 'n_jobs': None, 'oob_score': False, 'rando

In [22]:
# Find the model with the highest testing accuracy

best_model_name = max(best_models, key=lambda k: best_models[k].score(X_test, y_test))
best_model = best_models[best_model_name]

# Print the best model and its testing accuracy

print(f"Best Model: {best_model_name}")
print(f"Testing Accuracy: {best_model.score(X_test, y_test)}")

Best Model: Logistic Regression
Testing Accuracy: 1.0


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HER# Questions
# (1 mark) Do you need regression or classification models for your dataset?
# Answer: Classification models, as the target variable is categorical.

# (2 marks) Which models did you select for testing and why?
# Answer: Logistic Regression, Support Vector Machine, and Random Forest are commonly used for classification tasks.

# (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?
# Answer  Best Model: Logistic Regression  as it has a Testing Accuracy: 1.0


## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [24]:
test_accuracy = best_models[model_name].score(X_test, y_test)
print(f"Testing Accuracy: {test_accuracy}")


Testing Accuracy: 1.0



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose?
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*

# (1 mark) Which accuracy metric did you choose?
# Answer::  I chose testing accuracy as the accuracy metric for evaluating the performance of the models. Testing accuracy is a commonly used metric for classification tasks, representing the proportion of correctly predicted instances out of the total test set.
# (1 mark) How do these results compare to those in part 3? Did this model generalize well?
# Answer: Compare the testing accuracy with the cross-validated results from the grid search. yes, If the testing accuracy is similar, the model generalizes well.

# (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?
# Answer: Not enough. Evaluate the testing accuracy in the context of the problem. Suggestions for improvement could include trying more models, adjusting hyperparameters, or obtaining more data.



## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE*

# Answer: The code is a combination of common practices in machine learning using scikit-learn and customizations for the Iris dataset. I compelted all steps in  Data input, data processing, model implementation, validation, and reflection. I don't use generative AI for code responses. Challenges was  include finding a suitable dataset, dealing with missing values, or selecting appropriate models. Success was achived in a  be attributed to using well-documented libraries and following best practices in machine learning.


## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

I enjoyed the hands-on experience of applying machine learning techniques to a real dataset. The challenging part was selecting suitable models and hyperparameters, but it was motivating to see the impact of these choices on the model's performance.

In [None]:
# I  enjoyed the hands-on experience of applying machine learning techniques to a real dataset. The challenging part was selecting suitable models and hyperparameters, but it was motivating to see the impact of these choices on the model's performance.