# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Jubayer Ahmed

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [444]:
import pandas as pd
import os
import requests
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [445]:
# Import dataset (1 mark)

file_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data'
file_name = file_url.split('/')[-1]

if not os.path.isfile(file_name):
    print('Downloading from {}'.format(file_url))
    r = requests.get(file_url)
    with open(file_name,'wb') as output_file:
        output_file.write(r.content)
    
data = pd.read_csv(file_name,                 
                   na_values='?', 
                   names=[ 'animal', 'hair', 'feathers', 'eggs', 'milk', 'airborne',
                            'aquatic', 'predator', 'toothed', 'backbone', 'breathes',
                            'venemous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'class'])

print(data.shape, type(data))
print(data)

(101, 18) <class 'pandas.core.frame.DataFrame'>
       animal  hair  feathers  eggs  milk  airborne  aquatic  predator  \
0    aardvark     1         0     0     1         0        0         1   
1    antelope     1         0     0     1         0        0         0   
2        bass     0         0     1     0         0        1         1   
3        bear     1         0     0     1         0        0         1   
4        boar     1         0     0     1         0        0         1   
..        ...   ...       ...   ...   ...       ...      ...       ...   
96    wallaby     1         0     0     1         0        0         0   
97       wasp     1         0     1     0         1        0         0   
98       wolf     1         0     0     1         0        0         1   
99       worm     0         0     1     0         0        0         0   
100      wren     0         1     1     0         1        0         0   

     toothed  backbone  breathes  venemous  fins  legs  tail  d

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*
1. I got it from the UC Irvine ML libray. Link: https://archive.ics.uci.edu/dataset/111/zoo
2. I picked this data because I like animals.
3. No it was easy. I just went to the same website as we got wine data from last assignment and picked a different dataset.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [446]:
# Clean data (if needed)
#checking for nulls and if I will need imputing.
print(data.head().isnull().sum().sum())

0


In [447]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   animal    101 non-null    object
 1   hair      101 non-null    int64 
 2   feathers  101 non-null    int64 
 3   eggs      101 non-null    int64 
 4   milk      101 non-null    int64 
 5   airborne  101 non-null    int64 
 6   aquatic   101 non-null    int64 
 7   predator  101 non-null    int64 
 8   toothed   101 non-null    int64 
 9   backbone  101 non-null    int64 
 10  breathes  101 non-null    int64 
 11  venemous  101 non-null    int64 
 12  fins      101 non-null    int64 
 13  legs      101 non-null    int64 
 14  tail      101 non-null    int64 
 15  domestic  101 non-null    int64 
 16  catsize   101 non-null    int64 
 17  class     101 non-null    int64 
dtypes: int64(17), object(1)
memory usage: 14.3+ KB


In [448]:
#making animal, legs features categorical since they are distinct values
categorical_columns = ['animal', 'legs']
data = data.astype({col: 'category' for col in categorical_columns})
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 18 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   animal    101 non-null    category
 1   hair      101 non-null    int64   
 2   feathers  101 non-null    int64   
 3   eggs      101 non-null    int64   
 4   milk      101 non-null    int64   
 5   airborne  101 non-null    int64   
 6   aquatic   101 non-null    int64   
 7   predator  101 non-null    int64   
 8   toothed   101 non-null    int64   
 9   backbone  101 non-null    int64   
 10  breathes  101 non-null    int64   
 11  venemous  101 non-null    int64   
 12  fins      101 non-null    int64   
 13  legs      101 non-null    category
 14  tail      101 non-null    int64   
 15  domestic  101 non-null    int64   
 16  catsize   101 non-null    int64   
 17  class     101 non-null    int64   
dtypes: category(2), int64(16)
memory usage: 18.0 KB


In [449]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed
preprocessor = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ],
    remainder='passthrough' 
)

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. There were no missing values. If there were, I would delete the row because there would be no way for me to find the missing info. There is only one entry per animal so I cannot fill with most frequent as it would likely be incorrect. I cannot fill with zero either since all the features are applicable. I could alternatively consider dropping the column if the missing values are all in one column and that column has similar values for all. This would indicate that this feature is not that important for classification.
2. I have a mixture of numerical and categorical values. The animal column is text without order so I used one-hot encoding. The legs column have specific values without order so I also used one-hot encoding. The target already has numerical values where each group represents a group of animals wihtout ordinal relationship and therefore do not need encoding. The rest of columns are 0 and 1s representing boolean value and are already suited for machine learning. They don't require scaling as they are all in same range.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [450]:
# Split the data into training and testing sets
y = data['class']
X = data.drop(columns=['class'])
print(X.shape, type(X))
print(y.shape, type(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=222)

(101, 17) <class 'pandas.core.frame.DataFrame'>
(101,) <class 'pandas.core.series.Series'>


In [451]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter=5000, random_state=222))])

param_grid = {
     'classifier__C': [0.1, 1, 10, 100],
     'classifier__fit_intercept': [True, False]}

grid_linear = GridSearchCV(pipe, param_grid=param_grid, cv=3, return_train_score=True)
grid_linear.fit(X_train, y_train)

print("Best parameters:\n{}".format(grid_linear.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_linear.cv_results_['mean_train_score'][grid_linear.best_index_]))
print("Best cross-validation test score: {:.2f}".format(grid_linear.best_score_))
print("\nTraining Accuracy Score: ", grid_linear.cv_results_['mean_train_score'])
print("Validation Accuracy Score: ", grid_linear.cv_results_['mean_test_score'])



Best parameters:
{'classifier__C': 1, 'classifier__fit_intercept': False}
Best cross-validation train score: 1.00
Best cross-validation test score: 0.95

Training Accuracy Score:  [0.91253203 0.91253203 1.         1.         1.         1.
 1.         1.        ]
Validation Accuracy Score:  [0.86277303 0.85042735 0.9382716  0.95061728 0.9382716  0.95061728
 0.9382716  0.95061728]


In [452]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state=222))])

param_grid = {
        'classifier__n_estimators': [3, 5, 7],
        'classifier__max_depth': [3, 5, 7],
        'classifier__max_features': [13, 15, 17]}

grid_RF = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True)
grid_RF.fit(X_train, y_train)

print("Best parameters:\n{}\n".format(grid_RF.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_RF.cv_results_['mean_train_score'][grid_RF.best_index_]))
print("Best cross-validation test score: {:.2f}".format(grid_RF.best_score_))
print("\nTraining Accuracy Score: ", grid_RF.cv_results_['mean_train_score'])
print("Validation Accuracy Score: ", grid_RF.cv_results_['mean_test_score'])


Best parameters:
{'classifier__max_depth': 5, 'classifier__max_features': 15, 'classifier__n_estimators': 5}

Best cross-validation train score: 0.99
Best cross-validation test score: 0.96

Training Accuracy Score:  [0.89401351 0.92499418 0.94374563 0.89972048 0.91241556 0.93768926
 0.90635919 0.93768926 0.95003494 0.98124854 0.96866993 0.97495924
 0.98124854 0.99371069 0.99371069 0.97495924 0.98765432 0.99371069
 0.99382716 0.96866993 0.98113208 0.99382716 0.98765432 0.99382716
 0.98753785 0.99371069 1.        ]
Validation Accuracy Score:  [0.8005698  0.82526116 0.86324786 0.87511871 0.88841406 0.9002849
 0.86277303 0.88793922 0.88793922 0.88888889 0.83855651 0.87559354
 0.93732194 0.96296296 0.96296296 0.90075973 0.92545109 0.91310541
 0.92592593 0.90123457 0.90075973 0.91215575 0.92545109 0.93779677
 0.90123457 0.9382716  0.9382716 ]


In [453]:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC(kernel='rbf', random_state=222))])

param_grid = { 'classifier__C': [1, 10, 20, 30],
    'classifier__gamma': [0.001, 0.01, 0.1, 1]}

gridSVC = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True)
gridSVC.fit(X_train, y_train)

print("Best parameters:\n{}\n".format(gridSVC.best_params_))
print("Best cross-validation train score: {:.2f}".format(gridSVC.cv_results_['mean_train_score'][gridSVC.best_index_]))
print("Best cross-validation test score: {:.2f}".format(gridSVC.best_score_))
print("\nTraining Accuracy Score: ", gridSVC.cv_results_['mean_train_score'])
print("Validation Accuracy Score: ", gridSVC.cv_results_['mean_test_score'])

Best parameters:
{'classifier__C': 10, 'classifier__gamma': 0.1}

Best cross-validation train score: 1.00
Best cross-validation test score: 0.95

Training Accuracy Score:  [0.41253203 0.61250874 0.97495924 1.         0.65024458 1.
 1.         1.         0.8312369  1.         1.         1.
 0.8624505  1.         1.         1.        ]
Validation Accuracy Score:  [0.41263058 0.61253561 0.87559354 0.71225071 0.61253561 0.92545109
 0.95061728 0.74976258 0.73884141 0.9382716  0.95061728 0.74976258
 0.83760684 0.9382716  0.95061728 0.74976258]


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. I need classification because I am dealing with distinct values. Each entry will belong to groups 1,2,3,4,5,6, or 7.
2. I chose to go with the Random Forest model since it gave me the best validation score at 0.96 which is close to the training score of 0.99. This gave me the best validation score and lowest variance. The SVC and logistic regressions show overfit (high variance, low bias) as the training score is 1 while performing while also performing slightly worse than Random Forest in the validation score. I ensured that the selected hyperparameters are in the middle of the selected range. This ensures that I selected a good range for the parameter grid.
3. As discussed above, the random forest worked best. Given that my dataset has mostly boolean values, Random Forest might capture non-linear relationships effectively while Logistic Regression assumes a linear relationship between the features and the target. Also my dataset is quite small so it is possible that outliers carried a lot of weight. The SVC and LogisticRegression would be more affected by this than Random Forest as discussed in class.


## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [454]:
# Calculate testing accuracy (1 mark)
print("\nRandomForest\nTest-set score: {:.2f}".format(grid_RF.score(X_test, y_test)))


RandomForest
Test-set score: 0.90



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*
1. I used the default score method which computes the accuracy for classification.
2. Yes the model generalized fairly well as it got 0.9 test score which is a bit lower than the validation of 0.96 in Step3. 
3. Although the model did good with a test score of 0.9, I would like to have more data to train/test before using in real world. I would try to get the test score closer to validation. This dataset was only 100 entries with single entry for each animal. We know that animals have lots of variation so we should use more data to train the model on those variations. If we only look at the score of 0.9, we would be tempted to start using it in real world, but we should do more rigurous training/testing. I suspect that with more data that includes variation we would get lower train/validation. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?



*ANSWER HERE*
1. I wrote it myself after having reviewed the Imputation Examples 1 and 2 from class. I copied the print statements from these examples to print out the best parameters.
2. I completed in order 1 step at a time.
3. I only used chatGPT to learn how to specify which columns to encode and which to leave unchanged. I used this querry: "how to specify on pipeline which columns to encode and which column to keep unchanged". It told me to use "remainder = 'passthrough'. 
4. I did not encounter challenges as it is pretty straightforward. Reading through the examples as mentioned above helped me understand what I needed to do.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*
I liked the assignment as it brings eveything we learned thus far together. I found it motivating to see how everything ties in together. I also see that I have much more to learn which seems challenging right now. It feels like it will take a lot more practice and studying to be able to work professionaly in this field. I am happy with the progress thus far. 