# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Jubayer Ahmed

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [191]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [192]:
# Import dataset (1 mark)
import os
import requests

file_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data'
file_name = file_url.split('/')[-1]

if not os.path.isfile(file_name):
    print('Downloading from {}'.format(file_url))
    r = requests.get(file_url)
    with open(file_name,'wb') as output_file:
        output_file.write(r.content)
    
data = pd.read_csv(file_name,                 
                   na_values='?', 
                   names=[ 'animal', 'hair', 'feathers', 'eggs', 'milk', 'airborne',
                            'aquatic', 'predator', 'toothed', 'backbone', 'breathes',
                            'venemous', 'fins', 'legs', 'tail', 'domestic', 'catsize', 'class'])


y = data['class']
#dropping the target
X = data.drop(columns=['class'])
#dropping animal since it is just an ID, provides no value.
X = X.drop(columns=['animal'])
print(X.shape, type(X))
print(y.shape, type(y))

(101, 16) <class 'pandas.core.frame.DataFrame'>
(101,) <class 'pandas.core.series.Series'>


In [193]:
print(X)

     hair  feathers  eggs  milk  airborne  aquatic  predator  toothed  \
0       1         0     0     1         0        0         1        1   
1       1         0     0     1         0        0         0        1   
2       0         0     1     0         0        1         1        1   
3       1         0     0     1         0        0         1        1   
4       1         0     0     1         0        0         1        1   
..    ...       ...   ...   ...       ...      ...       ...      ...   
96      1         0     0     1         0        0         0        1   
97      1         0     1     0         1        0         0        0   
98      1         0     0     1         0        0         1        1   
99      0         0     1     0         0        0         0        0   
100     0         1     1     0         1        0         0        0   

     backbone  breathes  venemous  fins  legs  tail  domestic  catsize  
0           1         1         0     0     4     

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

*ANSWER HERE*
1. https://archive.ics.uci.edu/dataset/111/zoo
2. I picked this data because I like animals.
3. No it was easy. I just went to the same website as we got wine data from last assignment and picked a different dataset.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [194]:
# Clean data (if needed)
#checking for nulls and if I will need imputing.
print(data.head().isnull().sum().sum())

0


In [195]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   hair      101 non-null    int64
 1   feathers  101 non-null    int64
 2   eggs      101 non-null    int64
 3   milk      101 non-null    int64
 4   airborne  101 non-null    int64
 5   aquatic   101 non-null    int64
 6   predator  101 non-null    int64
 7   toothed   101 non-null    int64
 8   backbone  101 non-null    int64
 9   breathes  101 non-null    int64
 10  venemous  101 non-null    int64
 11  fins      101 non-null    int64
 12  legs      101 non-null    int64
 13  tail      101 non-null    int64
 14  domestic  101 non-null    int64
 15  catsize   101 non-null    int64
dtypes: int64(16)
memory usage: 12.8 KB


In [196]:
#making all features categorical as they are distinct and have meaning
X = X.astype({col: 'category' for col in ['hair', 'feathers', 'eggs', 'milk', 'airborne',
                            'aquatic', 'predator', 'toothed', 'backbone', 'breathes',
                            'venemous', 'fins', 'legs', 'tail', 'domestic', 'catsize']})
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   hair      101 non-null    category
 1   feathers  101 non-null    category
 2   eggs      101 non-null    category
 3   milk      101 non-null    category
 4   airborne  101 non-null    category
 5   aquatic   101 non-null    category
 6   predator  101 non-null    category
 7   toothed   101 non-null    category
 8   backbone  101 non-null    category
 9   breathes  101 non-null    category
 10  venemous  101 non-null    category
 11  fins      101 non-null    category
 12  legs      101 non-null    category
 13  tail      101 non-null    category
 14  domestic  101 non-null    category
 15  catsize   101 non-null    category
dtypes: category(16)
memory usage: 3.7 KB


In [197]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')



### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

*ANSWER HERE*
1. There were no missing values. If there were, I would delete the row because there would be no way for me to find the missing info. There is only one entry per animal so I cannot fill with most frequent as it would likely be incorrect. I cannot fill with zero either since all the features are applicable. I could alternatively consider dropping the column if the missing values are all in one column and that column has similar values for all. This would indicate that this feature is not that important for classification.
2. I only have categorical data as each feature has a distinct value. I used one-hot encoding as there is no order or hierarchy between the categories.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [198]:
# Implement pipeline and grid search here. Can add more code blocks if necessary
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

pipe = Pipeline(steps=[('preprocessor', encoder),
                      ('classifier', LogisticRegression(max_iter=5000, random_state=0))])


param_grid = [
    {'classifier': [LogisticRegression(max_iter=5000, random_state=0)], 
     'preprocessor': [encoder],
     'classifier__C': [0.1, 1.0, 2.0, 4.0],
     'classifier__fit_intercept': [True, False]}]

grid_linear = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                            test_size=0.3, stratify=y,random_state=0)

grid_linear.fit(X_train, y_train)

print("Best params:\n{}".format(grid_linear.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid_linear.cv_results_['mean_train_score'][grid_linear.best_index_]))
print("Best cross-validation test score: {:.2f}".format(grid_linear.best_score_))

Best params:
{'classifier': LogisticRegression(max_iter=5000, random_state=0), 'classifier__C': 1.0, 'classifier__fit_intercept': True, 'preprocessor': OneHotEncoder(handle_unknown='ignore')}
Best cross-validation train score: 1.00
Best cross-validation test score: 0.94


In [199]:
pipe = Pipeline(steps=[('preprocessor', encoder),
                      ('classifier', RandomForestClassifier(random_state=0))])


param_grid = [
    {'classifier': [RandomForestClassifier(random_state=0)],
     'preprocessor': [encoder], 
        'classifier__n_estimators': [10, 20, 30],
        'classifier__max_depth': [5, 7, 9],
        'classifier__max_features': [1, 10, 16]}]

gridRF = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True)
gridRF.fit(X_train, y_train)

print("Best params:\n{}\n".format(gridRF.best_params_))
print("Best cross-validation train score: {:.2f}".format(gridRF.cv_results_['mean_train_score'][gridRF.best_index_]))
print("Best cross-validation test score: {:.2f}".format(gridRF.best_score_))


Best params:
{'classifier': RandomForestClassifier(max_depth=7, max_features=10, n_estimators=20,
                       random_state=0), 'classifier__max_depth': 7, 'classifier__max_features': 10, 'classifier__n_estimators': 20, 'preprocessor': OneHotEncoder(handle_unknown='ignore')}

Best cross-validation train score: 1.00
Best cross-validation test score: 0.97


In [200]:
pipe = Pipeline(steps=[('preprocessor', encoder),
                      ('classifier', SVC(kernel='rbf', random_state=0))])

param_grid = { 'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

gridSVC = GridSearchCV(pipe, param_grid, cv=3, return_train_score=True)
gridSVC.fit(X_train, y_train)

print("Best params:\n{}\n".format(gridSVC.best_params_))
print("Best cross-validation train score: {:.2f}".format(gridSVC.cv_results_['mean_train_score'][gridSVC.best_index_]))
print("Best cross-validation test score: {:.2f}".format(gridSVC.best_score_))

Best params:
{'classifier__C': 1, 'classifier__gamma': 0.1}

Best cross-validation train score: 0.99
Best cross-validation test score: 0.94


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

*ANSWER HERE*
1. I need classification because I am dealing with distinct values. Each entry will belong to groups 1,2,3,4,5,6, or 7.
2. I chose to go with the Random Forest model since it gave me the best validation score at 0.97.
3. They all worked well as the scores are above 0.9 for all. However the Random forest worked slightly better. Given that my dataset has mostly boolean values, Random Forest might capture non-linear relationships effectively while Logistic Regression assumes a linear relationship between the features and the target. Also my dataset is quite small so it is possible that outliers carried a lot of weight. Both logistic regression and SVC would be more affected by this than Random Forest as discussed in class.


## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [201]:
# Calculate testing accuracy (1 mark)
print("\nRandomForest\nTest-set score: {:.2f}".format(gridRF.score(X_test, y_test)))


RandomForest
Test-set score: 0.97



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

*ANSWER HERE*
1. I used the default score method which computes the accuracy for classification.
2. Yes the model generalized well as it got the same test score of 0.97 as we got for validation in part 3.
3. Although the model did very good with testing score of 0.97, I would like to have more data to train/test before using in real world. This dataset was only 100 entries with single entry for each animal. We know that animals have lots of variation so we should use more data to train the model on those variations. If we only look at the score of 0.97, we would be tempted to start using it in real world, but we should do more rigurous training/testing. I suspect that with more data that includes variation we would not get training of 1 and test of nearly 1. 

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?



*ANSWER HERE*
1. I wrote it myself after having reviewed the Imputation Examples 1 and 2 from class. I copied the print statements from the notes to print out the best parameters and results.
2. I completed in order 1 step at a time.
3. I did not use AI.
4. I did not encounter challenges as it is pretty straightforward. Reading through the examples as mentioned above helped me understand what I needed to do.

*DESCRIBE YOUR PROCESS HERE*

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*
I liked the assignment as it brings eveything we learned thus far together. I found it motivating to see how everything ties in together. I also see that I have much more to learn which seems challenging right now. It feels like it will take a lot more practice and studying to be able to work professionaly in this field. I am happy with the progress thus far. 