## Homework 5: Regularization

Goals:

* Practice using `sklearn` library
* Experiment with model complexity using regularization
* Experiment with model complexity using polynomial features


For the purposes of this exercise you will need a classification dataset. 


Deliverables:

* The source code
* A separate file containing a short report including Tables 1,2,3, and 4 and Figures 1, and 2 as described below. 
* Your dataset

Please upload these separately, not as an archive. Your code should be well organized and well commented.

**You are free to use unfettered access to AI for this assignment**



### Stage 1: Data preparation

*Consult an AI such as ChatGPT as necessary to assist with the following tasks. While some of the steps described below may be unclear to you, they are standard ML operations and ChatGPT will be able to not only write the code, but also explain the purpose of the step to you.  Here is an optional "header" you can put in your AI conversation:*

**Hello, I am a student in a machine learning class. My professor has given me a series of steps to execute in a machine learning project. In the dialog below I will share some of them with you. When you respond to me, please construct your answer so as to inform me about the purpose of the indicated action and provide some well-documented Python code to assist me in implementing the project myself. In your Python comments, please assume that I am not very familiar with the Python programming language or the standard ML libraries. I would like the comments in your code to be long and fairly substantive. You may assume that this is the first time I have done an exercise of this nature. Naturally I may have follow up questions to your responses, and in those cases I would appreciate your guidance, with your responses being tailored to my background and level of experience.**  

Once you load your data you may want to provide ChatGPT with a sample so that it knows the kind of thing you are doing. For example you might do:

```
print(df.sample(2))
```

and then paste this into the ChatGPT conversation, informing it that you are providing a sample of your dataset. 

*Make sure you are using a classification dataset and not a regression dataset*.

You have the opportunity here of curating your own education a bit. Do you want to really know how to do ML coding without the AI helping you? If so you should use AI minimally. You should also probably mentally transcribe the output of the AI rather than copying and pasting. This level of expertise may be helpful in a job interview some day. The level of mastery you will attain in this course regarding this practical ML stuff is basically a function of how hard you want to try.


#### Stage 1 A: Fixing `X` and `y`.


Load your chosen dataset into a pandas dataframe. There are lots of libraries built into sklearn that can load standard benchmark datasets. You can also find standard datasets on the UCI ML Repo, or on Kaggle. Please do not use synthetic data (eg `make_blobs`). Use a real dataset, preferably one of interest to you. 

Convert all nominal columns to numerical columns and drop irrelevant features if applicable (eg the ID column). 

Slice out the independent variable columns and put them into a variable `X`. This can be a numpy array or a pandas dataframe. 

Slice out the dependent variable column and put it into a variable `y`. 

(You do not need to add a bias column to `X` or convert `y` to be $\pm 1$ because the library will do that for us. It is also fine if `y` contains more than 2 values, provided it contains only a small discrete set of distinct values.)

#### Stage 1 B: Splitting and Scaling

Use a `train_test_split` to convert `X` and `y` into `X_train`,`X_test`,`y_train`, and `y_test`. 

Choose a scaling algorithm and library (e.g. `StandardScaler`). Fit and transform the training data using the scaler. Transform the test data using the same scaler. Call the resulting matrices `X_train_scl` and `X_test_scl`. 

#### Stage 1 C: Polynomial Features

Using the original `X`, do a PolynomialFeatures transformation to add features of degree 3. Call the result `X_p3`.

Do a train/test split on `X_p3` and `y` as before, using appropriate names, e.g. `X_train_p3`.

Scale the resulting data using appropriate names, e.g. `X_train_p3_scl`. 

### Stage 2: Defining our models

Import the logistic regression library from `sklearn`:

```python
from sklearn.linear_model import LogisticRegression
```

Use Python introspection to print the documentation. Read it and ask ChatGPT questions about anything that is unclear. There will be a lot of things you do not understand; that is normal. 

```python
## Python introspection. Put this in a cell and run the cell:
LogisticRegression?
```

Do similar steps to load and learn about the DecisionTreeClassifier and the support vector machine classifier library: SVC. 

If your data is more than a few thousand rows the SVC class may be too slow (it scales quadratically in the number of samples). If this is the case you can shuffle your data (be sure to shuffle X and y the same way) and truncate your dataset to the first few thousand samples. 

### Stage 3: Warm up

There is nothing to turn in for this section -- it is just about building your intuition. For each of your models, try fitting the training data. Then look at the score of your model on the test set. You can use the default parameters for the model. Try both the plain `X_train_scl` data as well as the enriched `X_train_p3_scl` data. You might also try tweaking one of the model parameters. For example, try adjusting `max_depth` on your decision tree, or `C` on your linear model. 

### Stage 3: Experiments and table building

The main "deliverables" for your work on this assignment will be a series of tables. These tables should be appropriately named and described and included in a separate document. 

### Tables 1 and 2: The C parameter

Make a table with two rows, one called LogisticRegression (or just LR) and one called SVC. There should be a column for `n = [-3,-2,-1,0,1,2,3]`, respectively (six columns). The entry in the ith row and jth column should be a pair of numbers relating to the ith classifer using the regularization parameter `C = 10**n[j]`. The pair of numbers should be the training score of the model on the data and the testing score on the regular data `X_train_scl` and `X_test_scl` (no polynomial features). 

Which choice of model and regularization has the best test score? Where does there seem to be overfitting of the training data? Where does there seem to be underfitting?

Now make essentially the same table, but using `X_train_p3_scl` and `X_test_p3_scl`. What is the overall best model out of your two tables?

### Tables 3 and 4: Regularizing the decision tree

Make a table with one row for each value of `mss = [0.01,0.1,0.2]` and one column for each choice of `md = [2,4,8,16,None]`. The entry in the ith row and jth column should be based on a decision tree classifier using `min_samples_split=mss[i]` and `max_depth=md[j]`. The entry should be a pair of numbers corresponding to the score of the model on the training set and test set, respectively. 

Which choice of model and regularization has the best test score? Where does there seem to be overfitting of the training data? Where does there seem to be underfitting?

Now make essentially the same table, but using `X_train_p3_scl` and `X_test_p3_scl`. What is the overall best model out of your two tables?


### Figures 1 and 2

Make a visualization of your best decision tree overall (with respect to test performance). This visualization should show the tree structure, the Gini values, and use informative column names. 

Make a visualization of your best decision tree of `max_depth=4` (with respect to test performance). This visualization should show the tree structure, the Gini values, and use informative column names. 
