# Problem Set 10: Machine learning for predicting recidivism

As always...

1. Make your own copy of this notebook by going to `File->Save a copy in Drive`. This will create your own copy of this notebook that you can run on Colab.

2. Click on the title of the notebook, up above, and change it to `YourLastName_YourFirstName_DSP_PS8.ipynb`.

3. Go to `Share` in the upper right corner. Where it says "Add people, groups, and calendar events", enter the following email addresses (the TAs and me): parkut@bc.edu, bisc@bc.edu, prudhome@bc.edu.



## *Enter your honor pledge here*

"This code is my own work. I did not share my code or look at the code of another student. I did not consult ChatGPT, CoPilot, or another large language model."

## Overview
In this problem set, you'll be working with a dataset used to try to predict recidivism, whether someone commits a crime after they are paroled from prison. This is a much larger dataset than the ones we've used to far, in terms of both features and samples, so there is a lot more room for experimentation.

You will be building models to try to beat the majority class baseline. I've provided code to get you started, and you will use this notebook to build your models. You can do anything you like to the features, and you can use any classifier you like with any parameterization you like.

In the hackathon you carved out a separate dev set from your training data. This time you will be doing k-fold cross validation on on your models. **I have provided three examples of how to do this**, but this is something that was demonstrated earlier. Review the prior sample code and the lecture notes for more deails about k-fold cross validation.

## What to turn in
When you have three models whose average accuracy is better than majority baseline, you will re-build those model on the *whole* dataset and use those models to predict the data in `test.csv`. When you submit this notebook, you will also submit these three prediction files, which you will call `submission.csv`, `submission2.csv` and `submission3.csv`. When the TAs grade your homework, they will report back your accuracy on this test dataset.

I have provided code that shows you how to create these files and how to move the files to a location in your Google Drive. You will download and submit these files with your GitHub repo, and you will also share the folder you create with the TAs, just as you share your Colab notebook.

**This notebook only needs to contain the code for creating the three final models -- first within the k-fold cross-validation setting and second in the full dataset setting. You must comment your code so that we know you know what you are doing.**

**This problem set is due Thursday, May 2, at 11:59pm.** You have a 24-hour grace period. (I don't want to make any work actually due after the end of classes, but you may submit the problem set on Friday if you prefer.)

## Step 1: Get the data

In [None]:
!rm data.csv
!rm test.csv
!wget https://raw.githubusercontent.com/CSCI1090-S24/ps10/main/data.csv
!wget https://raw.githubusercontent.com/CSCI1090-S24/ps10/main/test.csv

--2024-04-30 19:34:41--  https://raw.githubusercontent.com/CSCI1090-S24/ps10/main/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1995686 (1.9M) [text/plain]
Saving to: ‘data.csv’


2024-04-30 19:34:41 (31.6 MB/s) - ‘data.csv’ saved [1995686/1995686]

--2024-04-30 19:34:41--  https://raw.githubusercontent.com/CSCI1090-S24/ps10/main/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220728 (216K) [text/plain]
Saving to: ‘test.csv’


2024-04-30 19:34:42 (8.24 MB/s) - ‘test.csv’ saved [220728/220728]



In [None]:
# Read in the data
import pandas as pd
import numpy as np

recidivism = pd.read_csv("data.csv")

test_data = pd.read_csv("test.csv")

In [None]:
# describe the dataset
print(recidivism.describe())

# print out the number of rows (samples)
rows, columns = recidivism.shape
print(rows)

# print out the number of features (columns)
print(columns)

# print out first 10 rows
recidivism.head(10)

# print out all column labels
recidivism.columns

                 ID  Residence_PUMA  Supervision_Risk_Score_First  \
count   8190.000000     8190.000000                   8076.000000   
mean   13304.850549       12.172894                      5.868995   
std     7820.321672        7.151819                      2.381232   
min        1.000000        1.000000                      1.000000   
25%     6407.750000        6.000000                      4.000000   
50%    13278.000000       12.000000                      6.000000   
75%    20177.250000       18.000000                      8.000000   
max    26758.000000       25.000000                     10.000000   

        Dependents  Prior_Arrest_Episodes_Felony  Prior_Arrest_Episodes_Misd  \
count  8190.000000                   8190.000000                 8190.000000   
mean      1.516239                      5.469109                    3.148474   
std       1.220742                      3.187848                    2.278922   
min       0.000000                      0.000000          

Index(['ID', 'Gender', 'Race', 'Age_at_Release', 'Residence_PUMA',
       'Gang_Affiliated', 'Supervision_Risk_Score_First',
       'Supervision_Level_First', 'Education_Level', 'Dependents',
       'Prison_Offense', 'Prison_Years', 'Prior_Arrest_Episodes_Felony',
       'Prior_Arrest_Episodes_Misd', 'Prior_Arrest_Episodes_Violent',
       'Prior_Arrest_Episodes_Property', 'Prior_Arrest_Episodes_Drug',
       'Prior_Arrest_Episodes_PPViolationCharges',
       'Prior_Arrest_Episodes_DVCharges', 'Prior_Arrest_Episodes_GunCharges',
       'Prior_Conviction_Episodes_Felony', 'Prior_Conviction_Episodes_Misd',
       'Prior_Conviction_Episodes_Viol', 'Prior_Conviction_Episodes_Prop',
       'Prior_Conviction_Episodes_Drug',
       'Prior_Conviction_Episodes_PPViolationCharges',
       'Prior_Conviction_Episodes_DomesticViolenceCharges',
       'Prior_Conviction_Episodes_GunCharges', 'Prior_Revocations_Parole',
       'Prior_Revocations_Probation', 'Condition_MH_SA', 'Condition_Cog_Ed',
     

This dataset has lots of features, many of which are categorical (strings rather than numbers). In addition, many rows have one or more missing values (`NaN`). We need to address these issues, but since we have so many features, we can't look at each feature individually and determine the best approach like we did in prior problem sets. Instead, we're going to manage this automatically. It might not be as good thinking carefully about each feature, but it will get our data into a state where we can do some prediction.

First, let's fill in the missing values. The code below will fill in all missing values with the mode for that column for categorical variables.

As the StatQuest guy would say "Terminology Alert!" Filling in missing values with the mode, mean, or median is called *imputing*.

Run the code below.

In [None]:
# Identify all the columns in the dataset that have NaN (missing) values
columns_with_missing = recidivism.columns[recidivism.isnull().any()].tolist()

# Replace all NaN with the mode for that column.
for col in columns_with_missing:
  recidivism[col].fillna(recidivism[col].mode(), inplace=True)


## Now do the same for the test data
columns_with_missing = test_data.columns[test_data.isnull().any()].tolist()
for col in columns_with_missing:
  test_data[col].fillna(test_data[col].mode(), inplace=True)


Next let's replace all categorical values (like ethnicity or gender) with integer labels (1/0). Run the code below.

In [None]:
# Replace categorical values in the dataset with integer labels.

# First create a list of columns that can't be interpreted as numbers.
string_columns = []
for col in recidivism.columns:

    # If all values can be converted to numeric without errors,
    # consider it as a numeric column.
    if recidivism[col].apply(pd.to_numeric, errors='coerce').notna().all():
        continue

    # Otherwise it's a string (categorical) column,
    string_columns.append(col)

# Next, factorize all the string (categorical) columns so
# that the categories are represented with integers and not strings.
for col in string_columns:
    recidivism[col] = pd.factorize(recidivism[col])[0]


# Now do the same for the test data
string_columns = []
for col in test_data.columns:
    if test_data[col].apply(pd.to_numeric, errors='coerce').notna().all():
        continue
    string_columns.append(col)

for col in string_columns:
    test_data[col] = pd.factorize(test_data[col])[0]


Now we'll actually create the `X` and `y`. Run this code cell, but read the comments carefully.

In [None]:
# Create the training dataset from the dataframe

# This pulls out all the columns that are features for each sample
# (excludes the first which is the ID and the last which is the y variable.
X = recidivism.iloc[:, 1:-1]

# This pulls out the final column which is what we are trying to predict:
# whether or not they were arrested for a new crime upon release.
y = recidivism["Recidivism_Arrest"].ravel()


# Create the test set from the test_data dataframe
# Get everything but the first column, which is the ID.
# There is no "Recidivism_Arrest" column because only the
# TAs and I know the answers to the test data.
X_test = test_data.iloc[:, 1:]

## 2. Develop your models

In this section you will run lots of experiments to try to improve upon the strongest of three baselines, the majority class baseline. As in the hackathon, you can change the parameters, features, and the models themselves. In the code spaces below you will run your experiments using k-fold cross validation. When you have three models that outperform the majority class baseline, you will proceed to step 3 and retrain those models on all the data and test on the test data. Instructions for that part of the assignment are found in step 3.

**Please note: you are permitted to come up with one hand-designed method for predicting recidivism that does not use machine learning. Remember that we did this in one of the very first problem sets in the class! This will fun, and you might find that your non-machine learning method outperforms other methods.**

### Demonstration of k-fold cross validation

Instead of partitioning the data into a train and dev set, we are going to use 10-fold cross-validation. The code below demonstrates how to do this with three classifiers: (1) random baseline; (2) majority baseline; (3) `DecisionTreeClassifier` with default parameters.

You'll see that the majority class baseline (i.e., predict that everyone does not committ a crime when they are released) outperforms the decision tree.

Run the code below, then in the followin code cells run your own experiments using cross validation by selecting new algorithsm, modifying the parameters, and using more or fewer features.

In [None]:
## DO NOT DELETE OR MODIFY THIS CODE CELL
## USE THIS AS A REFERENCE FOR THE CODE YOU CREATE
## IN THE FOLLOWING CODE CELLS

## DEMONSTRATION OF HOW TO USE K-FOLD CROSS-VALIDATION
## WITH OUR THREE BASELINES

# import cross validation function and classifiers
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier

# set the value of k
k = 10

### Random baseline
print("Random baseline")
random_clf = DummyClassifier(strategy='uniform')

# Do the cross validation
scores = cross_val_score(random_clf, X, y, cv=k)

# Then print the average of all k of those scores.
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Majority class baseline
print("Majority class baseline")
majority_clf = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(majority_clf, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Decision tree classifier
print("Decision tree classifier")
dectree = DecisionTreeClassifier()
scores = cross_val_score(dectree, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()



Random baseline
Mean Accuracy: 0.5011

Majority class baseline
Mean Accuracy: 0.6023

Decision tree classifier
Mean Accuracy: 0.5736



In [None]:
# USE THESE CODE CELLS TO EXPERIMENT
#increasing k by 2 fold makes it slightly more accurate

from sklearn.ensemble import RandomForestClassifier

### Random Forest Classifier
print("Random Forest classifier")
model1 = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=1)
scores = cross_val_score(model1, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

Random Forest classifier
Mean Accuracy: 0.6527



In [None]:
# USE THESE CODE CELLS TO EXPERIMENT
from sklearn.linear_model import RidgeClassifier

k = 20

### Ridge Classifier
print("Ridge Classifier")
model2 = RidgeClassifier()
scores = cross_val_score(model2, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

Ridge Classifier
Mean Accuracy: 0.6520



In [None]:
# USE THESE CODE CELLS TO EXPERIMENT
from sklearn.naive_bayes import GaussianNB

### GaussianNB
print("GaussianNB")
model3 = GaussianNB()
scores = cross_val_score(model3, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

GaussianNB
Mean Accuracy: 0.6137



In [None]:
# THE CODE FOR YOUR FIRST BEST MODEL GOES HERE
# Make sure to show the accuracy of your model with k-fold
# cross-validation, as I demonstrated above.

#increasing k by 2 fold makes it slightly more accurate

from sklearn.ensemble import RandomForestClassifier

k = 20

### Random baseline
print("Random baseline")
random_clf = DummyClassifier(strategy='uniform')

# Do the cross validation
scores = cross_val_score(random_clf, X, y, cv=k)

# Then print the average of all k of those scores.
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Majority class baseline
print("Majority class baseline")
majority_clf = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(majority_clf, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Random Forest Classifier
print("Random Forest classifier")
model1 = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=1)
scores = cross_val_score(model1, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

Random baseline
Mean Accuracy: 0.4945

Majority class baseline
Mean Accuracy: 0.6023

Random Forest classifier
Mean Accuracy: 0.6527



In [None]:
# THE CODE FOR YOUR SECOND BEST MODEL GOES HERE
# Make sure to show the accuracy of your model with k-fold
# cross-validation, as I demonstrated above.

from sklearn.linear_model import RidgeClassifier

k = 20

### Random baseline
print("Random baseline")
random_clf = DummyClassifier(strategy='uniform')

# Do the cross validation
scores = cross_val_score(random_clf, X, y, cv=k)

# Then print the average of all k of those scores.
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Majority class baseline
print("Majority class baseline")
majority_clf = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(majority_clf, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Ridge Classifier
print("Ridge Classifier")
model2 = RidgeClassifier()
scores = cross_val_score(model2, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

Random baseline
Mean Accuracy: 0.5021

Majority class baseline
Mean Accuracy: 0.6023

Ridge Classifier
Mean Accuracy: 0.6520



In [None]:
# THE CODE FOR YOUR THIRD BEST MODEL GOES HERE
# Make sure to show the accuracy of your model with k-fold
# cross-validation, as I demonstrated above.

from sklearn.naive_bayes import GaussianNB

k = 20

### Random baseline
print("Random baseline")
random_clf = DummyClassifier(strategy='uniform')

# Do the cross validation
scores = cross_val_score(random_clf, X, y, cv=k)

# Then print the average of all k of those scores.
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### Majority class baseline
print("Majority class baseline")
majority_clf = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(majority_clf, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

### GaussianNB
print("GaussianNB Classifier")
model3 = GaussianNB()
scores = cross_val_score(model3, X, y, cv=k)
print(f"Mean Accuracy: {np.mean(scores):.4f}")
print()

Random baseline
Mean Accuracy: 0.4982

Majority class baseline
Mean Accuracy: 0.6023

GaussianNB Classifier
Mean Accuracy: 0.6137



## 3. Create your submissions

Above you experiments and found three models that outperform the majority class baseline under 10-fold cross-validation. Now you need to train those models on *all* the data and test them on the official test data from `test.csv`.

Above I carefully read in `test.csv` into a variable called `X_test`. If you have not modified the features in any way, you can use `X`, `y`, and `X_test` as is, as I show below.


### If you changed the features...
**If you  modified the features** in your experiments above (e.g., you are using fewer features, you altered the feature values somehow), you will need to make those modifications to `X` and `X_test` in the code cell below.

If you did not change the features, you can skip this code cell.



In [None]:
## Here ia a code cell where you can adjust the feautres for X and X_test
## in case that was something you did in your experiments above that resulted
## in a better model.

## If you did not change the features, you do not need to write anything in
## this cell.





### Demonstration of creating your submissions

Above we did cross validation to evaluate three naive models. The code below creates one submission for each of these models trained on the *full* dataset and tested on the `test.csv` data.

You will edit these codes cells to create your submissions from your strongest models. You will just have to replace the lines initializing the models with the line you used in your cross-validation experiment to initialize your strong-performing model.

**AGAIN:** In the code below you will replace the models with the ones you created. You don't need to make any other changes. If you changed the features, please include that code in the above code cell, as clearly indicated.

In [None]:
import csv

## My submission #1: random forest classifier

## Change the line below to create your model.
## If you changed features and created new variables, replace X,
## y, and X_test to your new variables.

model1 = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=1)

model1.fit(X, y)
y_pred_random = model1.predict(X_test)
with open("submission1.csv", 'w', newline='') as f:
    for e in y_pred_random:
      f.write(str(e) + "\n")

In [None]:
## My submission #2: ridge classifier

model2 = RidgeClassifier()
model2.fit(X, y)
y_pred_ridge = model2.predict(X_test)
with open("submission2.csv", 'w', newline='') as f:
    for e in y_pred_ridge:
      f.write(str(e) + "\n")

In [None]:
## My submission #3: Gaussian NB classifier

model3 = GaussianNB()
model3.fit(X, y)
y_pred_gauss = model3.predict(X_test)
with open("submission3.csv", 'w', newline='') as f:
    for e in y_pred_gauss:
      f.write(str(e) + "\n")

### Moving your submissions to your Google Drive

Now you need to mount your Google Drive, create a folder there, and move your submissions to that folder. When you log into drive.google.com, you will see that folder. You will share that folder with the TAs just as you always do with Colab notebooks, and you will also download the files to your computer and commit them to your GitHub repo.

The first step is to mount your Google Drive. This will open a dialog window. Just keep clicking yes. You are just giving Google permission to look at your Drive from Colab, which Google also owns.

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


This code block below creates a folder in your drive called `problemset10` then moves your submission CSVs to that folder in your Drive. The final line should list your three submission CSVs.

You must run this code every time you create a new submission CSV file.

After you run this code, you can go to [drive.google.com](https://drive.google.com/drive/u/0/my-drive) and you'll see your `problemset10` folder.

In [None]:
!mkdir -p /content/drive/MyDrive/problemset10
!mv submission* /content/drive/MyDrive/problemset10
!ls /content/drive/MyDrive/problemset10

submission1.csv  submission2.csv  submission3.csv


---

## How to submit

1. Share this Colab notebook with the TAs and me.
2. Share your `problemset10` folder with the TAs and me.
3. Download the Colab notebook and push it to your GitHub repo.
4. Download your submissions and push them to your GitHub repo.

**This problem set is due Thursday, May 2, at 11:59pm.**

You have a 24-hour grace period. I don't want to make any work actually due after the end of classes, but you may submit the problem set on Friday if you prefer.