# CS 5110: Data Privacy Final - Nicholas Kent
## Machine Learning Model Using DP-Mini-Batching Gradient Descent

In [29]:
# Load the data and libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.utils import gen_batches
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

# modified guassian mechanism to work with a npy array
def gaussian_mech_vec(vec, sensitivity, epsilon, delta):
    noised_data = np.array([np.random.normal(loc=0, scale=sensitivity * np.sqrt(2*np.log(1.25/delta)) / epsilon, size = vec.shape[1])
                            for _ in range(vec.shape[0])])
    return vec + noised_data


# Download the processed csv files
# Define vars to store:
# The original dataset for reference
athlete = pd.read_csv('https://raw.githubusercontent.com/nichkent/Data-Privacy-Final/main/athlete_events.csv')

# The x or sample columns
athlete_x = pd.read_csv('https://raw.githubusercontent.com/nichkent/Data-Privacy-Final/main/athlete_events_processed_x.csv')

# The y or target columns
athlete_y = pd.read_csv('https://raw.githubusercontent.com/nichkent/Data-Privacy-Final/main/athlete_events_processed_y.csv') 

## Part 1: Prepare athlete_x and athlete_y For the Model

In this part, the information stored in the two csv files loaded 
into `athlete_x` and `athlete_y` are prepared and processed so that they
can be loaded into the machine learning model later.

- This required removing unnecessary columns that would not be useful in dertermining who is most likely to predict the target set. Also, in  this same stepthe **Name** and **Event** columns were removed here to be used  later when determining the athlete who is most likely to win a gold medal.

- After saving the Name and event columns to separate numpy files the only categorical column left was the **Sex** column. This columns information was One-Hot-Encoded using Sklearn's OneHotEncoder.

- Since there are NA values present in the data, we were required to remove them in some way. To do this I decided to use Sklearn's SimpleImputer to impute the NA values of the data to the mean of the current column. This implementation resolves an issue that would have occured when attempting to use `Logistic Regression` later in the model.

Finally, the manipulated datasets are stored in their respective numpy arrays to be fed to the model during the learning process.

In [30]:
# Step 1: Convert all feature names to strings
athlete_x.columns = [str(col) for col in athlete_x.columns]

# Save the 'Name' and 'Event' columns separately before dropping them
# This is so that we can reassign the final values for who got the medal at the end
# Save to separate numpy files
names_array = athlete_x['Name'].values
events_array = athlete_x['Event'].values
np.save('names.npy', names_array)
np.save('events.npy', events_array)



# Step 2: Drop all irrelevent columns

# Drop irrelevant columns including 'Name' and 'Event' now that they have been saved
columns_to_drop = ['ID', 'Games', 'Team', 'NOC', 'Year', 'Season', 'City', 'Sport', 'Name', 'Event']

# Drop unnecessary columns 
for col in columns_to_drop:
    if col in athlete_x.columns:
        athlete_x.drop(col, axis = 1, inplace = True)

        
# Step 3: One Hot Encode the categorical variables

# Handling categorical data using one-hot encoding
# Instantiate the OneHotEncoder
encoder = OneHotEncoder(sparse = False)        
        
# Handling categorical data using one-hot encoding
# 'Sex' is the only remaining categorical column in athlete_x
categorical_columns = ['Sex'] 
categorical_data = encoder.fit_transform(athlete_x[categorical_columns])

# Create meaningful column names for the one-hot encoded column
columns = encoder.get_feature_names(categorical_columns)  
categorical_df = pd.DataFrame(categorical_data, columns = columns)

# Reset the index on the original DataFrame to ensure alignment
athlete_x.reset_index(drop = True, inplace = True)

# Drop the original categorical columns in athlete_x
athlete_x = athlete_x.drop(categorical_columns, axis = 1)


# Step 4: Change the NA values to the mean value of that column in athlete_x

numeric_columns = athlete_x.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# Perform the calculation and convert back to DataFrame to maintain column names
athlete_x_numeric = pd.DataFrame(imputer.fit_transform(athlete_x[numeric_columns]), columns = numeric_columns)

# Concatenate the numeric DataFrame and the one-hot encoded categorical DataFrame
athlete_x = pd.concat([athlete_x_numeric, categorical_df], axis = 1)

# Convert the DataFrame to a Numpy array
data_array = athlete_x.values



# Instantiate and apply a scaler to fix overfitting issues of the model
scaler = StandardScaler()
data_array_scaled = scaler.fit_transform(data_array)

# Save the arrays to a .npy file
# Create the sample array
np.save('athlete_events_processed_x.npy', data_array_scaled)

# Create the target array
medals_array = athlete_y.values
np.save('athlete_events_processed_y.npy', medals_array)

## Part 2: Load Numpy Arrays For Learning

After the data has been successfully preprocessed for the model, it must now be loaded into training and test sets for the model to learn off of. This was divided into training sets of size `119999` and test sets of size `30000` which adds up to the total `149999` size of the dataset.

In [31]:
# Step 5: Define the vars for the learning model

# Load the sample and target arrays from their respective files
X = np.load('athlete_events_processed_x.npy')
y = np.load('athlete_events_processed_y.npy')

# Split data into training and test sets
training_size = int(X.shape[0] * 0.8)

# Create train and test sets
X_train = X[:training_size]
X_test = X[training_size:]

y_train = y[:training_size]
y_test = y[training_size:]

# Print the size of the train and test sets
print('X Train and test set sizes:', len(X_train), len(X_test))
print('Y Train and test set sizes:', len(y_train), len(y_test))

X Train and test set sizes: 119999 30000
Y Train and test set sizes: 119999 30000


## Part 3: Create The Necessary Tools For Learning

In order to create the model for learning it was required that class weights be used to remove weight baises from the columns. This technique involves the usage of Sklearn's `compute_class_weight` function that allows us to balance the y_train array.

Then the `gaussian mechanism` is run on the X_train dataset. Since this is the information being passed into the learning model we need to add noise. The gaussian mechanism is used here, specifically the vectorized version of the function. This instantiation of the gaussian mechanism has a total privacy cost of 1 under sequential composition with a sensitivity of 1.

The `create_model` function is called in part 4. This function uses **Stochastic Gradient Descent** with **Logistic Regression** in order to train the model with the X_train and y_train datasets. The choice to have max_iter = 100 is due to computational limitations. This value can be adjusted based on available resources and required model performance.

In [33]:
# Step 6: Define the learning model with Stochastic

# Runs for 3 iterations due to computational limitations, could theoretically run for more
# Uses logistic regression

# For the class weight later, imported here to not interfer with previous code
from sklearn.utils.class_weight import compute_class_weight

# Reshape y_train to be a 1D array
y_train = y_train.ravel()

# Print the reshaped y_train's type and shape
print("Reshaped y_train type:", type(y_train))
print("Reshaped y_train shape:", y_train.shape)

# Now compute class weights to remove weight baises when running
try:
    class_weights = compute_class_weight(
        class_weight = 'balanced', 
        classes = np.unique(y_train), 
        y = y_train
    )
    
    print("Class weights computed successfully.")
    
except Exception as e:
    print("Error computing class weights:", e)

# Ensure that class labels are mapped correctly to their weights
class_labels = np.unique(y_train)
class_weight_dict = {class_labels[i]: class_weights[i] for i in range(len(class_labels))}


# Make sure X_train is a 2D numpy array
X_train_array = np.array(X_train)  

# Apply differential privacy noise to the training data before passing the information to the model
epsilon = 1.0  # Differential privacy parameter using sequential composition
delta = 1e-5   # Delta parameter for Gaussian mechanism
X_train_noisy = gaussian_mech_vec(X_train, sensitivity = 1, epsilon = epsilon, delta = delta)


def create_model():
    # Uses logistical regression
    # Iterations currently set to 100, 1000 is likely to work better however but this is due to computational limits
    # Uses Stochastic Gradient Descent because we are using mini-batching gradient descent defined below which requires SGD as a basis
    model = SGDClassifier(loss='log', max_iter = 10, tol = 1e-3, class_weight = class_weight_dict)
    
    return model

Reshaped y_train type: <class 'numpy.ndarray'>
Reshaped y_train shape: (119999,)
Class weights computed successfully.


## Part 4: Using The Mini-Batching Technique

The implementation of **Mini-Batching Gradient Descent** uses a batch size of 32. This was found to be the best trade-off of computational efficiency and this model's ability to generalize given the training data.

1. The `create_model` function is called to instatiate a **Logistic Regression** model using **Stochastic Gradient Descent**.
2. The model is trained over a number of iterations (100). Each iteration the entire dataset is passed through the model in mini-batches.
    - At the beginning of each loop the indicies of the training data are shuffled. This shuffling is integral to mini-batch training to ensure that each batch is filled with random data from the dataset. This makes sure that he model does not memorize the dataset with pattern recognition.
    - The training data is then divided into mini-batches. For each batch, a subset of the data is selected based on the shuffled indices. This subset includes both the features of `X_batch` and the labels of `y_batch`.
    - Each mini-btach is then used to partially fit the model with the `partial_fit` method. This helps with reducing computational intensity of the program.
3. The model is then evaluated using Sklearn's `classification_report` function. This prints out a few numbers including precision, recall, f1-score for each class, and the probability of which athlete is most likely to win a gold medal based on the dataset. The athlete array is then recombined here to help identifiy this output.

In [34]:
# Step 7: Run the mini-batching

from sklearn.metrics import classification_report

# Define the batch size for mini-batch training
batch_size = 32  

# Create the logistic regression model using SGD with mini-batch learning
model = create_model()

# Perform mini-batch training with noised data
for epoch in range(model.max_iter):
    # Create shuffled indicies
    shuffled_indices = np.random.permutation(len(X_train_noisy))
    
    # Run through the noisy dataset
    for i in range(0, len(X_train_noisy), batch_size):
        # Find the batch indicies
        batch_indices = shuffled_indices[i:i + batch_size]
        
        # Create the batches for X and y respectively using the batch indicies found above
        X_batch = X_train_noisy[batch_indices]
        y_batch = y_train[batch_indices]
        
        # Fit the new batches to the model
        model.partial_fit(X_batch, y_batch, classes = np.array([0, 1]))

        
# Step 8: Evaluate the model's performance

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Predict the probability of each athlete in the test set winning a gold medal
probabilities = model.predict_proba(X_test)[:, 1]
max_prob_index = np.argmax(probabilities)
prob = probabilities[max_prob_index]
prob_str = f"{prob:.4f}" if prob < 1 else "1.0000 (very high confidence)"
print(f"Athlete most likely to win gold: {names_array[training_size:][max_prob_index]} with probability {prob_str}")

              precision    recall  f1-score   support

           0       0.96      0.78      0.86     28502
           1       0.07      0.33      0.12      1498

    accuracy                           0.76     30000
   macro avg       0.51      0.55      0.49     30000
weighted avg       0.91      0.76      0.82     30000

Athlete most likely to win gold: Max Liebermann with probability 0.9419


## Conclusion

The model, given it's current parameters, is able to predict an athlete that is most likely to win gold based on the given columns with about 50-60% accuracy. A limitation of this implementation is that it is very computationally intensive. To get a result of 80% or higher, which would be ideal in, would require a lot of computational power to increase the number of iterations. The implemenation of the gaussian mechanism just before training the model ensures differential privacy of the output by adding noise to the entire dataset thus ensuring that the model could not memorize the dataset given to it.