# CSE 144 Spring 2023 Assignment 1

This course's initial assignment comprises 10 questions, each carrying a score of 6, amounting to a total score of 60.

In this exciting assignment, you'll get the chance to train a logistic regression model and your very own neural network on a real-world dataset! But that's not all – you'll also practice your skills in interacting with Kaggle, the leading platform for machine learning competitions. Get ready to dive in and unleash your data science skills!

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format="retina"
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import random
import torch
from torch import nn, optim
import math
from IPython import display


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data preparation

**Dataset: Customer Churn Prediction 2020**

To begin working on this code, it is necessary to obtain the Customer Churn Prediction 2020 dataset, which can be downloaded from the [following page](https://www.kaggle.com/competitions/customer-churn-prediction-2020/overview). After downloading the dataset, you can proceed to upload the train.csv and test.csv files to Colab for accessing them in your notebook.

**Load train and test datasets from CSV files**

In [None]:
train_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/train.csv")
test_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/test.csv")

In [None]:
train_df.head()

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,OH,107,area_code_415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,no
1,NJ,137,area_code_415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,no
2,OH,84,area_code_408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,no
3,OK,75,area_code_415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,no
4,MA,121,area_code_510,no,yes,24,218.2,88,37.09,348.5,108,29.62,212.6,118,9.57,7.5,7,2.03,3,no


In [None]:
train_df.describe()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
count,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0,4250.0
mean,100.236235,7.631765,180.2596,99.907294,30.644682,200.173906,100.176471,17.015012,200.527882,99.839529,9.023892,10.256071,4.426353,2.769654,1.559059
std,39.698401,13.439882,54.012373,19.850817,9.182096,50.249518,19.908591,4.271212,50.353548,20.09322,2.265922,2.760102,2.463069,0.745204,1.311434
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,73.0,0.0,143.325,87.0,24.365,165.925,87.0,14.1025,167.225,86.0,7.5225,8.5,3.0,2.3,1.0
50%,100.0,0.0,180.45,100.0,30.68,200.7,100.0,17.06,200.45,100.0,9.02,10.3,4.0,2.78,1.0
75%,127.0,16.0,216.2,113.0,36.75,233.775,114.0,19.8675,234.7,113.0,10.56,12.0,6.0,3.24,2.0
max,243.0,52.0,351.5,165.0,59.76,359.3,170.0,30.54,395.0,175.0,17.77,20.0,20.0,5.4,9.0


**Setting a Seed for Random Number Generators in Python and PyTorch**

In [None]:
seed = 1
random.seed(seed)
torch.manual_seed(seed)

<torch._C.Generator at 0x7fbbbdf65670>

**Shuffle the training data**

In [None]:
# Show first five elements in the input
def show_first_five_elements(input):
  return list(input)[:5]

train_df_index = train_df.index
print(f"Initial index of train_df: {show_first_five_elements(train_df_index)}")

shuffled_index = np.random.permutation(train_df.index)
print(f"Shuffled index: {show_first_five_elements(shuffled_index)}")

train_df = train_df.reindex(shuffled_index)
print(f"Shuffled index of train_df: {show_first_five_elements(train_df.index)}")

print("\nThe examples in train_df are shuffled:")
train_df.head()

Initial index of train_df: [0, 1, 2, 3, 4]
Shuffled index: [1844, 1920, 3727, 3964, 3496]
Shuffled index of train_df: [1844, 1920, 3727, 3964, 3496]

The examples in train_df are shuffled:


Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
1844,CA,120,area_code_510,no,no,0,134.8,94,22.92,204.1,106,17.35,238.4,109,10.73,6.7,8,1.81,1,no
1920,WY,94,area_code_408,no,no,0,207.0,109,35.19,167.4,80,14.23,238.2,117,10.72,2.6,6,0.7,1,no
3727,VA,100,area_code_415,no,no,0,200.7,151,34.12,290.8,127,24.72,186.7,105,8.4,10.2,9,2.75,1,no
3964,WA,64,area_code_510,no,no,0,50.6,91,8.6,308.9,78,26.26,255.4,114,11.49,13.7,5,3.7,2,no
3496,ME,120,area_code_510,no,no,0,163.6,109,27.81,237.3,95,20.17,186.2,141,8.38,12.3,5,3.32,0,no


**Selecting features for training and testing**

In [None]:
X = train_df[['state', 'account_length', 'area_code', 'international_plan',
       'voice_mail_plan', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_day_charge', 'total_eve_minutes',
       'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
       'total_night_calls', 'total_night_charge', 'total_intl_minutes',
       'total_intl_calls', 'total_intl_charge',
       'number_customer_service_calls']]
Y = train_df['churn']

X_test = test_df[['state', 'account_length', 'area_code', 'international_plan',
       'voice_mail_plan', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_day_charge', 'total_eve_minutes',
       'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
       'total_night_calls', 'total_night_charge', 'total_intl_minutes',
       'total_intl_calls', 'total_intl_charge',
       'number_customer_service_calls']]

test_id = test_df['id']

**Check missing data in the train and test datasets**

In [None]:
print(f"Count of missing values in the train_df: {train_df.isnull().sum().sum()}")
print(f"Count of missing values in the test_df: {test_df.isnull().sum().sum()}")

Count of missing values in the train_df: 0
Count of missing values in the test_df: 0


**Replace 'no' with 0 and 'yes' with 1 in the Y DataFrame**

Question 1: In order to prepare Y for future training purposes, consider converting the current string values of 'no' and 'yes' into binary integers, 0 and 1, respectively. Hint: This can be accomplished by using the replace function.

In [None]:
###### Your codes start here.######
Y = Y.replace({'no': 0, 'yes': 1})
###### Your codes end here.######

# Convert Y DataFrame to integer type
Y = Y.astype(int)



```
# This is formatted as code
```

Question 2: Split X and Y into training set & validation set. Hint: try train_test_split function.

In [None]:
###### Your codes start here.######
X_train, X_val, Y_train, Y_val = train_test_split(X, Y)
###### Your codes end here.######

In [None]:
X_train.head()

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
1082,UT,144,area_code_415,no,no,0,139.6,96,23.73,124.2,93,10.56,95.6,75,4.3,15.0,4,4.05,2
2853,MD,68,area_code_408,no,no,0,150.4,106,25.57,199.0,131,16.92,109.7,103,4.94,12.3,3,3.32,3
3184,MS,173,area_code_415,no,no,0,82.1,75,13.96,201.2,95,17.1,206.7,146,9.3,9.6,2,2.59,1
3500,ID,130,area_code_510,no,no,0,140.9,68,23.95,217.5,109,18.49,123.6,96,5.56,13.6,4,3.67,3
1784,WI,111,area_code_415,no,no,0,246.5,108,41.91,216.3,89,18.39,179.6,99,8.08,12.7,3,3.43,2


**Convert categorical features in a DataFrame to one-hot encoding**

In [None]:
def convert_features_to_one_hot(df, feature_name_list):
  for feature_name in feature_name_list:
    df = pd.get_dummies(df, columns=[feature_name])

  return df

Question 3: Create a list that contains all categorical features in the dataset, then use function convert_features_to_one_hot to transform the categorical features into one-hot encoded representations.

In [None]:
from pandas.core.arrays import categorical
# List of categorical features to be one-hot encoded
###### Your codes start here.######
features = ['state', 'area_code', 'international_plan', 'voice_mail_plan']
X_train = convert_features_to_one_hot(X_train, features)
X_test = convert_features_to_one_hot(X_test, features)
X_val = convert_features_to_one_hot(X_val, features)
###### Your codes end here.######

In [None]:
X_train.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,...,state_WI,state_WV,state_WY,area_code_area_code_408,area_code_area_code_415,area_code_area_code_510,international_plan_no,international_plan_yes,voice_mail_plan_no,voice_mail_plan_yes
1082,144,0,139.6,96,23.73,124.2,93,10.56,95.6,75,...,0,0,0,0,1,0,1,0,1,0
2853,68,0,150.4,106,25.57,199.0,131,16.92,109.7,103,...,0,0,0,1,0,0,1,0,1,0
3184,173,0,82.1,75,13.96,201.2,95,17.1,206.7,146,...,0,0,0,0,1,0,1,0,1,0
3500,130,0,140.9,68,23.95,217.5,109,18.49,123.6,96,...,0,0,0,0,0,1,1,0,1,0
1784,111,0,246.5,108,41.91,216.3,89,18.39,179.6,99,...,1,0,0,0,1,0,1,0,1,0


**Save Predicted Churn Values to CSV File**

In [None]:
def save_prediction_to_csv_file(y_test_pred, file_name):
  # Convert predicted churn values to 'yes' and 'no'
  churn_list = ['no' if pred == 0 else 'yes' for pred in y_test_pred]

  # Create a DataFrame containing the IDs and predicted churn values
  submission_df = pd.DataFrame({'id': test_df['id'], 'churn': churn_list})

  # Save the DataFrame to a CSV file
  submission_df.to_csv(file_name, index=False)

# Logistic regression

**Fit logistic regression model on training data**

Question 4: Train a logistic regression model using X_train and Y_train with l2 penalty. Hint: Check LogisticRegression from sklearn.

In [None]:
###### Your codes start here.######
LR_model = LogisticRegression(penalty='l2', max_iter=5000)
LR_model.fit(X_train, Y_train)
###### Your codes end here.######

**Make predictions on validation data and print validation accuracy**

Question 5: Generate predictions on the validation dataset using the predict function, and subsequently, assess the model's predictive performance by comparing the predicted results against the true labels, computing the accuracy metric as a measure of performance.

In [None]:
###### Your codes start here.######
from sklearn.metrics import accuracy_score

y_val_pred = LR_model.predict(X_val)
val_accuracy = accuracy_score(Y_val, y_val_pred)

print("Validation Accuracy:", val_accuracy)
###### Your codes end here.######

Validation Accuracy: 0.8645343367826905


**Make predictions on test data and save to CSV file**

In [None]:
y_test_pred = LR_model.predict(X_test)
save_prediction_to_csv_file(y_test_pred, "submission_lr.csv")

After obtaining the submission file, submission_lr.csv, proceed to upload it to the designated competition page on [kaggle](https://www.kaggle.com/competitions/customer-churn-prediction-2020/submissions). This will allow you to assess the performance of your model on the competition's test set by viewing the corresponding evaluation results.

# Neural Network

**Set device to GPU if available, otherwise use CPU**

Question 6: Prior to executing the code, verify if a GPU is available and if so, assign it to a variable named device. Otherwise, utilize the CPU for computation. This can be accomplished by using the torch.device function.

In [None]:
###### Your codes start here.######
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
###### Your codes end here.######

**Convert data to PyTorch tensors and move to device**

In [None]:
# Convert training data to PyTorch tensors and move to device
X_train = torch.tensor(X_train.values, dtype=torch.float32).to(device)
Y_train = torch.tensor(Y_train.values, dtype=torch.long).to(device)

# Convert validation data to PyTorch tensors and move to device
X_val = torch.tensor(X_val.values, dtype=torch.float32).to(device)
Y_val = torch.tensor(Y_val.values, dtype=torch.long).to(device)

# Convert test data to PyTorch tensors and move to device
X_test = torch.tensor(X_test.values, dtype=torch.float32).to(device)

**Define network architecture and hyperparameters**

In [None]:
D = X_train.shape[1]  # feature dimensions
C = 2  # num_classes
H = 6  # num_hidden_units

# Set learning rate and regularization strength
learning_rate = 1e-3
lambda_l2 = 1e-3

# Set number of training epochs
epochs = 27000

**Train a Linear Model with PyTorch**

Question 7:

For each question, there are 2 scores available.

1. Construct your initial neural network architecture, consisting of three fully-connected layers with ReLU activation functions inserted between each pair of fully-connected layers. The network should consist of D input feature dimensions, H hidden units in each layer, and C output classes. Hint: Use nn.Sequential function.

2. Implement the cross-entropy loss function to calculate the loss between predicted and ground-truth labels. Hint: use the CrossEntropyLoss function provided by PyTorch.

3. Use the Adam optimizer to update the model's parameters during training. Set the weight decay parameter to lambda_l2.

Notice: Since CrossEntropyLoss function already includes a sigmoid activation function, there's no need to include an additional sigmoid layer in your model. The CrossEntropyLoss function combines the sigmoid activation and the binary cross-entropy loss function into a single function.

In [None]:
###### 1. Your codes start here.######
# Define the neural network architecture
class LinearModel(nn.Module):
    def __init__(self, D, H, C):
        super(LinearModel, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(D, H),
            nn.ReLU(),
            nn.Linear(H, H),
            nn.ReLU(),
            nn.Linear(H, C)
        )

    def forward(self, x):
        out = self.layers(x)
        return out

model = LinearModel(D, H, C)
###### Your codes end here.######
model.to(device)

# nn package has different loss functions.
# we use cross entropy loss for our classification task
###### 2. Your codes start here.######
loss_fn = nn.CrossEntropyLoss()
###### Your codes end here.######

# we use the optim package to apply
# ADAM for our parameter updates
###### 3. Your codes start here.######
optimizer = optim.Adam(model.parameters(), weight_decay=lambda_l2)
###### Your codes end here.######

Question 8: Train your neural network model using a for loop. For guidance on how to write the training code, refer to [this link](https://github.com/yizuc/CSE144/blob/main/05-regression.ipynb) for an example implementation.

In [None]:
# Training
for t in range(epochs):
    ###### Your codes start here.######
    model.train()
    optimizer.zero_grad()

    # Forward pass
    y_train_pred = model(X_train)
    loss = loss_fn(y_train_pred, Y_train)

    # Backward pass
    loss.backward()

    # Update weights
    optimizer.step()

# Compute and print training accuracy
y_train_pred = model(X_train)
_, predicted_train = torch.max(y_train_pred, 1)
train_acc = (Y_train == predicted_train).sum().float() / len(Y_train)
print(f"Neural network model training accuracy: {train_acc}")
###### Your codes end here.######

Neural network model training accuracy: 0.9444618821144104


**Evaluate neural network model on validation data**

Question 9: Explain the purpose of using the with torch.no_grad() context manager in the following code implementation.

In [None]:
with torch.no_grad():
  y_val_pred = model(X_val)
  _, predicted_val = torch.max(y_val_pred, 1)
  val_acc = (Y_val == predicted_val).sum().float() / len(Y_val)

  # Print validation accuracy
  print(f"Neural network model validation accuracy: {val_acc}")

Neural network model validation accuracy: 0.9096895456314087


The 'with torch.no_grad()' statement is used in a neural network to prevent PyTorch from tracking and storing intermediate values that are used during the forward pass. This reduces memory usage and speeds up computation. In the provided code snippet, the validation set is being used to evaluate the model's performance after each epoch. During evaluation, we don't need to store the intermediate values, so we use 'with torch.no_grad()' to ensure that PyTorch does not waste resources on unnecessary computations. Inside the 'with' statement, the predicted values are computed, and the accuracy of the predictions is calculated and printed.






**Evaluate neural network model on test data**

In [None]:
with torch.no_grad():
  y_test_pred = model(X_test)
  _, predicted_test = torch.max(y_test_pred, 1)

**Save predicted test values to CSV file**

In [None]:
save_prediction_to_csv_file(predicted_test, "submission_nn.csv")

After generating the submission_nn.csv file, proceed to upload it to the designated competition page on [kaggle](https://www.kaggle.com/competitions/customer-churn-prediction-2020/submissions) to assess the performance of the neural network model. You can then compare the neural network's performance against the logistic regression model and determine whether it outperforms it.

Question 10: Adjust the value of H in your neural network architecture from 6 to 40, and retrain the model. Then, answer the following questions, each of which carries a score of 2:

1. How does the training accuracy change after increasing H?
2. How does the validation accuracy change after increasing H?
3. What is the likely explanation for the observed changes in validation accuracy after increasing H?

1. The training accuracy increases significantly after increasing H from 6 to 40, from ~0.93 to ~0.98. This indicates that the model is better able to fit the training data with a larger number of hidden units, which leads to better performance.
2. The validation accuracy decreases slightly after increasing H from 6 to 40, from ~0.9 to ~0.88. This suggests that the larger model with more hidden units is overfitting the training data and is not generalizing as well to the validation data.
3. The likely explanation for the observed decrease in validation accuracy is overfitting, where the larger model with more hidden units is fitting the noise in the training data, rather than the underlying patterns. This causes the model to perform well on the training data but not on the validation data.