<a href="https://colab.research.google.com/github/page-jerzak/ai_computing/blob/main/DSMDLP_Module3_Part1_Prediction_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Challenge: K-Fold Cross Validation
Using this code, revise the Logistic Regression training procedure to instead apply a 10-fold cross validation. You will be using the same dataset as the previous assignment, and an abbreviate preprocessing pipeline has been provided.

For this assignment, our goal is to train a model to predict whether students are likely to complete their next math assignment.

## Dataset

This assignment will utilize student data collected from the ASSISTments digital learning platform. The dataset contains data from 18,345 unique students across 37,268 assignments. For each of these, our goal is to predict whether each student will complete the assignment using their past performance and behavior in the system. Several student level features have been aggregated for each student up to the moment that they were given the target assignment, and we want to examine whether this information can be used to predict their future performance (as measured by completion of the next assignment).

**The dataset can be downloaded from Canvas or using this direct link:
[ASSISTments Assignments Dataset](https://drive.google.com/file/d/1oEEKSkDPn8RBM6oa9eNmXYIFBmxqFQBm/view?usp=sharing)**

The dataset used in this assignment has been sampled from a larger publicly-available dataset published in the International Conference on Educational Data Mining:

[Prihar, E., Syed, M., Ostrow, K., Shaw, S., Sales, A., & Heffernan, N. (2022, July). Exploring common trends in online educational experiments. *In Proceedings of the 15th International Conference on Educational Data Mining.*](https://educationaldatamining.org/edm2022/proceedings/2022.EDM-long-papers.3/index.html)

**A description of each column and further context for the data can be found in that paper.**

#Data Loading and Preprocessing
Download the **assistments_control_assignments_with_priors.csv** file from Canvas or the link above. Run the first code cell below to upload the dataset. The second code cell below uses the pandas library to read the file into a Dataframe and displays the number of rows and columns as well as a sample of the loaded data.



In [None]:
from google.colab import files
dataset = files.upload()
filename = list(dataset.keys())[0]
print(f"{filename} has been uploaded")

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv(filename)

# Drop rows with missing values
df = df.dropna()

# Print the shape of the dataset
print("\nShape of the dataset (rows, columns):", df.shape)

# Display a sample of the dataset
df


##Descriptive Statistics

Following the previous assignment, the code below identifies all the features that were considered in the previous assignment in addition to a set of features for which there is low correlation. Feel free to replace these features with the set that you identified in the previous assignment.

In [None]:
# The list of all potential features we can utilize
# Note: class- and teacher-level variables are omitted for this assignment,
#       but these may be useful in other contexts
all_features = ['0_student_prior_skill_builders',
                      '0_student_prior_problem_sets',
                      '0_student_prior_attempted_problems',
                      '0_student_prior_completed_problems',
                      'student_prior_started_skill_builder_count',
                      'student_prior_skill_builder_percent_completed',
                      'student_prior_started_problem_set_count',
                      'student_prior_problem_set_percent_completed',
                      'student_prior_completed_problem_count',
                      'student_prior_median_first_response_time',
                      'student_prior_median_time_on_task',
                      'student_prior_average_attempt_count',
                      'student_prior_average_correctness',
                      'opportunity_zone']

# The following represents one possible permutation of features
low_corr_features = ['0_student_prior_completed_problems',
                      'student_prior_started_skill_builder_count',
                      'student_prior_skill_builder_percent_completed',
                      'student_prior_problem_set_percent_completed',
                      'student_prior_completed_problem_count',
                      'student_prior_median_first_response_time',
                      'student_prior_average_attempt_count',
                      'student_prior_average_correctness',
                      'opportunity_zone']

dependent_variable = "assignment_completed"
df_features = df[all_features]

# Display the means and standard deviations
stats = {}
stats['Mean'] = df_features.mean(numeric_only=True)
stats['SD'] = df_features.std(numeric_only=True)

feature_statistics = pd.DataFrame(stats)

print(f"The dataset contains a total of {len(df)} rows and {len(df.columns)} columns")
print(f"{len(all_features)} potential feature columns have been identified:")
print(feature_statistics.round(2))
print("==========")
print(f"Dependent Measure: {dependent_variable}")
print(f"{round(df[dependent_variable].mean()*100,2)}% of the {df[dependent_variable].sum()} total assignments were completed")

#Logistic Regression

With our set of selected features your goal is to modify the code below to instead implement a K-fold cross validation. Refer to the ASSISTments assignment for hints as to how to get started if you are unsure.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, cohen_kappa_score, roc_auc_score


# Create variables for our dependent variable and the set of selected features
X = df[selected_features]
y = df[dependent_variable]

# Split our data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and train the logistic regression model with sklearn
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# Predictions using sklearn model
y_pred_sklearn = lr_model.predict(X_test)
y_pred_proba_sklearn = lr_model.predict_proba(X_test)[:, 1]

# Model performance metrics using sklearn predictions
accuracy = accuracy_score(y_test, y_pred_sklearn)
kappa = cohen_kappa_score(y_test, y_pred_sklearn)
auc = roc_auc_score(y_test, y_pred_proba_sklearn)

# Create a DataFrame to store performance metrics
performance = pd.DataFrame({'Metric': ['Accuracy', 'Kappa', 'AUC'], 'Value': [accuracy, kappa, auc]})

# Display performance metrics and regression coefficients
print("Performance Metrics using scikit-learn:")
print(performance)
