<a href="https://colab.research.google.com/github/kanyijohn/Deep-Learning-ML-models/blob/main/MAP_Charting_Student_Math_Misunderstandings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kaggle



In [2]:
from google.colab import files
import json

# Upload the file again
uploaded = files.upload()

# Verify the contents (make sure username & key are correct)
for filename in uploaded.keys():
    print(f"Uploaded: {filename}")
    print(uploaded[filename].decode('utf-8'))  # Check the key is correct

Saving kaggle.json to kaggle.json
Uploaded: kaggle.json
{"username":"johnsonkanyi","key":"7ec6a315b03a6847f69379cf2240a060"}


In [3]:
!mkdir -p ~/.kaggle  # -p prevents error if dir exists
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json  # Restrict permissions

In [4]:
!kaggle competitions download -c map-charting-student-math-misunderstandings

Downloading map-charting-student-math-misunderstandings.zip to /content
  0% 0.00/815k [00:00<?, ?B/s]
100% 815k/815k [00:00<00:00, 866MB/s]


In [5]:
!unzip map-charting-student-math-misunderstandings.zip

Archive:  map-charting-student-math-misunderstandings.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


# Task
Create an NLP model for the "MAP - Charting Student Math Misunderstandings" competition using the data in "/content/MAP - Charting Student Math Misunderstandings.txt". Provide a step-by-step guide covering data loading, preprocessing, model selection, training, evaluation, prediction, and submission.

## Load and explore data

### Subtask:
Load the training and testing data into pandas DataFrames and perform initial exploration to understand the structure, content, and potential issues.


**Reasoning**:
Load the train and test CSV files into pandas DataFrames, display their head and info, and check for missing values as requested in the instructions.



In [6]:
import pandas as pd

# Load the datasets
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

# Display the first few rows of the training data
print("Train Data Head:")
display(train_df.head())

# Display basic info about the training data
print("\nTrain Info:")
train_df.info()

# Display the first few rows of the testing data
print("\nTest Data Head:")
display(test_df.head())

# Display basic info about the testing data
print("\nTest Info:")
test_df.info()

# Check for missing values
print("\nMissing values in Train Data:")
display(train_df.isnull().sum())

print("\nMissing values in Test Data:")
display(test_df.isnull().sum())

Train Data Head:


Unnamed: 0,row_id,QuestionId,QuestionText,MC_Answer,StudentExplanation,Category,Misconception
0,0,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),0ne third is equal to tree nineth,True_Correct,
1,1,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 / 3 because 6 over 9 is 2 thirds and 1 third...,True_Correct,
2,2,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),"1 3rd is half of 3 6th, so it is simplee to un...",True_Neither,
3,3,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 goes into everything and 3 goes into nine,True_Neither,
4,4,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 out of every 3 isn't coloured,True_Correct,



Train Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36696 entries, 0 to 36695
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   row_id              36696 non-null  int64 
 1   QuestionId          36696 non-null  int64 
 2   QuestionText        36696 non-null  object
 3   MC_Answer           36696 non-null  object
 4   StudentExplanation  36696 non-null  object
 5   Category            36696 non-null  object
 6   Misconception       9860 non-null   object
dtypes: int64(2), object(5)
memory usage: 2.0+ MB

Test Data Head:


Unnamed: 0,row_id,QuestionId,QuestionText,MC_Answer,StudentExplanation
0,36696,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),"I think that 1/3 is the answer, as it's the si..."
1,36697,31772,What fraction of the shape is not shaded? Give...,\( \frac{3}{6} \),i think this answer is because 3 triangles are...
2,36698,32835,Which number is the greatest?,\( 6.2 \),because the 2 makes it higher than the others.



Test Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   row_id              3 non-null      int64 
 1   QuestionId          3 non-null      int64 
 2   QuestionText        3 non-null      object
 3   MC_Answer           3 non-null      object
 4   StudentExplanation  3 non-null      object
dtypes: int64(2), object(3)
memory usage: 252.0+ bytes

Missing values in Train Data:


Unnamed: 0,0
row_id,0
QuestionId,0
QuestionText,0
MC_Answer,0
StudentExplanation,0
Category,0
Misconception,26836



Missing values in Test Data:


Unnamed: 0,0
row_id,0
QuestionId,0
QuestionText,0
MC_Answer,0
StudentExplanation,0


## Preprocessing and feature engineering

### Subtask:
Clean and preprocess the text data, and create relevant features for the NLP model. This may involve tokenization, stemming/lemmatization, and creating numerical representations of text.


**Reasoning**:
Handle missing values in the 'Misconception' column by filling them with a placeholder, combine text columns, clean the combined text data, and then tokenize, stem, and create TF-IDF features.



**Reasoning**:
The previous command failed because 'punkt_tab' resource was not found. The traceback suggests downloading it using `nltk.download('punkt_tab')`. The code block will download the missing resource and then execute the previous text preprocessing steps again.



In [7]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK data including the missing resource
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# 1. Handle missing values in 'Misconception'
train_df['Misconception'] = train_df['Misconception'].fillna('No Misconception')

# 2. Combine relevant text columns
train_df['combined_text'] = train_df['QuestionText'] + ' ' + train_df['MC_Answer'] + ' ' + train_df['StudentExplanation']
test_df['combined_text'] = test_df['QuestionText'] + ' ' + test_df['MC_Answer'] + ' ' + test_df['StudentExplanation']

# 3. Clean the combined text data
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text) # Remove special characters and punctuation
    return text

train_df['cleaned_text'] = train_df['combined_text'].apply(clean_text)
test_df['cleaned_text'] = test_df['combined_text'].apply(clean_text)

# 4. Apply tokenization and 5. Perform stemming
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(stemmed_tokens)

train_df['processed_text'] = train_df['cleaned_text'].apply(tokenize_and_stem)
test_df['processed_text'] = test_df['cleaned_text'].apply(tokenize_and_stem)

# 6. Create numerical representations (TF-IDF)
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # Limit features to reduce dimensionality

X_train = tfidf_vectorizer.fit_transform(train_df['processed_text'])
X_test = tfidf_vectorizer.transform(test_df['processed_text'])

# Store TF-IDF features (optional, for inspection)
# train_tfidf_df = pd.DataFrame(X_train.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
# test_tfidf_df = pd.DataFrame(X_test.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# 7. Create additional features (text length and word count)
train_df['text_length'] = train_df['cleaned_text'].apply(len)
test_df['text_length'] = test_df['cleaned_text'].apply(len)

train_df['word_count'] = train_df['cleaned_text'].apply(lambda x: len(x.split()))
test_df['word_count'] = test_df['cleaned_text'].apply(lambda x: len(x.split()))

# Display the first few rows of the updated dataframes with new columns
print("\nTrain Data with new features:")
display(train_df.head())

print("\nTest Data with new features:")
display(test_df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



Train Data with new features:


Unnamed: 0,row_id,QuestionId,QuestionText,MC_Answer,StudentExplanation,Category,Misconception,combined_text,cleaned_text,processed_text,text_length,word_count
0,0,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),0ne third is equal to tree nineth,True_Correct,No Misconception,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,196,37
1,1,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 / 3 because 6 over 9 is 2 thirds and 1 third...,True_Correct,No Misconception,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,222,45
2,2,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),"1 3rd is half of 3 6th, so it is simplee to un...",True_Neither,No Misconception,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,216,43
3,3,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 goes into everything and 3 goes into nine,True_Neither,No Misconception,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,206,39
4,4,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),1 out of every 3 isn't coloured,True_Correct,No Misconception,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,193,37



Test Data with new features:


Unnamed: 0,row_id,QuestionId,QuestionText,MC_Answer,StudentExplanation,combined_text,cleaned_text,processed_text,text_length,word_count
0,36696,31772,What fraction of the shape is not shaded? Give...,\( \frac{1}{3} \),"I think that 1/3 is the answer, as it's the si...",What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,223,44
1,36697,31772,What fraction of the shape is not shaded? Give...,\( \frac{3}{6} \),i think this answer is because 3 triangles are...,What fraction of the shape is not shaded? Give...,what fraction of the shape is not shaded give ...,fraction shape shade give answer simplest form...,239,45
2,36698,32835,Which number is the greatest?,\( 6.2 \),because the 2 makes it higher than the others.,Which number is the greatest? \( 6.2 \) becaus...,which number is the greatest 62 because the ...,number greatest 62 2 make higher other,79,15


## Model selection

### Subtask:
Choose an appropriate NLP model for the task. Given the nature of the competition, this could involve deep learning models like recurrent neural networks (RNNs), transformers, or simpler models like Naive Bayes or SVMs with TF-IDF features.


## Model training




## Model Training

### Subtask:
Train the selected RoBERTA model using the preprocessed training data.

**Reasoning**:
Initialize the TFRobertaForSequenceClassification model and train it using the prepared `train_dataset` and `val_dataset`.

In [8]:
import tensorflow as tf
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer # Although not used for RoBERTA input, it was in previous preprocessing

# Download necessary NLTK data if not already downloaded
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')


# --- Data Loading ---
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: train.csv or test.csv not found. Please ensure the data is unzipped in the correct directory.")
    # Exit or handle the error appropriately if files are not found
    exit() # Or raise an error, depending on desired behavior


# --- Data Preprocessing (from previous steps) ---

# 1. Handle missing values in 'Misconception'
train_df['Misconception'] = train_df['Misconception'].fillna('No Misconception')

# 2. Combine relevant text columns
train_df['combined_text'] = train_df['QuestionText'] + ' ' + train_df['MC_Answer'] + ' ' + train_df['StudentExplanation']
test_df['combined_text'] = test_df['QuestionText'] + ' ' + test_df['MC_Answer'] + ' ' + test_df['StudentExplanation']

# 3. Clean the combined text data
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text) # Remove special characters and punctuation
    return text

train_df['cleaned_text'] = train_df['combined_text'].apply(clean_text)
test_df['cleaned_text'] = test_df['combined_text'].apply(clean_text)

# 4. Apply tokenization and 5. Perform stemming
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmed_tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return ' '.join(stemmed_tokens)

train_df['processed_text'] = train_df['cleaned_text'].apply(tokenize_and_stem)
test_df['processed_text'] = test_df['cleaned_text'].apply(tokenize_and_stem)

# --- End of Data Preprocessing ---


# --- Data Preparation for RoBERTA (from previous steps) ---

# Prepare the labels
# The 'Misconception' column is the target variable.
# We need to encode the categorical labels into numerical ones.
label_encoder = LabelEncoder()
train_df['Misconception_Encoded'] = label_encoder.fit_transform(train_df['Misconception'])

# Split the training data for validation
X_train_text, X_val_text, y_train_encoded, y_val_encoded = train_test_split(
    train_df['processed_text'],
    train_df['Misconception_Encoded'],
    test_size=0.2,  # Using 20% of the data for validation
    random_state=42,
    stratify=train_df['Misconception_Encoded'] # Stratify to maintain class distribution
)

# Initialize RoBERTA tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize the text data for RoBERTA
max_length = 128 # You might need to adjust this based on your data
train_encodings = tokenizer(X_train_text.tolist(), truncation=True, padding=True, max_length=max_length, return_tensors='tf')
val_encodings = tokenizer(X_val_text.tolist(), truncation=True, padding=True, max_length=max_length, return_tensors='tf')
test_encodings = tokenizer(test_df['processed_text'].tolist(), truncation=True, padding=True, max_length=max_length, return_tensors='tf') # Assuming test_df is available

# Convert labels to numpy arrays for creating TensorFlow datasets
train_labels_array = y_train_encoded.values
val_labels_array = y_val_encoded.values

# Create TensorFlow datasets for the tokenized text and labels
train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_encodings), train_labels_array))
val_dataset = tf.data.Dataset.from_tensor_slices((dict(val_encodings), val_labels_array))
test_dataset = tf.data.Dataset.from_tensor_slices(dict(test_encodings)) # Assuming test_encodings is available


# Define batch size and prefetch for performance
batch_size = 16 # You can adjust the batch size
train_dataset = train_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
val_dataset = val_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE) # Assuming test_dataset is needed later


# --- End of Data Preparation for RoBERTA ---


# Initialize the RoBERTA model for sequence classification
# The number of output labels is the number of unique misconceptions
num_labels = len(label_encoder.classes_)
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=num_labels)

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Train the model
# You can adjust the number of epochs and add callbacks as needed
epochs = 3 # Example: train for 3 epochs
history = model.fit(
    train_dataset,
    epochs=epochs,
    validation_data=val_dataset
)

# Print training history (optional)
print("\nTraining History:")
print(history.history)

Data loaded successfully.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifie

Epoch 1/3
Epoch 2/3
Epoch 3/3

Training History:
{'loss': [0.597223162651062, 0.36036619544029236, 0.3101639151573181], 'accuracy': [0.8221147060394287, 0.8704864382743835, 0.8864627480506897], 'val_loss': [0.43015921115875244, 0.3584235906600952, 0.4643762707710266], 'val_accuracy': [0.8500000238418579, 0.8663488030433655, 0.8512261509895325]}


## Calculate Mean Average Precision (MAP)

### Subtask:
Calculate the Mean Average Precision (MAP) for the trained model on the validation set.

**Reasoning**:
Predict the probabilities for the validation set, calculate the average precision for each class, and then compute the mean of these average precision scores to get the Mean Average Precision (MAP).

In [9]:
from sklearn.metrics import average_precision_score
import numpy as np

# Get predictions for the validation set
# model.predict returns a TFRobertaSequenceClassifierOutput object
val_predictions = model.predict(val_dataset)

# The logits are the raw, unnormalized scores for each class
val_logits = val_predictions.logits

# Apply softmax to get probabilities
val_probabilities = tf.nn.softmax(val_logits, axis=-1).numpy()

# Convert the true labels to one-hot encoding format required by average_precision_score
# We need to know the number of unique classes to create the one-hot encoded matrix
num_classes = len(label_encoder.classes_)
y_val_one_hot = np.zeros((len(y_val_encoded), num_classes))
y_val_one_hot[np.arange(len(y_val_encoded)), y_val_encoded] = 1


# Calculate Average Precision for each class
average_precisions = []
for i in range(num_classes):
    # Handle cases where a class might not be present in the validation set
    if np.sum(y_val_one_hot[:, i]) > 0:
        ap = average_precision_score(y_val_one_hot[:, i], val_probabilities[:, i])
        average_precisions.append(ap)
    else:
        print(f"Warning: Class {i} not present in validation set.") # Optional: print a warning

# Calculate Mean Average Precision (MAP)
mean_average_precision = np.mean(average_precisions)

print(f"\nMean Average Precision (MAP) on the validation set: {mean_average_precision}")


Mean Average Precision (MAP) on the validation set: 0.6717294503734172


# Task
Predict the probabilities of each misconception for the test dataset using the trained model, generate a submission file in the required format, and present the findings in a readable format.

## Predict on test data

### Subtask:
Use the trained model to predict the probabilities of each misconception for the test dataset.


**Reasoning**:
Use the trained model to predict probabilities for the test dataset and extract the probabilities.



In [10]:
import tensorflow as tf

# Get predictions for the test set
test_predictions = model.predict(test_dataset)

# The logits are the raw, unnormalized scores for each class
test_logits = test_predictions.logits

# Apply softmax to get probabilities
test_probabilities = tf.nn.softmax(test_logits, axis=-1).numpy()

print("\nTest Predictions Probabilities (first 5 rows):")
display(test_probabilities[:5])


Test Predictions Probabilities (first 5 rows):


array([[2.66230218e-05, 8.87456918e-05, 3.03512224e-05, 2.07816975e-05,
        6.59666766e-05, 4.25551552e-05, 8.04955998e-05, 1.85229419e-05,
        5.87085815e-05, 2.24850792e-05, 4.15853829e-05, 1.02374397e-05,
        1.71665736e-02, 1.40885211e-04, 1.00142615e-05, 2.33282408e-05,
        2.34601073e-04, 4.43227269e-04, 3.58932775e-05, 1.01093785e-04,
        1.92921543e-05, 9.79554415e-01, 1.35528040e-04, 1.06094718e-04,
        1.05410727e-04, 1.19450211e-04, 4.45275509e-05, 7.64435390e-05,
        5.55375409e-05, 4.75977795e-05, 7.20737444e-04, 4.54334149e-05,
        1.32260058e-04, 1.48588670e-05, 1.20110475e-04, 3.96573669e-05],
       [2.39328379e-04, 2.92570505e-04, 6.17283746e-04, 8.15467865e-05,
        4.25428268e-04, 3.26648325e-04, 2.99210282e-04, 9.72608832e-05,
        5.23028466e-05, 1.13927724e-03, 2.12646075e-04, 3.95550269e-05,
        4.11906764e-02, 1.96777837e-04, 8.89247603e-05, 1.84293982e-04,
        2.34904815e-04, 1.68034190e-03, 2.98893399e-04, 3.34146

## Generate submission file

### Subtask:
Create a submission file in the format required by the competition, which typically involves mapping the predicted probabilities back to the original misconception labels and formatting the output.


**Reasoning**:
Create the submission file by mapping the predicted probabilities to the original labels and saving the DataFrame to a CSV file.



In [11]:
import pandas as pd
import numpy as np # Make sure numpy is imported if needed

# 1. Get the original misconception labels from the label_encoder
original_labels = label_encoder.classes_

# 2. Create a DataFrame for the submission file
# Use 'row_id' from the test_df and columns for each misconception label
submission_df = pd.DataFrame(index=test_df['row_id'])

# 3. Populate the submission DataFrame with the predicted probabilities
# Ensure the columns correspond to the correct misconception labels
for i, label in enumerate(original_labels):
    submission_df[label] = test_probabilities[:, i]

# 4. Save the submission DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=True, index_label='row_id')

# 5. Display the head of the generated submission file
print("\nSubmission File Head:")
display(submission_df.head())


Submission File Head:


Unnamed: 0_level_0,Adding_across,Adding_terms,Additive,Base_rate,Certainty,Definition,Denominator-only_change,Division,Duplication,Firstterm,...,Subtraction,SwapDividend,Tacking,Unknowable,WNB,Whole_numbers_larger,Wrong_Fraction,Wrong_Operation,Wrong_fraction,Wrong_term
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
36696,2.7e-05,8.9e-05,3e-05,2.1e-05,6.6e-05,4.3e-05,8e-05,1.9e-05,5.9e-05,2.2e-05,...,4.5e-05,7.6e-05,5.6e-05,4.8e-05,0.000721,4.5e-05,0.000132,1.5e-05,0.00012,4e-05
36697,0.000239,0.000293,0.000617,8.2e-05,0.000425,0.000327,0.000299,9.7e-05,5.2e-05,0.001139,...,0.000194,0.000291,0.000137,0.000331,0.692021,0.000322,0.000912,4.2e-05,0.000167,0.000377
36698,6.8e-05,0.000148,3.6e-05,0.000596,0.000172,0.000338,0.000357,6.2e-05,0.000238,3.1e-05,...,5.5e-05,0.000303,6.2e-05,8.8e-05,9.4e-05,0.000153,0.000157,4.3e-05,2.8e-05,4e-05


## Summary:

### Data Analysis Key Findings

* The trained model was used to predict the probabilities of each misconception for the test dataset, resulting in a NumPy array `test_probabilities`.
* A submission DataFrame was created using the 'row\_id' from the test data and columns corresponding to the original misconception labels.
* The predicted probabilities were populated into the submission DataFrame, with each column representing the probabilities for a specific misconception.
* The submission DataFrame was successfully saved as a CSV file named 'submission.csv' with 'row\_id' as the index.

### Insights or Next Steps

* The generated 'submission.csv' file is ready to be submitted for evaluation.
* The process can be extended to analyze the distribution of predicted probabilities for each misconception across the test set.
