# Exploratory Data Analysis (EDA)

In this notebook, we will perform exploratory data analysis on the training dataset to understand the distribution of misconceptions and student responses. We will visualize the data and derive insights that can help in building our NLP model.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualisation style
sns.set(style='whitegrid')

In [None]:
# Load the training data
train_data = pd.read_csv('../data/train.csv')

# Display the first few rows of the dataset
train_data.head()

In [None]:
# Check for missing values in the dataset
missing_values = train_data.isnull().sum()
missing_values[missing_values > 0]

In [None]:
# Visualize the distribution of categories
plt.figure(figsize=(10, 6))
sns.countplot(data=train_data, x='Category', order=train_data['Category'].value_counts().index)
plt.title('Distribution of Misconception Categories')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Visualize the distribution of correct vs incorrect responses
plt.figure(figsize=(10, 6))
sns.countplot(data=train_data, x='MC_Answer', hue='Category')
plt.title('Distribution of Answers by Category')
plt.xlabel('Multiple Choice Answer')
plt.ylabel('Count')
plt.legend(title='Category')
plt.show()

In [None]:
# Analyze the length of student explanations
train_data['Explanation_Length'] = train_data['StudentExplanation'].apply(len)

plt.figure(figsize=(10, 6))
sns.histplot(train_data['Explanation_Length'], bins=30, kde=True)
plt.title('Distribution of Explanation Lengths')
plt.xlabel('Length of Explanation')
plt.ylabel('Frequency')
plt.show()

## Conclusion

In this exploratory analysis, we have visualized the distribution of misconceptions, the correctness of answers, and the length of student explanations. These insights will guide us in feature selection and model training for predicting student misconceptions.