### Project 02: Post-Training Knowledge Assessment
#### Description
This project aims to evaluate the knowledge and skills acquired by GED students following the completion of the training program. . By analyzing both pre-training (entrance scores) and post-training assessment data, we can evaluate the effectiveness of the program in enhancing students' academic abilities and their readiness for the GED exam. Comparing pre- and post-training data will provide insights into individual student growth and the overall impact of the program.

#### Data Points
- Pre-training (entrance) scores
- Post-training assessment scores
- Subject areas covered (math, science, RLA, Social Studies.)
- Student feedback on perceived knowledge gained
- Attendance and participation rates
#### Methodology
- Administer post-training assessments to measure knowledge acquisition after completing the program.
- Collect quantitative/qualitative feedback from students regarding their perceived improvements in knowledge and skills.
- Compare pre- and post-training scores to assess individual and group progress across different subject areas.
- Identify patterns in knowledge growth, and correlate those with attendance, participation rates, and student feedback.
- Use visualizations to present the findings, highlighting any correlations between attendance, participation, and assessment scores.
#### Expected Outcomes
- Comprehensive understanding of knowledge gained by participants, measured by improvements from pre- to post-training.
- Insights into subject areas where students showed significant improvement or encountered challenges.
- Identification of correlations between attendance/participation and academic performance.
- Recommendations for program enhancements based on assessment results and student feedback to improve future training sessions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('./data/pre_post_data_02.csv', index_col = 0)
df_copy = df.drop([col for col in df.columns if 'avg' in col], axis = 1)
df_copy = df_copy.rename(columns = {'n_math_test1':'math_test1'}).sort_index(axis=1)
df_copy.head()

In [None]:
sub_list = ['math', 'ss', 'rla', 'sci']
for sub in sub_list:
    data_cols = [col for col in df_copy.columns if sub in col]
    data = df_copy[data_cols].copy()
    medians = data.median()
    plt.figure(figsize = (12,6))
    ax =sns.boxplot(data=data)
    for xtick, median in zip(ax.get_xticks(), medians):
        ax.text(xtick, median + 0.1, f'{median:.2f}', horizontalalignment='center', color='black', weight='bold')    
    plt.show()

**Insights** The students performed consistently well in their module-end tests. However, we cannot conclusively determine the relationship between pre- and post-training improvements, as these tests are based on module-specific content. Additionally, the students' pre-test does not accurately reflect their knowledge of the GED material; it only measures whether the students are ready for the GED program.

In [None]:
df_avg = df[[col for col in df.columns if 'avg' in col]].copy()
df_avg = pd.merge(df_avg, df[['Pre_Math', 'Pre_Eng']], left_index = True, right_index = True)
cor_matrix = df_avg.corr().iloc[4:6, 0:4]
cor_matrix

In [None]:
plt.figure(figsize = (8,6))
sns.heatmap(cor_matrix, cmap='viridis', annot=True, fmt='.2f')
plt.show()

**Insights**
- Given that Pre_Eng has a moderate positive correlation with subjects like **RLA and Social Studies**, students' English proficiency helps them perform better during the training. 


In [None]:
df_att = pd.read_csv('./data/attendance.csv', index_col = 0)
df_att[df_att==0] = pd.NA
df_att =df_att.fillna(df_att.mean())

df_att.describe()

In [None]:
plt.figure(figsize=(12, 8))
label_list = ['Math', 'Social Study', 'RLA', 'Science']

for idx, sub in enumerate(sub_list):
    plt.subplot(2, 2, idx+1)
    sns.regplot(x='att_'+sub, y='avg_'+sub, data=df_att, ci = 95, label = label_list[idx])
    plt.xlabel(label_list[idx]+ '-attendance %')
    plt.ylabel(label_list[idx])
    plt.legend(loc = 'lower left')
plt.tight_layout()
plt.show()


**Insights** There is a linear correlation between attendance and module performance, with a stronger correlation observed in Social Studies and Science.

In [None]:
df_con = pd.read_csv('./data/confidence.csv', index_col = 0)
df_con = df_con.dropna()
df_melted = pd.melt(df_con, var_name='subject_confidence', value_name='confidence_score')
df_melted['subject'] = df_melted['subject_confidence'].apply(lambda x: x.split('_')[2])  # Math, RLA, etc.
df_melted['before_after'] = df_melted['subject_confidence'].apply(lambda x: 'Before' if 'bef' in x else 'After')
df_con.head()

In [None]:
plt.figure(figsize=(10, 6))
ax=sns.boxplot(data=df_melted, x='subject', y='confidence_score', hue='before_after', palette="Set2")
for i, subject in enumerate(df_melted['subject'].unique()):

    median_before = df_melted[(df_melted['subject'] == subject) & (df_melted['before_after'] == 'Before')]['confidence_score'].mean()
    median_after = df_melted[(df_melted['subject'] == subject) & (df_melted['before_after'] == 'After')]['confidence_score'].mean()
    

    ax.text(i - 0.2, median_before+0.1, f'{median_before:.2f}', color='black', ha="center", va="center")
    ax.text(i + 0.2, median_after+0.1, f'{median_after:.2f}', color='black', ha="center", va="center")


plt.title('Confidence Before and After Training for each Subject')
plt.ylabel('Confidence Score')
plt.xlabel('Subject')
plt.xticks([0,1,2,3],labels = ['Math', 'RLA', 'Science', 'Social Study'])
plt.legend(loc = 'upper right', bbox_to_anchor=(1.15,0.99))
plt.show()