[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1nmRz-pOLEc0zjeG6MIaqtMXgdql4wpi0?usp=sharing)

# 1.1 - Data Exploration (Merge Content)

This notebook provides the in depth analysis on the student performance in exams at public schools.

**Info_Content.csv**

The content in Junyi Academy contains exercises, videos, and exams.
All the content in this dataset is in the type of exercise.

An exercise is a basic unit for students to learn a certain concept.
There are multiple problems in a single exercise that all relate to a certain concept.

This table records the metadata and hierarchy structure of each exercise in Junyi Academy.
There are three difficulty settings for each content, which indicates how hard it is to learn the concept.
The learning stage is separated into three stages: Elementary, Junior, and Senior.

The exercises in Junyi Academy are organized in a tree-like structure.
The current dataset release has four levels in the hierarchy.

| Variable Name | Description |
|:-|:-|
| ucid | The hashed unique ID of the content. |
| content_pretty_name | The Chinese display name of this content. |
| content_kind | The kind of this content. The current dataset release only includes `Exercise` |
| difficulty | The difficulty of this content. There are four possible values: `Easy`, `Normal`, `Hard` and `Unset`. Unset means |
| learning stage | The subject of this content. The current dataset release only includes `math` |
| subject | The learning stage of this content. There are three possible values: `Elementary`, `Junior` and `Senior`. |
| level1_id | The hashed level 1 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level2_id | The hashed level 2 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level3_id | The hashed level 3 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level4_id | The hashed level 4 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |

## Importing Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import matplotlib.pyplot as plt
import seaborn as sns

## Loading Data

In [2]:
%%time
info_content_raw_df = pd.read_csv('../data/raw/Info_Content.csv', index_col='ucid')
info_userdata_raw_df = pd.read_csv('../data/raw/Info_UserData.csv', index_col='uuid')
log_problem_raw_df = pd.read_csv('../data/raw/Log_Problem.csv', index_col='upid')

Wall time: 37.2 s


## Merging Data

This is one of the most important steps for our solution. The users, contents and problems are stored in seperate datasets. For purpose of model, the datasets have to be merged in very carefull way so that they are useful for the model.

1. One users can attempt multiple problems.
2. One content can have multiple problems.

In [4]:
merge_df = log_problem_raw_df.merge(info_userdata_raw_df, how='left', on='uuid')

In [5]:
merge_df = merge_df.merge(info_content_raw_df, how='left', on='ucid')

In [6]:
merge_df.head()

Unnamed: 0,timestamp_TW,uuid,ucid,problem_number,exercise_problem_repeat_session,is_correct,total_sec_taken,total_attempt_cnt,used_hint_cnt,is_hint_used,...,has_class_cnt,content_pretty_name,content_kind,difficulty,subject,learning_stage,level1_id,level2_id,level3_id,level4_id
0,2019-05-26 21:00:00 UTC,FLy+lviglNR5Y1l0Xiijnl6QHySBcpKHJLCtQ6ogm2Q=,KDOmuTrY/IJzDP4kIgIYCBiGyTymsJ8Iy4cDB35WGYg=,18,2,True,33,1,0,False,...,1,【一般】含乘方的四則運算,Exercise,normal,math,junior,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,xYDz4OEv0xsri1IpmXlrgMLJ848rgySf+39xWpq4DBI=,/yqeM1FRP1rB9WuQWBkStMqrBQgjEexaeyWIhBC7ov4=,3jxSic/zhR8AsGYosBmwxHpD3CCxpEZRMKGWPQ0pmG8=
1,2019-05-17 16:30:00 UTC,+Gqj2nalc6M9fusyVECTC0AN7UQdDQTXESIuElkDltU=,COZ39Wo+uIUO2s7c2VGEHjJf6Vx0xifxVAiaeHtaTdk=,4,1,True,8,1,0,False,...,0,【基礎】用符號代表數,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,4cISKCt3nXWe4r6Q8bzjOiL2EYYsZyT6Z0mNNJckqEc=,J7h/J2DAyPA0bQF0yKIT3xSLnNWYeu5zj9dAlyH7Wi0=
2,2019-05-15 19:15:00 UTC,6D5QN8j8ng/VR74ES3A0zqAj0bIFFyaKjKEj8ZyXjQ8=,TwyqyV1uJYlDAX8wX/PtTCVZEBo/APIVfTzzleGkNCQ=,9,1,True,17,1,0,False,...,0,【基礎】6 個和第 6 個,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,scsWmkZsfmdmD2IzB24sQ1Au1BOXYgQEx9zO3+4glq8=,0kg46I/6iMMbn3w0M3CGBQ0jcjNGl+29E3S2vihaIo0=
3,2019-05-05 14:45:00 UTC,GgTZuCqZXObthtK6GAwqvlHrTMm5pKHWeezQxL/pcKc=,tBo6ECyT8IlKAM8UhQHWkqv92PRLcSiwuerfC7vNX+w=,2,1,True,10,1,0,False,...,0,【一般】判斷直線通過的象限,Exercise,normal,math,junior,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,R81Sqc8LAYj8amTPwFRvoPgbGpdaZoQLNX0hTg0DMB4=,t28teuC5ahPcOEtWJFv3k5eJZfLOFiyFCDL10Ktqu7Q=,TlimT3AnGjbr2ABY8Ji1ArdO35DUs6xZoaGE/0v2xW4=
4,2019-05-14 16:45:00 UTC,JMNKWoU0CkMSzgQ8bCnmCYlD8jEzAVge3lHMYLXKM2g=,vVpSKAMQbTMvtdERR0ksOeRmmaFt0R210t4Z//0RpPA=,6,1,True,98,1,0,False,...,0,【基礎】加減混合併式,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,scsWmkZsfmdmD2IzB24sQ1Au1BOXYgQEx9zO3+4glq8=,QveDXHYtspe3tNggjIr6cNoA1ghw4K5kmQT115hFzYI=


In [8]:
merge_df['level'].value_counts()

0    11809119
1     2352668
2      996819
3      751424
4      307281
Name: level, dtype: int64