[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1nmRz-pOLEc0zjeG6MIaqtMXgdql4wpi0?usp=sharing)

# 1.1 - Data Exploration (Merge Content)

This notebook provides the in depth analysis on the student performance in exams at public schools.

**Info_Content.csv**

The content in Junyi Academy contains exercises, videos, and exams.
All the content in this dataset is in the type of exercise.

An exercise is a basic unit for students to learn a certain concept.
There are multiple problems in a single exercise that all relate to a certain concept.

This table records the metadata and hierarchy structure of each exercise in Junyi Academy.
There are three difficulty settings for each content, which indicates how hard it is to learn the concept.
The learning stage is separated into three stages: Elementary, Junior, and Senior.

The exercises in Junyi Academy are organized in a tree-like structure.
The current dataset release has four levels in the hierarchy.

| Variable Name | Description |
|:-|:-|
| ucid | The hashed unique ID of the content. |
| content_pretty_name | The Chinese display name of this content. |
| content_kind | The kind of this content. The current dataset release only includes `Exercise` |
| difficulty | The difficulty of this content. There are four possible values: `Easy`, `Normal`, `Hard` and `Unset`. Unset means |
| learning stage | The subject of this content. The current dataset release only includes `math` |
| subject | The learning stage of this content. There are three possible values: `Elementary`, `Junior` and `Senior`. |
| level1_id | The hashed level 1 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level2_id | The hashed level 2 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level3_id | The hashed level 3 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |
| level4_id | The hashed level 4 layer ID of this content. The levels form the tree-like hierarchy structure of contents in Junyi |

## Importing Libraries

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import matplotlib.pyplot as plt

## Loading Data

In [3]:
%%time
info_content_raw_df = pd.read_csv('../data/raw/Info_Content.csv', index_col='ucid')
info_userdata_raw_df = pd.read_csv('../data/raw/Info_UserData.csv', index_col='uuid')
log_problem_raw_df = pd.read_csv('../data/raw/Log_Problem.csv', index_col='upid')

Wall time: 20.1 ms


## Merging Data

This is one of the most important steps for our solution. The users, contents and problems are stored in seperate datasets. For purpose of model, the datasets have to be merged in very carefull way so that they are useful for the model.

1. One users can attempt multiple problems.
2. One content can have multiple problems.

In [None]:
merge_df = log_problem_raw_df.merge(info_userdata_df, how='left', on='uuid')

In [None]:
merge_df = merge_df.merge(info_content_raw_df, how='left', on='ucid')

In [4]:
info_content_raw_df.head()

Unnamed: 0_level_0,content_pretty_name,content_kind,difficulty,subject,learning_stage,level1_id,level2_id,level3_id,level4_id
ucid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
odIwFdIiecFwVUAEEV40K3MSuCSlIZkbq92Zp9tkZq8=,【基礎】怎樣解題：數量關係,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,bo3jsx1beVLEZ+2sckxdZNYnlLpVS7hb5lWU2baQ66k=,KPJMQebU0O24+NzlQ4udb2BXLlKV1Hte61+hV5Xb+oU=
dfeeBaa8zDhWS6nu7zeXKwLyi4zqEajI3tJM9/fSBPM=,【基礎】和差問題 1,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,bo3jsx1beVLEZ+2sckxdZNYnlLpVS7hb5lWU2baQ66k=,KPJMQebU0O24+NzlQ4udb2BXLlKV1Hte61+hV5Xb+oU=
C2AT0OBTUn+PRxEVd39enhW/DJtka1Tk90DUAR6yVdA=,【基礎】雞兔問題 1,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,bo3jsx1beVLEZ+2sckxdZNYnlLpVS7hb5lWU2baQ66k=,KPJMQebU0O24+NzlQ4udb2BXLlKV1Hte61+hV5Xb+oU=
jZvYpEa6VB/WrlKKmQHnfbv/xJ4OypBzq0epVcn500Q=,【基礎】年齡問題 1,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,bo3jsx1beVLEZ+2sckxdZNYnlLpVS7hb5lWU2baQ66k=,KPJMQebU0O24+NzlQ4udb2BXLlKV1Hte61+hV5Xb+oU=
M+UxJPgRIW57a0YS3eik8A9YDj+AwaMpTa5yWYn/kAw=,【基礎】追趕問題,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=,bo3jsx1beVLEZ+2sckxdZNYnlLpVS7hb5lWU2baQ66k=,KPJMQebU0O24+NzlQ4udb2BXLlKV1Hte61+hV5Xb+oU=


Let’s have a look at data dimensionality, feature names, and feature types.

In [5]:
info_content_raw_df.shape

(1330, 9)

From the output, we can see that the table contains 1330 rows and 9 columns.

Now let's try printing out column names using columns:

In [6]:
info_content_raw_df.columns

Index(['content_pretty_name', 'content_kind', 'difficulty', 'subject',
       'learning_stage', 'level1_id', 'level2_id', 'level3_id', 'level4_id'],
      dtype='object')

In [7]:
info_content_raw_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1330 entries, odIwFdIiecFwVUAEEV40K3MSuCSlIZkbq92Zp9tkZq8= to gvez7GFXUbuQl27U5+p/4QwFZZyXP2QFYQdoor8ZkeE=
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   content_pretty_name  1330 non-null   object
 1   content_kind         1330 non-null   object
 2   difficulty           1330 non-null   object
 3   subject              1330 non-null   object
 4   learning_stage       1330 non-null   object
 5   level1_id            1330 non-null   object
 6   level2_id            1330 non-null   object
 7   level3_id            1330 non-null   object
 8   level4_id            1330 non-null   object
dtypes: object(9)
memory usage: 103.9+ KB


object is the data types of our features. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 1330 observations, the same number of rows we saw before with shape.

In [8]:
info_content_raw_df.describe()

Unnamed: 0,content_pretty_name,content_kind,difficulty,subject,learning_stage,level1_id,level2_id,level3_id,level4_id
count,1330,1330,1330,1330,1330,1330,1330,1330,1330
unique,1320,1,4,1,3,1,10,42,171
top,【基礎】因數與倍數,Exercise,easy,math,elementary,aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=,7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=,scsWmkZsfmdmD2IzB24sQ1Au1BOXYgQEx9zO3+4glq8=,364ml6jwsO0pO5l86JBpC+KFYvYr7mn7S9gVuhoBnUE=
freq,2,1330,835,1330,784,1330,553,146,18


For categorical (type object) feature we can use the value_counts method. Let's have a look at the distribution of Churn:

In [9]:
info_content_raw_df['content_pretty_name'].value_counts()

【基礎】因數與倍數          2
【一般】函數關係式          2
【基礎】線對稱圖形          2
【進階】因數與倍數          2
【基礎】質數與合數          2
                  ..
【基礎】認識函數的對應關係      1
【進階】運用小數乘法解題       1
【基礎】變換符號，做十字交乘法    1
【進階】公里綜合習題         1
【基礎】被乘數和積的關係       1
Name: content_pretty_name, Length: 1320, dtype: int64

Some content_pretty_name has the frequency of 2.

In [10]:
info_content_raw_df['content_kind'].value_counts()

Exercise    1330
Name: content_kind, dtype: int64

1330 records are the kind of exercise. The current dataset release only includes `Exercise`. The `content_kind` column can be drop since all rows have the same values.

In [11]:
info_content_raw_df['difficulty'].value_counts()

easy      835
normal    305
hard      149
unset      41
Name: difficulty, dtype: int64

835 problems are easy, 305 problems are normal, 149 problems are hard difficulty and 41 problems are unset.

In [12]:
info_content_raw_df['subject'].value_counts()

math    1330
Name: subject, dtype: int64

The subject of all 1330 content records are `math`. The current dataset release only includes `math`.

In [13]:
info_content_raw_df['learning_stage'].value_counts()

elementary    784
junior        543
senior          3
Name: learning_stage, dtype: int64

There are 784 problems are elementary learning stage, 543 problems are junior learning stage and 3 problems are senior stage.

In [14]:
info_content_raw_df['level1_id'].value_counts()

aH0Dz0KdH9gio7rrcGRHvrmd9vcd/0WJbeEFB7qeUKA=    1330
Name: level1_id, dtype: int64

In [15]:
info_content_raw_df['level2_id'].value_counts()

7f73q332BKPBXaixasa4EkUb+pF6VAsLxNIg4506JJs=    553
R81Sqc8LAYj8amTPwFRvoPgbGpdaZoQLNX0hTg0DMB4=    224
xYDz4OEv0xsri1IpmXlrgMLJ848rgySf+39xWpq4DBI=    162
MfUX4BrIuFzJjm97tCQVisXbonyvtYtwCUJo6JpmoyU=    136
1EzKLzTq9Ax8/wlR9cJNrtthvk9lBi/SFdx/4L1PIaE=    115
ICgke8JJv5eapCPwyj1aco8PEtoBkUbTZYIqxmYtqBk=    100
2YwsqJH0U7Zguyun1OaStQsIHbUoYvgJNK0QCGC5BQI=     21
jXSXg7CfDboPEXlnqJTGuQOb0VIgOXCpaU/Sl+/m3n0=     16
5Np4fxxPeBgmNpeEOcXqarZIVsOEzZ1fSssL8cytQAc=      2
rzRcsBurW8jbUhivGAdZozPksRAZ5xM898ohJEBg93g=      1
Name: level2_id, dtype: int64

In [16]:
info_content_raw_df['level3_id'].value_counts()

scsWmkZsfmdmD2IzB24sQ1Au1BOXYgQEx9zO3+4glq8=    146
zM75Dhur9om41RTSUIivWvZ07gckl2Hi0cd3/Kx4sN4=    132
YM8uggtD5HZBGCCzNJLU90fC9C+B8/bZP3x483rb7PI=     97
4cISKCt3nXWe4r6Q8bzjOiL2EYYsZyT6Z0mNNJckqEc=     84
nLqsiSA2CPPgbpIk8GE3OSF94E4F7ogLig3ETQnOw4g=     78
CFq991L5i+mxDSbH+06jz+rWPf+FmW8hT4uxQjzwxpM=     49
/yqeM1FRP1rB9WuQWBkStMqrBQgjEexaeyWIhBC7ov4=     43
Dnl0P09LJllOG2eS2nB9pQax0ZixH7aCXPTKjB6O/dg=     37
xJYDWrjKoWhvx5w88UZja7WZgWHm2o9jmUQR8V911qc=     37
adBGt1t/h6kVwnNoQ/bmRj5KcQLWM8NKNR9HeqJy9ZA=     37
ItasYR+er/FlZlRvL66/NB3wY0AvmlrZKoqe4gmPyD0=     34
mS7VsK5w8e8Fw12vi9+tqcFfvw+I2J9+ot6NAMLgGno=     34
t9kdlRkqrUJUBY1Ill0Lt1UDYGQG08slJLdM3z+3to4=     34
APMXggE9pI1fpPhZYSwqSQGw/eQmPpLFBY7s9oBPWIk=     31
X5gAB0OLUqA8Rp3qGLuJyf6kyY16/nSreFkh68gpFzc=     31
a2w4FdOiBsy7t1fsstano3rtWKyay8qPhPd4U6Er7YQ=     27
hnFTwzVTapuqt9dsYhFFSgSgU0i7cBjbx0itZHp3gXw=     26
bnTxSNli6FmYFjFpnuPi5RDuC2L3MsoN4XIz7sdfdvI=     26
sl7IabghdxfoS9yuKX/dLgRCIZfVrkT6fO1pbXV30cI=     24
1B4NV31TSyP4

In [17]:
info_content_raw_df['level4_id'].value_counts()

364ml6jwsO0pO5l86JBpC+KFYvYr7mn7S9gVuhoBnUE=    18
FpQ/ONFxc3KG9UipcD2MZd0AB2A4C4QihnPIu1ilshs=    16
B7VcZ+GXaXxwo0f9JJv1LLU2KkDNcrDt2cL6+KwnfrA=    16
hq6uCe9NmtCc+0wlbGGIsxegP2cqYAdFebGd+v4/o8Q=    15
SoFA3oYx02KwW6utJm+Op0W9ZI/wGVzDyVDAAZRiRng=    15
                                                ..
UgkV1PV2Qm1SUXkTKRK7ojztDA68vDv33Vg2mv+oWa0=     2
18hZrFpMhRCzG1ntJLj9spv2bCK65XhZBR1+fdEQMaQ=     1
k9h8WnipCeqQzCoBIlopbBZspaDYdQKbtmsv18qlUZM=     1
zyVjBuMRkEs/hVTbayt34VrOAU1KNtk5Tt0EvU+/xhk=     1
du5oJdoBN5kRI9HOeBz42j8tba4SuHf0PmdkaF97Nlg=     1
Name: level4_id, Length: 171, dtype: int64