### Capstone - Parsing Data (Part 1)

In this demonstration, i will parse the relevant data into the application that the Capstone team has built.

The aim of the Capstone project is to create a Chinese stroke-based keyboard for the patients to use. The keyboard needs to be able to serve 2 functions.

1. Predict chinese characters when the relevant strokes are being typed.
2. Predict chinese phrases (>= 2 chinese characters) whenever a Chinese character is being selected.

This section will demonstrate the 1st function. 

In [1]:
import pandas as pd
import json

Open character frequency data.

In [2]:
df = pd.read_csv('cleaned_data.csv')
df

Unnamed: 0,Index,Absolute Count,Frequency/Million words
0,的,4650143,45845.3460
1,我,2841511,28014.2040
2,了,2725964,26875.0365
3,不,1752436,17277.1106
4,是,1648662,16254.0119
5,你,1622660,15997.6605
6,一,1356666,13375.2493
7,在,992131,9781.3312
8,有,906700,8939.0746
9,个,856510,8444.2559


#### Step 1: Getting onegrams(single characters) from dataframe

In [3]:
onegram_df = df[df['Index'].map(len) == 1]
onegram_df

Unnamed: 0,Index,Absolute Count,Frequency/Million words
0,的,4650143,45845.3460
1,我,2841511,28014.2040
2,了,2725964,26875.0365
3,不,1752436,17277.1106
4,是,1648662,16254.0119
5,你,1622660,15997.6605
6,一,1356666,13375.2493
7,在,992131,9781.3312
8,有,906700,8939.0746
9,个,856510,8444.2559


#### Step 2: Make a character stroke dataframe

In [4]:
stroke = pd.read_csv('character_strokes.txt',sep='\t',header=None,names=['index','character','num_stroke','stroke_order','code'])

In [5]:
stroke.head(5)

Unnamed: 0,index,character,num_stroke,stroke_order,code
0,1,一,1,1,300
1,11,二,2,11,300
2,55,三,3,111,300
3,153,亖,4,1111,300
4,688,弎,6,111154,300


In [6]:
stroke.drop(columns=['index','code'],inplace=True)

#### Step 3: Combine stroke dataframe and character frequency dataframe, only keeping characters existing in both dataframes

In [7]:
onegram_df = onegram_df.merge(stroke,how='inner',left_on='Index',right_on='character')

In [8]:
onegram_df.drop(columns=['character'],inplace=True)

In [9]:
onegram_df

Unnamed: 0,Index,Absolute Count,Frequency/Million words,num_stroke,stroke_order
0,的,4650143,45845.3460,8,32511354
1,我,2841511,28014.2040,7,3121534
2,了,2725964,26875.0365,2,52
3,不,1752436,17277.1106,4,1324
4,是,1648662,16254.0119,9,251112134
5,你,1622660,15997.6605,7,3235234
6,一,1356666,13375.2493,1,1
7,在,992131,9781.3312,6,132121
8,有,906700,8939.0746,6,132511
9,个,856510,8444.2559,3,342


There is the same number of rows (1943), which means the one-grams in the stroke frequency dataset is a subset of the stroke order dataset.

Great. Now we have a dataframe consisting of single character information. When the user types in strokes, it retrieves stroke information from the 'stroke_order' column. Then, using a ranking prediction algorith using metrics such as character frequencies, it can determine the most commonly used characters from the same stroke order.

#### Step 4: Parse the data into a JSON file (Hashmap) called onegram, with index as the character.

In [10]:
onegram_df = onegram_df.set_index('Index')

In [11]:
onegram_df.head(5)

Unnamed: 0_level_0,Absolute Count,Frequency/Million words,num_stroke,stroke_order
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
的,4650143,45845.346,8,32511354
我,2841511,28014.204,7,3121534
了,2725964,26875.0365,2,52
不,1752436,17277.1106,4,1324
是,1648662,16254.0119,9,251112134


In [12]:
onegram_df.to_json('onegram.json',orient='index')