### Capstone - Parsing Data (Part 2)

The aim of the Capstone project is to create a Chinese stroke-based keyboard for the patients to use. The keyboard needs to be able to serve 2 functions.

1. Predict chinese characters when the relevant strokes are being typed.
2. Predict chinese phrases (>= 2 chinese characters) whenever a Chinese character is being selected.

This section will demonstrate the 2nd function. 

In [1]:
import pandas as pd

Open character frequency data

In [2]:
df = pd.read_csv('cleaned_data.csv')
df

Unnamed: 0,Index,Absolute Count,Frequency/Million words
0,的,4650143,45845.3460
1,我,2841511,28014.2040
2,了,2725964,26875.0365
3,不,1752436,17277.1106
4,是,1648662,16254.0119
5,你,1622660,15997.6605
6,一,1356666,13375.2493
7,在,992131,9781.3312
8,有,906700,8939.0746
9,个,856510,8444.2559


#### Step 1: Split dataframe into single characters (onegram) and multiple characters (multigram)

In [3]:
onegram_df = df[df['Index'].map(len) == 1]
onegram_df

Unnamed: 0,Index,Absolute Count,Frequency/Million words
0,的,4650143,45845.3460
1,我,2841511,28014.2040
2,了,2725964,26875.0365
3,不,1752436,17277.1106
4,是,1648662,16254.0119
5,你,1622660,15997.6605
6,一,1356666,13375.2493
7,在,992131,9781.3312
8,有,906700,8939.0746
9,个,856510,8444.2559


We don't need the character frequency of the one-gram dataframe.

In [4]:
onegram_df.drop(columns=['Absolute Count','Frequency/Million words'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [5]:
onegram_df.head(5)

Unnamed: 0,Index
0,的
1,我
2,了
3,不
4,是


In [6]:
multigram_df = df[df['Index'].map(len) > 1]
multigram_df

Unnamed: 0,Index,Absolute Count,Frequency/Million words
34,自己,290637,2865.3643
49,分享,222292,2191.5570
51,我们,213466,2104.5423
52,今天,206840,2039.2172
53,一下,202261,1994.0732
54,什么,202206,1993.5310
59,获得,193248,1905.2148
64,没有,179864,1773.2632
66,可以,175352,1728.7798
73,超过,157159,1549.4166


#### Step 2: Creating and removing the necessary and unnecessary columns

In [7]:
multigram_df['first_char'] = multigram_df['Index'].str[0]
multigram_df['subsequent_char'] = multigram_df['Index'].str[1:]
multigram_df.drop(columns=['Index','Absolute Count'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [8]:
multigram_df['phrase_freq'] = multigram_df['subsequent_char']+':'+multigram_df['Frequency/Million words'].round(4).astype('str')
multigram_df.drop(columns=['Frequency/Million words','subsequent_char'],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [9]:
multigram_df.head(5)

Unnamed: 0,first_char,phrase_freq
34,自,己:2865.3643
49,分,享:2191.557
51,我,们:2104.5423
52,今,天:2039.2172
53,一,下:1994.0732


In [10]:
multigram_df['first_char'].value_counts()

大    219
一    130
小    121
不    109
无    108
天     94
上     89
心     88
发     83
开     83
好     83
人     82
打     81
出     81
老     79
高     70
有     69
自     68
中     67
相     60
新     57
下     57
回     54
微     54
海     52
年     52
金     50
看     48
长     48
公     48
    ... 
拯      1
箭      1
蒲      1
姊      1
茄      1
俱      1
勉      1
禽      1
嘱      1
梓      1
芯      1
噱      1
撤      1
潇      1
辫      1
苹      1
鲸      1
哑      1
拦      1
琉      1
迹      1
扇      1
贡      1
囍      1
蟑      1
汹      1
赣      1
椰      1
扒      1
荤      1
Name: first_char, Length: 2823, dtype: int64

#### Step 3: Apply a groupby function grouped by first_char and combining all phrase freq data

In [11]:
firstchar_group = multigram_df.groupby(['first_char']).agg({'phrase_freq':lambda x: ','.join(x)})

In [12]:
firstchar_group.head(20)

Unnamed: 0_level_0,phrase_freq
first_char,Unnamed: 1_level_1
一,"下:1994.0732,起:1484.121,样:702.8023,直:548.4808,定..."
丁,丁:4.3281
七,"十二:15.7743,点:12.5208,点半:6.566,十七:4.89,彩:3.5788..."
万,"事如意:66.5476,物:14.6405,一:12.5603,事:10.9138,历:8...."
丈,"夫:32.1302,母娘:5.452"
三,"十:60.3858,星:53.1888,尺:24.4501,点:24.0557,亚:20.4..."
上,"班:402.1548,海:212.8835,网:111.5043,传:80.1037,帝:7..."
下,"载:934.6248,午:283.7296,来:178.6533,去:176.6717,雨:..."
不,"过:433.2301,管:415.4939,错:341.5915,断:155.6131,同:..."
与,"众不同:7.8871,否:5.8661,其:3.8647,时俱进:2.1098,世隔绝:1.745"


Great. Now we have the information we need. For each first character, we can see the following possible phrases and its frequencies.

For example, 一下 has a frequency of 1994.0732 and 一起 has a frequency of 1484.121.

Now, we just have to transform the phrase_freq column into a hashmap format and save the file as a JSON.

In [13]:
def extract_phrase(row):
    multigram_json = {}
    phrases_freq = row['phrase_freq'].split(',')
    for phrase_freq in phrases_freq:
        phrase,freq = phrase_freq.split(':')[0],phrase_freq.split(':')[1]
        multigram_json[phrase] = freq
    return multigram_json

firstchar_group['dict'] = firstchar_group.apply(extract_phrase,axis=1)
firstchar_group.drop(columns=['phrase_freq'],inplace=True)
firstchar_group.reset_index(inplace=True)

In [17]:
firstchar_group.head(10)

Unnamed: 0,first_char,dict
0,一,"{'下': '1994.0732', '起': '1484.121', '样': '702...."
1,丁,{'丁': '4.3281'}
2,七,"{'十二': '15.7743', '点': '12.5208', '点半': '6.566..."
3,万,"{'事如意': '66.5476', '物': '14.6405', '一': '12.56..."
4,丈,"{'夫': '32.1302', '母娘': '5.452'}"
5,三,"{'十': '60.3858', '星': '53.1888', '尺': '24.4501..."
6,上,"{'班': '402.1548', '海': '212.8835', '网': '111.5..."
7,下,"{'载': '934.6248', '午': '283.7296', '来': '178.6..."
8,不,"{'过': '433.2301', '管': '415.4939', '错': '341.5..."
9,与,"{'众不同': '7.8871', '否': '5.8661', '其': '3.8647'..."


In [18]:
def json_dump(row):
    json_data[row['first_char']] = row['dict']

json_data = {}
firstchar_group.apply(json_dump,axis=1)

0       None
1       None
2       None
3       None
4       None
5       None
6       None
7       None
8       None
9       None
10      None
11      None
12      None
13      None
14      None
15      None
16      None
17      None
18      None
19      None
20      None
21      None
22      None
23      None
24      None
25      None
26      None
27      None
28      None
29      None
        ... 
2793    None
2794    None
2795    None
2796    None
2797    None
2798    None
2799    None
2800    None
2801    None
2802    None
2803    None
2804    None
2805    None
2806    None
2807    None
2808    None
2809    None
2810    None
2811    None
2812    None
2813    None
2814    None
2815    None
2816    None
2817    None
2818    None
2819    None
2820    None
2821    None
2822    None
Length: 2823, dtype: object

In [19]:
json_data

{'一': {'下': '1994.0732',
  '起': '1484.121',
  '样': '702.8023',
  '直': '548.4808',
  '定': '520.7773',
  '切': '436.2174',
  '点': '312.9317',
  '些': '238.3096',
  '半': '130.4728',
  '举': '126.0067',
  '般': '116.0985',
  '路': '113.2296',
  '大早': '87.9809',
  '唿百应': '62.2097',
  '边': '61.3619',
  '会儿': '52.5677',
  '早': '48.5551',
  '会': '48.5354',
  '旦': '40.9046',
  '点点': '36.2414',
  '百': '33.7471',
  '一': '32.9583',
  '生': '29.7246',
  '手': '25.7416',
  '汽': '24.3121',
  '时': '23.6318',
  '下子': '22.4882',
  '辈子': '19.639',
  '盘': '19.5995',
  '同': '18.5643',
  '如既往': '17.1742',
  '月': '16.7503',
  '万': '16.2869',
  '笔': '15.3799',
  '致': '15.0447',
  '再': '14.8278',
  '千': '14.3743',
  '鸣惊人': '13.9011',
  '共': '13.4771',
  '齐': '13.1222',
  '味': '12.8264',
  '心': '12.2941',
  '个': '11.949',
  '流': '11.7518',
  '体': '10.7462',
  '塌煳涂': '10.1843',
  '口气': '9.5631',
  '石二鸟': '8.7843',
  '向': '8.5871',
  '点儿': '8.5871',
  '言九鼎': '8.4195',
  '无所有': '7.5519',
  '度': '7.404',
  '个人': '7.3942',

Results are as expected. Now save the JSON data as multigram.json

In [20]:
import json

with open('multigram.json', 'w') as outfile:
    json.dump(json_data, outfile)