# ETL of Data, part 2: 

In the last notebook, data was extracted from various sources. 

Now, these sources, need to be transformed into a suitable schema
for the application to run off it


## Goals and Objectives

Goal: Transform the data so that it can fit into a schema that the app will run off

* Transform the Characters into a json format that contains their
name, pinyin, 
Goal, create a schema for the dataset by transforming the initial
data that was extracted from various sources


Objectives: 

### Radicals

* _ID
* Radical
* 
* meaning
* hema codes

### Characters: 

* _ID
* char 
* meaning as list
* radicals ID
* words containing char (ids)
* phrases containing char
* other radical forms
* stroke number
* frequency
* hsk level 

Link to char decomp table


### Character Decomposition

* _ID
* Char Decomp Tree


Most Common words with radical (Embedded)



#### Setup Code

In [1]:
#import statements

from pathlib import Path

import pandas as pd
import numpy as np
import re
# Mongodb Client

import pymongo 
import json
from bson.objectid import ObjectId

# Global Variables

DATADIR = Path().cwd()/'..'/'data'/'extracted'

In [2]:
# Connecting to database

client = pymongo.MongoClient("mongodb://localhost:27017")

db = client['CCRS']

## Part I. Radicals

DATADIR

In [3]:
RadicalsCollection = db['Radicals']

In [4]:
radicalsDF = pd.read_csv(DATADIR/'Radicals.csv',index_col=0)
radicalsDF.shape 

(214, 10)

In [5]:
radicalsDF.head()

Unnamed: 0,number,radical,variants,simplifiedradical,pinyin,english,strokecount,char,ucn,kDefinition
0,1,一,,,yi1,one,1,一,U+4E00,"one; a, an; alone"
1,2,丨,,,gun3,line,1,丨,U+4E28,number one; line; Kangxi radical 2
2,4,丿,"乀 (fu2), 乁(yi2)",,pie3,slash,1,丿,U+4E3F,line; Kangxi radical 4
3,5,乙,"乚 (yin3), 乛",,yi4,second,1,乙,U+4E59,second; 2nd heavenly stem
4,6,亅,,,jue2,hook,1,亅,U+4E85,hook; Kangxi radical 6


#### Objective 1 BSON ID

In [6]:
# Assign the BSON ID To the dataframe
radicalsDF['objectids'] = radicalsDF['radical'].map(lambda x: ObjectId())



#### Objective 2: Meaning and definitions

In [7]:
# English Category Clean up 
# Remove whitespace formatting
radicalsDF['english']  = radicalsDF.english.str.replace('\xad','')

# Cleaning up the list of definitions 

# Needs to deal with the nested list, expanding it out into a table
radicalsDF['Meaning'] = radicalsDF['kDefinition'].str.split(';')
meaningDF = radicalsDF['Meaning'].apply(pd.Series).copy()
meaningDF.head()

Unnamed: 0,0,1,2,3,4
0,one,"a, an",alone,,
1,number one,line,Kangxi radical 2,,
2,line,Kangxi radical 4,,,
3,second,2nd heavenly stem,,,
4,hook,Kangxi radical 6,,,


In [8]:
meaningDF['english'] = radicalsDF['english']

In [9]:
# Removing the extra information about Kangxi Radicals
for i in range(5):
    meaningDF[i] = np.where(meaningDF[i].str.contains('Kangxi'),np.NaN,meaningDF[i])
    meaningDF[i] = meaningDF[i].str.strip()

In [10]:
meaningDF

Unnamed: 0,0,1,2,3,4,english
0,one,"a, an",alone,,,one
1,number one,line,,,,line
2,line,,,,,slash
3,second,2nd heavenly stem,,,,second
4,hook,,,,,hook
...,...,...,...,...,...,...
209,"even, uniform, of equal length",,,,,even
210,teeth,"gears, cogs",age,,,tooth
211,dragon,,,,,dragon
212,turtle or tortoise,cuckold,,,,turtle


In [11]:
# This line is to unpack the definitions even further, with the goal of
# unpacking the nested lists inside of the nested lists

# Populating an empty array
meaningDF['idx'] = np.NaN

# Recurses through each column, adding where it iis found  
for i in range(5):
    meaningDF['idx'] = np.where(meaningDF['english'] == meaningDF[i],i,meaningDF['idx'])


# Checking for redundant definitions
secondaryCheckIdx = meaningDF['idx'].isnull()



In [12]:
secondaryCheckIdx

0      False
1      False
2       True
3      False
4      False
       ...  
209     True
210     True
211    False
212     True
213    False
Name: idx, Length: 214, dtype: bool

In [13]:
# Unpacking Level 2 nested list of definitions, checking for matches
#meaningDF[meaningDF[4].str.contains(',') == True]


commaMeanings0 = meaningDF[secondaryCheckIdx][0].str.split(', | or ').apply(pd.Series)
#print(commaMeanings0.shape[1])
commaMeanings1 = meaningDF[secondaryCheckIdx][1].str.split(', | or ').apply(pd.Series)
#print(commaMeanings1.shape[0])
# Merging two nested lists together in order to check for matching words that indicate redudant information 
commaMeanings = pd.merge(commaMeanings0,commaMeanings1,how='outer',on=commaMeanings0.index).drop('key_0',axis=1)

# Makes possible to iterate through each
commaMeanings.columns = range(commaMeanings.shape[1])

commaMeanings['single_word_def_is_redundant'] = np.NaN
commaMeanings['english'] = meaningDF[secondaryCheckIdx].english.reset_index(drop=True)

for i in range(commaMeanings.shape[1] -2 ): # -2 for index column and english column
    commaMeanings['single_word_def_is_redundant'] = np.where(commaMeanings['english'] == commaMeanings[i], i, commaMeanings['single_word_def_is_redundant'])

commaMeanings['merge_idx'] =  meaningDF[secondaryCheckIdx].index

In [14]:
meaningDF = pd.merge(meaningDF,commaMeanings[['merge_idx','single_word_def_is_redundant']],how='left',left_on=meaningDF.index,right_on='merge_idx').drop('merge_idx',axis=1)
#meaningDF.shape

In [15]:
#meaningDF.shape

In [16]:
meaningDF['english'] = np.where(meaningDF['single_word_def_is_redundant'].isnull() & meaningDF['idx'].isnull(),meaningDF['english'],np.NaN)

In [17]:
meaningDF = meaningDF[['english',0,1,2,3,4]]

In [18]:
meaningDF

Unnamed: 0,english,0,1,2,3,4
0,,one,"a, an",alone,,
1,,number one,line,,,
2,slash,line,,,,
3,,second,2nd heavenly stem,,,
4,,hook,,,,
...,...,...,...,...,...,...
209,,"even, uniform, of equal length",,,,
210,tooth,teeth,"gears, cogs",age,,
211,,dragon,,,,
212,,turtle or tortoise,cuckold,,,


In [19]:
radicalsDF

Unnamed: 0,number,radical,variants,simplifiedradical,pinyin,english,strokecount,char,ucn,kDefinition,objectids,Meaning
0,1,一,,,yi1,one,1,一,U+4E00,"one; a, an; alone",6381c2049be015fd65351ea3,"[one, a, an, alone]"
1,2,丨,,,gun3,line,1,丨,U+4E28,number one; line; Kangxi radical 2,6381c2049be015fd65351ea4,"[number one, line, Kangxi radical 2]"
2,4,丿,"乀 (fu2), 乁(yi2)",,pie3,slash,1,丿,U+4E3F,line; Kangxi radical 4,6381c2049be015fd65351ea5,"[line, Kangxi radical 4]"
3,5,乙,"乚 (yin3), 乛",,yi4,second,1,乙,U+4E59,second; 2nd heavenly stem,6381c2049be015fd65351ea6,"[second, 2nd heavenly stem]"
4,6,亅,,,jue2,hook,1,亅,U+4E85,hook; Kangxi radical 6,6381c2049be015fd65351ea7,"[hook, Kangxi radical 6]"
...,...,...,...,...,...,...,...,...,...,...,...,...
209,210,齊,,齐,qi2,even,14,齊,U+9F4A,"even, uniform, of equal length; Kangxi radical...",6381c2049be015fd65351f74,"[even, uniform, of equal length, Kangxi radic..."
210,211,齒,,齿,chi3,tooth,15,齒,U+9F52,"teeth; gears, cogs; age; Kangxi radical 211",6381c2049be015fd65351f75,"[teeth, gears, cogs, age, Kangxi radical 211]"
211,212,龍,,龙,long2,dragon,16,龍,U+9F8D,dragon; Kangxi radical 212,6381c2049be015fd65351f76,"[dragon, Kangxi radical 212]"
212,213,龜,,龟,gui1,turtle,16,龜,U+9F9C,turtle or tortoise; cuckold; Kangxi radical 213,6381c2049be015fd65351f77,"[turtle or tortoise, cuckold, Kangxi radical..."


In [20]:
radicalsDF['Meaning'] = meaningDF.apply(lambda x: ', '.join(x.dropna()), axis=1)
radicalsDF['Meaning'] = '[' + radicalsDF['Meaning'] + ']'

In [21]:
radicalsDF.drop(['kDefinition','english'],axis=1)

Unnamed: 0,number,radical,variants,simplifiedradical,pinyin,strokecount,char,ucn,objectids,Meaning
0,1,一,,,yi1,1,一,U+4E00,6381c2049be015fd65351ea3,"[one, a, an, alone]"
1,2,丨,,,gun3,1,丨,U+4E28,6381c2049be015fd65351ea4,"[number one, line]"
2,4,丿,"乀 (fu2), 乁(yi2)",,pie3,1,丿,U+4E3F,6381c2049be015fd65351ea5,"[slash, line]"
3,5,乙,"乚 (yin3), 乛",,yi4,1,乙,U+4E59,6381c2049be015fd65351ea6,"[second, 2nd heavenly stem]"
4,6,亅,,,jue2,1,亅,U+4E85,6381c2049be015fd65351ea7,[hook]
...,...,...,...,...,...,...,...,...,...,...
209,210,齊,,齐,qi2,14,齊,U+9F4A,6381c2049be015fd65351f74,"[even, uniform, of equal length]"
210,211,齒,,齿,chi3,15,齒,U+9F52,6381c2049be015fd65351f75,"[tooth, teeth, gears, cogs, age]"
211,212,龍,,龙,long2,16,龍,U+9F8D,6381c2049be015fd65351f76,[dragon]
212,213,龜,,龟,gui1,16,龜,U+9F9C,6381c2049be015fd65351f77,"[turtle or tortoise, cuckold]"


#### merging traditional and simplified radicals

In [22]:
radicalsDF['simplifiedradical'].fillna(radicalsDF['radical'],inplace=True)

In [23]:
# Collecting instances where there is a traditional radical

radicalsDF['traditional'] = np.where(radicalsDF['simplifiedradical'] != radicalsDF['radical'],radicalsDF['radical'],np.NaN)

#### Objective 4, assign hema codes

These codes are useful in factorizing the chinese language with a system that is far less complex than wubi, 
but allows for one to reduce the cardinality of the radicals tenfold, down from 214, into 25 categories.


In [24]:
#hemaDF = pd.read_csv(DATADIR/'HemaCodes.csv',index_col=0)
#hemaDF.shape

In [25]:
# Loading a file which was created using a combination of work from 
# The previous notebook, and manual inspection using vim to fill the empty 
# spaces

hemaRadicals = pd.read_csv(DATADIR/'hemaRadicalsCodes.csv',index_col=0)
hemaRadicals.shape

# 

(214, 2)

In [26]:
# Data integrity Check
radicalsDF['simplifiedradical'].isin(hemaRadicals['simplifiedradical']).value_counts()

True    214
Name: simplifiedradical, dtype: int64

In [27]:
radicalsDF = pd.merge(radicalsDF,hemaRadicals,how='left',left_on='simplifiedradical',right_on='simplifiedradical',)

In [28]:
# Checking integrity
radicalsDF.head()

Unnamed: 0,number,radical,variants,simplifiedradical,pinyin,english,strokecount,char,ucn,kDefinition,objectids,Meaning,traditional,code
0,1,一,,一,yi1,one,1,一,U+4E00,"one; a, an; alone",6381c2049be015fd65351ea3,"[one, a, an, alone]",,11
1,2,丨,,丨,gun3,line,1,丨,U+4E28,number one; line; Kangxi radical 2,6381c2049be015fd65351ea4,"[number one, line]",,21
2,4,丿,"乀 (fu2), 乁(yi2)",丿,pie3,slash,1,丿,U+4E3F,line; Kangxi radical 4,6381c2049be015fd65351ea5,"[slash, line]",,41
3,5,乙,"乚 (yin3), 乛",乙,yi4,second,1,乙,U+4E59,second; 2nd heavenly stem,6381c2049be015fd65351ea6,"[second, 2nd heavenly stem]",,11
4,6,亅,,亅,jue2,hook,1,亅,U+4E85,hook; Kangxi radical 6,6381c2049be015fd65351ea7,[hook],,21


### Objective 5: create JSON Schema

Take the data that I need, and put it in json

In [29]:
# Changing Variants col to list type 
# for json schema

radicalsDF['variants'] = np.where(~radicalsDF['variants'].isna(),'[' + radicalsDF['variants'] + ']',False)

In [30]:
radicalsDF.sample(5)

Unnamed: 0,number,radical,variants,simplifiedradical,pinyin,english,strokecount,char,ucn,kDefinition,objectids,Meaning,traditional,code
200,201,黃,False,黃,huang2,yellow,12,黃,U+9EC3,yellow; surname; Kangxi radical 201,6381c2049be015fd65351f6b,"[yellow, surname]",,33 25 43
15,16,几,False,几,ji1,table,2,几,U+51E0,small table,6381c2049be015fd65351eb2,"[table, small table]",,44
117,133,至,False,至,zhi4,arrive,6,至,U+81F3,"reach, arrive; extremely, very; Kangxi radical...",6381c2049be015fd65351f18,"[reach, arrive, extremely, very]",,11 43 32
139,141,虍,False,虍,hu1,tiger,6,虍,U+864D,tiger; Kangxi radical 141,6381c2049be015fd65351f2e,[tiger],,45 44
48,50,巾,False,巾,jin1,turban,3,巾,U+5DFE,kerchief; towel; turban; Kangxi radical 50,6381c2049be015fd65351ed3,"[kerchief, towel, turban]",,23 21


In [31]:
radicalsExportDF = radicalsDF[['objectids','simplifiedradical','number','Meaning','pinyin','variants','traditional','code']].copy()
radicalsExportDF.shape

(214, 8)

In [32]:
radicalsExportDF.columns = np.array(['_id','radical','number','meaning','pinyin','variants','traditional','hemaCode'])


In [33]:
radicalsExportDF.head()

Unnamed: 0,_id,radical,number,meaning,pinyin,variants,traditional,hemaCode
0,6381c2049be015fd65351ea3,一,1,"[one, a, an, alone]",yi1,False,,11
1,6381c2049be015fd65351ea4,丨,2,"[number one, line]",gun3,False,,21
2,6381c2049be015fd65351ea5,丿,4,"[slash, line]",pie3,"[乀 (fu2), 乁(yi2)]",,41
3,6381c2049be015fd65351ea6,乙,5,"[second, 2nd heavenly stem]",yi4,"[乚 (yin3), 乛]",,11
4,6381c2049be015fd65351ea7,亅,6,[hook],jue2,False,,21


In [34]:
#debugging 

#not variants
# is a problem with the ID not being able to encode

radicalsExportDF.drop('_id',axis=1, inplace=True)

# Trying a new ID
# radicalsExportDF['_id'] = radicalsDF['radical'].map(lambda x: ObjectId())


In [35]:
# Exporting  to JSON Format, dropping NA Values

radicalJSON = radicalsExportDF.to_json(orient='records')

In [36]:


def remove_empty_elements(d):
    """recursively remove empty lists, empty dicts, or None elements from a dictionary"""

    def empty(x):
        return x is None or x == {} or x == []

    if not isinstance(d, (dict, list)):
        return d
    elif isinstance(d, list):
        return [v for v in (remove_empty_elements(v) for v in d) if not empty(v)]
    else:
        return {k: v for k, v in ((k, remove_empty_elements(v)) for k, v in d.items()) if not empty(v)}

In [37]:
radicalJSON = json.loads(radicalJSON)

In [38]:
radicalJSON = remove_empty_elements(radicalJSON)

In [39]:
radicalsExportDF['variants']

0                  False
1                  False
2      [乀 (fu2), 乁(yi2)]
3          [乚 (yin3), 乛]
4                  False
             ...        
209                False
210                False
211                False
212                False
213                False
Name: variants, Length: 214, dtype: object

In [40]:
radicalJSON[118]

{'radical': '竹',
 'number': 118,
 'meaning': '[bamboo, flute]',
 'pinyin': 'zhu2',
 'variants': '[⺮]',
 'hemaCode': '41 13 41'}

In [41]:
radicalJSON

[{'radical': '一',
  'number': 1,
  'meaning': '[one, a, an, alone]',
  'pinyin': 'yi1',
  'variants': False,
  'hemaCode': '11'},
 {'radical': '丨',
  'number': 2,
  'meaning': '[number one, line]',
  'pinyin': 'gun3',
  'variants': False,
  'hemaCode': '21'},
 {'radical': '丿',
  'number': 4,
  'meaning': '[slash, line]',
  'pinyin': 'pie3',
  'variants': '[乀 (fu2), 乁(yi2)]',
  'hemaCode': '41'},
 {'radical': '乙',
  'number': 5,
  'meaning': '[second, 2nd heavenly stem]',
  'pinyin': 'yi4',
  'variants': '[乚 (yin3), 乛]',
  'hemaCode': '11'},
 {'radical': '亅',
  'number': 6,
  'meaning': '[hook]',
  'pinyin': 'jue2',
  'variants': False,
  'hemaCode': '21'},
 {'radical': '丶',
  'number': 3,
  'meaning': '[dot]',
  'pinyin': 'zhu3',
  'variants': False,
  'hemaCode': '51'},
 {'radical': '二',
  'number': 7,
  'meaning': '[two, twice]',
  'pinyin': 'er4',
  'variants': False,
  'hemaCode': '11 11'},
 {'radical': '亠',
  'number': 8,
  'meaning': '[lid, head]',
  'pinyin': 'tou2',
  'variants

In [42]:
with open(DATADIR/'radicals.json','w') as fp:
    for document in radicalJSON:
        fp.write(f'\n{document}')
fp.close()


# Characters 

* _ID
* char 
* meaning as list
* radicals ID
* words containing char (ids)
* phrases containing char
* other radical forms
* stroke number
* frequency
* hsk level 

In [43]:
CharDF = pd.read_csv(DATADIR/'uniqueCharacters.csv',index_col=0)

In [44]:
CharDF.head()

Unnamed: 0,char,IndvRawFrequency,cumulativeRawFrequency,Pinyin,English,kDefinition,kHanyuPinyin,kMandarin,kTotalStrokes,kSimplifiedVariant,ucn
0,的,7922684,4.094325,de/di2/di4,"(possessive particle)/of, really and truly, ai...","possessive, adjectival suffix","42644.160:dì,dí,de",de,8,,U+7684
1,一,3050722,5.670893,yi1,one/1/single/a(n),"one; a, an; alone",10001.010:yī,yī,1,,U+4E00
2,是,2615490,7.022539,shi4,is/are/am/yes/to be,"indeed, yes, right; to be; demonstrative prono...","21497.050:shì,tí",shì,9,,U+662F
3,不,2237915,8.179061,bu4/bu2,(negative prefix)/not/no,"no, not; un-; negative prefix","10011.060:bù,fǒu,fōu,fū",bù,4,,U+4E0D
4,了,2128528,9.279052,le/liao3/liao4,(modal particle intensifying preceding clause)...,to finish; particle of completed action,"10048.060:liǎo,le,liào",le,2,U+4E86,U+4E86


### Processing meaning

transform into a list


In [45]:
# Proce

In [46]:
CharDF.English = CharDF.English.str.split('/')

### Find words containing this character

CharDF

# Character Decomposition

charDecompTable = pd.rea

In [47]:
charDecomp = pd.read_csv(DATADIR/'FlattenedDecompositionTable.csv',index_col=0)

In [48]:
charDecomp

Unnamed: 0,Component,Strokes,CompositionType,LeftComponent,LeftStrokes,RightComponent,RightStrokes,Signature,Notes,Section
0,一,1,一,一,1,*,0,M,/,*
1,丁,2,吕,一,1,亅,1,MN,/,一
3,七,2,一,七,2,*,0,JU,/,一
7,万,3,一,万,3,*,0,MS,/,一
8,丈,3,一,丈,3,*,0,JK,/,一
...,...,...,...,...,...,...,...,...,...,...
20894,龝,21,吅,禾,5,龜,16,HDHBS,/,龜
20896,龟,7,一,⺈,2,电,5,NWU,/,龜
20897,龠,17,一,龠,17,*,0,OMRB,/,*
20899,龢,22,吅,龠,17,禾,5,OBHD,/,龠


### Data Exploration

#### Understanding the decomposition table

The goal is to find the base level where relationships can be made

1 relationship is to the radical,
The second relationship is to the hema radical

ideally it is decomposed to the radical, and possibly another hema code.

This will explore the possbilities

In [49]:
charDecomp.CompositionType.value_counts()

吅    6649
吕    2518
回     255
一     222
+     103
冖      69
咒      35
弼      31
品      18
叕       1
*       1
Name: CompositionType, dtype: int64

In [50]:
charDecomp.Component[charDecomp.CompositionType == '一'].isin(radicalsDF.char).value_counts()

True     117
False    105
Name: Component, dtype: int64

From this, it doesn't have a direct match, let's find the radicals in the other table

In [51]:
uniqueRadicals = radicalsExportDF.radical

charDecomp.Component.isin(uniqueRadicals).value_counts()

False    9724
True      178
Name: Component, dtype: int64

Testing the idea that the other radicals are in the left or right components

In [52]:
missingRadicals = radicalsExportDF[~radicalsExportDF['radical'].isin(charDecomp['Component'])]

In [53]:
 charDecomp.LeftComponent.isin(missingRadicals.radical).value_counts()

False    9166
True      736
Name: LeftComponent, dtype: int64

In [54]:
radicalsInLeftComponent =  missingRadicals.radical.isin(charDecomp.LeftComponent)
radicalsInRightComponent = missingRadicals.radical.isin(charDecomp.RightComponent)

radicalsInSubcomponents = radicalsInLeftComponent | radicalsInRightComponent
radicalsInSubcomponents.value_counts()

True     33
False     4
Name: radical, dtype: int64

In [55]:
missingRadicals = missingRadicals[radicalsInSubcomponents == False]

In [56]:
missingRadicals.head()

Unnamed: 0,radical,number,meaning,pinyin,variants,traditional,hemaCode
22,匸,23,"[hiding enclosure, box]",xi3,False,,13
33,夊,35,[go slowly],sui1 (bot­tom),False,,44
156,⻊,157,"[foot, attain, satisfy, enough]",zu2,False,足,24 21 22
207,鼡,208,"[rat, mouse]",shu3,False,鼠,53 44 15


### Zeroing in

Here, the goal is to find where these last for radicals are. Although 98 percent is not bad, Im concerned that
three of these are relatively common character components

Radical 23 was merged with radical 22, meaning a box
In this case, tthe component should be merged

Radical 35 is an orphan that was merged with radical 44
This should be removed from the database

Radical 157 needs some transfoormation due to the other varients of it
It means foot, and is very common. So, some transformation is needed


Radical 208 is  essentially a word for rat, and is very obscure. 
however, it appears that the traditional character is used more,
so it can be swapped

In [57]:
radicalsExportDF.drop([22,33],axis=0,inplace=True)

In [58]:
radicalsExportDF.radical[207] = '鼠'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  radicalsExportDF.radical[207] = '鼠'


In [59]:
radicalsExportDF.radical[156] = '足'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  radicalsExportDF.radical[156] = '足'


In [60]:
radicalsExportDF.variants[156] = '[⻊]'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  radicalsExportDF.variants[156] = '[⻊]'


In [61]:
radicalsExportDF.loc[156]

radical                                      足
number                                     157
meaning        [foot, attain, satisfy, enough]
pinyin                                     zu2
variants                                   [⻊]
traditional                                  足
hemaCode                              24 21 22
Name: 156, dtype: object

In [62]:
idsCharDecomp =  pd.read_csv(DATADIR/'idsDecomposition.csv',index_col=0)

In [63]:
idsCharDecomp[idsCharDecomp.char == '亡']

Unnamed: 0,ucn,char,1,2
89,U+4EA1,亡,⿱亠𠃊,


In [64]:
charDecomp.CompositionType.unique()

array(['一', '吕', '回', '咒', '+', '吅', '冖', '品', '弼', '叕', '*'],
      dtype=object)

In [65]:
charDecomp[charDecomp.CompositionType == '一']

Unnamed: 0,Component,Strokes,CompositionType,LeftComponent,LeftStrokes,RightComponent,RightStrokes,Signature,Notes,Section
0,一,1,一,一,1,*,0,M,/,*
3,七,2,一,七,2,*,0,JU,/,一
7,万,3,一,万,3,*,0,MS,/,一
8,丈,3,一,丈,3,*,0,JK,/,一
10,上,3,一,上,3,*,0,YM,/,一
...,...,...,...,...,...,...,...,...,...,...
20878,龍,16,一,龍,16,*,0,YBYSP,/,*
20890,龙,5,一,龙,5,*,0,IKP,/,龍
20893,龜,16,一,龜,16,*,0,HBSS,/,*
20896,龟,7,一,⺈,2,电,5,NWU,/,龜


### Character Decomp Table

In [66]:
bottomFloor = radicalsExportDF.radical


charDecompJSON = []
for i, char in enumerate(uniqueChars.char):
    counterForComponents = 0
    numberOfRadicals = []

    jsonCell = {}
    topLevelComponent = charDecomp[charDecomp.Component == char]


    jsonCell['char'] = char
    
    # Level one decomposition
    isBaseRadical = bottomFloor[bottomFloor.str.contains(char)]

    # Level 1 looks for the basic decomposition

    # returns a dictionary with the shape of this
    if isBaseRadical.shape[0] == 1:
        jsonCell['nodeInfo'] = {'':''} ##placeholder
        jsonCell['decomp'] = {topLevelComponent.CompositionType.values[0]:isBaseRadical.values[0]}
        print(jsonCell)
        continue
        # Ends the loop because it is already a base radical

    # Next step checks for the breakdown

    if topLevelComponent.shape[0] == 1:
        jsonCell['nodeDecomp'] = topLevelComponent.CompositionType 
    # Data Integrity Check
    elif topLevelComponent.shape[0] > 1:
        print(topLevelComponent.index)
        print(topLevelComponent.Component)
    
    # Get left component because that is normally how it is written
        



        # Get Right component

    
    
    
    
    
    
    charDecompJSON.append(jsonCell)




NameError: name 'uniqueChars' is not defined

In [None]:
charDecompJSON

[{'char': '的'},
 {'char': '是'},
 {'char': '不'},
 {'char': '了'},
 {'char': '在'},
 {'char': '有'},
 {'char': '我'},
 {'char': '他'},
 {'char': '这'},
 {'char': '个'},
 {'char': '们'},
 {'char': '中'},
 {'char': '来'},
 {'char': '上'},
 {'char': '为'},
 {'char': '和'},
 {'char': '国'},
 {'char': '地'},
 {'char': '到'},
 {'char': '以'},
 {'char': '说'},
 {'char': '时'},
 {'char': '要'},
 {'char': '就'},
 {'char': '出'},
 {'char': '会'},
 {'char': '可'},
 {'char': '也'},
 {'char': '你'},
 {'char': '对'},
 {'char': '能'},
 {'char': '那'},
 {'char': '得'},
 {'char': '于'},
 {'char': '着'},
 {'char': '下'},
 {'char': '之'},
 {'char': '年'},
 {'char': '过'},
 {'char': '发'},
 {'char': '后'},
 {'char': '作'},
 {'char': '道'},
 {'char': '所'},
 {'char': '然'},
 {'char': '家'},
 {'char': '种'},
 {'char': '事'},
 {'char': '成'},
 {'char': '多'},
 {'char': '经'},
 {'char': '么'},
 {'char': '去'},
 {'char': '法'},
 {'char': '学'},
 {'char': '如'},
 {'char': '都'},
 {'char': '同'},
 {'char': '现'},
 {'char': '当'},
 {'char': '没'},
 {'char': '动'},
 {'char'

In [None]:
charDecompJSON

[{'char': '的'},
 {'char': '一', 0: '一'},
 {'char': '是'},
 {'char': '不'},
 {'char': '了'},
 {'char': '在'},
 {'char': '人', 0: '人'},
 {'char': '有'},
 {'char': '我'},
 {'char': '他'},
 {'char': '这'},
 {'char': '个'},
 {'char': '们'},
 {'char': '中'},
 {'char': '来'},
 {'char': '上'},
 {'char': '大', 0: '大'},
 {'char': '为'},
 {'char': '和'},
 {'char': '国'},
 {'char': '地'},
 {'char': '到'},
 {'char': '以'},
 {'char': '说'},
 {'char': '时'},
 {'char': '要'},
 {'char': '就'},
 {'char': '出'},
 {'char': '会'},
 {'char': '可'},
 {'char': '也'},
 {'char': '你'},
 {'char': '对'},
 {'char': '生', 0: '生'},
 {'char': '能'},
 {'char': '而', 0: '而'},
 {'char': '子', 0: '子'},
 {'char': '那'},
 {'char': '得'},
 {'char': '于'},
 {'char': '着'},
 {'char': '下'},
 {'char': '自', 0: '自'},
 {'char': '之'},
 {'char': '年'},
 {'char': '过'},
 {'char': '发'},
 {'char': '后'},
 {'char': '作'},
 {'char': '里', 0: '里'},
 {'char': '用', 0: '用'},
 {'char': '道'},
 {'char': '行', 0: '行'},
 {'char': '所'},
 {'char': '然'},
 {'char': '家'},
 {'char': '种'},
 {'char'

In [None]:
isBaseRadical = bottomFloor[bottomFloor.str.contains(char)]

In [None]:
isBaseRadical.values[0]

'足'

In [None]:
charDecomp.Component 

0        一
1        丁
3        七
7        万
8        丈
        ..
20894    龝
20896    龟
20897    龠
20899    龢
20901    龤
Name: Component, Length: 9902, dtype: object

In [None]:
charDecomp

Unnamed: 0,Component,Strokes,CompositionType,LeftComponent,LeftStrokes,RightComponent,RightStrokes,Signature,Notes,Section
0,一,1,一,一,1,*,0,M,/,*
1,丁,2,吕,一,1,亅,1,MN,/,一
3,七,2,一,七,2,*,0,JU,/,一
7,万,3,一,万,3,*,0,MS,/,一
8,丈,3,一,丈,3,*,0,JK,/,一
...,...,...,...,...,...,...,...,...,...,...
20894,龝,21,吅,禾,5,龜,16,HDHBS,/,龜
20896,龟,7,一,⺈,2,电,5,NWU,/,龜
20897,龠,17,一,龠,17,*,0,OMRB,/,*
20899,龢,22,吅,龠,17,禾,5,OBHD,/,龠


In [None]:
charDecomp.Component

0        一
1        丁
3        七
7        万
8        丈
        ..
20894    龝
20896    龟
20897    龠
20899    龢
20901    龤
Name: Component, Length: 9902, dtype: object

In [None]:
for char in CharDF.char:
    print(char)
    

的
一
是
不
了
在
人
有
我
他
这
个
们
中
来
上
大
为
和
国
地
到
以
说
时
要
就
出
会
可
也
你
对
生
能
而
子
那
得
于
着
下
自
之
年
过
发
后
作
里
用
道
行
所
然
家
种
事
成
方
多
经
么
去
法
学
如
都
同
现
当
没
动
面
起
看
定
天
分
还
进
好
小
部
其
些
主
样
理
心
她
本
前
开
但
因
只
从
想
实
日
军
者
意
无
力
它
与
长
把
机
十
民
第
公
此
已
工
使
情
明
性
知
全
三
又
关
点
正
业
外
将
两
高
间
由
问
很
最
重
并
物
手
应
战
向
头
文
体
政
美
相
见
被
利
什
二
等
产
或
新
己
制
身
果
加
西
斯
月
话
合
回
特
代
内
信
表
化
老
给
世
位
次
度
门
任
常
先
海
通
教
儿
原
东
声
提
立
及
比
员
解
水
名
真
论
处
走
义
各
入
几
口
认
条
平
系
气
题
活
尔
更
别
打
女
变
四
神
总
何
电
数
安
少
报
才
结
反
受
目
太
量
再
感
建
务
做
接
必
场
件
计
管
期
市
直
德
资
命
山
金
指
克
许
统
区
保
至
队
形
社
便
空
决
治
展
马
科
司
五
基
眼
书
非
则
听
白
却
界
达
光
放
强
即
像
难
且
权
思
王
象
完
设
式
色
路
记
南
品
住
告
类
求
据
程
北
边
死
张
该
交
规
万
取
拉
格
望
觉
术
领
共
确
传
师
观
清
今
切
院
让
识
候
带
导
争
运
笑
飞
风
步
改
收
根
干
造
言
联
持
组
每
济
车
亲
极
林
服
快
办
议
往
元
英
士
证
近
失
转
夫
令
准
布
始
怎
呢
存
未
远
叫
台
单
影
具
罗
字
爱
击
流
备
兵
连
调
深
商
算
质
团
集
百
需
价
花
党
华
城
石
级
整
府
离
况
亚
请
技
际
约
示
复
病
息
究
线
似
官
火
断
精
满
支
视
消
越
器
容
照
须
九
增
研
写
称
企
八
功
吗
包
片
史
委
乎
查
轻
易
早
曾
除
农
找
装
广
显
吧
阿
李
标
谈
吃
图
念
六
引
历
首
医
局
突
专
费
号
尽
另
周
较
注
语
仅
考
落
青
随
选
列
