Here, we get relevant data and parse a html converted from a pdf found at the url: of the top chinese characters:   

http://www.zein.se/patrick/3000char.html

In [2]:
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
# Suppress just SettingWithCopyWarningda
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

Loop through the tables of the html. With the exception of the first one, they follow a predictable pattern of definition, words from by multiple characters. 

We create an initial datasets: one with characters only (both simplifed and traditional), together another with multiple characters and their pronunciations. The character category(cat) is the index of the character (used for ranking relatively difficulty).

For alternative characters (mainly numbers) we only show it for the single character word

Special care is used to ensure merged traditional forms can be separated based on the word and meaning, such as (幹乾=>干)

In [57]:
#our key helper function to parse definitions and pronunciations apart given the raw main character form, etc
#returns a list of lists with the order ["cat", "word/charcter", "pronunciation", "definition", "code","alt"]. Code is if the word uses traditional/simplified (s,t,a). More contains extra information (in few cases where there is an alternative character)
# since many words may have compound simplified characters in this list, we will later need to convert again for the words that 
#note pronunciation is the the pronunciation of the character in the word
# are desired to only be written in traditional.
#Therefore, each row will be written twice, one for traditional one for simplified
#for single characters, we use the first pronunciation listed

def getbetween(input, start, end): #helper function to get the string in between two symbols (the first matching abtter)
    startind = input.find(start)
    #if the first one is not found, return empty:
    if startind == -1: 
        return ""
    endind = input.find(end)
    #if the first one is not found (but start is found), return the string starting from the end 
    if endind == -1:
        return input[startind+1:]
    return input[startind+1:endind].trimstrip() #otherwise, return 

def get_wordList(main, des):
    ret = [] #list to return

    simp = main.split("(")[0] #get the simplified form
    alt = getbetween(simp,"(A",")") #set as empty string initially for the alternative form #check for alternative form, by default if there is non, this function should return an empty string 
    #the the traditional form, if any 
    trad = getbetween(simp,"(F",")")

    frags = des.split("[") #split on [
    #iterate through diffenent fragments with pronunciations and descriptions within the definition
    for frag in frags:
        #get the separate pronunciations
        pro = frag.split("]")[0]
        #if not empty
        if len(pro) > 0:
            #get rid of the part that is not part of the pronunciaiton
            defis = frag.replace(f"{pro}]", "")
            #print("pronunciation:", pro, ",definitions:", defis)
            defis = defis.split(";")#get separate defintions
            print(defis)

    return main 

In [60]:
link = "./main_char_set-converted.html"

with open(link, 'r') as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')
tables = soup.find_all("table")  
grand_list = [] #list used to create the dataframe in the end
#count the number of tables 
index = 0;
for table in tables:
    rows = table.find_all("tr") #extract all tables
    for row in rows: 
        # Extract all cells (td or th)
        cols = row.find_all(["td"]) #print the number of cols (exclude the header) 
        #valid character columns have 3 columns 
        if len(cols) == 3:
            #get the number column 
            col1 = cols[0].get_text()
            #get the char column 
            col2 = cols[1].get_text()
            #get description column, this is complicated and is where we split the characters from the words, using a helper function
            col3 = cols[2].get_text()
            #call our key helper function that retruns a list to be appended to the grand_list 
            print("new char")
            res = get_wordList(col2, col3)
            grand_list.append(res)

    index += 1

new char
['Prononciation et explications']
new char
[' <grammatical particle marking genitive as well as simple and composed adjectives>', ' 我的 wǒde my', ' 高的 gāode high, tall', "是的 shìde that's it, that's right", ' 是...的 shì...de one who...', ' 他是说汉语的. Tā shì shuō Hànyǔde. He is one who speaks Chinese. ']
[' 目的 mùdì goal']
[' true, real', ' 的确 díquè certainly']
new char
[' one, a little', ' 第一 dì-yī first, primary', ' 看一看 kànyīkàn have a (quick) look at']
[' (used before tones #4 and #5)', ' 一个人 yíge rén one person', ' 一定 yídìng certain', ' 一样 yíyàng same', ' 一月 yíyuè January ']
[' (used before tones #1, #2 and #3)', ' 一点儿 yìdiǎnr a little', ' 一些 yìxiē some{Compare with 幺(T 么) yāo, which also means "one"}']
new char
[' to be, 是不是? shìbushì? is (it) or is (it) not?', ' 是否 shìfǒu whether or not, is (it) or is (it) not?']
new char
[' not']
[' (used before tone #4)', " 不是 bú shì isn't"]
new char
[' <verb particle marking a new situation or a completed action>', " 你来了! Nǐ láile! You have c

In [35]:

test = "[bù] not[bú]"
print(get_wordList("不", test))

pronunciation: 
pronunciation: bù
pronunciation: bú
不
