DramaCharCount (DCC) is a set of Python scripts that aims to count the occurrence of each characters and words in Japanese drama subtitle files, and generate some statistics on their use. In particular, it generates a tongue-in-cheek "Japanese Drama Proficiency Test" list that groups the characters or words in a way similar to the jōyō kanji or unofficial JLPT lists. In addition, the data is updated to a SQL and used on jdramastuff.xyz to allow browsing the statistics for a particular drama. The source for the website is also contained in this repository but is not required to run the scripts (note that a SQL server is however required, see the Prerequisite chapter).
The output of DCC is available in data_kanji_raw.csv. The columns are:
- kanji
The kanji itself :o)
- count
The number of occurrences between all subtitle files.
- freq
The occurrence frequency i.e. count for this kanji divided by total kanji count (not total character count i.e. does not include hiragana for example).
- cumul_freq
The cumulative frequency. For example a value of 50% means that the number of occurrences of all kanji up to the current kanji represent 50% of every kanji in every dramas.
- drama_freq
The frequency of appearance of this kanji across dramas. For example a value of 30% means that this kanji appears in 30% of all dramas (and is never used in the other 70% of dramas).
- episode_freq
The frequency of appearance of this kanji across episodes. For example a value of 30% means that this kanji appears in 30% of all episodes (and is never used in the other 70% of episodes).
- jdpt
The JDPT level for this kanji. The JDPT is defined using the cumulative frequency as follows:
Level | Cumulative frequency | Number of kanji in this level |
---|---|---|
JDPT 5: | 50% | 143 kanji |
JDPT 4: | 75% | 281 kanji |
JDPT 3: | 90% | 435 kanji |
JDPT 2: | 95% | 323 kanji |
JDPT 1: | 98% | 391 kanji |
JDPT 0: | 99% | 266 kanji |
Number of JDPT kanji: | 1839 kanji | |
Not in JDPT: | 100% | 2173 kanji |
- jdpt_pos The position as JDPT kanji i.e. the position when sorted by count.
- jouyou The jouyou level.
- jouyou_pos The jouyou position i.e. the position when sorted by jouyou level then by count within each level.
Development done in sprints. Please refer to https://github.com/gramm/DramaCharCount/projects/1 to see what is planned. Feel free to mail me at info@jdramastuff.xyz if you have any questions.
- DCC requires a MySQL server to work. I recommend using XAMPP for local MySQL servers.
- DCC requires a MySQL database to work. The name of the database does not matter.
DCC can be configured in two ways:
-
Either by editing settings.py, then calling the scripts without arguments, for example
connection_info = dict( host="localhost", database="db_charcount", user="admin", password="adminpw" ) subtitles_path = "C:/path/to/subtitles/"
-
Or by passing all arguments as argument to the script, for example
DramaCharCount.py --host=localhost --user=admin --password=adminpw --database=my_sql_database --path=C:/path/to/subtitles/
Other configuration options are all contained in settings.py and should be relatively straightforward:
# Absolute path to the directory containing all subtitle files
subtitles_path = "C:/Users/Max/Documents/_tmp/all_drama/"
# Enable to print SQL requests in the console
print_sql = False
# Enable to profile execution time of scripts
enable_profiler = False
# Path to CSV dump (used in JdcCsvGen.py)
csv_path_kanji = "../web/public_html/data_kanji_raw.csv"
DCC will assume that each subfolder under the subtitles_path folder represents one drama i.e. the name of the subfolder will be used as drama name. All nested files in a given subfolder will be considered as a subtitle file for this drama, with one file assumed to correspond to one episode of the drama.
DCC contains a script "CleanSubtitles.py" which can be used to remove all lines that do not contain any "readable" characters i.e. all non-text lines in a standard SRT file. This is done on all files in all nested subfolders. Be aware that no backup of the files is created.
The steps are as follow:
- Have a local SQL server running (XAMPP started, verify MySQL running via phpMyAdmin)
- Configure Settings.py
- Create the drama table by executing JdsDramaHandler
- Load all subtitle lines in the line table by executing JdsLineHandler
- Count all characters by executing JdsCharHandler. Note that this script will mark a drama as "processed" (drama[kanji_ok]=1) and will skip processed dramas on execution. It is therefore safe to interrupt the script to continue executing it later, or to execute it again if for example the MySQL connection failed. Clearing the "processed" state can be done by dropping all character info via jds_char_handler.reset()
- Optional: build reference of each character to all lines containing this character by executin JdsLineHandler. This script is used to be able to display example usage of characters in a given drama when browed from jdramastuff.xyz. The philosophy is identical to JdsCharHandler i.e. drama[kanji_line_ref_ok] will indicate drama to skip.
- Generate statistics by executing JdsInfoHandler
- Optional: Generate CSV output by execution JdsCsvGen
- Compute frequency across dramas
- Compute frequency across episodes
- Generate JDPT base on cumulative frequency
- Improve runtime of char count
- Separate char count and line to char ref
- Add CSV generator for raw kanji info
- Add computation of frequency and cumulative frequency
- Fix all PyCharm warnings
- Greatly improve help
- Verify generation from scratch
- Support settings.py to configure connection info, subtitle path
- Counting of kanji occurences
- Generation of JDPT levels based on number of kanji in JLPT levels
- Computing of relative position within jōyō, JLPT, JDPT
- Computing distance between JLPT and JDPT