/
README.txt
30 lines (25 loc) · 1.15 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Jack Wines & Tegan Wilson
Natural Language Processing
Blake Howald
Python3 libraries used:
networkx
plotly
tabulate
beautifulsoup4
To run the program, ensure that algorithm.py, kanji_hiragana.txt,
katakana_dict.txt, and edict2 are in the same directory. Type:
python3 algorithm.py
into the command line to run.
Files:
algorithm.py - our main file: takes in a set of training data and a set of text
data, calculates the pronunciation of each sentence, finds the minimum edit
distance between the calculated reading and the actual reading, and outputs
the average and median edit distance and the top and bottom 5 sentences
scrape_news.py - takes in a set of news urls and pulls test data pairs: sentence
with kanji, and sentence without kanji, and records them in kanji_hiragana.txt
urls.txt - the set of news urls that we pulled test data from
kanji_hiragana.txt - the test data we pulled from news urls
katakana_dict.txt - a katakana to hiragana map we created to map each katakana
character to its associated hiragana character (kept the output all in
hiragana for consistency)
edict2 - the Japanese to English dictionary that we used as our training data