Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error due to lack of resources #11

Closed
new-village opened this issue May 18, 2024 · 6 comments
Closed

Error due to lack of resources #11

new-village opened this issue May 18, 2024 · 6 comments
Assignees

Comments

@new-village
Copy link
Owner

enrich_kana 関数を実行すると下記のエラーが発生して処理が完了しない。

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 131, in worker
put((job, i, result))
File "/usr/lib/python3.10/multiprocessing/queues.py", line 377, in put
self._writer.send_bytes(obj)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.10/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

@new-village
Copy link
Owner Author

マルチプロセスからシングルプロセスに処理を変更

    for func in selected_functions:
        df = func(df)

enrich_kana の実行時間をベンチマークしたところ、エラーは発生せず、下記の時間で実行完了した。

read_csv: 73.20 sec
enrich_kana: 4004.94 sec

@new-village
Copy link
Owner Author

new-village commented May 19, 2024

'pd.read_csv'の引数にengine="pyarrow"を追記。また、enrich_kana 関数の変換処理部分をコメントアウトして実行したところ、大幅にパフォーマンスが改善。

df['std_furigana'] = df['furigana'].where(df['furigana'].notna(), df['name'])
# df['std_furigana'] = df['std_furigana'].apply(_normalize_and_convert_kana)

read_csv: 19.92 sec
enrich_kana: 0.44 sec

@new-village
Copy link
Owner Author

pykakasi のインスタンス化をグローバル変数化して、関数実行毎にインスタンス化しないように変更。

read_csv: 19.83 sec
enrich_kana: 1130.26 sec

@new-village new-village self-assigned this May 19, 2024
@new-village
Copy link
Owner Author

既にフリガナになっている値については、pykakasiによる処理をスキップさせることで高速化を図る。

    if re.fullmatch(r"[ァ-ヴー]+", cleaned_text):
        return cleaned_text
    else:
        return "".join(item['kana'] for item in kks.convert(cleaned_text))

50%くらいの処理時間に短縮

read_csv: 37.41 sec
enrich_kana: 588.81 sec

@new-village
Copy link
Owner Author

pandarallelを使ったapply関数のマルチプロセス化を実施。

 df['std_furigana'] = df['std_furigana'].parallel_apply(_normalize_and_convert_kana)

AMD Ryzen 7 7735U (8 cores) で、実行時間が1/5に。

read_csv: 31.25 sec
enrich_kana: 115.97 sec

@new-village
Copy link
Owner Author

Summary

image

new-village added a commit that referenced this issue May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant