# _CommonCrawl日本語データの分割抽出

 このNotebookは, CommonCrawl日本語データ分割抽出を行うものです.

GENIACプロジェクトのTeam Hatakeyama(仮)ではCommonCrawlのデータを

質良く加工して日本語LLMの学習に役立つコーパス作成を進めています.

そのコーパス作成で問題となるポイントとして,

CommonCrawlのデータが多すぎることがあります(なんと100TB程度!).


 多くの言語が集まるデータから日本語のデータのみを抜く工程

でもデータ数の多さからチームメンバーだけで行うことは困難です.

そのため, チーム内外でCommonCrawlからの日本語データを分割で抽出し,

最終的に統合するということを目指しています.


## 手順 (Google Colab)..

今回CommonCrawlにおける90000個のアーカイブデータ(warc)を分割で処理します.

90000個を10個ずつのバッチに分けて処理, その結果をGoogleDriveに格納することで,

90000個すべての処理を目指します.

そして, その上で行っていただきたいのは,

1)このセル以降のセルを上から順番に実行していく.

2)最後のセルにあるbatch_numberを変更して, セルを実行
https://colab.research.google.com/drive/1Gq8HQ0iyASH5iOAkosclJEQTwYJvYRmy#scrollTo=UawI0uZgAjz6&line=3&uniqifier=1

3)ファイルメニューに現れる/submit/に保存されている{batch_number}.gzファイルを
ダウンロード

4)バッチの処理が終わったことをgenaic slackなどで共有いただき, アップロードするDriveの場所の指示を受けてください.

60分程度実行すると, 1バッチ終了します.


### 手順 (個別環境 linux)

1)このGoogle Colab Notebookをダウンロードしてください.

2)jupyterが使用できるpython環境で実行してください.

## Environment

pythonのライブラリをインストールします.

※初回だけ実行してください

In [None]:
!git clone https://github.com/hatakeyama-llm-team/Dataset_CommonCrawl/
import os
path = '/content/Dataset_CommonCrawl'
os.chdir(path)
!ls
!pip install -r requirements.txt

Collecting Jupyter-Beeper
  Downloading Jupyter_Beeper-1.0.3-py3-none-any.whl (3.8 kB)
Collecting jedi>=0.16 (from ipython->Jupyter-Beeper)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, Jupyter-Beeper
Successfully installed Jupyter-Beeper-1.0.3 jedi-0.19.1


In [None]:
import os
path = '/content/Dataset_CommonCrawl/extract_data'
os.chdir(path)
!ls

### Function - File Utils

今回の処理に必要な関数(ファイル処理関係)をインストールします.

※初回だけ実行してください

In [None]:
import os
from src.file_utils import download_file, decompress_gz
from src.downloader import cc_path_to_urls,download_warc_file,download_warc_file_with_s3
base_url = "https://data.commoncrawl.org/"
os.makedirs("data/gz", exist_ok=True)
os.makedirs("data/warc", exist_ok=True)

### Function - Process Utils

今回の処理に必要な関数(warcファイルの処理)をインストールします.

※初回だけ実行してください

In [None]:
# Beep when finished
import jupyter_beeper

def beep(frequency=2500):
    b = jupyter_beeper.Beeper()
    b.beep(frequency, secs=1, blocking=True)


## Data Preprocess

### Step1 Download CommonCrawl Paths

CommonCrawlにおけるアーカイブ(warc)が保存されているパスの文字列が

圧縮されたファイル(gz)で保存されている.

このファイルをダウンロードして, 解凍, data/data_list配下に保存する


In [None]:
from extract_data.warc.download_path_list import download_path_list

"""
download path list from commoncrawl
"""
# Parameter
# 今回処理するwarcのパスリストが圧縮されているURL
# パスリストをダウンロードするフォルダの作成
os.makedirs("data", exist_ok=True)
os.makedirs("data/path_list", exist_ok=True)
download_path_list()

ファイルが正常にダウンロードされました: data/path_list/CC-MAIN-2023-50.gz
data/path_list/CC-MAIN-2023-50.gzが解凍され、data/path_list/CC-MAIN-2023-50に保存されました。


### step2 Check Step1 Result

warcのパスの文字列が保存されているかを確認する

※step1で異常があった際に実行してください

In [1]:
from extract_data.warc.src.downloader import get_cc_path_list

# Process
# 保存されているwarcファイルのパスのリストを取得
path_list = get_cc_path_list(path_dir="data/path_list/*")
# 表示
path_list

data/path_list/CC-MAIN-2023-50


['crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00000.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00001.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00002.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00003.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00004.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00005.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00006.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231128113443-00007.warc.gz',
 'crawl-data/CC-MAIN-2023-50/segments/1700679099281.67/warc/CC-MAIN-20231128083443-20231

### Step3 : Download Warc and Extract Japanese Site Data

warc.gzファイルのURI(その一部)を元にwarc.gzファイルをダウンロード/解凍(->warc)する.

warcファイルを読み込み, 内部に保存されている日本語サイトのデータのみを抽出する.

抽出した結果をjson.gz形式で圧縮して保存.

warcファイルは90000個ほど存在するため, この一部のみを処理していく必要あり,

そのため, 処理するファイルのパスをいくつかのbatchに分けている.

このbatchの番号(batch_number)を指定し, そのbatchにおける日本語ページを取得

取得した結果をまとめたzipファイルがsubmit/{batch_number}.zipに保存される.

保存されている内容を指定のGoogleDriveに配置ください

---

#### 行っていただく内容

- is_debugをTrueにして動くかを確認 (初回のみ)

- batch_numberを変更 (取りくむbatchをご指定ください)

- is_debugをFalseにしてデータ抽出/加工スタート

#### 2024/03/04 更新

結果のzipファイルを自動的にダウンロードするように変更いたしました

In [None]:
from tqdm import tqdm
from extract_data.warc.download_and_parse import download_and_parse
import os
import shutil
import time

from google.colab import files

# 処理結果を自動的にダウンロードするように変更

#@markdown - ここの番号を指定を受けた番号に変更をしてください
#@markdown - batch_numberはコンマ区切りで入力してください
batch_number = 503,504, #@param
is_debug = False

# 保存用ディレクトリの指定
submit_dir = "submit"
# もしdriveがマウントできれば,上のsubmit_dirをコメントアウト, 以下をコードにしてください.
# submit_dir = "/content/drive/MyDrive/CommonCrawl/"
def curation(batch_number, submit_dir="/content/submit", is_debug=False):
    cc_path_list = get_cc_path_list()
    if is_debug:
        n_batch = 3
    else:
        n_batch = 10
    start_idx, end_idx = batch_number * n_batch, (batch_number+1) * n_batch
    target_path_list  = cc_path_list[start_idx:end_idx]
    for cc_path in tqdm(target_path_list):
        download_and_parse(cc_path, f"process/batch{batch_number}")
    shutil.make_archive(f'{submit_dir}/{batch_number}',
                        format='zip', root_dir=f"process/batch{batch_number}")

    shutil.rmtree("process/")

for num in batch_number:
  # batchの番号に従って,データの処理
  curation(num, submit_dir=submit_dir, is_debug=is_debug)

  # ファイルのダウンロード
  try:
    files.download(f"./submit/{num}.zip")
  except:
    print("ERROR: File Download Unsuccessful.")
  time.sleep(10)
  beep()
  shutil.rmtree("/content/data/gz")
  shutil.rmtree("/content/data/warc")
  os.mkdir("/content/data/gz")
  os.mkdir("/content/data/warc")

beep(5000)


data/path_list/CC-MAIN-2023-50


  0%|          | 0/10 [00:00<?, ?it/s]

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00530.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00530.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00530.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00530.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00530.warcに保存されました。



0it [00:00, ?it/s][A
33it [00:00, 147.20it/s][A
72it [00:00, 225.61it/s][A
98it [00:00, 189.37it/s][A
225it [00:00, 472.64it/s][A
311it [00:00, 579.94it/s][A
378it [00:00, 538.62it/s][A
438it [00:01, 274.87it/s][A
527it [00:01, 371.96it/s][A
658it [00:01, 544.57it/s][A
740it [00:02, 253.29it/s][A
804it [00:02, 295.90it/s][A
865it [00:02, 334.79it/s][A
925it [00:02, 339.01it/s][A
1002it [00:02, 406.98it/s][A
1125it [00:02, 556.73it/s][A
1224it [00:03, 620.54it/s][A

1547it [00:03, 570.16it/s][A
1637it [00:04, 374.17it/s][A
1719it [00:04, 429.59it/s][A
1792it [00:04, 463.80it/s][A
1862it [00:04, 472.93it/s][A
2034it [00:04, 687.11it/s][A
2123it [00:04, 672.23it/s][A
2214it [00:04, 668.48it/s][A
2291it [00:04, 575.66it/s][A
2421it [00:05, 695.43it/s][A
2511it [00:05, 704.60it/s][A
2595it [00:05, 694.92it/s][A
2710it [00:05, 802.28it/s][A
2796it [00:05, 422.46it/s][A
2862it [00:06, 371.04it/s][A
2964it [00:06, 468.00it/s][A
3032it [00:06, 393.81it/s][A
32

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00531.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00531.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00531.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00531.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00531.warcに保存されました。



0it [00:00, ?it/s][A
9it [00:00, 42.96it/s][A
75it [00:00, 208.03it/s][A
105it [00:01, 85.10it/s][A
156it [00:01, 136.15it/s][A
418it [00:01, 513.85it/s][A
525it [00:01, 426.03it/s][A
606it [00:01, 447.99it/s][A
729it [00:01, 546.15it/s][A
816it [00:02, 512.32it/s][A
884it [00:02, 522.07it/s][A
948it [00:02, 347.75it/s][A
1113it [00:02, 521.44it/s][A
1186it [00:02, 510.42it/s][A
1296it [00:03, 590.98it/s][A
1407it [00:03, 685.19it/s][A
1489it [00:03, 543.77it/s][A
1620it [00:03, 678.30it/s][A
1725it [00:03, 723.69it/s][A
1809it [00:03, 738.52it/s][A
1892it [00:04, 218.33it/s][A
1952it [00:05, 175.27it/s][A
2065it [00:05, 253.31it/s][A
2131it [00:05, 244.80it/s][A

2319it [00:06, 404.59it/s][A
2391it [00:06, 408.61it/s][A
2478it [00:06, 422.73it/s][A
2551it [00:06, 474.67it/s][A
2613it [00:06, 348.52it/s][A
2745it [00:07, 494.47it/s][A
2820it [00:07, 524.91it/s][A
2887it [00:07, 534.72it/s][A
2996it [00:07, 656.64it/s][A
3219it [00:07, 754.01it/s][A
33

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00532.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00532.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00532.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00532.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00532.warcに保存されました。



0it [00:00, ?it/s][A
72it [00:00, 492.28it/s][A
122it [00:00, 269.41it/s][A
201it [00:00, 385.23it/s][A
330it [00:00, 599.98it/s][A
401it [00:00, 545.72it/s][A
573it [00:00, 804.26it/s][A
664it [00:01, 492.69it/s][A
753it [00:01, 563.16it/s][A
849it [00:01, 643.36it/s][A
931it [00:01, 584.73it/s][A
1002it [00:02, 203.67it/s][A
1074it [00:02, 246.78it/s][A
1227it [00:02, 389.73it/s][A
1446it [00:03, 630.37it/s][A
1561it [00:03, 624.29it/s][A
1660it [00:03, 593.92it/s][A
1788it [00:03, 702.45it/s][A
1883it [00:03, 468.07it/s][A

2051it [00:04, 467.15it/s][A
2113it [00:04, 395.07it/s][A
2216it [00:04, 498.14it/s][A
2301it [00:04, 551.82it/s][A
2520it [00:04, 703.72it/s][A
2661it [00:05, 823.05it/s][A
2754it [00:05, 754.96it/s][A
2896it [00:05, 895.68it/s][A
3015it [00:05, 747.91it/s][A
3102it [00:05, 672.13it/s][A
3180it [00:06, 265.51it/s][A
3333it [00:06, 341.16it/s][A
3391it [00:07, 362.60it/s][A
3447it [00:07, 317.48it/s][A
3498it [00:07, 320.00it/s]

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00533.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00533.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00533.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00533.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00533.warcに保存されました。



0it [00:00, ?it/s][A
67it [00:00, 665.46it/s][A
195it [00:00, 873.90it/s][A
357it [00:00, 1168.86it/s][A
476it [00:01, 313.02it/s] [A
582it [00:01, 395.89it/s][A
664it [00:01, 246.70it/s][A
843it [00:02, 401.98it/s][A
939it [00:06, 77.28it/s] [A
1005it [00:06, 92.74it/s][A
1065it [00:06, 98.82it/s][A
1209it [00:06, 158.22it/s][A
1311it [00:07, 210.79it/s][A
1386it [00:07, 228.39it/s][A
1497it [00:07, 309.91it/s][A

1647it [00:07, 315.70it/s][A
1761it [00:07, 419.38it/s][A
1833it [00:08, 423.39it/s][A
1897it [00:08, 375.30it/s][A
1950it [00:08, 350.64it/s][A
2148it [00:08, 626.50it/s][A
2259it [00:08, 713.78it/s][A
2358it [00:08, 754.68it/s][A

2524it [00:09, 429.58it/s][A
2584it [00:09, 408.43it/s][A
2658it [00:10, 162.22it/s][A
2729it [00:10, 205.61it/s][A
2793it [00:11, 249.15it/s][A
2846it [00:11, 274.40it/s][A
2895it [00:11, 293.56it/s][A
2941it [00:11, 317.93it/s][A
2986it [00:11, 218.27it/s][A
3021it [00:12, 105.23it/s][A
3108it [00:12, 162.61it/

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00534.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00534.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00534.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00534.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00534.warcに保存されました。



0it [00:00, ?it/s][A
66it [00:00, 369.16it/s][A
103it [00:00, 124.26it/s][A
123it [00:00, 134.20it/s][A
189it [00:00, 222.71it/s][A
222it [00:01, 243.31it/s][A
256it [00:01, 265.28it/s][A
369it [00:01, 471.09it/s][A
426it [00:01, 473.48it/s][A
480it [00:02, 176.41it/s][A
537it [00:02, 169.59it/s][A
600it [00:02, 217.78it/s][A
720it [00:02, 351.16it/s][A
783it [00:03, 252.44it/s][A
870it [00:04, 125.08it/s][A
906it [00:07, 51.31it/s] [A
963it [00:07, 68.24it/s][A
1101it [00:07, 126.10it/s][A
1206it [00:07, 166.07it/s][A
1261it [00:08, 177.40it/s][A
1329it [00:08, 219.91it/s][A
1392it [00:08, 235.93it/s][A
1485it [00:08, 306.55it/s][A
1611it [00:08, 435.43it/s][A
1728it [00:08, 549.48it/s][A
1809it [00:08, 583.53it/s][A
1986it [00:08, 807.65it/s][A
2087it [00:09, 617.57it/s][A
2205it [00:09, 712.43it/s][A
2295it [00:09, 699.26it/s][A
2378it [00:09, 718.61it/s][A
2517it [00:09, 861.97it/s][A
2613it [00:10, 452.73it/s][A
2692it [00:10, 504.13it/s][A
2767i

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00535.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00535.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00535.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00535.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00535.warcに保存されました。



0it [00:00, ?it/s][A
52it [00:00, 515.40it/s][A
104it [00:00, 463.24it/s][A
159it [00:00, 426.68it/s][A
309it [00:00, 784.40it/s][A
393it [00:00, 762.42it/s][A
480it [00:00, 464.18it/s][A


670it [00:01, 459.92it/s][A
822it [00:01, 691.36it/s][A
902it [00:01, 536.85it/s][A
968it [00:02, 215.43it/s][A
1052it [00:02, 278.51it/s][A
1111it [00:03, 181.51it/s][A
1194it [00:03, 187.34it/s][A
1272it [00:03, 241.27it/s][A
1320it [00:04, 186.78it/s][A
1395it [00:04, 239.08it/s][A
1439it [00:04, 245.73it/s][A
1500it [00:04, 281.52it/s][A
1623it [00:04, 431.06it/s][A
1686it [00:05, 455.16it/s][A
1857it [00:05, 709.19it/s][A
1949it [00:05, 746.49it/s][A
2040it [00:05, 417.53it/s][A
2110it [00:05, 421.51it/s][A
2235it [00:06, 481.92it/s][A
2370it [00:06, 618.77it/s][A
2502it [00:06, 743.68it/s][A
2595it [00:06, 781.76it/s][A
2688it [00:06, 589.75it/s][A
2790it [00:06, 655.15it/s][A
2870it [00:07, 443.38it/s][A
2991it [00:07, 563.93it/s][A
3069it [00:07, 552.61it/s]

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00536.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00536.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00536.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00536.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00536.warcに保存されました。



0it [00:00, ?it/s][A
66it [00:00, 659.61it/s][A
132it [00:00, 142.31it/s][A
189it [00:02, 54.92it/s] [A
210it [00:02, 62.85it/s][A
306it [00:03, 102.25it/s][A
402it [00:03, 159.75it/s][A
435it [00:03, 174.63it/s][A
480it [00:03, 199.78it/s][A
555it [00:03, 237.07it/s][A
705it [00:04, 236.82it/s][A
762it [00:04, 264.68it/s][A

829it [00:04, 252.99it/s][A
861it [00:05, 240.63it/s][A
1026it [00:05, 487.57it/s][A
1101it [00:05, 483.37it/s][A
1209it [00:05, 602.83it/s][A
1317it [00:05, 544.47it/s][A
1383it [00:05, 439.91it/s][A
1438it [00:06, 320.84it/s][A
1533it [00:06, 369.82it/s][A
1579it [00:06, 252.36it/s][A
1665it [00:07, 330.86it/s][A

1759it [00:08, 103.02it/s][A
1818it [00:08, 134.98it/s][A

1917it [00:09, 152.15it/s][A
2076it [00:09, 297.59it/s][A
2137it [00:09, 311.90it/s][A
2229it [00:10, 224.82it/s][A
2271it [00:10, 187.05it/s][A
2334it [00:10, 226.36it/s][A
2385it [00:11, 227.63it/s][A
2424it [00:11, 181.05it/s][A
2451it [00:11, 147.66it/s][A

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00537.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00537.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00537.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00537.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00537.warcに保存されました。



0it [00:00, ?it/s][A
90it [00:00, 638.82it/s][A
198it [00:00, 567.06it/s][A
285it [00:00, 577.74it/s][A
448it [00:00, 863.59it/s][A
545it [00:00, 719.04it/s][A
626it [00:01, 290.17it/s][A
750it [00:01, 405.50it/s][A
837it [00:01, 470.96it/s][A
918it [00:02, 241.59it/s][A

1062it [00:03, 209.67it/s][A
1174it [00:03, 300.65it/s][A
1241it [00:03, 280.71it/s][A
1295it [00:03, 267.85it/s][A
1401it [00:03, 371.88it/s][A
1463it [00:04, 402.74it/s][A
1545it [00:04, 310.40it/s][A
1662it [00:04, 424.74it/s][A
1726it [00:04, 441.83it/s][A
1821it [00:04, 536.63it/s][A
1893it [00:05, 177.67it/s][A
1945it [00:06, 180.12it/s][A
1987it [00:06, 182.14it/s][A
2046it [00:06, 226.33it/s][A
2088it [00:06, 236.45it/s][A
2169it [00:06, 320.30it/s][A
2226it [00:06, 363.73it/s][A
2278it [00:07, 321.84it/s][A
2331it [00:07, 355.06it/s][A

2510it [00:13, 37.55it/s] [A
2556it [00:13, 43.70it/s][A
2595it [00:13, 53.40it/s][A
2630it [00:14, 55.56it/s][A
2669it [00:14, 70.78it/s][A

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00538.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00538.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00538.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00538.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00538.warcに保存されました。



0it [00:00, ?it/s][A
66it [00:00, 421.47it/s][A
183it [00:00, 697.96it/s][A
256it [00:00, 270.65it/s][A
339it [00:00, 331.40it/s][A
388it [00:01, 330.28it/s][A
462it [00:01, 408.41it/s][A
516it [00:01, 306.81it/s][A
558it [00:01, 262.62it/s][A
593it [00:02, 126.39it/s][A
666it [00:02, 184.32it/s][A

771it [00:03, 205.25it/s][A
867it [00:03, 308.46it/s][A
996it [00:03, 469.27it/s][A
1071it [00:03, 457.92it/s][A
1141it [00:03, 503.27it/s][A
1230it [00:03, 497.49it/s][A
1291it [00:03, 431.35it/s][A
1408it [00:04, 572.74it/s][A
1479it [00:04, 582.78it/s][A
1547it [00:04, 270.02it/s][A
1598it [00:05, 212.56it/s][A
1695it [00:05, 299.23it/s][A
1758it [00:05, 342.19it/s][A
1836it [00:05, 370.58it/s][A
1890it [00:06, 249.77it/s][A

1973it [00:06, 224.82it/s][A
2115it [00:06, 312.24it/s][A
2199it [00:06, 381.34it/s][A
2319it [00:06, 517.53it/s][A
2390it [00:07, 496.79it/s][A
2478it [00:07, 505.34it/s][A
2568it [00:07, 444.36it/s][A
2621it [00:07, 448.40it/s][A

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00539.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00539.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00539.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00539.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00539.warcに保存されました。



0it [00:00, ?it/s][A
3it [00:00, 21.61it/s][A
51it [00:00, 247.93it/s][A
108it [00:00, 361.17it/s][A
147it [00:00, 303.71it/s][A
336it [00:00, 426.50it/s][A
423it [00:01, 502.92it/s][A
549it [00:01, 616.04it/s][A
616it [00:01, 484.19it/s][A
675it [00:02, 223.08it/s][A
741it [00:02, 265.68it/s][A
786it [00:02, 289.73it/s][A
864it [00:02, 363.19it/s][A
918it [00:02, 324.77it/s][A
987it [00:02, 386.03it/s][A
1168it [00:02, 671.30it/s][A
1259it [00:03, 474.86it/s][A
1344it [00:03, 514.05it/s][A
1414it [00:03, 441.33it/s][A


1609it [00:03, 431.47it/s][A
1776it [00:04, 674.54it/s][A
1872it [00:04, 626.43it/s][A

2028it [00:04, 671.99it/s][A
2103it [00:04, 527.77it/s][A

2254it [00:04, 610.62it/s][A
2323it [00:05, 595.99it/s][A
2388it [00:05, 540.44it/s][A
2463it [00:05, 443.65it/s][A
2514it [00:05, 344.49it/s][A
2604it [00:06, 178.70it/s][A

2689it [00:07, 188.93it/s][A
2772it [00:07, 246.00it/s][A
2892it [00:07, 374.84it/s][A
2949it [00:07, 375.13it/s][A


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

data/path_list/CC-MAIN-2023-50


  0%|          | 0/10 [00:00<?, ?it/s]

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00540.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00540.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00540.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00540.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00540.warcに保存されました。



0it [00:00, ?it/s][A
3it [00:01,  1.57it/s][A
42it [00:02, 28.36it/s][A
177it [00:02, 98.68it/s][A
198it [00:03, 90.50it/s][A
276it [00:03, 149.91it/s][A
402it [00:03, 270.29it/s][A
486it [00:03, 347.23it/s][A
576it [00:04, 143.29it/s][A
627it [00:04, 157.35it/s][A
678it [00:05, 167.02it/s][A
726it [00:05, 177.41it/s][A
763it [00:05, 198.92it/s][A
797it [00:05, 199.13it/s][A
827it [00:05, 197.44it/s][A
885it [00:05, 255.85it/s][A
942it [00:06, 289.46it/s][A
1031it [00:06, 408.76it/s][A
1083it [00:06, 375.80it/s][A
1224it [00:06, 574.07it/s][A
1292it [00:06, 582.42it/s][A
1358it [00:06, 412.37it/s][A
1411it [00:07, 326.14it/s][A
1458it [00:07, 349.78it/s][A
1551it [00:07, 440.93it/s][A
1623it [00:07, 496.29it/s][A
1681it [00:07, 378.20it/s][A
1818it [00:07, 445.02it/s][A
1878it [00:08, 466.83it/s][A
1930it [00:08, 444.73it/s][A
2037it [00:08, 558.63it/s][A
2121it [00:08, 620.31it/s][A
2232it [00:08, 736.07it/s][A
2313it [00:08, 594.91it/s][A
2394it [00

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00541.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00541.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00541.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00541.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00541.warcに保存されました。



0it [00:00, ?it/s][A
9it [00:00, 81.05it/s][A
96it [00:00, 483.26it/s][A
210it [00:00, 668.07it/s][A
354it [00:00, 933.04it/s][A
450it [00:01, 321.92it/s][A

657it [00:01, 377.03it/s][A
778it [00:01, 500.69it/s][A
879it [00:01, 550.90it/s][A
958it [00:02, 474.01it/s][A
1179it [00:02, 779.57it/s][A
1291it [00:02, 604.91it/s][A
1380it [00:02, 577.69it/s][A
1473it [00:02, 630.50it/s][A
1553it [00:03, 436.61it/s][A
1724it [00:03, 637.34it/s][A
1836it [00:03, 452.65it/s][A
1946it [00:03, 543.70it/s][A
2032it [00:03, 568.94it/s][A
2112it [00:04, 428.76it/s][A
2175it [00:04, 407.71it/s][A

2370it [00:04, 547.88it/s][A
2443it [00:04, 475.36it/s][A
2504it [00:05, 455.41it/s][A
2559it [00:05, 389.70it/s][A


3030it [00:05, 826.21it/s][A
3122it [00:05, 606.61it/s][A
3216it [00:06, 660.73it/s][A
3295it [00:06, 674.38it/s][A
3372it [00:06, 679.65it/s][A
3447it [00:06, 682.20it/s][A
3561it [00:06, 540.47it/s][A
3624it [00:07, 340.13it/s][A
3678it [00:07, 349.91it/s]

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00542.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00542.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00542.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00542.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00542.warcに保存されました。



0it [00:00, ?it/s][A

174it [00:00, 480.67it/s][A
222it [00:00, 396.04it/s][A
261it [00:00, 273.39it/s][A
291it [00:00, 278.89it/s][A
390it [00:01, 423.63it/s][A
490it [00:01, 557.41it/s][A
639it [00:01, 769.21it/s][A
765it [00:01, 347.07it/s][A
882it [00:02, 447.00it/s][A
1033it [00:02, 607.52it/s][A
1236it [00:02, 859.78it/s][A
1368it [00:02, 658.61it/s][A

1664it [00:03, 332.48it/s][A
1755it [00:03, 379.64it/s][A
1860it [00:03, 450.21it/s][A
1950it [00:04, 423.75it/s][A
2024it [00:04, 278.55it/s][A
2080it [00:04, 294.60it/s][A
2131it [00:05, 314.59it/s][A
2223it [00:05, 401.71it/s][A
2313it [00:05, 486.70it/s][A
2381it [00:05, 427.00it/s][A
2439it [00:05, 364.70it/s][A
2589it [00:05, 558.08it/s][A
2664it [00:05, 511.03it/s][A
2729it [00:06, 430.80it/s][A
2802it [00:06, 483.17it/s][A
2862it [00:07, 225.34it/s][A
2906it [00:07, 236.59it/s][A
2946it [00:07, 245.76it/s][A
2997it [00:07, 224.36it/s][A
3028it [00:07, 180.25it/s][A
3102it [00:08, 217.14it/

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00543.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00543.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00543.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00543.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00543.warcに保存されました。



0it [00:00, ?it/s][A
30it [00:00, 30.86it/s][A
94it [00:01, 108.89it/s][A
174it [00:01, 213.97it/s][A
255it [00:01, 315.95it/s][A
345it [00:01, 420.70it/s][A
420it [00:01, 449.67it/s][A
540it [00:01, 610.77it/s][A
621it [00:02, 399.65it/s][A

743it [00:02, 398.85it/s][A
918it [00:02, 561.63it/s][A
982it [00:02, 553.93it/s][A

1264it [00:03, 443.86it/s][A
1419it [00:03, 580.76it/s][A
1506it [00:04, 225.89it/s][A
1569it [00:04, 237.85it/s][A
1668it [00:04, 295.44it/s][A
1767it [00:04, 374.42it/s][A
1837it [00:05, 257.52it/s][A
1890it [00:05, 264.96it/s][A
1969it [00:05, 329.04it/s][A
2024it [00:05, 338.21it/s][A
2074it [00:06, 351.72it/s][A
2121it [00:06, 370.09it/s][A
2223it [00:06, 252.32it/s][A
2261it [00:06, 262.36it/s][A
2297it [00:07, 230.57it/s][A
2327it [00:07, 129.67it/s][A

2463it [00:08, 233.58it/s][A
2550it [00:08, 302.40it/s][A
2591it [00:08, 311.42it/s][A
2661it [00:08, 377.43it/s][A
2709it [00:08, 379.65it/s][A
2754it [00:09, 204.35it/s][

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00544.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00544.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00544.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00544.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00544.warcに保存されました。




15it [00:00, 106.09it/s][A

138it [00:00, 188.17it/s][A
288it [00:01, 417.21it/s][A
365it [00:01, 433.73it/s][A
495it [00:01, 591.82it/s][A
579it [00:01, 439.36it/s][A
644it [00:02, 240.18it/s][A
750it [00:02, 330.28it/s][A
822it [00:02, 298.17it/s][A
874it [00:02, 296.00it/s][A
960it [00:02, 368.50it/s][A
1017it [00:03, 308.52it/s][A
1061it [00:03, 264.23it/s][A
1158it [00:03, 340.33it/s][A

1271it [00:04, 273.90it/s][A
1341it [00:04, 329.80it/s][A
1464it [00:04, 452.21it/s][A
1518it [00:04, 332.92it/s][A
1570it [00:04, 363.07it/s][A
1641it [00:04, 413.88it/s][A
1752it [00:05, 549.92it/s][A
1819it [00:05, 335.66it/s][A
1871it [00:05, 298.62it/s][A
1914it [00:05, 299.41it/s][A
1974it [00:06, 333.35it/s][A
2046it [00:06, 367.10it/s][A
2089it [00:06, 204.72it/s][A
2169it [00:06, 245.20it/s][A
2292it [00:07, 376.52it/s][A
2376it [00:07, 452.55it/s][A
2441it [00:07, 392.36it/s][A
2495it [00:07, 307.04it/s][A
2553it [00:07, 332.61it/s][A
2597it [00:07, 343

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00545.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00545.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00545.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00545.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00545.warcに保存されました。



0it [00:00, ?it/s][A
6it [00:00, 36.04it/s][A
90it [00:00, 323.72it/s][A
153it [00:00, 409.76it/s][A
297it [00:00, 351.13it/s][A

364it [00:01, 159.97it/s][A
444it [00:01, 231.57it/s][A
631it [00:02, 465.90it/s][A
716it [00:02, 474.69it/s][A
831it [00:05, 79.69it/s] [A
883it [00:06, 84.90it/s][A
1030it [00:06, 141.28it/s][A
1128it [00:06, 182.40it/s][A
1199it [00:07, 165.15it/s][A
1401it [00:07, 294.82it/s][A
1495it [00:07, 328.97it/s][A
1629it [00:07, 426.15it/s][A
1767it [00:07, 548.84it/s][A
1870it [00:08, 331.09it/s][A

2003it [00:09, 183.50it/s][A
2061it [00:09, 214.19it/s][A
2108it [00:10, 183.38it/s][A
2145it [00:11, 113.69it/s][A
2208it [00:11, 151.94it/s][A
2319it [00:11, 243.65it/s][A

2466it [00:11, 270.35it/s][A
2535it [00:11, 310.89it/s][A
2598it [00:11, 354.15it/s][A
2651it [00:12, 377.97it/s][A
2703it [00:12, 403.83it/s][A
2755it [00:12, 427.09it/s][A
2859it [00:12, 572.27it/s][A
2963it [00:12, 691.05it/s][A
3041it [00:12, 571.57it/s][A

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00546.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00546.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00546.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00546.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00546.warcに保存されました。



0it [00:00, ?it/s][A
3it [00:00,  4.94it/s][A
84it [00:00, 155.45it/s][A
128it [00:00, 198.70it/s][A
171it [00:01, 223.66it/s][A
264it [00:01, 377.69it/s][A
333it [00:01, 443.47it/s][A
477it [00:01, 684.84it/s][A
562it [00:01, 628.85it/s][A
637it [00:01, 612.87it/s][A
753it [00:01, 734.02it/s][A
912it [00:01, 828.31it/s][A
1009it [00:01, 860.11it/s][A
1137it [00:02, 946.12it/s][A
1272it [00:02, 360.58it/s][A
1346it [00:03, 391.03it/s][A

1473it [00:03, 389.17it/s][A
1528it [00:03, 368.46it/s][A
1593it [00:03, 399.38it/s][A
1788it [00:03, 659.50it/s][A
1941it [00:04, 645.24it/s][A
2016it [00:04, 653.61it/s][A
2094it [00:04, 522.65it/s][A
2211it [00:04, 620.15it/s][A
2283it [00:04, 508.68it/s][A
2394it [00:04, 564.66it/s][A
2484it [00:05, 530.87it/s][A
2544it [00:05, 541.48it/s][A
2603it [00:05, 533.20it/s][A
2660it [00:05, 354.65it/s][A
2712it [00:05, 365.22it/s][A
2763it [00:06, 293.18it/s][A
2946it [00:06, 547.68it/s][A
3023it [00:06, 533.36it/s][A
3

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00547.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00547.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00547.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00547.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00547.warcに保存されました。



0it [00:00, ?it/s][A
6it [00:01,  4.53it/s][A
69it [00:01, 52.03it/s][A
87it [00:01, 63.95it/s][A
169it [00:01, 156.99it/s][A
210it [00:02, 183.26it/s][A
340it [00:02, 365.86it/s][A
406it [00:02, 354.68it/s][A
480it [00:02, 417.24it/s][A
539it [00:02, 439.60it/s][A
596it [00:02, 382.40it/s][A
663it [00:03, 184.79it/s][A
723it [00:04, 144.41it/s][A
792it [00:04, 155.92it/s][A
867it [00:04, 210.28it/s][A
933it [00:04, 259.92it/s][A
979it [00:04, 260.92it/s][A
1068it [00:05, 221.71it/s][A
1218it [00:05, 343.75it/s][A
1336it [00:05, 455.85it/s][A
1410it [00:05, 495.35it/s][A

1567it [00:06, 200.64it/s][A
1632it [00:07, 229.08it/s][A
1761it [00:07, 305.74it/s][A
1842it [00:07, 340.42it/s][A
1891it [00:07, 341.86it/s][A
2004it [00:08, 266.42it/s][A
2055it [00:08, 283.14it/s][A
2093it [00:08, 222.43it/s][A
2123it [00:08, 229.51it/s][A
2163it [00:09, 206.98it/s][A
2202it [00:09, 217.27it/s][A
2301it [00:09, 343.76it/s][A
2348it [00:09, 256.69it/s][A
2436it [0

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00548.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00548.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00548.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00548.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00548.warcに保存されました。



0it [00:00, ?it/s][A
108it [00:00, 1066.22it/s][A
216it [00:00, 857.72it/s] [A
378it [00:00, 1147.54it/s][A
498it [00:01, 271.02it/s] [A
576it [00:01, 321.93it/s][A
651it [00:01, 351.92it/s][A
717it [00:03, 105.36it/s][A
783it [00:03, 134.41it/s][A
834it [00:03, 147.80it/s][A
925it [00:04, 212.22it/s][A
982it [00:04, 181.17it/s][A
1050it [00:04, 227.68it/s][A
1099it [00:06, 74.08it/s] [A

1271it [00:07, 139.73it/s][A
1317it [00:07, 162.73it/s][A
1395it [00:07, 199.62it/s][A
1437it [00:07, 203.70it/s][A
1524it [00:07, 280.43it/s][A
1614it [00:07, 372.96it/s][A
1680it [00:07, 415.87it/s][A

1788it [00:08, 303.20it/s][A
1917it [00:08, 405.95it/s][A
2088it [00:08, 622.98it/s][A

2274it [00:09, 612.86it/s][A
2347it [00:09, 609.99it/s][A
2514it [00:09, 734.02it/s][A
2593it [00:09, 543.74it/s][A
2658it [00:09, 559.53it/s][A
2722it [00:09, 496.57it/s][A
2784it [00:09, 517.15it/s][A
2844it [00:10, 494.89it/s][A
2897it [00:10, 400.60it/s][A
3045it [00:10, 616.21

downloading https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/segments/1700679100047.66/warc/CC-MAIN-20231129010302-20231129040302-00549.warc.gz
ファイルが正常にダウンロードされました: data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00549.warc.gz
decompressing data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00549.warc.gz
data/gz/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00549.warc.gzが解凍され、data/warc/crawl-data_CC-MAIN-2023-50_segments_1700679100047.66_warc_CC-MAIN-20231129010302-20231129040302-00549.warcに保存されました。



0it [00:00, ?it/s][A
66it [00:00, 524.29it/s][A
119it [00:00, 106.28it/s][A
145it [00:01, 125.39it/s][A
192it [00:01, 162.87it/s][A
219it [00:01, 163.87it/s][A
310it [00:01, 297.29it/s][A
414it [00:01, 306.74it/s][A
455it [00:02, 249.06it/s][A
582it [00:02, 406.03it/s][A
644it [00:02, 222.08it/s][A
690it [00:03, 235.50it/s][A

768it [00:03, 199.62it/s][A
828it [00:03, 214.29it/s][A
939it [00:03, 344.39it/s][A
992it [00:04, 348.61it/s][A
1040it [00:04, 353.90it/s][A

1302it [00:04, 452.62it/s][A
1371it [00:04, 466.03it/s][A
1503it [00:04, 629.00it/s][A
1583it [00:05, 209.72it/s][A
1668it [00:06, 265.95it/s][A
1788it [00:06, 333.60it/s][A
1853it [00:06, 328.82it/s][A
1908it [00:08, 115.22it/s][A
1948it [00:08, 117.13it/s][A
2091it [00:08, 203.84it/s][A
2169it [00:08, 252.82it/s][A
2244it [00:09, 234.80it/s][A
2292it [00:09, 244.26it/s][A
2335it [00:09, 253.81it/s][A
2374it [00:09, 190.66it/s][A
2404it [00:09, 200.25it/s][A
2466it [00:09, 251.10it/s][A
2

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>