# 下载用于词表示的预训练全局向量 (GloVe)

我们使用的数据集包含 2009 年的 160 万条训练推文和 350 条测试推文，算法分配的二元正面和负面情绪分数相当平均。

## 导入和设置

In [1]:
from pathlib import Path
import requests
from io import BytesIO
from zipfile import ZipFile
from tqdm import tqdm

## 下载解压

您可以了解有关数据的更多信息并从[此处](https://nlp.stanford.edu/projects/glove/) 手动下载它们。

In [2]:
path = Path('glove')
if not path.exists():
    path.mkdir()

In [3]:
URLs = ['http://nlp.stanford.edu/data/glove.6B.zip',
        'http://nlp.stanford.edu/data/glove.twitter.27B.zip',
        'http://nlp.stanford.edu/data/glove.840B.300d.zip']

In [4]:
all_targets = [('glove.6B.100d.txt', 'glove.6B.300d.txt'),
               ('glove.twitter.27B.200d.txt',),
               ('glove.840B.300d.txt',)]

下载每个目标可能需要 10-20 分钟或更长时间，具体取决于您的连接。您可以在浏览器中粘贴其中一个网址，以检查下载速度和时间估计。

In [5]:
for url, targets in zip(URLs, all_targets):
    print(f'downloading {targets}...')
    response = requests.get(url).content
    print('done')
    with ZipFile(BytesIO(response)) as zip_file:
        for file in tqdm(zip_file.namelist()):
            if file in targets:
                local_file = path / file
                if not local_file.exists():
                    with local_file.open('wb') as output:
                        for line in zip_file.open(file).readlines():
                            output.write(line)

downloading ('glove.6B.100d.txt', 'glove.6B.300d.txt')...



  0%|          | 0/4 [00:00<?, ?it/s]

done


100%|██████████| 4/4 [00:11<00:00,  2.85s/it]


downloading ('glove.twitter.27B.200d.txt',)...



  0%|          | 0/4 [00:00<?, ?it/s]

done


100%|██████████| 4/4 [00:16<00:00,  4.02s/it]


downloading ('glove.840B.300d.txt',)...



  0%|          | 0/1 [00:00<?, ?it/s]

done


100%|██████████| 1/1 [00:58<00:00, 58.75s/it]
