# Downloading all Wikipedia Articles 

This notebook implements the downloading of all Wikipedia articles. I kept the actual download out of the main notebook because of the lengthy output. 

## Find Files to Download

In [17]:
import requests
from bs4 import BeautifulSoup
from timeit import default_timer as timer
import os

base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')
soup_index

<html>
<head><title>Index of /enwiki/</title></head>
<body bgcolor="white">
<h1>Index of /enwiki/</h1><hr/><pre><a href="../">../</a>
<a href="20190301/">20190301/</a>                                          21-Apr-2019 01:33                   -
<a href="20190320/">20190320/</a>                                          02-May-2019 01:28                   -
<a href="20190401/">20190401/</a>                                          21-May-2019 01:34                   -
<a href="20190420/">20190420/</a>                                          02-Jun-2019 01:27                   -
<a href="20190501/">20190501/</a>                                          10-May-2019 09:15                   -
<a href="20190520/">20190520/</a>                                          24-May-2019 02:26                   -
<a href="20190601/">20190601/</a>                                          04-Jun-2019 20:30                   -
<a href="latest/">latest/</a>                                            04

In [18]:
# Find the links that are dates of dumps
dumps = [a['href'] for a in soup_index.find_all('a') if 
         a.text == '20190601/']

dumps_url = base_url + dumps[0]

# Retrieve the html
dump_html = requests.get(dumps_url).text

# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')

files = []
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
        
files_to_download = [file[0] for file in files if '.xml-p' in file[0]]
print(f'There are {len(files_to_download)} files to download.')

There are 114 files to download.


In [19]:
files_to_download

['enwiki-20190601-pages-articles-multistream1.xml-p10p30302.bz2',
 'enwiki-20190601-pages-articles-multistream2.xml-p30304p88444.bz2',
 'enwiki-20190601-pages-articles-multistream3.xml-p88445p200507.bz2',
 'enwiki-20190601-pages-articles-multistream4.xml-p200511p352689.bz2',
 'enwiki-20190601-pages-articles-multistream5.xml-p352690p565312.bz2',
 'enwiki-20190601-pages-articles-multistream6.xml-p565314p892912.bz2',
 'enwiki-20190601-pages-articles-multistream7.xml-p892914p1268691.bz2',
 'enwiki-20190601-pages-articles-multistream8.xml-p1268693p1791079.bz2',
 'enwiki-20190601-pages-articles-multistream9.xml-p1791081p2336422.bz2',
 'enwiki-20190601-pages-articles-multistream10.xml-p2336425p3046511.bz2',
 'enwiki-20190601-pages-articles-multistream11.xml-p3046517p3926861.bz2',
 'enwiki-20190601-pages-articles-multistream12.xml-p3926864p5040435.bz2',
 'enwiki-20190601-pages-articles-multistream13.xml-p5040438p6197593.bz2',
 'enwiki-20190601-pages-articles-multistream14.xml-p6197599p7697599.

## Download Files Using Keras

Files will be saved in `/.keras/datasets`.

In [None]:
from tensorflow.keras.utils import get_file

data_paths = []

start = timer()
for file in files_to_download:
    data_paths.append(get_file(file, dumps_url + file))
    
end = timer()
print(f'{round(end - start)} total seconds elapsed.')

Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream22.xml-p25427984p26823658.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream23.xml-p26823661p28323661.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream23.xml-p28323661p29823661.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream23.xml-p29823661p30503448.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream24.xml-p30503454p32003454.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream24.xml-p32003454p33503454.bz2
Downloading data from https://dumps.wikimedia.org/enwiki/20190601/enwiki-20190601-pages-articles-multistream24.xml-p33503454p33952815.bz2
Downloading data from https://dump

The total download time was just over 2 hours. That's not bad for all of Wikipedia (at leas the English articles).

This process could also be done in parallel using multithreading or multiprocessing. However, I have run into issues running parallel jobs donwloading files because the code was making too many requests to the server.