Top Wikipedia pages, [according to Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Popular_pages#Top-100_list).

Rank | Page | Views in millions
-----|------|------------------
1 | United States | 237
2 | Donald Trump | 233
3 | Barack Obama | 155
3 | Elizabeth II | 155
5 | India | 151
6 | World War II | 136
7 | Michael Jackson | 133
7 | United Kingdom | 133
9 | Cristiano Ronaldo | 131
10 | Lady Gaga | 124
11 | Sex | 123
12 | Adolf Hitler | 121
13 | Eminem | 120
14 | Game of Thrones | 116
15 | World War I | 113
16 | The Beatles | 112
17 | Justin Bieber | 109
18 | Elon Musk | 108
19 | Canada | 106
19 | Freddie Mercury | 106
21 | Lionel Messi | 105
22 | Kim Kardashian | 104
23 | Steve Jobs | 103
24 | Michael Jordan | 100
24 | Dwayne Johnson | 100
24 | The Big Bang Theory | 100
24 | List of Presidents of the United States | 100
24 | Australia | 100
29 | Stephen Hawking | 97
30 | Taylor Swift | 94
31 | List of highest-grossing films | 93
31 | China | 93
33 | Darth Vader | 91
34 | Star Wars | 90
34 | Miley Cyrus | 90
34 | Abraham Lincoln | 90
37 | September 11 attacks | 89
38 | Lil Wayne | 88
38 | Academy Awards | 88
38 | Japan | 88
38 | Johnny Depp | 88
38 | Germany | 88
38 | LeBron James | 88
38 | New York City | 88
45 | Harry Potter | 86
45 | Kobe Bryant | 86
45 | Selena Gomez | 86
45 | Leonardo DiCaprio | 86
45 | Rihanna | 86
45 | Albert Einstein | 86
45 | Russia | 86
52 | The Walking Dead (TV series) | 85
53 | How I Met Your Mother | 83
53 | Kanye West | 83
53 | Tupac Shakur | 83
53 | Angelina Jolie | 83
53 | John F. Kennedy | 83
53 | COVID-19 pandemic | 83
53 | Scarlett Johansson | 83
53 | List of Marvel Cinematic Universe films | 83
61 | Joe Biden | 82
62 | Chernobyl disaster | 81
63 | France | 80
63 | Tom Cruise | 80
63 | Ariana Grande | 80
66 | Jennifer Aniston | 79
66 | Breaking Bad | 79
66 | Arnold Schwarzenegger | 79
66 | Pablo Escobar | 79
70 | Keanu Reeves | 78
71 | Mila Kunis | 77
71 | Vietnam War | 77
71 | Meghan, Duchess of Sussex | 77
71 | Queen Victoria | 77
71 | Mark Zuckerberg | 77
71 | William Shakespeare | 77
71 | Jay-Z | 77
78 | Earth | 76
78 | Bill Gates | 76
78 | Muhammad Ali | 76
78 | Ted Bundy | 76
82 | Nicky Minaj | 75
82 | Will Smith | 75
84 | Singapore | 74
84 | Israel | 74
84 | John Cena | 74
84 | Bruce Lee | 74
84 | Elvis Presley | 74
89 | Diana, Princess of Wales | 73
89 | Charles Manson | 73
89 | Manchester United F.C. | 73
92 | Marilyn Monroe | 72
93 | Sexual intercourse | 71
93 | Katy Perry | 71
93 | Winston Churchill | 71
93 | Tom Brady | 71
93 | Periodic Table | 71
93 | Glee (TV series) | 71
93 | Brad Pitt | 71
93 | Madonna | 71


In [160]:
from bs4 import BeautifulSoup
from requests import get

import pandas as pd
import numpy as np

import time
import re

Some constants...

In [161]:
DATA_DIR = 'data'

# we filter out these elements
CLEAN = [
  'a[id="top"]',
  'a[class="mw-selflink selflink"]',
  'a[class="image"]',
  'a[class="internal"]',
  "sup",
  "div.reflist" # remove citations, that doesn't count
]

# we filter out these links
REM_LINKS = [
  r"(\/wiki\/File:\w+)",
  r"(\/wiki\/Special:\w+)",
  r"(\/wiki\/Template:\w+)",
  r"(\/wiki\/Category:\w+)",
  r"(\/wiki\/Portal:\w+)",
  r"(\/wiki\/Template_talk:\w+)",
  r"(\/wiki\/Help:\w+)",
  r"(\/wiki\/Wikipedia:\w+)",
  r"(^#\w+)",
]

# main page content selector
CONT_SEL = "div#content"

Here, we define a function to clean up the page of any unwanted links or elements. Although Wikipedia pages are fairly clean and nice to work with programmatically, there are still certain types of elements that we want to filter out. Such links include self links (links that link back to themselves), image links, internal links, link to files or template pages, among others.

In [162]:
def cleanup_page(html):
  # clean up unwanted links from pages
  for c in CLEAN:
    els = html.select(f"{CONT_SEL} {c}")
    for el in els:
      el.decompose()

  # format remaining links
  links = html.select(f"{CONT_SEL} a")
  for link in links:
    # extract href from link
    href = link['href']

    # extract text of links and remove punctuation
    text = re.sub(r"[\,\.\:\!\?]", "", link.text)

    # at this stage, we want to further remove certain types of links
    # that is: any of the links in REM_LINKS, OR any link that doesn't start with /wiki/
    if any([re.match(regex, href) for regex in REM_LINKS]) or not re.match(r"^\/wiki\/\w+", href):
      link.decompose()
    else:
      # remove leading /wiki/ from href as it is redundant
      href = re.sub(r"\/wiki\/", "", href)

      # Here is the 1000 IQ play. We want to preserve the URL of the links but
      # also work with them from a cleaner text file. We CAN extract the text
      # from the entire page but that would mean losing the hrefs. To solve
      # this, we replace the text content of the link with its text AND the
      # associated href. THEN we can simply extract the text content of the file
      # without losing the href!!1
      link.replace_with(f'{{{text}|{href}}}')

In [163]:
# load most popular wikipedia pages csv
df = pd.read_csv(f"{DATA_DIR}/top1000.csv")

# get a list of pages as an array of strings
pages = df['article'].to_numpy().astype(str)

# filter the pages to only articles without ':' in the title
# I know this may not cover everything, but I'm just testing here
pages = pages[np.char.find(pages, ':') == -1]

In [165]:
# in case you wanna skip ahead
START_INDEX = 713
SLEEP_TIME_S = 2

n = pages.size

# compile this for efficiency
newline_regex = re.compile(r"\n{3,}")

for i, page in enumerate(pages):
  if i < START_INDEX:
    continue

  print(f"{i} of {n} ({round((i / n) * 100, 2)}%) - Now Scraping {page}")

  # load the page as html with BeautifulSoup
  res = get(f'https://en.wikipedia.org/wiki/{page}')

  # check if we got baned :c
  if res.status_code != 200:
    print("We got got")
    break

  html = BeautifulSoup(res.text, 'html.parser')

  # clean up html on the page
  cleanup_page(html)

  # create one parsed page and one clean html page
  parsed_page = newline_regex.sub("\n\n", html.getText()) # replace any more than three newlines into only 2
  html_page = str(html.prettify())

  # replace bad characters in titles with underscores
  title = re.sub(r"\/", "_", page)

  # save files
  parsed_file = open(f"{DATA_DIR}/pages/text/{i + 1}-{title}.txt", "w")
  parsed_file.write(parsed_page)
  parsed_file.close()

  html_file = open(f"{DATA_DIR}/pages/html/{i + 1}-{title}.html", "w")
  html_file.write(html_page)
  html_file.close()

  # let's not overload wikipedia with requests here
  time.sleep(SLEEP_TIME_S)

713 of 970 (73.51%) - Now Scraping 2022_FIFA_World_Cup_qualification_(OFC)
714 of 970 (73.61%) - Now Scraping Pravin_Tambe
715 of 970 (73.71%) - Now Scraping Roman_Polanski
716 of 970 (73.81%) - Now Scraping Jack_McKinney_(basketball)
717 of 970 (73.92%) - Now Scraping Nazi_Germany
718 of 970 (74.02%) - Now Scraping Sandra_Oh
719 of 970 (74.12%) - Now Scraping Kherson
720 of 970 (74.23%) - Now Scraping T-14_Armata
721 of 970 (74.33%) - Now Scraping Virginia_Thomas
722 of 970 (74.43%) - Now Scraping Iran
723 of 970 (74.54%) - Now Scraping Sunny_Balwani
724 of 970 (74.64%) - Now Scraping Ragnar_Lodbrok
725 of 970 (74.74%) - Now Scraping South_Africa
726 of 970 (74.85%) - Now Scraping Turkish_Radio_and_Television_Corporation
727 of 970 (74.95%) - Now Scraping Cuban_Missile_Crisis
728 of 970 (75.05%) - Now Scraping Paul_Thomas_Anderson
729 of 970 (75.15%) - Now Scraping Muammar_Gaddafi
730 of 970 (75.26%) - Now Scraping Ian_Gibbons_(biochemist)
731 of 970 (75.36%) - Now Scraping Lockheed_M