# scraper

This notebook will download some wikipedia pages and adds them to to the `wikitext/` folder.

You do NOT need to run this. We already downloaded 10000 pages for you.

In [8]:
import requests, os, time, signal

class TimeoutException(Exception):
    pass

def timeout_handler(signum, frame):
    raise TimeoutException("Function timed out.")


In [9]:

def get_data(page):
    # Define the URL of the Wikipedia API endpoint
  url = "https://en.wikipedia.org/w/api.php"

  # Set the parameters for the API request
  params = {
    "action": "query",
    "format": "json",
    "prop": "extracts",
    "explaintext": "1",
    "titles": "Pet_door",
  }

  # Make the GET request
  params["titles"] = page
  response = requests.get(url, params=params)

  # Get the wikitext content from the response
  data = response.json()
  page_id = list(data["query"]["pages"].keys())[0]
  wikitext = data["query"]["pages"][page_id]["extract"]

  return wikitext


# Print the wikitext content
print(get_data("Valle_dei_Templi"))


The Valle dei Templi (Italian: [ˈvalle dei ˈtɛmpli]; Sicilian: Vaddi di li Tempri), or Valley of the Temples, is an archaeological site in Agrigento (ancient Greek Akragas), Sicily. It is one of the most outstanding examples of ancient Greek art and architecture, and is one of the main attractions of Sicily. 
The term "valley" is a misnomer, the site being located on a ridge outside the town of Agrigento.


== Overview ==

The Valley includes remains of seven temples, all in Doric style. The ascription of the names, apart from that of the Olympeion, are a mere tradition established in Renaissance times. The temples are:

Temple of Concordia, whose name comes from a Latin inscription found nearby, and which was built in the 5th century BC. Turned into a church in the 6th century AD, it is now one of the best preserved in the Valley.
Temple of Juno, also built in the 5th century BC. It was burnt in 406 BC by the Carthaginians.
Temple of Heracles, who was one of the most venerated deities

In [10]:
while True:

    # Define the URL of the Wikipedia API endpoint
  url = "https://en.wikipedia.org/w/api.php"

  # Set the parameters for the API request
  params = {
      "action": "query",
      "format": "json",
      "list": "random",
      "rnnamespace": 0,  # Limit to main namespace pages
      "rnlimit": 5000  # Number of random pages to retrieve
  }

  # Make the GET request
  response = requests.get(url, params=params)

  # Get the random pages from the response
  data = response.json()
  random_pages = data["query"]["random"]

  # Print the titles of the random pages


  print("[new cycle]")
  for page in random_pages:
      print(page["title"])
      try:
        file_path = "wikitext/" + page["title"] + ".txt"
        if os.path.exists(file_path):
            print("[skipped]")
            continue
        time.sleep(0.01)
        max_execution_time = 1

        try:
          # Set the alarm signal and handler
          signal.signal(signal.SIGALRM, timeout_handler)
          signal.alarm(max_execution_time)


          # Get the wikitext content for each random page
          wikitext = get_data(page["title"])
          signal.alarm(0)
        except TimeoutException:
          print("[timeout]")
          continue

        # Write the wikitext to a file
        with open(file_path, "w") as file:
            file.write(wikitext)
            print("[done]")


      except Exception as e: print("[error]", e)

[new cycle]
Wàn Guó Gōng Bào
[done]
Brancaccio
[timeout]
Torrens Island, South Australia
[done]
Bishop of Maidstone
[done]
Ossetian Muslims
[done]
Temple Rodef Shalom (Falls Church, Virginia)
[done]
Smethwick by-election
[done]
Etsushi Takahashi
[done]
Mayisha Akbar
[done]
Kaki Singer
[done]
Benedetto Bartolo
[done]
Atherington
[done]
Ted Harris (ice hockey)
[done]
Sajzy
[done]
Eunidia bigriseovittata
[done]
Pseudebulea lungtanensis
[timeout]
2019–20 NCAA Division I women's ice hockey season
[done]
Kevin Thomas (cricketer)
[done]
Terry Stringer
[done]
Yörük Ali Efe
[done]
Sean Day
[done]
Meddler (short story)
[done]
W Morung Makunga
[done]
IEBC
[done]
Jan Gerard Wessels Boer
[done]
David Smith (sport shooter)
[done]
List of people from New Orleans
[done]
Heteroschizomus
[done]
Kevin Schuler
[done]
Mecidiyeköy
[done]
Firooz Bahram High School
[done]
Magic (Tom Browne album)
[done]
Johan Andreas Altenburg
[done]
Miles Müller
[done]
Orient Express (TV series)
[done]
Ettienne Richardson
[d

Traceback (most recent call last):
  File "/home/luca-fabbian/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_12294/1462914030.py", line 43, in <module>
    wikitext = get_data(page["title"])
  File "/tmp/ipykernel_12294/1159211186.py", line 16, in get_data
    response = requests.get(url, params=params)
  File "/home/luca-fabbian/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/home/luca-fabbian/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/luca-fabbian/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/luca-fabbian/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
   

In [11]:
import os

folder_path = "wikitext"  # Replace with the actual folder path

for filename in os.listdir(folder_path):
    if filename.endswith(".html"):
        new_filename = os.path.splitext(filename)[0] + ".txt"
        old_file_path = os.path.join(folder_path, filename)
        new_file_path = os.path.join(folder_path, new_filename)
        os.rename(old_file_path, new_file_path)