# Week 4
## Part 1: Download the Wikipedia pages of characters

Now, it's time to go and get the names of all the wiki pages you'll need for your analysis. Those will serve as the nodes in our network.

**Exercise**

- Go to the page https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers and extract all of the artist-links using your regular expressions from above.
  - Hint: To make this easier, you can simply hit the edit button on wikipedia, copy the entire content of the file to a plain text file on your computer and manually delete all of the markup that's not related to the artists' names. (Otherwise there are some wiki-links there that you don't want)


In [None]:
def extract_wiki_links(text):
    import re
    
    pattern = r'\[\[([\w\s./\(\)]+)(?:\|[\w\s./\(\)]+)?\]\]'
    matches = re.findall(pattern, text)
    return matches

def insert_underscores(list_of_titles):
    return [title.replace(' ', '_') for title in list_of_titles]

def create_wiki_link(list_of_titles):
    wiki_prefix = "https://en.wikipedia.org/wiki/"
    links = [wiki_prefix + title for title in list_of_titles]
    return links

Extract all names of the wikipages

In [19]:
# take all text from the .txt file
wiki_txt = open('List_of_mainstream_rock_performers.txt', 'r', encoding='utf-8')
text = wiki_txt.read()
wiki_txt.close()

# extract all wiki links from the text
list_of_performers = extract_wiki_links(text)
bands = insert_underscores(list_of_performers)
print(list_of_performers[:10])
print(bands[:10])


['10cc', '10 Years (band)', '3 Doors Down', '311 (band)', '38 Special (band)', 'ABBA', 'Accept (band)', 'AC/DC', 'Bryan Adams', 'Aerosmith']
['10cc', '10_Years_(band)', '3_Doors_Down', '311_(band)', '38_Special_(band)', 'ABBA', 'Accept_(band)', 'AC/DC', 'Bryan_Adams', 'Aerosmith']


### API

**Exercise**
Use your knowledge of APIs and the list of all the wiki-pages to download all the text on the pages of the country performers.
- Hint 0: Make sure you read the Wiki API pages to ensure that your download the cleanest possible version of the page (the wikitext). This link may be helpful.
- Hint 1: You may want to save the individual band/artist pages on your computer. You can use your skills from the first lectures to write them as plain-text files (that's what I would do - one file per band/artist, named according to its wiki-link). (But you can also use pickle files or start a database if you like that better.)
- Hint 2: If you now have a directory with all those files, you can use os.listdir() to list all the files in that directory within Python and iterate over the files if you need to.
- Hint 3: Don't forget to add underscores to the performer names when you construct the urls


In [None]:
# From a wikipedia site name, extract the page text using the wikipedia API
def get_wiki_page_text(page_name):
    import json
    import urllib.request

    baseurl = "https://en.wikipedia.org/w/api.php?"
    action = "action=query"
    title = f"titles={page_name}"
    content = "prop=extracts&rvprop=explaintext"
    dataformat ="format=json"
    
    query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
    
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; ExerciseBot/1.0)'} # Set a user-agent to avoid potential blocking by wikipedia
    req_query = urllib.request.Request(query, headers=headers)
    wikiresponse = urllib.request.urlopen(req_query)
    wikidata = wikiresponse.read()
    wikitext = wikidata.decode('utf-8')
    
    json_wikitext = json.loads(wikitext)
    
    page_id = next(iter(json_wikitext['query']['pages']))
    page_text = json_wikitext['query']['pages'][page_id]['extract']
    
    return page_text

# test with fist performer on the list list_of_performers
# print(get_wiki_page_text(bands[1]))

# save all pages as text files in a directory (AC/DC is missing)
import os
output_dir = 'performer_pages'
os.makedirs(output_dir, exist_ok=True)
for performer in bands:
    try:
        page_text = get_wiki_page_text(performer)
        with open(os.path.join(output_dir, f"{performer}.txt"), 'w', encoding='utf-8') as f:
            f.write(page_text)
    except Exception as e:
        print(f"Failed to retrieve page for {performer}: {e}")



In [23]:
print(os.listdir(path=output_dir)[:10])  # List first 10 files in the directory

['Funkadelic.txt', 'Slayer.txt', 'Ted_Nugent.txt', 'Great_White.txt', 'Days_of_the_New.txt', 'The_Dave_Clark_Five.txt', 'Keane_(band).txt', 'Jimmy_Eat_World.txt', 'Flogging_Molly.txt', 'Simple_Plan.txt']
