# Formalia:

Please read the [assignment overview page](https://github.com/suneman/socialgraphs2025/wiki/Assignments) carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment. 

_If you fail to follow these simple instructions, it will negatively impact your grade!_

**Due date and time**: The assignment is due on Tuesday September 30th, 2025 at 23:55. Hand in your IPython notebook file (with extension `.ipynb`) via DTU Learn

# Assignment 1.1: Exploring WS and BA models

This first part draws on the Watts-Stogatz and Barabasi-Albert models from Week 3. You should provide solutions to the exercises with the following titles from **Part 1** 

* *Did you really read the text? Answer the following questions (no calculations needed) in your IPython notebook*

* *WS edition*

And from **Part 2**

* *BA Edition*.
  * **Note**: The second part of this exercise (after the questions to the text) first has you build a BA network step-by-step, but doesn't ask any questions. For that part, I would simply like you to write well-documented code that shows how you build the network. 


# Assignment 1.2: Stats and visualization of the Rock Music Network

This second part requires you to have built the network of Rock Musicians as described in the exercises for Week 4. You should complete the following exercise from **Part 2**.

* *Explain your process in words*

* *Simple network statistics and analysis*.

  * **Note related to this and the following exercise**. It is nice to have the dataset underlying the statistics and visualization available when we grade. Therefore, I recommend that you create a small *network dataset*, which is simply your graph stored in some format that you like (since it's only a few hundred nodes and a few thousand edges, it won't take up a lot of space). You can then place that network one of your group members' GitHub account (or some other server that's available online) and have your Jupyter Notebook fetch that dataset when it runs. (It's OK to use an LLM for help with setting this up, if it seems difficult). 

And the following exercise from **Part 3**

* *Let's build a simple visualization of the network*

And that's it! You're all set.

# Assignment 1.2
## Extracting the Links from the 'List of mainstream rock performers' Wikipedia Page
### Part 1: Defining some function for easier link extraction using regular expressions
We define three function in this part:
1. get_links: This function extracts the links from a page.
2. sanitize_string: This function sanitizes the links from a link. This basically just formats the way we save the links. As some band names would hurt the OS conventions, thus we couldn't save the corresponding musician name in a file. For example: 'AC/DC'.
3. sanitize_list: This function just sanitizes a list of strings.

In [1]:
import re

def get_links(page: str) -> list:
    return re.findall(r"\[\[([^\[\]]*)\]\]",page)

def sanitize_string(name: str) -> str:
    return name.replace('/','_').replace(' ','_')

def sanitize_list(names: list[str]) -> list[str]:
    return [sanitize_string(x) for x in names]

### Part 2: Calling Wikipedia API and Extracting the Wikitext of the Main Page
We define an another function for extracting the wikitext of a wikipedia page.

In [2]:
import requests

API = "https://en.wikipedia.org/w/api.php"
HEADERS = {"User-Agent": "MyWikipediaClient/1.0 (example@example.com)"}


# This method is based on this: https://stackoverflow.com/a/62225015 and this line was partly done with ChatGPT data["query"]["pages"][0]["revisions"][0]["slots"]["main"]["content"].
def get_wikitext(title: str) -> str:
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": title,
        "rvslots": "main",
        "rvprop": "content",
        "format": "json",
        "formatversion": "2",
    }

    response = requests.get(API, params=params, headers=HEADERS, timeout=20)
    response.raise_for_status()

    data = response.json()

    return data["query"]["pages"][0]["revisions"][0]["slots"]["main"]["content"]

In [4]:
main_page = get_wikitext('List of mainstream rock performers')
print(main_page[0:100] + "\n")

rock_pages = re.findall(r'\[\[([^\[\]]*)\]\]', main_page)

rock_pages = sanitize_list(rock_pages)
print(rock_pages[0:10])

{{short description|None}}

This is an alphabetical '''list of mainstream rock performers''' spannin

['rock_music', '10cc', '10_Years_(band)|10_Years', '3_Doors_Down', '311_(band)|311', '38_Special_(band)|.38_Special', 'ABBA', 'Accept_(band)|Accept', 'AC_DC', 'Bryan_Adams']


### Part 3: Downloading the Pages and Extracting the Links from those Pages
Download pages:
**Remark:** This part can take a while to execute. (Reference: ran 3-5 minutes locally)

In [7]:
import os

for musician in rock_pages:
    musician_page = get_wikitext(musician)

    os.makedirs("./Misc/RockPages", exist_ok=True)

    with open(f'./Misc/RockPages/{musician}.txt', 'w') as output:
        output.write(musician_page)

print(os.listdir('./Misc/RockPages'))

['Funkadelic.txt', 'Slayer.txt', 'Kaleo_(band)|Kaleo.txt', 'Flyleaf_(band)|Flyleaf.txt', 'Ted_Nugent.txt', 'Great_White.txt', 'Royal_Blood_(band)|Royal_Blood.txt', 'Days_of_the_New.txt', 'The_Dave_Clark_Five.txt', 'Slipknot_(band)|Slipknot.txt', 'Jimmy_Eat_World.txt', 'Poison_(American_band)|Poison.txt', 'Flogging_Molly.txt', 'Simple_Plan.txt', 'AC_DC.txt', 'Joe_Walsh.txt', 'Hoobastank.txt', 'Eddie_Money.txt', "Guns_N'_Roses.txt", 'Roxy_Music.txt', 'Limp_Bizkit.txt', 'Anthrax_(American_band)|Anthrax.txt', 'Midnight_Oil.txt', 'Awolnation.txt', 'Jethro_Tull_(band)|Jethro_Tull.txt', 'Marilyn_Manson.txt', 'Thirty_Seconds_to_Mars.txt', 'Black_Stone_Cherry.txt', "Herman's_Hermits.txt", 'Dr._Hook_&_the_Medicine_Show.txt', "Manfred_Mann's_Earth_Band.txt", 'The_Pretenders.txt', 'Oasis_(band)|Oasis.txt', '10cc.txt', 'Eric_Clapton.txt', 'Pearl_Jam.txt', 'Billy_Talent.txt', 'Starset.txt', 'Jack_White.txt', 'Queensrÿche.txt', 'Bachman–Turner_Overdrive.txt', 'Pixies_(band)|Pixies.txt', 'Alanis_Moris

Extracting links:


In [8]:
import networkx as nx

G = nx.DiGraph()
musicians = os.listdir('./Misc/RockPages')

for musician in musicians:
    with open(f'./Misc/RockPages/{musician}') as f:
        text = f.read()

    # Remove .txt from the end
    p_name = os.path.splitext(musician)[0]

    # It is actually important to add it here too, as some musicians have no links to other musicians
    # Furthermore, it is just easier to save the length of page here
    G.add_node(p_name, attr=len(text.split()))

    links = get_links(text)
    links = sanitize_list(links)

    # Only save edges where the link goes to a musician
    filtered_links = [link for link in links if link + ".txt" in musicians]
    print(filtered_links)

    for link in filtered_links:
        G.add_edge(p_name,link)

['Parliament-Funkadelic', 'AllMusic', 'Jimi_Hendrix', 'Parliament-Funkadelic', 'Sly_and_the_Family_Stone', 'Parliament-Funkadelic', 'Cream_(band)|Cream']
['Metallica', 'Megadeth', 'Anthrax_(American_band)|Anthrax', 'Iron_Maiden', 'Black_Sabbath', 'Judas_Priest', 'AllMusic', 'Scorpions_(band)|Scorpions', 'Iron_Maiden', 'Megadeth', 'AllMusic', 'Judas_Priest', 'Motörhead', 'Megadeth', 'Anthrax_(American_band)|Anthrax', 'Alice_in_Chains', 'Metallica', 'Ozzy_Osbourne', 'Sepultura', 'Black_Sabbath', 'Pantera', 'Tool_(band)|Tool', 'Slipknot_(band)|Slipknot', 'Metallica', 'Lamb_of_God_(band)|Lamb_of_God', 'Trivium_(band)|Trivium', 'Megadeth', 'Anthrax_(American_band)|Anthrax', 'Anthrax_(American_band)|Anthrax', 'Lamb_of_God_(band)|Lamb_of_God', 'Lamb_of_God_(band)|Lamb_of_God', 'Anthrax_(American_band)|Anthrax', 'Primus_(band)|Primus', 'Anthrax_(American_band)|Anthrax', 'Volbeat', 'Metallica', 'Megadeth', 'Black_Sabbath', 'AllMusic', 'AllMusic', 'Black_Sabbath', 'Motörhead', 'Judas_Priest', 'I

Remove disconnected nodes:

In [15]:
nodes = list(G.nodes())

for node in nodes:
    if G.in_degree(node) == 0 and G.out_degree(node) == 0:
        G.remove_node(node)
        print(node)

Dr._Hook_&_the_Medicine_Show
Shakin'_Stevens
Category:Lists_of_rock_musicians
Van_Zant_(band)
Stevie_Ray_Vaughan|Stevie_Ray_Vaughan_and_Double_Trouble
Bread_(band)|Bread
Jet_(Australian_band)|Jet
Category:Lists_of_rock_musicians_by_subgenre


Saving the network:

In [16]:
import pickle

pickle.dump(G, open('Misc/MusicianGraph.pickle', 'wb'))

Extracting the largest connected component:

In [None]:
LCG = max(nx.weakly_connected_components(G), key=len)