In [1]:
import requests
import json

## #1

Identify a famous person who has been famous for at least a few years and that you have some personal interest in. Use the Wikimedia API to collect page view data from the English Wikipedia article on that person. Now use that data to generate a time-series visualization and include a link to it in your notebook.

In [2]:
def get_wikipedia_pageviews(page_title, language):
    # /metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}
    url = ("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/" + 
           f"{language}/all-access/user/{page_title}/daily/20010101/20230424")

    headers = {
        'User-Agent': 'python data collection bot by makohill@uw.edu'
    }

    response = requests.get(url, headers=headers)

    if not response.status_code == 200:
        print("ERROR, request not OK")
    
    data = response.json()
    return data

In [3]:
def clean_up_timestamp(day):
    new_time_stamp = day[0:4] + "-" + day[4:6] + "-" + day[6:8]
    return new_time_stamp

In [4]:
data = get_wikipedia_pageviews("Elinor Ostrom", "en.wikipedia.org")

In [5]:
views_by_day = {}
for day_dict in data['items']:
    day = clean_up_timestamp(day_dict['timestamp'])
    views_by_day[day] = day_dict['views']

In [6]:
with open('view_by_day_ostrom.tsv', 'w') as output_file:
    print("day\tviews", file=output_file)
    for day in views_by_day:
        print(f"{day}\t{views_by_day[day]}", file=output_file)

https://docs.google.com/spreadsheets/d/1pnuM9lkUq0Kh3D6RnuWIpDk9gI3TMt1kcfP6N5Y9Guk/edit?usp=sharing

## #2
Identify 2 other languages editions of Wikipedia that have articles on that person. Collect page view data on the article in other languages and create a single visualization that shows how the dynamics and similar and/or different. (Note: My approach involved creating a TSV file with multiple columns.)

In [7]:
views_by_day_lang = {}
for language in ["en.wikipedia.org", "de.wikipedia.org", "fr.wikipedia.org"]:
    data = get_wikipedia_pageviews("Elinor Ostrom", language)
    
    for day_dict in data['items']:
        day = clean_up_timestamp(day_dict['timestamp'])
        
        if day in views_by_day_lang.keys():
            views_by_day_lang[day][language] = day_dict['views']
        else:
            views_by_day_lang[day] = {language : day_dict['views']}

In [8]:
with open('ostrom_view_by_day_combo.tsv', 'w') as output_file:
    print("day\ten.wikipedia.org\ten.wikipedia.org\tfr.wikipedia.org", file=output_file)
    for day in views_by_day_lang:
            print(f'{day}\t{views_by_day_lang[day]["en.wikipedia.org"]}\t{views_by_day_lang[day]["de.wikipedia.org"]}\t{views_by_day_lang[day]["fr.wikipedia.org"]}', file=output_file)

https://docs.google.com/spreadsheets/d/1pnuM9lkUq0Kh3D6RnuWIpDk9gI3TMt1kcfP6N5Y9Guk/edit#gid=1207896766

## Collect page view data on the articles about Marvel Comics and DC Comics in English Wikipedia. (If you'd rather replace these examples with some other comparison of popular rivals, that's just as good!)

In [9]:
views_by_day_comics = {}
for page_title in ["Marvel_Comics", "DC_Comics"]:
    data = get_wikipedia_pageviews(page_title, "en.wikipedia.org")
    
    for day_dict in data['items']:
        day = clean_up_timestamp(day_dict['timestamp'])
        
        if day in views_by_day_comics.keys():
            views_by_day_comics[day][page_title] = day_dict['views']
        else:
            views_by_day_comics[day] = { page_title : day_dict['views']}

### Which has more total page views in 2022?

In [10]:
counter = {}
for day in views_by_day_comics.keys():
    # skip if it's not 2022
    if day[0:4] != "2022":
        continue
        
    for page in views_by_day_comics[day].keys():
        if page in counter.keys():
            counter[page] += views_by_day_comics[day][page]
        else:
            counter[page] = views_by_day_comics[day][page]
counter

{'Marvel_Comics': 1804195, 'DC_Comics': 1793582}

### Can you draw a visualization in a spreadsheet that shows this? (Again, provide a link.)

In [11]:
with open('comics_view_by_day.tsv', 'w') as output_file:
    print("day\marvel\tdc", file=output_file)
    for day in views_by_day_comics:
            print(f'{day}\t{views_by_day_comics[day]["Marvel_Comics"]}\t{views_by_day_comics[day]["DC_Comics"]}', file=output_file)

https://docs.google.com/spreadsheets/d/1pnuM9lkUq0Kh3D6RnuWIpDk9gI3TMt1kcfP6N5Y9Guk/edit#gid=359752784

### Were there any years when 2022's more popular page was instead the less popular of the two? How many and which ones?

In [12]:
counter_by_year = {}

for day in views_by_day_comics.keys():
    year = day[0:4]
    
    if year not in counter_by_year.keys():
        counter_by_year[year] = { "Marvel_Comics" : 0,
                                  "DC_Comics" : 0}
        
    for page in views_by_day_comics[day].keys():
        counter_by_year[year][page] += views_by_day_comics[day][page]
        
counter_by_year

{'2015': {'Marvel_Comics': 901007, 'DC_Comics': 824561},
 '2016': {'Marvel_Comics': 1982087, 'DC_Comics': 2003131},
 '2017': {'Marvel_Comics': 1670161, 'DC_Comics': 1623985},
 '2018': {'Marvel_Comics': 2707650, 'DC_Comics': 1810590},
 '2019': {'Marvel_Comics': 2099570, 'DC_Comics': 1696735},
 '2020': {'Marvel_Comics': 1227661, 'DC_Comics': 1299000},
 '2021': {'Marvel_Comics': 1878513, 'DC_Comics': 1528781},
 '2022': {'Marvel_Comics': 1804195, 'DC_Comics': 1793582},
 '2023': {'Marvel_Comics': 376331, 'DC_Comics': 486125}}

In [13]:
for year in counter_by_year.keys():
    if counter_by_year[year]["Marvel_Comics"] > counter_by_year[year]["DC_Comics"]:
        print(f"{year}: Marvel was more")
    elif counter_by_year[year]["Marvel_Comics"] < counter_by_year[year]["DC_Comics"]:
        print(f"{year}: DC was more")
    else:
        print(f"{year}: they were equal!")

2015: Marvel was more
2016: DC was more
2017: Marvel was more
2018: Marvel was more
2019: Marvel was more
2020: DC was more
2021: Marvel was more
2022: Marvel was more
2023: DC was more


### Were there any months was this reversal of relative popularity occurred? How many and which ones?

In [14]:
# this code is identical to the code above except two things:
# (a) i changed every instance of year to month
# (b) I canged day[0:4] to day[0:7] so that it selected the month instead of the year!
counter_by_month = {}

for day in views_by_day_comics.keys():
    month = day[0:7]
    
    if month not in counter_by_month.keys():
        counter_by_month[month] = { "Marvel_Comics" : 0,
                                  "DC_Comics" : 0}
        
    for page in views_by_day_comics[day].keys():
        counter_by_month[month][page] += views_by_day_comics[day][page]
        
for month in counter_by_month.keys():
    if counter_by_month[month]["Marvel_Comics"] > counter_by_month[month]["DC_Comics"]:
        print(f"{month}: Marvel was more")
    elif counter_by_month[month]["Marvel_Comics"] < counter_by_month[month]["DC_Comics"]:
        print(f"{month}: DC was more")
    else:
        print(f"{month}: they were equal!")

2015-07: Marvel was more
2015-08: Marvel was more
2015-09: Marvel was more
2015-10: DC was more
2015-11: Marvel was more
2015-12: Marvel was more
2016-01: Marvel was more
2016-02: Marvel was more
2016-03: DC was more
2016-04: DC was more
2016-05: Marvel was more
2016-06: Marvel was more
2016-07: DC was more
2016-08: DC was more
2016-09: DC was more
2016-10: Marvel was more
2016-11: Marvel was more
2016-12: DC was more
2017-01: Marvel was more
2017-02: Marvel was more
2017-03: Marvel was more
2017-04: Marvel was more
2017-05: Marvel was more
2017-06: DC was more
2017-07: Marvel was more
2017-08: Marvel was more
2017-09: DC was more
2017-10: DC was more
2017-11: DC was more
2017-12: Marvel was more
2018-01: Marvel was more
2018-02: Marvel was more
2018-03: Marvel was more
2018-04: Marvel was more
2018-05: Marvel was more
2018-06: Marvel was more
2018-07: Marvel was more
2018-08: Marvel was more
2018-09: Marvel was more
2018-10: Marvel was more
2018-11: Marvel was more
2018-12: Marvel was

### How about any days? How many?

In [15]:
counter = {"Marvel_Comics" : 0, "DC_Comics" : 0, "TIE" : 0}

for day in views_by_day_comics.keys():

    if views_by_day_comics[day]["Marvel_Comics"] > views_by_day_comics[day]["DC_Comics"]:
        counter["Marvel_Comics"] += 1
    elif views_by_day_comics[day]["Marvel_Comics"] < views_by_day_comics[day]["DC_Comics"]:
        counter["DC_Comics"] += 1
    else:
        counter["TIE"] += 1

counter

{'Marvel_Comics': 1756, 'DC_Comics': 1096, 'TIE': 3}

## #3

I've made this file available which includes list of more than 100 Wikipedia articles about alternative rock bands from Washington state that I built from this category in Wikipedia.[*] It's a .jsonl file. Download the file (click "raw" and then save the file onto your drive). Now read it in, and request monthly page view data from all of them. If you need some help with loading it in, I've included some sample code at the bottom of this page.
### Once you've done this, sum up all of the page views from all of the pages and print out a TSV file with these total numbers.

In [16]:
list_of_page_titles = []
with open("list_of_washington_alternative_rocks_bands_wikipedia-2023-04-25.jsonl") as input_file:
    for line in input_file.readlines():
        data = json.loads(line)
        list_of_page_titles.append(data["page_title"])

In [17]:
total_views_by_day = {}
for page_title in list_of_page_titles:
    data = get_wikipedia_pageviews(page_title, "en.wikipedia.org")
    
    for day_dict in data['items']:
        day = clean_up_timestamp(day_dict['timestamp'])
        if day in total_views_by_day.keys():
            total_views_by_day[day] += day_dict['views']
        else:
            total_views_by_day[day] = day_dict['views']

### You know the routine by now! Now, make a time series graph of these numbers and include a link in your notebook.

In [18]:
with open('bands_view_by_day.tsv', 'w') as output_file:
    print("day\total_views", file=output_file)
    for day in views_by_day_comics:
            print(f'{day}\t{total_views_by_day[day]}', file=output_file)

https://docs.google.com/spreadsheets/d/1pnuM9lkUq0Kh3D6RnuWIpDk9gI3TMt1kcfP6N5Y9Guk/edit#gid=37465393