## #1 Wikipedia Page View API<span class="mw-editsection" style="box-sizing: border-box; user-select: none; font-size: 0.875rem; margin-left: 1rem; vertical-align: baseline; line-height: 1em; margin-right: 0px;"><span class="mw-editsection-bracket" style="box-sizing: border-box;">[</span><a href="https://wiki.communitydata.science/index.php?title=Community_Data_Science_Course_(Spring_2023)/Week_5_coding_challenges&amp;action=edit&amp;section=1" title="Edit section: #1 Wikipedia Page View API" style="box-sizing: border-box; color: rgb(27, 89, 155); background: none transparent;">edit</a><span class="mw-editsection-bracket" style="box-sizing: border-box;">]</span></span>

1. Identify a famous person who has been famous for at least a few years and that you have some personal interest in. Use the Wikimedia API to collect page view data from the English Wikipedia article on that person. Now use that data to generate a time-series visualization and include a link to it in your notebook.
2. Identify 2 other languages editions of Wikipedia that have articles on that person. Collect page view data on the article in other languages and create a single visualization that shows how the dynamics and similar and/or different. (Note: My approach involved creating a TSV file with multiple columns.)
3. Collect page view data on the articles about [Marvel Comics](https:\en.wikipedia.org\wiki\Marvel_Comics) and [DC Comics](https:\en.wikipedia.org\wiki\DC_Comics) in English Wikipedia. (If you'd rather replace these examples with some other comparison of popular rivals, that's just as good!)
    1. Which has more total page views in 2022?
    2. Can you draw a visualization in a spreadsheet that shows this? (Again, provide a link.)
    3. Where there years since 2015 when the less viewed page was viewed more? How many and which ones?
    4. Where their any months was this true? How many and which ones?
    5. How about any days? How many?
4. I've made [this file available](https:\github.com\kayleachampion\spr23_CDSW\blob\main\curriculum\week5\list_of_washington_alternative_rocks_bands_wikipedia-2023-04-25.jsonl) which includes list of more than 100 Wikipedia articles about alternative rock bands from Washington state that I built from [this category in Wikipedia](https:\en.wikipedia.org\wiki\Category:Alternative_rock_groups_from_Washington_(state)).\[\*\] It's a `.jsonl` file. Download the file (click "raw" and then save the file onto your drive). Now read it in, and request monthly page view data from all of them. If you need some help with loading it in, I've included some sample code at the bottom of this page.
    1. Once you've done this, sum up all of the page views from all of the pages and print out a TSV file with these total numbers.
    2. You know the routine by now! Now, make a time series graph of these numbers and include a link in your notebook.

| <br> |  |
| :-: | --- |
|  | <br> |

In [1]:
import requests
import json
import time

#From https://github.com/kayleachampion/spr23_CDSW/blob/main/curriculum/week5/week_5_lecture_part_1-data_collection.ipynb
def get_wikipedia_pageviews(page_title, region_code):
    # /metrics/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}
    url = ("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/" + 
           f"{region_code}.wikipedia.org/all-access/user/{page_title}/daily/20010101/20230424")

#Wikipedia API documentation mentions that User-Agent header needs to be unique.
    headers = {
        'User-Agent': 'python data collection bot by makohill@uw.edu'
    }

    response = requests.get(url, headers=headers)

    if not response.status_code == 200:
        print("ERROR, request not OK")
        
    data = response.json()

    return data

#Modifying previous eg.

def get_wikipedia_item(page_title):
    url = f"https://en.wikipedia.org/wiki/{page_title}"
    response = requests.get(url)

    if not response.status_code == 200:
        print("ERROR, request not OK")
    
    data = response.json()
    
    return data


def create_json_file(file_name_json, jsondata):
    with open(file_name_json, "w") as my_file:
        json_string = json.dumps(jsondata) #converts into json string        
        print(json_string, file=my_file) #This was saved as a long string, not as spreadsheet as we had on the past.
    
    print(f"{file_name_json} created")


In [2]:
#call to wikipedia
#get page view data
#get time-series visualization and include a multiple columns?

#1.Identify a famous person who has been famous for at least a few years and that you have some personal interest in. Use the Wikimedia API 
#to collect page view data from the English Wikipedia article on that person. Now use that data to generate a time-series visualization 
#and include a link to it in your notebook.

famous_person = "Taylor_Swift"
jsondata = get_wikipedia_pageviews(famous_person,"en")
file_name = f"{famous_person}_wikipedia_pageviews"
file_name_json = file_name+".json"
#get_wikipedia_item(famous_person)
print ("jsondata:"+str(type(jsondata)))

create_json_file(file_name_json,jsondata)




jsondata:<class 'dict'>
Taylor_Swift_wikipedia_pageviews.json created


In [3]:
#Way 1 to read file. 
#Reads whole file and saves it on 

def open_json_to_dict(file_name):
    with open(file_name, 'r') as input_file:
        input_data = input_file.read()
        new_data = json.loads(input_data) #load into new_data so we won't have to work as big string.
        return new_data


new_data= open_json_to_dict(file_name_json)

print(new_data['items'][0]) #check dictionary item 

print(len(new_data['items']))

#Way 2 to read file
#Reads each line and saves it on dictionary
# day_dicts = []
# with open(file_name_json, 'r') as f:
#     for line in f.readlines():
#         json_data2 = json.loads(line)
#         day_dicts.extend(json_data2["items"])

# print(type(day_dict))
# len(day_dicts)
# for key in day_dict:
#     print(key)
#     #break;

# print(day_dict)



{'project': 'en.wikipedia', 'article': 'Taylor_Swift', 'granularity': 'daily', 'timestamp': '2015070100', 'access': 'all-access', 'agent': 'user', 'views': 35553}
2855


In [4]:
def clean_up_timestamp(day):
    new_time_stamp = day[0:4] + "-" + day[4:6] + "-" + day[6:8]
    return new_time_stamp

In [5]:
#Clean up day format of each of the previous json items and save it on views_by_day

def get_views_by_day(json_data):
    #modify json data into a dictionary {timestamp:views}
    views_by_day = {}
    for day_dict in new_data['items']:
        day = clean_up_timestamp(day_dict['timestamp'])
        views_by_day[day] = day_dict['views']
    return views_by_day

views_by_day = get_views_by_day(new_data)

# total_views_by_day = {} 
# for day_dict in day_dicts:
#     day = clean_up_timestamp(day_dict["timestamp"])
#     if day in total_views_by_day:
#         total_views_by_day[day] = total_views_by_day[day] + day_dict['views']
#     else:
#         total_views_by_day[day] = day_dict["views"]


In [6]:
#turn into a tsv file
def create_tsv_file(file_name, dictionary_data):
    with open(file_name, "w") as f:
        print("day\ttotal_views", file=f)
        
        for day in dictionary_data.keys():
            views = dictionary_data[day]
            print(day, "\t", views, file=f)

    print(f"{file_name} created")

file_name = f"{famous_person}_views_by_day_combo.tsv"
create_tsv_file(file_name, views_by_day)

# with open(f"{famous_person}_views_by_day_combo2.tsv", "w") as f:
#     print("day\ttotal_views", file=f)
    
#     for day in total_views_by_day.keys():
#         views = total_views_by_day[day]
#         print(day, "\t", views, file=f)

# print(f"{famous_person}_views_by_day_combo2.tsv created")

Taylor_Swift_views_by_day_combo.tsv created


In [8]:
#Identify 2 other languages editions of Wikipedia that have articles on that person. 
#Collect page view data on the article in other languages and create a single visualization that shows how the dynamics and similar and/or different. 
#(Note: My approach involved creating a TSV file with multiple columns.)

region_codes = ["en","es", "ja"]
multiple_code_file =f"multiple_{famous_person}_wikipedia_pageviews"

view_data = []
for code in region_codes:
    code_jsondata = get_wikipedia_pageviews(famous_person,code)
    code_json_data = code_jsondata["items"]
    #print(code_json_data)
    view_data = view_data + code_json_data

#print(view_data)

create_json_file(multiple_code_file+".json", view_data)

new_data = open_json_to_dict(multiple_code_file+".json")

views_by_day = {}
for day_dict in new_data:
    day = clean_up_timestamp(day_dict['timestamp'])
    views_by_day[day] = day_dict['views']

#turn into a tsv file
with open(f"{multiple_code_file}.tsv", "w") as f:
    print("project\tday\ttotal_views", file=f)

    for day_dict in new_data:
        day = clean_up_timestamp(day_dict['timestamp'])
        views = day_dict['views']
        project = day_dict['project']
        print(day,"\t",project,"\t",views,file=f)


print(f"{multiple_code_file}_views_by_day_combo.tsv created")

multiple_Taylor_Swift_wikipedia_pageviews.json created
multiple_Taylor_Swift_wikipedia_pageviews_views_by_day_combo.tsv created


In [10]:
#3. Collect page view data on the articles about Marvel Comics and DC Comics in English Wikipedia. 
#(If you'd rather replace these examples with some other comparison of popular rivals, that's just as good!)

marvel_jsondata = get_wikipedia_pageviews("Marvel_Comics","en")
dc_jsondata= get_wikipedia_pageviews("DC_Comics","en")

marvel_by_day = {}
dc_by_day = {}
marvel_by_year = {}
dc_by_year={}

for _day in marvel_jsondata['items']:
    cleaned_day = clean_up_timestamp(_day["timestamp"])    
    current_year = _day["timestamp"][0:4]
    current_month = current_year+"-"+_day["timestamp"][4:6]
    current_day = _day["timestamp"][6:8]

    marvel_by_year[cleaned_day] = _day["views"]
    if (current_month in marvel_by_year):
        marvel_by_year[current_month] = marvel_by_year[current_month]+_day["views"]
    else:
        marvel_by_year[current_month] = _day["views"]
        
    if (current_year in marvel_by_year):        
        marvel_by_year[current_year]  = marvel_by_year[current_year]+_day["views"]
    else:
        marvel_by_year[current_year]  = _day["views"]


for _day in dc_jsondata['items']:
    cleaned_day = clean_up_timestamp(_day["timestamp"])    
    current_year = _day["timestamp"][0:4]
    current_month = current_year+"-"+_day["timestamp"][4:6]
    current_day = _day["timestamp"][6:8]

    dc_by_year[cleaned_day] = _day["views"]
    if (current_month in dc_by_year):
        dc_by_year[current_month] = dc_by_year[current_month]+_day["views"]
    else:
        dc_by_year[current_month] = _day["views"]

    if (current_year in dc_by_year):        
        dc_by_year[current_year]  = dc_by_year[current_year]+_day["views"]
    else:
        dc_by_year[current_year]  = _day["views"]

    

#print("Marvel Comics:")
#print(marvel_by_year)
#print("DC Comics:")
#print(dc_by_year)

with open(f"views_by_day_comics.tsv", "w") as f:
    print("project\tdate\ttotal_views", file=f)

    for day_dict in new_data:
        day = clean_up_timestamp(day_dict['timestamp'])
        views = day_dict['views']
        project = day_dict['project']
        print(day,"\t",project,"\t",views,file=f)

years = ["2015","2016","2017","2018","2019","2020","2021","2022"]
for year in years:
    if (marvel_by_year[year] > dc_by_year[year]):
        print(f"{year}: Marvel")
    else:
        print(f"{year}: DC")

    for month in range(11):
        value_marvel =0
        value_dc= 0
        if (f"{year}-{month+1}" in marvel_by_year):
            value_marvel = marvel_by_year[f"{year}-{month+1}"]

        if (f"{year}-{month+1}" in dc_by_year):
            value_dc = dc_by_year[f"{year}-{month+1}"]

        if (value_marvel>value_dc):
            print(f"{year}-{month+1}: Marvel")
        else:
            print(f"{year}-{month+1} DC")

#Can you draw a visualization in a spreadsheet that shows this? (Again, provide a link.)
#Where there years since 2015 when the less viewed page was viewed more? How many and which ones?
    # 2015: Marvel
    # 2016: DC
    # 2017: Marvel
    # 2018: Marvel
    # 2019: Marvel
    # 2020: DC
    # 2021: Marvel
    # 2022: Marvel
#Where their any months was this true? How many and which ones?
#How about any days? How many?

2015: Marvel
2015-1 DC
2015-2 DC
2015-3 DC
2015-4 DC
2015-5 DC
2015-6 DC
2015-7 DC
2015-8 DC
2015-9 DC
2015-10 DC
2015-11: Marvel
2016: DC
2016-1 DC
2016-2 DC
2016-3 DC
2016-4 DC
2016-5 DC
2016-6 DC
2016-7 DC
2016-8 DC
2016-9 DC
2016-10: Marvel
2016-11: Marvel
2017: Marvel
2017-1 DC
2017-2 DC
2017-3 DC
2017-4 DC
2017-5 DC
2017-6 DC
2017-7 DC
2017-8 DC
2017-9 DC
2017-10 DC
2017-11 DC
2018: Marvel
2018-1 DC
2018-2 DC
2018-3 DC
2018-4 DC
2018-5 DC
2018-6 DC
2018-7 DC
2018-8 DC
2018-9 DC
2018-10: Marvel
2018-11: Marvel
2019: Marvel
2019-1 DC
2019-2 DC
2019-3 DC
2019-4 DC
2019-5 DC
2019-6 DC
2019-7 DC
2019-8 DC
2019-9 DC
2019-10 DC
2019-11 DC
2020: DC
2020-1 DC
2020-2 DC
2020-3 DC
2020-4 DC
2020-5 DC
2020-6 DC
2020-7 DC
2020-8 DC
2020-9 DC
2020-10: Marvel
2020-11 DC
2021: Marvel
2021-1 DC
2021-2 DC
2021-3 DC
2021-4 DC
2021-5 DC
2021-6 DC
2021-7 DC
2021-8 DC
2021-9 DC
2021-10: Marvel
2021-11: Marvel
2022: Marvel
2022-1 DC
2022-2 DC
2022-3 DC
2022-4 DC
2022-5 DC
2022-6 DC
2022-7 DC
2022-8 DC


In [11]:
#4.I've made this file available which includes list of more than 100 Wikipedia articles about alternative rock bands from Washington state that 
#I built from this category in Wikipedia.[*] It's a .jsonl file. 
#Download the file (click "raw" and then save the file onto your drive). 
#Now read it in, and request monthly page view data from all of them. If you need some help with loading it in, 
#I've included some sample code at the bottom of this page.
#Once you've done this, sum up all of the page views from all of the pages and print out a TSV file with these total numbers.
#You know the routine by now! Now, make a time series graph of these numbers and include a link in your notebook.

#page_dicts = raw_data["*"][0]['a']['*']

#with open("list_of_washington_alternative_rocks_bands_wikipedia-2023-04-25.jsonl", 'w') as band_list:

band_list = []
with open("list_of_washington_alternative_rocks_bands_wikipedia-2023-04-25.jsonl", 'r') as input_file:
    for line in input_file.readlines():
        json_data2 = json.loads(line)
        #print(json_data2)
        band_list.append(json_data2["page_title"])

# #print(band_list)

band_items = []
for band in band_list:
    print(".")
    band_views = get_wikipedia_pageviews(band,"en")    
    band_items = band_items + band_views["items"]



#create_json_file("alternative_bands.json",band_items)


    
    # for page in page_dicts:
    #     print(page['title']) # also print it out so we can see it
    #     output_dict = {'website' : 'en.wikipedia.org',
    #                    'page_title' : page['title'] }
    #     output_string = json.dumps(output_dict)
    #     print(output_string, file=band_list) # print to the file

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


In [20]:
alternative_bands_file_name = "alternative_bands"
create_json_file(alternative_bands_file_name+".json",band_items)
bands_dict = open_json_to_dict(alternative_bands_file_name+".json")
#print(bands_dict)

views_by_day = {}
with open(f"{alternative_bands_file_name}.tsv", "w") as f:
    print("band\tday\ttotal_views", file=f)
    for day_dict in bands_dict:
        day = clean_up_timestamp(day_dict['timestamp'])
        band_name = bands_dict = day_dict["article"]
        views_by_day[day] = day_dict['views']
        print(band_name,"\t",day,"\t",day_dict['views'],file=f)

print(f"{alternative_bands_file_name}.tsv created")

#get_views_by_day(bands_dict)

alternative_bands.json created
alternative_bands.tsv created


## #2 Starting on your projects<span class="mw-editsection" style="box-sizing: border-box; user-select: none; font-size: 0.875rem; margin-left: 1rem; vertical-align: baseline; line-height: 1em; margin-right: 0px;"><span class="mw-editsection-bracket" style="box-sizing: border-box;">[</span><a href="https://wiki.communitydata.science/index.php?title=Community_Data_Science_Course_(Spring_2023)/Week_5_coding_challenges&amp;action=edit&amp;section=2" title="Edit section: #2 Starting on your projects" style="color: rgb(27, 89, 155); box-sizing: border-box; background: none transparent;">edit</a><span class="mw-editsection-bracket" style="box-sizing: border-box;">]</span></span>

|  |  |
| :-: | --- |
| [![Cmbox notice.png](https://upload.wikimedia.org/wikipedia/commons/7/76/Cmbox_notice.png)](https:\wiki.communitydata.science\File:Cmbox_notice.png) | If you are planning on collecting data from Reddit, please look into using the [Pushshift API](https:\pushshift.io\) instead of the default Reddit API. The Pushshift API is not as up-to-date but it is targeted toward data scientists, not app-makers, and is likely much better suited to our needs in the class. That said, take a look at both! |

In this section, you will take your first steps towards working with your project API. Many of these questions will not involve code, so just mark down your answers in cells in your notebook.

One very useful trick is to convert cells into "markdown" mode. You can do in the menu with _Cell→Cell Type→Markdown_ or you can just type `m` when the cell is selected but not being edited (just press `Esc` if you are editing to switch out of edit mode). Clicking `y` turns it back into code. Markdown is just normal text but if you want to do fancier stuff like links or formatting you can look at this [Markdown Cheat Sheet](https:\www.markdownguide.org\cheat-sheet\).

Feel free to document any findings you think might be useful as you continue to work on your project; you might thank yourself later!

1. Identify an API you will (or might!) want to use for your project.
2. Find documentation for that API and include links in your notebook.
3. What are the API endpoints you plan to use? What are the parameters you will need to use at that endpoint?
4. Is there a Python module that exists that helps make contact with the API? (See if you can you find example code on how to use it).
    1. If so, download it, install it, and import it into your notebook.
5. Does the API require authentication? Does it need to be approved?
    1. If so, sign up for a developer account and get your keys. (Do this early because it often takes time for these accounts to be approved.)
6. Does the API list rate limits? Does it make any requests about how you should use it?
7. Make a single API call, either directly using requests or using the Python module you have used. It doesn't matter for what. The goal is that you can get _something'_.
8. IMPORTANT: If you have included any API keys in your notebook, _make a copy of your notebook, delete the cell where you include the keys, before you upload the copy of the notebook._ We'll show you some tricks for hiding this information going forward.

## Notes<span class="mw-editsection" style="box-sizing: border-box; user-select: none; font-size: 0.875rem; margin-left: 1rem; vertical-align: baseline; line-height: 1em; margin-right: 0px;"><span class="mw-editsection-bracket" style="box-sizing: border-box;">[</span><a href="https://wiki.communitydata.science/index.php?title=Community_Data_Science_Course_(Spring_2023)/Week_5_coding_challenges&amp;action=edit&amp;section=3" title="Edit section: Notes" style="color: rgb(27, 89, 155); box-sizing: border-box; background: none transparent;">edit</a><span class="mw-editsection-bracket" style="box-sizing: border-box;">]</span></span>

\[\*\] You will probably not be shocked to hear that I collected this data from an API! I've included a Jupyter Notebook with the code to grab that data from [the PetScan API](https:\petscan.wmflabs.org\) [in the form of this Github notebook](https:\github.com\kayleachampion\spr23_CDSW\blob\main\curriculum\week5\get_washington_alternative_rock_bands_list-20230425.ipynb).

If you just want to read it in the file, remember it's just a JSONL file so you can modify the code from the lecture and it should work (e.g., something with `open()` and the `.readlines()` function associated with file variables.

In [None]:
# 1. I´m looking forward to use IMDB API and/or rotten tomatoes API.
#2. https://developer.imdb.com/documentation/api-documentation, https://developer.fandango.com/rotten_tomatoes
#3. Not sure as the documentation is not completely clear about this since I don´t have access to these yet.
#4. requests
#5. Yes, authentication seems to be based on an API key.
#6. I don't know yet as I don't have access yet.
#7. NA
