In [1]:
import requests
import json

Identify a movie, television, video game, or other media property that has both (a) 5 or more related articles on Wikipedia and (b) 5 or more other articles on the same topic on a Fandom.com website. Any large entertainment franchise will definitely work but feel free to get creative! For example, you might choose 5 Wikipedia articles about the anime Naruto and 5 articles (pages) from the naruto.fandom.com site. You may notice that fandom.com has a top layer with staff-produced video content, but once you dig down into a particular fandom's wiki, you'll start to see a more familiar wiki style page. For example, compare the fandom.com page about the SpongeBob pilot episode 'Help Wanted' and the Wikipedia page about the same pilot episode.

In [2]:
# these are each main characters in The Wire
articles = ["Jimmy McNulty",
            "Rhonda Pearlman",
            "Stringer Bell",
            "Omar Little",
            "Tommy Carcetti"]

In [3]:
wikipedia_url = "http://en.wikipedia.org/w/api.php/"

In [4]:
def get_article_revision_json(endpoint, title):
    api_answers = []
    
    parameters = {'action' : 'query',
                  'titles' : title,
                  'prop' : 'revisions',
                  'rvprop' : 'flags|timestamp|user|size|ids',
                  'rvlimit' : 500,
                  'format' : 'json',
                   }

    # we'll repeat this forever (i.e., we'll only stop when we find
    # the "break" command)
    while True:
        # this will wait for one second
        # time.sleep(1)
        
        # the first line open the urls but also handles unicode urls
        call = requests.get(endpoint, params=parameters)
        api_answer = call.json()
        
        # now we'll add this to whatever we are tracking
        api_answers.append(api_answer)
        
        # 'continue' tells us there's more revisions to add
        if 'continue' in api_answer.keys():
            # replace the 'continue' parameter with the contents of the
            # api_answer dictionary.
            parameters.update(api_answer['continue'])
        else:
            break
        
    return(api_answers)

In [5]:
wp_api_answers = []
with open('thewire_characters_wikipedia-20230508.jsonl', 'w') as output_file:
    for page_title in articles:
        
        print(f"now working on: {page_title}")
        page_api_answers = get_article_revision_json(wikipedia_url, page_title)
        for api_answer in page_api_answers:
            print(json.dumps(api_answer), file=output_file)
            wp_api_answers.append(api_answer)

now working on: Jimmy McNulty
now working on: Rhonda Pearlman
now working on: Stringer Bell
now working on: Omar Little
now working on: Tommy Carcetti


First modify the code from first sets of notebooks I used in the Community Data Science Course (Spring 2023)/Week 6 lecture to download data (and metadata) about revisions to the 5 articles you chose from Wikipedia.

In [6]:
def api_data_into_revisions(api_answers):
    revisions = []

    for api_answer in api_answers:

        # get the list of pages from the json object
        pages = api_answer["query"]["pages"]

        # for every page, (there should always be only one) get its revisions:
        for page_id in pages.keys():
            query_revisions = pages[page_id]["revisions"]
            title = pages[page_id]['title']

            # for every revision, first we do some cleaning up
            for rev in query_revisions:
                #print(rev)
                # let's continue/skip this revision if the user is hidden
                if "userhidden" in rev.keys():
                    continue

                # 1: add a title field for the article because we're going to mix them together
                rev["title"] = title

                # 2: let's "recode" anon so it's true or false instead of present/missing
                if "anon" in rev.keys():
                    rev["anon"] = True
                else:
                    rev["anon"] = False

                # 3: let's recode "minor" in the same way
                if "minor" in rev.keys():
                    rev["minor"] = True
                else:
                    rev["minor"] = False

                # we're going to change the timestamp to make it work a little better in excel/spreadsheets
                rev["timestamp"] = rev["timestamp"].replace("T", " ")
                rev["timestamp"] = rev["timestamp"].replace("Z", "")

                # finally, save the revisions we've seen to a varaible
                revisions.append(rev)
                
    return revisions

In [7]:
wp_revisions = api_data_into_revisions(wp_api_answers)

Be ready to share:

(i) what proportion of those edits were made by users without accounts ("anon"),

In [8]:
def prop_flag(revisions, flag):

    num_edits = len(revisions)

    # count the number of anonymous edits 
    num_flagged = 0

    for rev in revisions:
        if rev[flag]:
            num_flagged = num_flagged + 1

    prop_flagged = num_flagged / num_edits

    print(f"total edits: {num_edits}")
    print(f"flaged edits: {num_flagged}")
    print(f"proportion flagged: {prop_flagged}")
    return prop_flagged

In [9]:
prop_flag(wp_revisions, "anon")

total edits: 2275
flaged edits: 1091
proportion flagged: 0.47956043956043953


0.47956043956043953

(ii) what proportion of those edits were marked as "minor", and

In [10]:
prop_flag(wp_revisions, "minor")

total edits: 2275
flaged edits: 499
proportion flagged: 0.21934065934065933


0.21934065934065933

(iii) make and share a visualization of the total number of edits across those 5 articles over time (I didn't do this in class but I made the TSV file would allow this).

In [11]:
def write_edits_by_day (revisions, output_filename):
    # lets count the number of edits by day
    edits_by_day = {}
    for rev in revisions:
        day_string = rev['timestamp'][0:10]

        if day_string in edits_by_day.keys():
            edits_by_day[day_string] = edits_by_day[day_string] + 1
        else:
            edits_by_day[day_string] = 1
    
    # write out a TSV file we could analyze in google docs
    with open(output_filename, "w", encoding='utf-8') as output_file:
        # write a header
        print("date\tedits", file=output_file)

        # iterate through every day and print out data into the file
        for day_string in edits_by_day.keys():
            print("\t".join([day_string, str(edits_by_day[day_string])]), file=output_file)

In [12]:
write_edits_by_day(wp_revisions, "thewire_characters_edits_by_day-enwp.tsv")

https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=85977895

Now grab data for the 5 articles you chose from the Fandom.com wiki you identified and grab revision/edit data from there. (Hint: Your wikipedia work will give you lots of clues here: for example, the fandom API endpoint for The Wire is https://thewire.fandom.com/api.php and the Fandom API, as I said in class, is the same as the Wikipedia API). Produce answers to the same three questions (i, ii, and iii) above but using this dataset.


In [13]:
fandom_url = "http://thewire.fandom.com/api.php"

In [14]:
fd_api_answers = []
with open('thewire_characters_fandom-20230508.jsonl', 'w') as output_file:
    for page_title in articles:
        
        print(f"now working on: {page_title}")
        page_api_answers = get_article_revision_json(fandom_url, page_title)
        for api_answer in page_api_answers:
            print(json.dumps(api_answer), file=output_file)
            fd_api_answers.append(api_answer)

now working on: Jimmy McNulty
now working on: Rhonda Pearlman
now working on: Stringer Bell
now working on: Omar Little
now working on: Tommy Carcetti


In [15]:
fd_revisions = api_data_into_revisions(fd_api_answers)

In [16]:
prop_flag(fd_revisions, "anon")

total edits: 158
flaged edits: 28
proportion flagged: 0.17721518987341772


0.17721518987341772

(ii) what proportion of those edits were marked as "minor", and

In [17]:
prop_flag(fd_revisions, "minor")

total edits: 158
flaged edits: 19
proportion flagged: 0.12025316455696203


0.12025316455696203

In [18]:
write_edits_by_day(fd_revisions, "thewire_characters_edits_by_day-fandom.tsv")

https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=964347220

In [19]:
# i will just print 5 files and put them together in google sheets
for title in articles:
    with open(f"article_size_per_day-{title}.tsv", 'w') as output_file:
        print("timestamp\tsize", file=output_file)
        for rev in wp_revisions:
            # skip if it's not the right article
            if rev['title'] != title:
                continue
            print(f'{rev["timestamp"]}\t{rev["size"]}', file=output_file)

Jimmy McNulty: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=1589826795erWX4/edit#gid=85977895

Rhonda Pearlman: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=924927859

Rhonda Pearlman: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=924927859

Stringer Bell: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=972117478

Omar Little: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=28443467

Tommy Carcetti: https://docs.google.com/spreadsheets/d/17slFCkD8EqiP6VKvojkb-BwpoQQvN6sZ9xGKh3erWX4/edit#gid=2014380976