# Data Acquisition

This Notebook is created to acquire articles data using Wikimedia REST API to request data for each article and get the monthly counts of page views. I have used the example code for data extraction along with some changes based on the homework assignment to perform this task

# Article Page Views API Code

In [6]:
import pandas as pd
import json, time, urllib.parse
import requests

Setting constants

In [7]:
#########
#
#    CONSTANTS
#

# The REST API 'pageviews' URL - this is the common URL/endpoint for all 'pageviews' API requests
API_REQUEST_PAGEVIEWS_ENDPOINT = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/'

# This is a parameterized string that specifies what kind of pageviews request we are going to make
# In this case it will be a 'per-article' based request. The string is a format string so that we can
# replace each parameter with an appropriate value before making the request
API_REQUEST_PER_ARTICLE_PARAMS = 'per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end}'

# The Pageviews API asks that we not exceed 100 requests per second, we add a small delay to each request
API_LATENCY_ASSUMED = 0.002       # Assuming roughly 2ms latency on the API and network
API_THROTTLE_WAIT = (1.0/100.0)-API_LATENCY_ASSUMED

# When making a request to the Wikimedia API they ask that you include your email address which will allow them
# to contact you if something happens - such as - your code exceeding rate limits - or some other error 
REQUEST_HEADERS = {
    'User-Agent': '<uwnetid@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2023',
}

# This is just a list of English Wikipedia article titles that we can use for example requests
ARTICLE_TITLES = [ 'Bison', 'Northern flicker', 'Red squirrel', 'Chinook salmon', 'Horseshoe bat' ]

# This template is used to map parameter values into the API_REQUST_PER_ARTICLE_PARAMS portion of an API request. The dictionary has a
# field/key for each of the required parameters. In the example, below, we only vary the article name, so the majority of the fields
# can stay constant for each request. Of course, these values *could* be changed if necessary.
ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023040100"    # this is likely the wrong end date
}


The API request will be made using one procedure. The idea is to make this reusable. The procedure is parameterized, but relies on the constants above for the important parameters. The underlying assumption is that this will be used to request data for a set of article pages. Therefore the parameter most likely to change is the article_title.

In [8]:
#########
#
#    PROCEDURES/FUNCTIONS
#

def request_pageviews_per_article(article_title = None, 
                                  endpoint_url = API_REQUEST_PAGEVIEWS_ENDPOINT, 
                                  endpoint_params = API_REQUEST_PER_ARTICLE_PARAMS, 
                                  request_template = ARTICLE_PAGEVIEWS_PARAMS_TEMPLATE,
                                  headers = REQUEST_HEADERS):

    # article title can be as a parameter to the call or in the request_template
    if article_title:
        request_template['article'] = article_title

    if not request_template['article']:
        raise Exception("Must supply an article title to make a pageviews request.")

    # Titles are supposed to have spaces replaced with "_" and be URL encoded
    article_title_encoded = urllib.parse.quote(request_template['article'].replace(' ','_'))
    request_template['article'] = article_title_encoded
    
    # now, create a request URL by combining the endpoint_url with the parameters for the request
    request_url = endpoint_url+endpoint_params.format(**request_template)
    
    # make the request
    try:
        # we'll wait first, to make sure we don't exceed the limit in the situation where an exception
        # occurs during the request processing - throttling is always a good practice with a free
        # data source like Wikipedia - or other community sources
        if API_THROTTLE_WAIT > 0.0:
            time.sleep(API_THROTTLE_WAIT)
        response = requests.get(request_url, headers=headers)
        json_response = response.json()
    except Exception as e:
        print(e)
        json_response = None
    return json_response


## Getting the subset of articles from thank_the_academy.AUG.2023.csv

In [9]:
articles = pd.read_csv('thank_the_academy.AUG.2023.csv.csv')
articles

Unnamed: 0,name,url
0,Everything Everywhere All at Once,https://en.wikipedia.org/wiki/Everything_Every...
1,All Quiet on the Western Front (2022 film),https://en.wikipedia.org/wiki/All_Quiet_on_the...
2,The Whale (2022 film),https://en.wikipedia.org/wiki/The_Whale_(2022_...
3,Top Gun: Maverick,https://en.wikipedia.org/wiki/Top_Gun:_Maverick
4,Black Panther: Wakanda Forever,https://en.wikipedia.org/wiki/Black_Panther:_W...
...,...,...
1354,The Yankee Doodle Mouse,https://en.wikipedia.org/wiki/The_Yankee_Doodl...
1355,The Yearling (1946 film),https://en.wikipedia.org/wiki/The_Yearling_(19...
1356,"Yesterday, Today and Tomorrow","https://en.wikipedia.org/wiki/Yesterday,_Today..."
1357,You Can't Take It with You (film),https://en.wikipedia.org/wiki/You_Can't_Take_I...


In [10]:
article_list = list(articles['name'])
article_list

['Everything Everywhere All at Once',
 'All Quiet on the Western Front (2022 film)',
 'The Whale (2022 film)',
 'Top Gun: Maverick',
 'Black Panther: Wakanda Forever',
 'Avatar: The Way of Water',
 'Women Talking (film)',
 "Guillermo del Toro's Pinocchio",
 'Navalny (film)',
 'The Elephant Whisperers',
 'An Irish Goodbye',
 'The Boy, the Mole, the Fox and the Horse (film)',
 'RRR (film)',
 'CODA (2021 film)',
 'Dune (2021 film)',
 'The Eyes of Tammy Faye (2021 film)',
 'No Time to Die',
 'The Windshield Wiper',
 'The Long Goodbye (Riz Ahmed album)',
 'The Queen of Basketball',
 'Summer of Soul',
 'Drive My Car (film)',
 'Encanto',
 'West Side Story (2021 film)',
 'Belfast (film)',
 'The Power of the Dog (film)',
 'King Richard (film)',
 'Cruella (film)',
 'Nomadland (film)',
 'The Father (2020 film)',
 'Judas and the Black Messiah',
 'Minari (film)',
 'Mank',
 'Sound of Metal',
 "Ma Rainey's Black Bottom (film)",
 'Promising Young Woman',
 'Tenet (film)',
 'Soul (2020 film)',
 'Another

## REST API for Mobile, Desktop and Cummulative data pull

Callling the REST API for mobile data and desktop data pull and then manipulating the dataframes to get final page view counts for Mobile, Desktop and Cummulative accesses. The function "request_pageviews_per_article" is used to get monthly time series data for each article in the subset and changeing the request_template to update accesses. Also storing the dataframes as json files for each of the accesses.

In [11]:
all_mobile_df = pd.DataFrame()
all_desktop_df = pd.DataFrame()
all_cumulative_df = pd.DataFrame()

for i in range(0,len(article_list)-1):
    subset_views_mobile_web = request_pageviews_per_article(article_list[i], request_template = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-web",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023093000"    # this is likely the wrong end date
})
    
    subset_views_mobile_app = request_pageviews_per_article(article_list[i], request_template = {
    "project":     "en.wikipedia.org",
    "access":      "mobile-app",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023093000"    # this is likely the wrong end date
})
    
    subset_views_desktop = request_pageviews_per_article(article_list[i], request_template = {
    "project":     "en.wikipedia.org",
    "access":      "desktop",      # this should be changed for the different access types
    "agent":       "user",
    "article":     "",             # this value will be set/changed before each request
    "granularity": "monthly",
    "start":       "2015010100",   # start and end dates need to be set
    "end":         "2023093000"    # this is likely the wrong end date
})
    
#     print(subset_views)
    try:
        mobile_web_df = pd.DataFrame(subset_views_mobile_web['items'])
        mobile_app_df = pd.DataFrame(subset_views_mobile_app['items'])
        mobile_df = pd.concat([mobile_web_df,mobile_app_df])
        mobile_df = mobile_df.groupby(['project','article','granularity','timestamp','agent'])['views'].sum().reset_index()
        mobile_df['access'] = "mobile"
        all_mobile_df = pd.concat([all_mobile_df,mobile_df])
        
        desktop_df = pd.DataFrame(subset_views_desktop['items'])
        all_desktop_df = pd.concat([all_desktop_df,desktop_df])
        
        cumulative_df = pd.concat([mobile_df,desktop_df])
        cumulative_df = cumulative_df.groupby(['project','article','granularity','timestamp','agent'])['views'].sum().reset_index()
        all_cumulative_df = pd.concat([all_cumulative_df,cumulative_df])
        
        if i%100 == 0:
            print(i)
    except:
        print(article_list[i])
    
    
print(all_mobile_df)
all_mobile_df.to_json('academy_monthly_mobile_201501-202309.json', orient = 'split', compression = 'infer', index = 'true')
print(all_desktop_df)
all_desktop_df.to_json('academy_monthly_desktop_201501-202309.json', orient = 'split', compression = 'infer', index = 'true')
print(all_cumulative_df)
all_cumulative_df.to_json('academy_monthly_cumulative_201501-202309.json', orient = 'split', compression = 'infer', index = 'true')

0
100
200
300
400
500
Victor/Victoria
600
700
800
900
1000
1100
1200
1300
         project                            article granularity   timestamp  \
0   en.wikipedia  Everything_Everywhere_All_at_Once     monthly  2020010100   
1   en.wikipedia  Everything_Everywhere_All_at_Once     monthly  2020020100   
2   en.wikipedia  Everything_Everywhere_All_at_Once     monthly  2020030100   
3   en.wikipedia  Everything_Everywhere_All_at_Once     monthly  2020040100   
4   en.wikipedia  Everything_Everywhere_All_at_Once     monthly  2020050100   
..           ...                                ...         ...         ...   
94  en.wikipedia  You_Can't_Take_It_with_You_(film)     monthly  2023050100   
95  en.wikipedia  You_Can't_Take_It_with_You_(film)     monthly  2023060100   
96  en.wikipedia  You_Can't_Take_It_with_You_(film)     monthly  2023070100   
97  en.wikipedia  You_Can't_Take_It_with_You_(film)     monthly  2023080100   
98  en.wikipedia  You_Can't_Take_It_with_You_(film)     m