# A1: Data Curation
This project collects and analyzes information about Wikipedia page views from 2008-2021. 
It was created by Emily Linebarger (elineb@uw.edu) in October 2021, and is maintained in the Github repository https://github.com/kathrynline/data-512-a1. 

## Data Aquisition
Part one of the project is querying two Wikipedia APIs for information on page views. 
The Wikipedia legacy Pagecounts API contains desktop and mobile traffic data from December 2007-July 2016. Its endpoint is 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'. 
The Wikipedia Pageview API contains desktop, mobile web, and mobile app traffic data from July 2015 through last month. Its endpoint is 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'. 

Documentation for both APIs can be found at https://wikimedia.org/api/rest_v1/#/. 

## Notes:
For the pageviews app, you can filter to organic user traffic by specifying agent=user in query parameters. The legacy pagecounts API doesn't have this feature. 
The two APIs overlap by about 1 year. 

In [39]:
import json
import requests
import pandas as pd

ModuleNotFoundError: No module named 'pandas'

In [24]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [25]:
headers = {
    'User-Agent': 'https://github.com/kathrynline',
    'From': 'elineb@uw.edu'
}

### Wikipedia legacy Pagecounts API extraction

In [26]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'

In [27]:
# Get desktop site data from December 1 2007 - July 31 2016. 
# To get through July 31, specify the end date as August 1 2016. 
desktop_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2007120100",
                 "end" : "2016080100"
                    }

# Get mobile site data from December 1 2007 - July 31 2016. 
# To get through July 31, specify the end date as August 1 2016. 
mobile_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "mobile-site",
                 "granularity" : "monthly",
                 "start" : "2007120100",
                 "end" : "2016080100"
                    }

In [28]:
legacy_desktop = api_call(endpoint_legacy, desktop_params_legacy)

with open('../0_data_raw/pagecounts_desktop-site_200712_202108.json', 'w') as outfile:
    json.dump(legacy_desktop, outfile)

In [29]:
legacy_mobile = api_call(endpoint_legacy, mobile_params_legacy)

with open('../0_data_raw/pagecounts_mobile-site_200712_202108.json', 'w') as outfile:
    json.dump(legacy_mobile, outfile)

### Wikipedia Pageviews API extraction

In [30]:
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

In [31]:
# Get desktop site data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
desktop_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

# Get mobile website data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
mobile_site_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "mobile-web",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

# Get mobile app data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
mobile_app_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "mobile-app",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

In [32]:
pageviews_desktop = api_call(endpoint_pageviews, desktop_params_pageviews)

with open('../0_data_raw/pageviews_desktop_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_desktop, outfile)

In [33]:
pageviews_mobile_site = api_call(endpoint_pageviews, mobile_site_params_pageviews)

with open('../0_data_raw/pageviews_mobile_site_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_mobile_site, outfile)

In [34]:
pageviews_mobile_app = api_call(endpoint_pageviews, mobile_app_params_pageviews)

with open('../0_data_raw/pageviews_mobile_app_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_mobile_app, outfile)

# Data processing

In [35]:
# Read in raw data outputs
pagecounts_desktop = json.load(open('../0_data_raw/pagecounts_desktop-site_200712_202108.json'))
pagecounts_mobile = json.load(open('../0_data_raw/pagecounts_mobile-site_200712_202108.json'))
pageviews_desktop = json.load(open('../0_data_raw/pageviews_desktop_201507_202109.json'))
pageviews_mobile_site = json.load(open('../0_data_raw/pageviews_mobile_site_201507_202109.json'))
pageviews_mobile_app = json.load(open('../0_data_raw/pageviews_mobile_app_201507_202109.json'))

In [None]:
# Combine all data into one CSV

In [38]:
# From the pageviews data, combine the mobile app and mobile site data to create a total mobile traffic count per month. 
print(pagecounts_mobile['items'])

[{'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2014100100', 'count': 3091546685}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2014110100', 'count': 3027489668}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2014120100', 'count': 3278950021}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2015010100', 'count': 3485302091}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2015020100', 'count': 3091534479}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2015030100', 'count': 3330832588}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granularity': 'monthly', 'timestamp': '2015040100', 'count': 3222089917}, {'project': 'en.wikipedia', 'access-site': 'mobile-site', 'granulari

In [37]:
print(pageviews_mobile_app)

{'items': [{'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015070100', 'views': 109624146}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015080100', 'views': 109669149}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015090100', 'views': 96221684}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015100100', 'views': 94523777}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015110100', 'views': 94353925}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2015120100', 'views': 99438956}, {'project': 'en.wikipedia', 'access': 'mobile-app', 'agent': 'user', 'granularity': 'monthly', 'timestamp': '2016010100', 'views': 1064