# A1: Data Curation
This project collects and analyzes information about Wikipedia page views from 2008-2021. 
It was created by Emily Linebarger (elineb@uw.edu) in October 2021, and is maintained in the Github repository https://github.com/kathrynline/data-512-a1. 

## Data Aquisition
Part one of the project is querying two Wikipedia APIs for information on page views. 
The Wikipedia legacy Pagecounts API contains desktop and mobile traffic data from December 2007-July 2016. Its endpoint is 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'. 
The Wikipedia Pageview API contains desktop, mobile web, and mobile app traffic data from July 2015 through last month. Its endpoint is 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'. 

Documentation for both APIs can be found at https://wikimedia.org/api/rest_v1/#/. 

## Notes:
For the pageviews app, you can filter to organic user traffic by specifying agent=user in query parameters. The legacy pagecounts API doesn't have this feature. 
The two APIs overlap by about 1 year. 

In [69]:
import json
import requests
import pandas as pd

In [70]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [71]:
headers = {
    'User-Agent': 'https://github.com/kathrynline',
    'From': 'elineb@uw.edu'
}

### Wikipedia legacy Pagecounts API extraction

In [72]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'

In [73]:
# Get desktop site data from December 1 2007 - July 31 2016. 
# To get through July 31, specify the end date as August 1 2016. 
desktop_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2007120100",
                 "end" : "2016080100"
                    }

# Get mobile site data from December 1 2007 - July 31 2016. 
# To get through July 31, specify the end date as August 1 2016. 
mobile_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "mobile-site",
                 "granularity" : "monthly",
                 "start" : "2007120100",
                 "end" : "2016080100"
                    }

In [74]:
legacy_desktop = api_call(endpoint_legacy, desktop_params_legacy)

with open('../0_data_raw/pagecounts_desktop-site_200712_202108.json', 'w') as outfile:
    json.dump(legacy_desktop, outfile)

In [75]:
legacy_mobile = api_call(endpoint_legacy, mobile_params_legacy)

with open('../0_data_raw/pagecounts_mobile-site_200712_202108.json', 'w') as outfile:
    json.dump(legacy_mobile, outfile)

### Wikipedia Pageviews API extraction

In [76]:
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

In [77]:
# Get desktop site data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
desktop_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

# Get mobile website data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
mobile_site_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "mobile-web",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

# Get mobile app data from July 2015 - September 2021.
# To get through September 31 2021, specify the end date as October 1 2021.  
mobile_app_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "mobile-app",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "20150701",
                    "end" : '202110010'
                        }

In [78]:
pageviews_desktop = api_call(endpoint_pageviews, desktop_params_pageviews)

with open('../0_data_raw/pageviews_desktop_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_desktop, outfile)

In [79]:
pageviews_mobile_site = api_call(endpoint_pageviews, mobile_site_params_pageviews)

with open('../0_data_raw/pageviews_mobile_site_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_mobile_site, outfile)

In [80]:
pageviews_mobile_app = api_call(endpoint_pageviews, mobile_app_params_pageviews)

with open('../0_data_raw/pageviews_mobile_app_201507_202109.json', 'w') as outfile:
    json.dump(pageviews_mobile_app, outfile)

# Data processing

In [81]:
# Read in raw data outputs
pagecounts_desktop = json.load(open('../0_data_raw/pagecounts_desktop-site_200712_202108.json'))
pagecounts_mobile = json.load(open('../0_data_raw/pagecounts_mobile-site_200712_202108.json'))
pageviews_desktop = json.load(open('../0_data_raw/pageviews_desktop_201507_202109.json'))
pageviews_mobile_site = json.load(open('../0_data_raw/pageviews_mobile_site_201507_202109.json'))
pageviews_mobile_app = json.load(open('../0_data_raw/pageviews_mobile_app_201507_202109.json'))

In [82]:
# Combine all data into a single CSV, keeping only the 'timestamp' and 'counts' (views) columns. 
def process_data(raw: dict, count_label: str) -> pd.DataFrame:
    """Takes a dictionary of raw Wikipedia API results and returns a pandas DataFrame."""
    data = pd.DataFrame(raw['items'])
    if 'count' in data.columns: # Pagecounts API 
        data = data[['timestamp', 'count']]
    else: # Pageviews API
        data = data[['timestamp', 'views']]
    data.columns = ['timestamp', count_label]
    
    return data

In [83]:
# Process each dataframe
pagecounts_desktop = process_data(pagecounts_desktop, 'pagecount_desktop_views')
pagecounts_mobile = process_data(pagecounts_mobile, 'pagecount_mobile_views')
pageviews_desktop = process_data(pageviews_desktop, 'pageview_desktop_views')
pageviews_mobile_site = process_data(pageviews_mobile_site, 'pageviews_mobile_site')
pageviews_mobile_app = process_data(pageviews_mobile_app, 'pageviews_mobile_app')

In [84]:
intermediate_data = pagecounts_desktop.merge(pagecounts_mobile, how='outer', on='timestamp')
intermediate_data = intermediate_data.merge(pageviews_desktop, how = 'outer', on = 'timestamp')
intermediate_data = intermediate_data.merge(pageviews_mobile_site, how = 'outer', on = 'timestamp')
intermediate_data = intermediate_data.merge(pageviews_mobile_app, how = 'outer', on = 'timestamp')

In [85]:
intermediate_data

Unnamed: 0,timestamp,pagecount_desktop_views,pagecount_mobile_views,pageview_desktop_views,pageviews_mobile_site,pageviews_mobile_app
0,2007120100,2.998332e+09,,,,
1,2008010100,4.930903e+09,,,,
2,2008020100,4.818394e+09,,,,
3,2008030100,4.955406e+09,,,,
4,2008040100,5.159162e+09,,,,
...,...,...,...,...,...,...
161,2021050100,,,2.824416e+09,4.810094e+09,166485079.0
162,2021060100,,,2.505971e+09,4.433806e+09,150704624.0
163,2021070100,,,2.765584e+09,4.617448e+09,161461155.0
164,2021080100,,,2.763414e+09,4.570813e+09,161381193.0


In [86]:
# Fill any NAs with zero - these zeros indicate that that collection method was not available for those dates. 
intermediate_data = intermediate_data.fillna(0)

In [87]:
# Combine pageviews mobile site and mobile app traffic into a single indicator for monthly mobile traffic. 
intermediate_data['pageview_mobile_views'] = intermediate_data['pageviews_mobile_site'] + intermediate_data['pageviews_mobile_app']

In [88]:
# Split 'timestamp' column into 'month' and 'year' columns. 
intermediate_data['year'] = intermediate_data['timestamp'].str[:4]
intermediate_data['month'] = intermediate_data['timestamp'].str[4:6]

In [89]:
# Create aggregate views by type of API
# Fill NAs with 0 to get a count when one method is not available. 
intermediate_data['pagecount_all_views'] = intermediate_data['pagecount_desktop_views'] + intermediate_data['pagecount_mobile_views']
intermediate_data['pageview_all_views'] = intermediate_data['pageview_desktop_views'] + intermediate_data['pageview_mobile_views']

In [90]:
# Subset to final columns
intermediate_data = intermediate_data[['year', 'month', 'pagecount_all_views', 
                                       'pagecount_desktop_views', 'pagecount_mobile_views', 
                                      'pageview_all_views', 'pageview_desktop_views', 'pageview_mobile_views']]

In [91]:
intermediate_data.head()

Unnamed: 0,year,month,pagecount_all_views,pagecount_desktop_views,pagecount_mobile_views,pageview_all_views,pageview_desktop_views,pageview_mobile_views
0,2007,12,2998332000.0,2998332000.0,0.0,0.0,0.0,0.0
1,2008,1,4930903000.0,4930903000.0,0.0,0.0,0.0,0.0
2,2008,2,4818394000.0,4818394000.0,0.0,0.0,0.0,0.0
3,2008,3,4955406000.0,4955406000.0,0.0,0.0,0.0,0.0
4,2008,4,5159162000.0,5159162000.0,0.0,0.0,0.0,0.0


In [92]:
# Save these outputs
intermediate_data.to_csv('../1_data_clean/en-wikipedia_traffic_200712-202108.csv')

## Analysis 