# Goal: 
Construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from January 1, 2008 through August 30, 2021

# Data Acquisition

There are 2 sources of Wikipedia Traffic data
    - Pagecounts
    - Pageviews
The main difference between pagecounts and the pageview data is that pageview allows us to filter out automated (bot) traffic whereas pagecounts does not.

We will be collecting traffic data from 2008 - 2021 & outputting to 5 different files
- pagecounts_desktop-site_200801-201607.json
- pagecounts_mobile-site_200801-201607.json
- pageviews_desktop_201507-202108.json
- pageviews_mobile-app_201507-202108.json
- pageviews_mobile-web_201507-202108.json

In [41]:
import json
import requests
import pandas as pd

In [33]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

In [85]:

#API call params

# see: https://wikimedia.org/api/rest_v1/#!/Legacy_data/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end
pagecounts_desktop_site = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2008010100",
                # for end use 1st day of month following final month of data
                 "end" : "2016080100"
                    }

pagecounts_mobile_site = {"project" : "en.wikipedia.org",
                 "access-site" : "mobile-site",
                 "granularity" : "monthly",
                 "start" : "2008010100",
                # for end use 1st day of month following final month of data
                 "end" : "2016080100"
                    }

# parameters for getting aggregated current standard pageview data
# see: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end
pageviews_desktop = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                     "end" : "2021090100"
                        }
pageviews_mobile_app = {"project" : "en.wikipedia.org",
                    "access" : "mobile-app",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                     "end" : "2021090100"
                        }

pageviews_mobile_web = {"project" : "en.wikipedia.org",
                    "access" : "mobile-web",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2015070100",
                    # for end use 1st day of month following final month of data
                     "end" : "2021090100"
                        }
# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/marcm97',
    'From': 'marcm5@uw.edu'
}

In [86]:
def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [87]:
#pagecounts API calls
pagecounts_desktop_site_json = api_call(endpoint_legacy, pagecounts_desktop_site)
pagecounts_mobile_site_json = api_call(endpoint_legacy, pagecounts_mobile_site)

#pageview API calls
pageviews_desktop_json = api_call(endpoint_pageviews, pageviews_desktop)
pageviews_mobile_app_json = api_call(endpoint_pageviews, pageviews_mobile_app)
pageviews_mobile_web_json = api_call(endpoint_pageviews, pageviews_mobile_web)

In [88]:
# writing json data to data_raw/
with open('data_raw/pagecounts_desktop-site_200801-201608.json', 'w', encoding='utf-8') as f:
    json.dump(pagecounts_desktop_site_json, f, ensure_ascii=False, indent=4)
    
with open('data_raw/pagecounts_mobile-site_200801-201608.json', 'w', encoding='utf-8') as f:
    json.dump(pagecounts_mobile_site_json, f, ensure_ascii=False, indent=4)
    
with open('data_raw/pageviews_desktop_201507-202108.json', 'w', encoding='utf-8') as f:
    json.dump(pageviews_desktop_json, f, ensure_ascii=False, indent=4)

with open('data_raw/pageviews_mobile-app_201507-202108.json', 'w', encoding='utf-8') as f:
    json.dump(pageviews_mobile_app_json, f, ensure_ascii=False, indent=4)
    
with open('data_raw/pageviews_mobile-web_201507-202108.json', 'w', encoding='utf-8') as f:
    json.dump(pageviews_mobile_web_json, f, ensure_ascii=False, indent=4)

# Data Processing

In [103]:
def create_df(path):
    with open(path) as f:
        json_data = json.load(f)
    data = pd.json_normalize(json_data["items"])
    data["Year"] = data["timestamp"].str[:4]#first 4 digits are the year
    data["Month"] = data["timestamp"].str[4:6]#digits 5,6 are months
    if "views" in data.columns:
        data["count"] = data["views"]
    data["count"] = pd.to_numeric(data["count"])
    return data[["Year","Month","count"]]
    

In [104]:
pagecounts_desktop = create_df("data_raw/pagecounts_desktop-site_200801-201608.json")
pagecounts_mobile = create_df("data_raw/pagecounts_mobile-site_200801-201608.json")
pageviews_desktop = create_df("data_raw/pageviews_desktop_201507-202108.json")
pageviews_mobile_app = create_df("data_raw/pageviews_mobile-app_201507-202108.json")
pageviews_mobile_web = create_df("data_raw/pageviews_mobile-web_201507-202108.json")

- For data collected from the Pageviews API, combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month.


In [122]:
pageviews_mobile = pageviews_mobile_app.merge(pageviews_mobile_web,
                          how = "outer",
                           on =["Year","Month"],
                           suffixes =["_mobile_app","_mobile_web"]
                          )
pageviews_mobile["count"] = pageviews_mobile["count_mobile_app"]+pageviews_mobile["count_mobile_web"]
pageviews_mobile = pageviews_mobile[["Year","Month","count"]]

In [126]:
#creating unified joined csv
pagecounts = pd.merge(pagecounts_desktop,
         pagecounts_mobile,
         how = "outer",
         on =["Year","Month"],
         suffixes =["_pagecounts_desktop","_pagecounts_mobile_web"]
        )

pageviews = pd.merge(pageviews_desktop,
         pageviews_mobile,
         how = "outer",
         on =["Year","Month"],
         suffixes =["_pageview_desktop","_pageview_mobile_web"]
        )

data = pd.merge(pagecounts,
                pageviews,
                how = "outer",
                on =["Year","Month"]
               )

In [133]:
#replacing NA with 0's
data = data.fillna(0)

In [135]:
data["pagecount_all_views"] = data["count_pagecounts_desktop"] + data["count_pagecounts_mobile_web"]
data["pageview_all_views"] = data["count_pageview_desktop"] + data["count_pageview_mobile_web"]

In [139]:
data = data.rename(columns = {
    "count_pagecounts_desktop": "pagecount_desktop_views",
    "count_pagecounts_mobile_web": "pagecount_mobile_views",
    "count_pageview_desktop": "pageview_desktop_views",
    "count_pageview_mobile_web":"pageview_mobile_views"
})
data = data[["Year",
      "Month",
      "pagecount_all_views",
      "pagecount_desktop_views",
      "pagecount_mobile_views",
      "pageview_all_views",
      "pageview_desktop_views",
      "pageview_mobile_views"
     ]]

In [140]:
# write out to csv
data.to_csv("data_clean/en-wikipedia_traffic_200712-202108.csv")