Course Human-Centered Data Science ([HCDS](https://www.mi.fu-berlin.de/en/inf/groups/hcc/teaching/winter_term_2020_21/course_human_centered_data_science.html)) - Winter Term 2020/21 - [HCC](https://www.mi.fu-berlin.de/en/inf/groups/hcc/index.html) | [Freie Universität Berlin](https://www.fu-berlin.de/)
***
# A2 - Reproducibility Workflow


Your assignment is to create a graph that looks a lot like the one below one, starting from scratch, and following best practices for reproducible research.

![wikipedia_pageViews_2008-2020.png](img/wikipedia_pageViews_2008-2020.png)

## Before you start
1. Read all instructions carefully before you begin.
1. Read all API documentation carefully before you begin.
1. Experiment with queries in the sandbox of the technical documentation for each API to familiarize yourself with the schema and the data.
1. Ask questions if you are unsure about anything!
1. When documenting your project, please keep the following questions in your mind:
   * _If I found this GitHub repository, and wanted to fully reproduce the analysis, what information would I want?_
   * _What information would I need?_

## Step 1️⃣: Data acquisition
In order to measure Wikipedia traffic from January 2008 until October 2020, you will need to collect data from two different APIs:

1. The **Legacy Pagecounts API** ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pagecounts_data_(legacy)/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end)) provides access to desktop and mobile traffic data from December 2007 through July 2016.
1. The **Pageviews API** ([documentation](https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews), [endpoint](https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end)) provides access to desktop, mobile web, and mobile app traffic data from July 2015 through last month.

For each API, you need to collect data for all months where data is available and then save the raw results into five (3+2) separate `JSON`files (one file per API query type) before continuing to step 2.

To get you started, you can use the following **sample code for API calls**:

In [1]:
# Source: https://public.paws.wmcloud.org/User:Jtmorgan/data512_a1_example.ipynb?format=raw

import json
import requests

endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

# SAMPLE parameters for getting aggregated legacy view data 
# see: https://wikimedia.org/api/rest_v1/#!/Legacy_data/get_metrics_legacy_pagecounts_aggregate_project_access_site_granularity_start_end
example_params_legacy = {"project" : "en.wikipedia.org",
                 "access-site" : "desktop-site",
                 "granularity" : "monthly",
                 "start" : "2001010100",
                # for end use 1st day of month following final month of data
                 "end" : "2018100100"
                    }

# SAMPLE parameters for getting aggregated current standard pageview data
# see: https://wikimedia.org/api/rest_v1/#!/Pageviews_data/get_metrics_pageviews_aggregate_project_access_agent_granularity_start_end
example_params_pageviews = {"project" : "en.wikipedia.org",
                    "access" : "desktop",
                    "agent" : "user",
                    "granularity" : "monthly",
                    "start" : "2001010100",
                    # for end use 1st day of month following final month of data
                    "end" : '2018101000'
                        }

# Customize these with your own information
headers = {
    'User-Agent': 'https://github.com/yourusername',
    'From': 'youremail@fu-berlin.de'
}

def api_call(endpoint,parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

In [2]:
example_monthly_pageviews = api_call(endpoint_pageviews, example_params_pageviews)
example_monthly_pageviews

{'items': [{'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015070100',
   'views': 4376666686},
  {'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015080100',
   'views': 4332482183},
  {'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015090100',
   'views': 4485491704},
  {'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015100100',
   'views': 4477532755},
  {'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015110100',
   'views': 4287720220},
  {'project': 'en.wikipedia',
   'access': 'desktop',
   'agent': 'user',
   'granularity': 'monthly',
   'timestamp': '2015120100',
   'views': 4100012037},
  {'project': 'en.wikipedia',
   'access': 

In [3]:
example_monthly_legacy = api_call(endpoint_legacy, example_params_legacy)
example_monthly_legacy

{'items': [{'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2007120100',
   'count': 2998331524},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2008010100',
   'count': 4930902570},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2008020100',
   'count': 4818393763},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2008030100',
   'count': 4955405809},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2008040100',
   'count': 5159162183},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timestamp': '2008050100',
   'count': 5584691092},
  {'project': 'en.wikipedia',
   'access-site': 'desktop-site',
   'granularity': 'monthly',
   'timest

Your `JSON`-formatted source data file must contain the complete and un-edited output of your API queries. The naming convention for the source data files is: `apiname_accesstype_firstmonth-lastmonth.json`. For example, your filename for monthly page views on desktop should be: `pagecounts_desktop-site_200712-202010.json`

### Important notes❗
1. As much as possible, we're interested in *organic* (user) traffic, as opposed to traffic by web crawlers or spiders. The Pageview API (but not the Pagecount API) allows you to filter by `agent=user`. You should do that.
1. There is about one year of overlapping traffic data between the two APIs. You need to gather, and later graph, data from both APIs for this period of time.

### Implementation

First of all, we need to set the endpoint urls and headers (customize these with your own information, if you want to query the api by yourself).

In [4]:
endpoint_legacy = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'
endpoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {
    'User-Agent': 'https://github.com/marisanest',
    'From': 'marisa.f.nest@fu-berlin.de'
}

Beforehand, we define two helper functions. First `api_call`, which handels the api call and returns the results as json. And secondly `api_call`, which saves the resulting json into a persistant file with given filename. 

In [5]:
def api_call(endpoint, parameters):
    call = requests.get(endpoint.format(**parameters), headers=headers)
    response = call.json()
    
    return response

def save_json(data, filename):
    with open(f"../data_raw/{filename}", 'w') as f:
        json.dump(data, f)

To get all relevant data, we need to make five differnet API calls:
1. Legacy Pagecounts API: monthly desktop traffic data from December 2007 through July 2016
2. Legacy Pagecounts API: monthly mobile traffic data from December 2007 through July 2016
3. Pageviews API: monthly desktop traffic data from July 2015 through last month
4. Pageviews API: monthly mobile-web user traffic data from July 2015 through last month
5. Pageviews API: monthly mobile-app user traffic data from July 2015 through last month

**Legacy Pagecounts API: monthly desktop traffic data from December 2007 through July 2016**

Fetching the data with `api_call()` with the suitable API url `endpoint_legacy` and the respective parameters. The resulting json string is then saved with `save_json()` in the appropriate file.

In [6]:
save_json(
    api_call(
        endpoint_legacy, 
        {
            "project" : "en.wikipedia.org",
            "access-site" : "desktop-site",
            "granularity" : "monthly",
            "start" : "2007120100",
            "end" : "2016080100"
        }
    ), 
    'pagecounts_desktop-site_200712-201607.json'
)

**Legacy Pagecounts API: monthly mobile traffic data from December 2007 through July 2016**

Fetching the data with `api_call()` with the suitable API url `endpoint_legacy` and parameters. The resulting json string is then saved with `save_json()` in the appropriate file.

In [7]:
save_json(
    api_call(
        endpoint_legacy, 
        {
            "project" : "en.wikipedia.org",
            "access-site" : "mobile-site",
            "granularity" : "monthly",
            "start" : "2007120100",
            "end" : "2016080100"
        }
    ), 
    'pagecounts_mobile-site_200712-201607.json'
)

**Pageviews API: monthly desktop traffic data from July 2015 through last month**

Fetching the data with `api_call()` with the suitable API url `endpoint_pageviews` and parameters. The resulting json string is then saved with `save_json()` in the appropriate file.

In [8]:
save_json(
    api_call(
        endpoint_pageviews, 
        {
            "project" : "en.wikipedia.org",
            "access" : "desktop",
            "agent" : "user",
            "granularity" : "monthly",
            "start" : "2015070100",
            "end" : '2020111000'
        }
    ), 
    'pageviews_desktop_201507-202010.json'
)

**Pageviews API: monthly mobile-web user traffic data from July 2015 through last month**

Fetching the data with `api_call()` with the suitable API url `endpoint_pageviews` and parameters. The resulting json string is then saved with `save_json()` in the appropriate file.

In [9]:
save_json(
    api_call(
        endpoint_pageviews, 
        {
            "project" : "en.wikipedia.org",
            "access" : "mobile-web",
            "agent" : "user",
            "granularity" : "monthly",
            "start" : "2015070100",
            "end" : '2020111000'
        }
    ), 
    'pageviews_mobile-web_201507-202010.json'
)

**Pageviews API: monthly mobile-app user traffic data from July 2015 through last month**

Fetching the data with `api_call()` with the suitable API url `endpoint_pageviews` and parameters. The resulting json string is then saved with `save_json()` in the appropriate file.

In [10]:
save_json(
    api_call(
        endpoint_pageviews, 
        {
            "project" : "en.wikipedia.org",
            "access" : "mobile-app",
            "agent" : "user",
            "granularity" : "monthly",
            "start" : "2015070100",
            "end" : '2020111000'
        }
    ), 
    'pageviews_mobile-app_201507-202010.json'
)

## Step 2: Data processing

You will need to perform a series of processing steps on these data files in order to prepare them for analysis. These steps must be followed exactly in order to prepare the data for analysis. At the end of this step, you will have a single `CSV`-formatted data file `en-wikipedia_traffic_200712-202010.csv` that can be used in your analysis (step 3) with no significant additional processing.

* For data collected from the Pageviews API, combine the monthly values for `mobile-app` and `mobile-web` to create a total mobile traffic count for each month.
* For all data, separate the value of `timestamp` into four-digit year (`YYYY`) and two-digit month (`MM`) and discard values for day and hour (`DDHH`).

Combine all data into a single CSV file with the following headers:

| year | month |pagecount_all_views|pagecount_desktop_views|pagecount_mobile_views|pageview_all_views|pageview_desktop_views|pageview_mobile_views|
|------| ------|-------------------|-----------------------|----------------------|------------------|----------------------|---------------------|
| YYYY | MM    |num_views          |num_views              |num_views             |num_views         |num_views             |num_views            | 

Import required libraries for data processing.

In [11]:
import pandas as pd

Again, before we start to process the data, we define two helper functions. First `load_json`, which loads the data from a spicific file into a data frame. And second `process_columns`, which process the different collums of the data frames resulting from the jason files.

In [12]:
def load_json(filename):
    with open(f"../data_raw/{filename}", 'r') as f:
        return pd.json_normalize(json.load(f), ['items'])

def process_columns(df, views_column_name):
    # get all colums names of the current data frame
    columns = df.columns.values.tolist()
    
    # add two new columns to the data representing the year and the month of the respective data point  
    df['year'] = df.apply(lambda row: int(row.timestamp[0:4]), axis=1) 
    df['month'] = df.apply(lambda row: int(row.timestamp[4:6]), axis=1) 
    
    # renaming the 'count' (legacy) or 'views'(pageviews) column to given name
    df.rename(columns = {columns.pop():views_column_name}, inplace = True) 
    
    # drop all not further needed columns and return the resultinf data frame
    return df.drop(columns=columns)

Now we can load all json files with `load_json()` and process the resulting data frames with `process_columns`. 

In [13]:
desktop_legacy_df = process_columns(
    load_json('pagecounts_desktop-site_200712-201607.json'), 
    'pagecount_desktop_views'
)

mobile_legacy_df = process_columns(
    load_json('pagecounts_mobile-site_200712-201607.json'), 
    'pagecount_mobile_views'
)

desktop_pageviews_df = process_columns(
    load_json('pageviews_desktop_201507-202010.json'), 
    'pageview_desktop_views'
)

mobile_web_pageviews_df = process_columns(
    load_json('pageviews_mobile-web_201507-202010.json'), 
    'pageview_mobile_views'
)

mobile_app_pageviews_df = process_columns(
    load_json('pageviews_mobile-app_201507-202010.json'), 
    'pageview_mobile_views'
)

Afterwards we need to merge the mobile-web and mobile-app data frames. This can be done as follows:

In [14]:
# merging the two data frames based on the year an month
mobile_pageviews_df = pd.merge(mobile_web_pageviews_df, mobile_app_pageviews_df, on=['year', 'month'], how='outer')
# adding the both view count and saving the result in a new column
mobile_pageviews_df['pageview_mobile_views'] = mobile_pageviews_df['pageview_mobile_views_x'] + mobile_pageviews_df['pageview_mobile_views_y']
# dropping the not longer needed columns
mobile_pageviews_df = mobile_pageviews_df.drop(columns=['pageview_mobile_views_x', 'pageview_mobile_views_y'])

Now we can merge all data frames into one whole dataset by the folling code:

In [15]:
# merging both legacy data frames
legacy_df = pd.merge(desktop_legacy_df, mobile_legacy_df, on=['year', 'month'], how='outer')
# merging both pageviews data frames
pageviews_df = pd.merge(desktop_pageviews_df, mobile_pageviews_df, on=['year', 'month'], how='outer')
# merging legacy and pageviews data frames
final_df = pd.merge(legacy_df, pageviews_df, on=['year', 'month'], how='outer')

Before we are almost finished with the 2. step, we need to process the final data frame by replacing all NaN values by zero and converting the data types of the view count columns to int (optional). Then we sum up all pagecount views and pageview views and save the results in two separat columns `pagecount_all_views` and `pageview_all_views`. 

In [16]:
final_df = final_df.fillna(0)
final_df['pagecount_desktop_views'] = final_df['pagecount_desktop_views'].astype(int)
final_df['pagecount_mobile_views'] = final_df['pagecount_mobile_views'].astype(int)
final_df['pageview_desktop_views'] = final_df['pageview_desktop_views'].astype(int)
final_df['pageview_mobile_views'] = final_df['pageview_mobile_views'].astype(int)
final_df['pagecount_all_views'] = final_df['pagecount_desktop_views'] + final_df['pagecount_mobile_views']
final_df['pageview_all_views'] = final_df['pageview_desktop_views'] + final_df['pageview_mobile_views']

Finally we can save the resulting data frame as csv under the corresponding file path.

In [17]:
final_df.to_csv('../data_clean/en-wikipedia_traffic_200712-202010.csv', index=False)

## Step 3: Analysis

For this assignment, the "analysis" will be fairly straightforward: you will visualize the dataset you have created as a **time series graph**. Your visualization will track three traffic metrics: mobile traffic, desktop traffic, and all traffic (mobile + desktop). In order to complete the analysis correctly and receive full credit, your graph will need to be the right scale to view the data; all units, axes, and values should be clearly labeled; and the graph should possess a legend and a title. You must also generate a .png or .jpeg formatted image of your final graph.
Please graph the data in your notebook, rather than using an external application!

First of all we need to load the data into a data frame again.

In [18]:
df = pd.read_csv('../data_clean/en-wikipedia_traffic_200712-202010.csv')

Now we import and use plotly to nicely visualize the data.

In [24]:
import plotly
import plotly.graph_objects as go

In [25]:
fig = go.Figure()

x = pd.to_datetime(df[['year','month']].assign(day=1))

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pagecount_desktop_views'], 
        name='Desktop traffic (pagecount)',
        line=dict(color='blue', width=1, dash='dash')
    )
)

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pagecount_mobile_views'], 
        name='Mobile traffic (pagecount)',
        line=dict(color='firebrick', width=1, dash='dash')
    )
)

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pagecount_all_views'], 
        name='All traffic (pagecount)',
        line=dict(color='black', width=1, dash='dash')
    )
)

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pageview_desktop_views'], 
        name='Desktop traffic (pageview)',
        line=dict(color='blue', width=1)
    )
)

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pageview_mobile_views'], 
        name='Mobile traffic (pageview)',
        line=dict(color='firebrick', width=1)
    )
)

fig.add_trace(
    go.Scatter(
        x=x, 
        y=df['pageview_all_views'], 
        name='All traffic (pageview)',
        line=dict(color='black', width=1)
    )
)

fig.update_layout(
    title = 'Page Views on English Wikipedia',
    xaxis_title='Year',
    yaxis_title='Views (in Billions)'
)

fig.show()

Finially we save the plotted figure, not dierctly as png, but as an interactive HTML file where we then can download the png file if needed.

In [28]:
plotly.offline.plot(fig, filename='img/wikipedia_pageViews_2008-2020')


Your filename `img/wikipedia_pageViews_2008-2020` didn't end with .html. Adding .html to the end of your file.



'img/wikipedia_pageViews_2008-2020.html'

***

#### Credits

This exercise is slighty adapted from the course [Human Centered Data Science (Fall 2019)](https://wiki.communitydata.science/Human_Centered_Data_Science_(Fall_2019)) of [Univeristy of Washington](https://www.washington.edu/datasciencemasters/) by [Jonathan T. Morgan](https://wiki.communitydata.science/User:Jtmorgan).

Same as the original inventors, we release the notebooks under the [Creative Commons Attribution license (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).