## API request for pageviews initiated 

The data is obtained through an API created by the wikimedia foundation enabling us to access the traffic data for mobile-web, mobile-app and desktop respectively

### Step 1: Data acquisition
We will collect monthly data on web traffic to English Wikipedia from two Wikipedia APIs and save the data in JSON files. We will make five API calls and then save the JSON results into five separate JSON data files. We will make the API requests in python code below, but these are the complete URLs for all five API requests:


In [295]:
import requests
import json
import csv
import numpy as np
import pandas as pd
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-web',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017101000'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response_pageviews_mobile_web = api_call.json()


with open('pageviews_mobile_web_201507_201709.json', 'w') as outfile:
    json.dump(response_pageviews_mobile_web, outfile)
    


In [296]:
import requests
import json
import csv
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-app',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017101000'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response_pageviews_mobile_app = api_call.json()


with open('pageviews_mobile_app_201507_201709.json', 'w') as outfile:
    json.dump(response_pageviews_mobile_app, outfile)
    


In [297]:
import requests
import json
import csv
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017101000'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response_pageviews_desktop = api_call.json()


with open('pageviews_desktop_201507_201709.json', 'w') as outfile:
    json.dump(response_pageviews_desktop, outfile)
    


## API for pagecounts initiated

In [298]:
import requests
import json
import csv
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'de.wikipedia.org',
            'access' : 'desktop-site',
            'granularity' : 'monthly',
            'start' : '2008060100',
            'end' : '2016060100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response_pagecounts_desktop = api_call.json()


with open('pagecounts_desktop_200806_201606.json', 'w') as outfile:
    json.dump(response_pagecounts_desktop, outfile)
    


In [299]:
import requests
import json
import csv
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

params = {'project' : 'de.wikipedia.org',
            'access' : 'mobile-site',
            'granularity' : 'monthly',
            'start' : '2008010100',
            'end' : '2016060100'#use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response_pagecounts_mobile = api_call.json()


with open('pagecounts_mobile_200801_201606.json', 'w') as outfile:
    json.dump(response_pagecounts_mobile, outfile)
    


### Step 2: Data processing
In this step, we perform a series of processing steps on the saved JSON data files generated by the API requests. This processing is being done to prepare the data for analysis and combine the relevant features into a single CSV-formatted data file.


In the following cells, I have converted the JSON files to csv files

In [300]:
(pd.DataFrame.from_dict(data=response_pageviews_mobile_web['items'], orient='columns').
to_csv('pageviews_mobile_web_201507_201709.csv', header=True))

In [301]:
(pd.DataFrame.from_dict(data=response_pageviews_desktop['items'], orient='columns').
to_csv('pageviews_desktop_201507_201709.csv', header=True))

In [302]:
(pd.DataFrame.from_dict(data=response_pageviews_mobile_app['items'], orient='columns').
to_csv('pageviews_mobile_app_201507_201709.csv', header=True))

In [303]:
(pd.DataFrame.from_dict(data=response_pagecounts_desktop['items'], orient='columns').
to_csv('pagecounts_desktop_200801_201606.csv', header=True))

In [304]:
(pd.DataFrame.from_dict(data=response_pagecounts_mobile['items'], orient='columns').
to_csv('pagecounts_mobile_200801_201606.csv', header=True))

### Giving names based on acronyms of the type of data.     
Naming convention:    
pvmw = pageview mobile-web  
  pvma = pageview mobile app  
pcd = pagecount desktop  
pcm = pagecount mobile

In [305]:
pvmw = pd.read_csv('pageviews_mobile_web_201507_201709.csv')

In [306]:
del pvmw['Unnamed: 0']

In [307]:
pvma = pd.read_csv('pageviews_mobile_app_201507_201709.csv')

In [308]:
del pvma['Unnamed: 0']

In [309]:
pcd = pd.read_csv('pagecounts_desktop_200801_201606.csv')

In [310]:
del pcd['Unnamed: 0']

In [311]:
pcm = pd.read_csv('pagecounts_mobile_200801_201606.csv')

In [312]:
del pcm['Unnamed: 0']

In [313]:
pvd = pd.read_csv('pageviews_desktop_201507_201709.csv')

In [314]:
del pvd['Unnamed: 0']

### Filling the dataframe with values of page views and counts

To separate year and month for each csv file, created a function to do that.

In [315]:
def change_csv(df):
    df_time = df['timestamp'].values.tolist()
    df_time = [str(x) for x in df_time]
    df_year=[]
    for i in range(0,len(df_time)):
        df_year.append(df_time[i][0:4])
    df['year'] = df_year
    df_month = []
    for i in range(0,len(df_time)):
        df_month.append(df_time[i][4:6])
    df['month'] = df_month
    del df['timestamp']
    return df

Changed all dataframes to dfs with year and month separated

In [316]:
pcm = change_csv(pcm)

In [317]:
pcd = change_csv(pcd)

In [318]:
pvmw = change_csv(pvmw)

In [319]:
pvma = change_csv(pvma)

In [320]:
pvd = change_csv(pvd)

### Making final dataframe

In [338]:
#Merging pagecount mobile and pagecount desktop and replacing Nan values with 0
df_pagecount = pcm.merge(pcd, on=['year','month'],how='outer')
df_pagecount= df_pagecount.fillna(0)
# To calculate total views


In [339]:
df_pagecount["pagecount_all_views"] = df_pagecount['count_y'] + df_pagecount['count_x']
#Merging pageview mobile and pageview desktop and replacing Nan values with 0
df_mobile = pvmw.merge(pvma,on=['year','month'],how='outer')
df_mobile["pageview_mobile_views"] = df_mobile["views_x"] + df_mobile["views_y"]  
df_mobile

Unnamed: 0,access_x,agent_x,granularity_x,project_x,views_x,year,month,access_y,agent_y,granularity_y,project_y,views_y,pageview_mobile_views
0,mobile-web,user,monthly,en.wikipedia,3179131148,2015,7,mobile-app,user,monthly,en.wikipedia,109624146,3288755294
1,mobile-web,user,monthly,en.wikipedia,3192663889,2015,8,mobile-app,user,monthly,en.wikipedia,109669149,3302333038
2,mobile-web,user,monthly,en.wikipedia,3073981649,2015,9,mobile-app,user,monthly,en.wikipedia,96221684,3170203333
3,mobile-web,user,monthly,en.wikipedia,3173975355,2015,10,mobile-app,user,monthly,en.wikipedia,94523777,3268499132
4,mobile-web,user,monthly,en.wikipedia,3142247145,2015,11,mobile-app,user,monthly,en.wikipedia,94353925,3236601070
5,mobile-web,user,monthly,en.wikipedia,3276836351,2015,12,mobile-app,user,monthly,en.wikipedia,99438956,3376275307
6,mobile-web,user,monthly,en.wikipedia,3611404079,2016,1,mobile-app,user,monthly,en.wikipedia,106432767,3717836846
7,mobile-web,user,monthly,en.wikipedia,3242448142,2016,2,mobile-app,user,monthly,en.wikipedia,92414130,3334862272
8,mobile-web,user,monthly,en.wikipedia,3288785117,2016,3,mobile-app,user,monthly,en.wikipedia,97899074,3386684191
9,mobile-web,user,monthly,en.wikipedia,3177044999,2016,4,mobile-app,user,monthly,en.wikipedia,81719003,3258764002


In [340]:
df_mobile.drop(["views_x", "views_y"],axis=1,inplace=True)
df_pageview = df_mobile.merge(pvd, how='outer',on=["year", "month"])
df_pageview = df_pageview.fillna(0)


In [342]:
df_pageview

Unnamed: 0,access_x,agent_x,granularity_x,project_x,year,month,access_y,agent_y,granularity_y,project_y,pageview_mobile_views,access,agent,granularity,project,views
0,mobile-web,user,monthly,en.wikipedia,2015,7,mobile-app,user,monthly,en.wikipedia,3288755294,desktop,user,monthly,en.wikipedia,4376666686
1,mobile-web,user,monthly,en.wikipedia,2015,8,mobile-app,user,monthly,en.wikipedia,3302333038,desktop,user,monthly,en.wikipedia,4332482183
2,mobile-web,user,monthly,en.wikipedia,2015,9,mobile-app,user,monthly,en.wikipedia,3170203333,desktop,user,monthly,en.wikipedia,4485491704
3,mobile-web,user,monthly,en.wikipedia,2015,10,mobile-app,user,monthly,en.wikipedia,3268499132,desktop,user,monthly,en.wikipedia,4477532755
4,mobile-web,user,monthly,en.wikipedia,2015,11,mobile-app,user,monthly,en.wikipedia,3236601070,desktop,user,monthly,en.wikipedia,4287720220
5,mobile-web,user,monthly,en.wikipedia,2015,12,mobile-app,user,monthly,en.wikipedia,3376275307,desktop,user,monthly,en.wikipedia,4100012037
6,mobile-web,user,monthly,en.wikipedia,2016,1,mobile-app,user,monthly,en.wikipedia,3717836846,desktop,user,monthly,en.wikipedia,4436179457
7,mobile-web,user,monthly,en.wikipedia,2016,2,mobile-app,user,monthly,en.wikipedia,3334862272,desktop,user,monthly,en.wikipedia,4250997185
8,mobile-web,user,monthly,en.wikipedia,2016,3,mobile-app,user,monthly,en.wikipedia,3386684191,desktop,user,monthly,en.wikipedia,4286590426
9,mobile-web,user,monthly,en.wikipedia,2016,4,mobile-app,user,monthly,en.wikipedia,3258764002,desktop,user,monthly,en.wikipedia,4149383857


In [341]:
#To get total views
df_pageview["pageview_all_views"] =  df_pageview["pageview_mobile_views"] + df_pageview["views_x"]
#Combining both pagecount and pageview dataframes to make the final dataframe
df_final = df_pagecount.merge(df_pageview, how='outer',on=["year", "month"])
df_final = df_final[["year", "month", "pagecount_all_views","count_y", "count_x","pageview_all_views", "views",
"pageview_mobile_views"]]
df_final = df_final.fillna(0)
en_wikipedia_traffic_200801_201709 = df_final.to_csv()

KeyError: 'views_x'

## Analysis

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt

In [None]:
#Creating a date range for the x-axis

daterange = pd.date_range('2008-01', '2017-10',freq='M')

In [None]:
x = daterange
len(x)

In [None]:
y1 = df_final["pagecount_all_views"]/1000000
y2 = df_final["count_y"]/1000000
y3 = df_final["count_x"]/1000000
y4 = df_final["pageview_all_views"]/1000000
y5 = df_final["views"]/1000000
y6 = df_final["pageview_mobile_views"]/1000000

# Create a new figure with appropriate dimensions.
fig = plt.figure(figsize=(11, 8))


In [None]:
dates = pd.date_range('2008-01', '2017-10', freq='M')

# Values to be plotted.
x = dates
# Values are divided by 1,000,000 to make y-axis readable.
y1 = df_final["pagecount_all_views"]/1000000
y2 = df_final["count_y"]/1000000
y3 = df_final["count_x"]/1000000
y4 = df_final["pageview_all_views"]/1000000
y5 = df_final["views"]/1000000
y6 = df_final["pageview_mobile_views"]/1000000

# Create a new figure with appropriate dimensions.
fig = plt.figure(figsize=(11, 8))

# Show grid
plt.grid(True)

# Plot each data series from DataFrame.
plt.plot(x, y1, label = "Total", color = "seagreen",
         alpha = 0.8, linewidth = 3)
plt.plot(x, y2, label = "Main Site", color = "dodgerblue",
         alpha = 0.8, linewidth = 2)
plt.plot(x, y3, label = "Mobile", color = "sienna")
plt.plot(x, y4, linestyle = "--", label = "Total",
         color = "seagreen", alpha = 0.8, linewidth = 3)
plt.plot(x, y5, "--", label = "Main Site",
         color = "dodgerblue", alpha = 0.8, linewidth = 2)
plt.plot(x, y6, "--", label = "Mobile", color = "sienna")

plt.legend(loc=2)
plt.xlabel("Time")
plt.ylabel("Wikipedia Page Views (million)")
plt.title("Web Traffic on English Wikipedia Pages from 2008-2017")

# Display plot
plt.show()

# Save plot to file
fig.savefig("WikipediaDataPlot.png")