# English Wikipedia Page Views by Topics in November and October 2019


https://phabricator.wikimedia.org/T234839


In this analysis, we use Adam's topic dataset of articles with "best" (most salient) topic prediction for pages accessed in November 2019. (see [example](https://dr0ptp4kt.github.io/topics-8.html) of first 10K non-randomized rows for an HTML view). 

The outcome topics are from the "predicted" field, which is the post-enrichment best guess for the articles.


In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')

In [1]:
import requests
import pandas as pd
import json
import matplotlib.pyplot as plt
import gzip
from wmfdata import hive
import numpy as np

You can find the source for `wmfdata` at https://github.com/neilpquinn/wmfdata


In [2]:
##read topic prediction file
topic = pd.read_csv('topic_predictions_201912_mediawiki_page_dump_enriched_20191201_through_20191219.tsv.gz', sep='\t',compression='gzip', header=0)

In [3]:
pageview_query = '''
    SELECT 
        CONCAT(year,"-",month,"-01") AS date,
        page_id, 
        SUM(view_count) AS pageviews
    FROM 
        wmf.pageview_hourly
    WHERE year = "{year}"
        AND month = "{month}" 
        AND project = "{wiki}"
        AND namespace_id = 0
        AND agent_type = "user"
        AND NOT (
            country_code IN ("PK", "IR", "AF") 
            AND user_agent_map["browser_family"] = "IE" AND user_agent_map["browser_major"] = 7
        )
    GROUP BY CONCAT(year,"-",month,"-01"), page_id
'''

## November 2019 Topic Analysis 

In [5]:
enwiki_pv_sept_all = hive.run([
    "SET mapreduce.map.memory.mb=4096",    
     pageview_query.format(
        year = 2019,
        month = 11,
        wiki = "en.wikipedia")
])

In [6]:
enwiki_pv_sept_all['proportion']= enwiki_pv_sept_all['pageviews']/enwiki_pv_sept_all['pageviews'].sum()
enwiki_pv_sep_all = enwiki_pv_sept_all.sort_values(by='pageviews', ascending=False)


In [8]:
enwiki_pv_sept_all[enwiki_pv_sept_all.page_id.isnull()]
# Note from AJB: In the previous notebook version this said enwiki_pv_sept_all[enwiki_pv_sept.page_id.isnull()]
# But there was no enwiki_pv_sept to be found.
# See https://nbviewer.jupyter.org/github/conniecc1/topics-modeling/blob/master/TopicPageviewsSept2019.ipynb.
# There the result was
#  	date 	page_id 	pageviews 	proportion
# 4156974 	2019-9-01 	NaN 	48039 	0.000007


Unnamed: 0,date,page_id,pageviews,proportion
917340,2019-11-01,,35497,5e-06


In [9]:
print('Total page views in latest month: ' + str(enwiki_pv_sept_all.pageviews.sum()))

Total page views in latest month: 7373875687


In [11]:
print('Number of unqiue pages in latest month: ' + str(enwiki_pv_sept_all.shape[0]))

Number of unqiue pages in latest month: 7121236


In [20]:
print('Top 1M pages account for ' + str(round(enwiki_pv_sept_all.proportion[:1000000].sum() * 100,2)) + '% of total page views in November.')

Top 1M pages account for 13.11% of total page views in November.


In [13]:
pageview_title_query = '''
WITH v AS (
    SELECT page_id, SUM(view_count) AS pageviews
    FROM wmf.pageview_hourly
    WHERE year = "{year}"
        AND month = "{month}" 
        AND project = "{wiki}"
        AND namespace_id = 0
        AND agent_type = "user"
        AND NOT (
            country_code IN ("PK", "IR", "AF") AND user_agent_map["browser_family"] = "IE" AND user_agent_map["browser_major"] = 7
        )
    GROUP BY page_id
    LIMIT 10000000
), p AS (
    SELECT page_id, page_title, page_latest
    FROM wmf_raw.mediawiki_page
    WHERE wiki_db = "enwiki"
    AND snapshot = "{snapshot}"
    AND page_id IS NOT NULL
    AND page_namespace = 0
    AND NOT page_is_redirect
)

SELECT v.page_id, p.page_title, v.pageviews
FROM v LEFT JOIN p ON v.page_id=p.page_id
'''

In [15]:
enwiki_pv_sept = hive.run([
    "SET mapreduce.map.memory.mb=4096",    
     pageview_title_query.format(
        year = 2019,
        month = 11,
        wiki = "en.wikipedia",
        snapshot = "2019-11")
])

In [16]:
enwiki_pv_sept['proportion']= enwiki_pv_sept['pageviews']/enwiki_pv_sept['pageviews'].sum()
enwiki_pv_sept = enwiki_pv_sept.sort_values(by='pageviews', ascending=False)

In [17]:
enwiki_pv_topic_sept = enwiki_pv_sept.merge(topic, how = 'left', on = 'page_id')

In [18]:
enwiki_pv_topic_sept['predicted'] = enwiki_pv_topic_sept['predicted'].fillna(value='Unknown')
enwiki_pv_topic_sept['proportion']= enwiki_pv_topic_sept['pageviews']/enwiki_pv_topic_sept['pageviews'].sum()

### Top 50 articles read in November 2019 on English Wikipedia¶

The table below shows the top 50 articles viewed in English Wikipedia in November 2019, with the corresponding propotions among the total pageviews and the best predicted topic.

In [19]:
enwiki_page_sept_summary = enwiki_pv_topic_sept[['page_title','pageviews','proportion','predicted']].sort_values(by='pageviews', ascending=False).reset_index(drop=True).head(50)


In [21]:
print('Top 50 articles account for ' +  str(round(enwiki_page_sept_summary.proportion.sum() * 100,2))+ '% of total page views in November.')

Top 50 articles account for 6.43% of total page views in November.


In [22]:
enwiki_page_sept_summary

Unnamed: 0,page_title,pageviews,proportion,predicted
0,Main_Page,359441360,0.048745,Internet culture
1,Simple_Mail_Transfer_Protocol,13313220,0.001805,Technology
2,The_Mandalorian,4866318,0.00066,Entertainment
3,List_of_awards_and_nominations_received_by_Mer...,3823347,0.000518,Entertainment
4,Henry_V_of_England,3746844,0.000508,Military and warfare
5,Joker_(2019_film),3495017,0.000474,Visual arts
6,The_Irishman_(2019_film),3270961,0.000444,Entertainment
7,Elizabeth_II,3175514,0.000431,Language and literature
8,Deaths_in_2019,3111145,0.000422,Time
9,"Princess_Margaret,_Countess_of_Snowdon",3073530,0.000417,Language and literature


### November Top 50 Topics Viewed


The table below shows the page views by top 50 topics in November 2019 on English Wikipedia. Main page is excluded in this table.

In [23]:
enwiki_topic_sept_summary = (enwiki_pv_topic_sept[enwiki_pv_topic_sept.page_title != 'Main_Page']
          .groupby('predicted', as_index = False)['pageviews', 'proportion']
          .sum()
          .sort_values(by='pageviews', ascending=False))

In [24]:
print('Top 10 topics account for ' +  str(round(enwiki_topic_sept_summary.proportion[:10].sum() * 100,2))+ '% of total page views in November.')
print('Top 50 topics account for ' +  str(round(enwiki_topic_sept_summary.proportion[:50].sum() * 100,2))+ '% of total page views in November.')

Top 10 topics account for 64.01% of total page views in November.
Top 50 topics account for 93.76% of total page views in November.


In [25]:
enwiki_topic_sept_summary.head(50)

Unnamed: 0,predicted,pageviews,proportion
148,Entertainment,1040547962,0.141112
425,Sports,616906761,0.083661
340,Performing arts,582297034,0.078967
201,History and society,523091549,0.070938
74,Broadcasting,499102749,0.067685
348,Politics and government,441673158,0.059897
246,Language and literature,267947415,0.036337
446,Technology,261528323,0.035467
82,Business and economics,245842123,0.033339
343,Philosophy and religion,241042965,0.032689


## November & October 2019 Topic Data Comparison

In [26]:
enwiki_pv_aug_all = hive.run([
    "SET mapreduce.map.memory.mb=4096",    
     pageview_query.format(
        year = 2019,
        month = 10,
        wiki = "en.wikipedia")
])

In [27]:
enwiki_pv_aug_all['proportion']= enwiki_pv_aug_all['pageviews']/enwiki_pv_aug_all['pageviews'].sum()
enwiki_pv_aug_all = enwiki_pv_aug_all.sort_values(by='pageviews', ascending=False)

In [28]:
print('Total page views in October: ' + str(enwiki_pv_aug_all.pageviews.sum()))

Total page views in October: 7742167216


In [29]:
print('Number of unqiue pages in October: ' + str(enwiki_pv_aug_all.shape[0]))

Number of unqiue pages in October: 7011076


In [30]:
enwiki_pv_aug = hive.run([
    "SET mapreduce.map.memory.mb=4096",    
     pageview_title_query.format(
        year = 2019,
        month = 10,
        wiki = "en.wikipedia",
        snapshot = "2019-10")
])

In [31]:
enwiki_pv_topic_aug = enwiki_pv_aug.merge(topic, how = 'left', on = 'page_id')

In [32]:
enwiki_pv_topic_aug['predicted'] = enwiki_pv_topic_aug['predicted'].fillna(value='Unknown')
enwiki_pv_topic_aug['proportion']= enwiki_pv_topic_aug['pageviews']/enwiki_pv_topic_aug['pageviews'].sum()

### October Top 50 Topics Viewed

In [33]:
enwiki_topic_aug_summary = (enwiki_pv_topic_aug[enwiki_pv_topic_aug.page_title != 'Main_Page']
          .groupby('predicted', as_index = False)['pageviews', 'proportion']
          .sum()
          .sort_values(by='pageviews', ascending=False))

In [34]:
print('Top 10 topics account for ' +  str(round(enwiki_topic_aug_summary.proportion[:10].sum() * 100,2))+ '% of total page views in October')
print('Top 50 topics account for ' +  str(round(enwiki_topic_aug_summary.proportion[:50].sum() * 100,2))+ '% of total page views in October')

Top 10 topics account for 62.13% of total page views in October
Top 50 topics account for 92.35% of total page views in October


In [35]:
enwiki_topic_sept_summary.head(50)

Unnamed: 0,predicted,pageviews,proportion
148,Entertainment,1040547962,0.141112
425,Sports,616906761,0.083661
340,Performing arts,582297034,0.078967
201,History and society,523091549,0.070938
74,Broadcasting,499102749,0.067685
348,Politics and government,441673158,0.059897
246,Language and literature,267947415,0.036337
446,Technology,261528323,0.035467
82,Business and economics,245842123,0.033339
343,Philosophy and religion,241042965,0.032689


### Top Topics Rank Comparison November vs. October

In [40]:
enwiki_topic_sept_summary["nov_rank"] = enwiki_topic_sept_summary["proportion"].rank(ascending=0) 
enwiki_topic_aug_summary["oct_rank"] = enwiki_topic_aug_summary["proportion"].rank(ascending=0) 

In [41]:
topic_rank = enwiki_topic_sept_summary.merge(enwiki_topic_aug_summary, how = 'left', on = 'predicted')
topic_rank = topic_rank.rename(columns={'predicted': 'topic', 'proportion_x': 'proportion_nov','proportion_y': 'proportion_oct','pageviews_x':'pageviews_nov','pageviews_y':'pageviews_oct'})

In [42]:
topic_rank[['topic','proportion_nov','proportion_oct','nov_rank','oct_rank']].head(50)


Unnamed: 0,topic,proportion_nov,proportion_oct,nov_rank,oct_rank
0,Entertainment,0.141112,0.130386,1.0,1.0
1,Sports,0.083661,0.084508,2.0,2.0
2,Performing arts,0.078967,0.076771,3.0,3.0
3,History and society,0.070938,0.069219,4.0,4.0
4,Broadcasting,0.067685,0.067243,5.0,5.0
5,Politics and government,0.059897,0.057299,6.0,6.0
6,Language and literature,0.036337,0.034036,7.0,9.0
7,Technology,0.035467,0.035118,8.0,7.0
8,Business and economics,0.033339,0.034189,9.0,8.0
9,Philosophy and religion,0.032689,0.032506,10.0,10.0


In [43]:
topic_rank['rank_diff_abs'] = abs(topic_rank['nov_rank'] - topic_rank['oct_rank'])


In [44]:
topic_rank[['topic','proportion_nov','proportion_oct','nov_rank','oct_rank','rank_diff_abs']].head(100).sort_values(by='rank_diff_abs', ascending=False).head(10)


Unnamed: 0,topic,proportion_nov,proportion_oct,nov_rank,oct_rank,rank_diff_abs
97,Venezuela,7.3e-05,4.7e-05,98.0,127.0,29.0
85,Syria,0.000105,0.000172,86.0,70.0,16.0
98,Hong_Kong,7.2e-05,9.3e-05,99.0,90.0,9.0
80,Nepal,0.000113,9.6e-05,81.0,89.0,8.0
96,Morocco,7.4e-05,7.5e-05,97.0,103.0,6.0
71,Thailand,0.000147,0.000135,72.0,78.0,6.0
86,Belgium,0.000102,0.000115,87.0,82.0,5.0
68,New Zealand,0.000179,0.000201,69.0,65.0,4.0
70,Israel,0.00015,0.000139,71.0,75.0,4.0
66,Nigeria,0.000187,0.000168,67.0,71.0,4.0
