# How to identify groups of articles relevant to a specific news event

Here we provide a demo of the temporal community detection procedure applied to an example event. The approach collects the groups of articles that are both well connected and exhibit similar patterns of page views to some specified seed articles.

Note that data collection is not performed from scratch in this demo version, but the doc will be updated.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from sklearn.preprocessing import RobustScaler
import WikiNewsNetwork as wnn

## 1. Data collection

N.B. This section incomplete at present and does not collect data from scratch.

Take a sample event with hyperlinked Wikipedia articles and an event date. e.g.:

_2018/11/30 ___[2018 Anchorage earthquake](https://en.wikipedia.org/wiki/2018_Anchorage_earthquake)___: A ___[magnitude](https://en.wikipedia.org/wiki/Moment_magnitude_scale)___ 7.0 earthquake hits Alaska, with the epicenter in ___[Anchorage](https://en.wikipedia.org/wiki/Anchorage,_Alaska)___. Severe damage is reported._

In [None]:
core = ['2018_Anchorage_earthquake', 'Moment_magnitude_scale', 'Anchorage,_Alaska']

Collect clickstream network data for period (download first if necessary).

In [None]:
# Pre-collected data, will be updated to collect from scratch
el = pd.read_hdf('demo/edgelist.h5')
display(el.head())

Collect page view data for relevant articles in period.

In [None]:
# Pre-collected data, will be updated to collect from scratch
ts = pd.read_hdf('demo/timeseries.h5')
display(ts.head())

Collect relevant redirects.

In [None]:
# Not relevant here, encouraged if collecting own data through clickstream/dumps/API

## 2. Processing

Pre-process network and pageview data

In [None]:
# Scale page view data
scaler = RobustScaler()
timeseries = scaler.fit_transform(ts)

# Filter edgelist 
el = el[el['n']>100]
el = el[(el['prev'].isin(ts.columns)) & (el['curr'].isin(ts.columns))]

# Convert to unweighted, undirected adjacency matrix
articles = sorted(set(el['prev']) | set(el['curr']))
network = (~el.pivot(index='prev', columns='curr', values='n').isna()
               ).reindex(columns=articles, index=articles, fill_value=False)
network = (network | network.T).astype(int)

Save network and page view data

In [None]:
np.save('demo/scaled_timeseries.npy', timeseries)
network.to_hdf('demo/adj.h5', key='df')

## 3. Community detection

Supply network and page view data to temporal community detection algorithm. The algorithm identifies groups of articles that are both well connected and exhibit similar attention dynamics around the time of the event.

In [None]:
# Load data if necessary
# timeseries = np.load('demo/scaled_timeseries.npy')
# network = pd.read_hdf('demo/adj.h5')

cd_output, nodename_dict = wnn.cd.cd_demo(timeseries, network, res=0.25, tau=1)
# see docs for many more arguments to this function

Process community detection output:

In [None]:
membership_df = pd.concat([pd.Series(x, index=nodename_dict[n])
                            for n, x in enumerate(cd_output[0])],
                           axis=1, sort=True)
display(membership_df)

Extract 'Event Reactions' - communities overlapping with event date (column 27) containing at least one of the previously specified core articles. We have identified the groups of articles related to the event!

In [None]:
ev_reactions = wnn.cd.extract_event_reactions(membership_df, core, list(membership_df.index))
for k, v in ev_reactions.items():
    print(k)
    display(v) # NaN simply indicates not part of the community for this timestep

## 4. Visualisation

A quick look at the page view time series for each community

In [None]:
# set style
plt.style.use('seaborn-darkgrid')
palette = sns.color_palette('colorblind', len(ev_reactions))

# alter x values
tsp = ts.copy()
tsp.index = (tsp.index - tsp.index[len(tsp)//2]).days

# plot page views for each community
fig, ax = plt.subplots(figsize=(7,5))
for n, (k, v) in enumerate(ev_reactions.items()):
    ax.plot(tsp[v.index], lw=1, alpha=0.25, c=palette[n])
    ax.plot(tsp[v.index].mean(axis=1), lw=2, c=palette[n])

# add legends
legend_elements1 = [Line2D([0], [0], color='k', lw=0.25,
                           label='Individual Article'),
                    Line2D([0], [0], color='k', lw=2,
                           label='Community Mean')]
legend_elements2 = [Line2D([0], [0], color=palette[n], lw=1,
                           label=k)
                    for n, k in enumerate(ev_reactions.keys())] 
l1 = ax.legend(handles=legend_elements1, loc=2)
l2 = ax.legend(handles=legend_elements2, title='Community', loc=4)
ax.add_artist(l1)

# tune other elements
ax.set_ylabel('Page views (daily)')
ax.set_xlabel('Days from event')
ax.set_xlim(-30, 30)
ax.set_yscale('log')
ax.set_title('Page views towards articles related to the Anchorage earthquake in 2018')
plt.show()