# US Alexa Ranks of Online News Media

## Instructions

1. To update Alexa US rank data and overwrite `ranks.csv`, simply run all cells below.
2. To add website(s) to the rank data collection, fill in the values in `add_sites` below and run all cells. Make sure to enter the URL that Alexa uses for ranking.
3. To remove website(s) from all datasets and from future data collection, add site IDs to `removes_sites` below and run all cells.

In [1]:
add_sites = [] # list of site (name, url) tuples of strings

In [2]:
remove_sites = [] # list of site ids (integers)

## Package Imports

In [3]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

## Sites

In [5]:
sites = pd.read_csv('sites.csv')

Add requested sites.

In [6]:
for (name, url) in add_sites:
    if url not in sites['url']:
        sites = sites.append({'id':sites.id.max()+1, 'name':name, 'url':url}, ignore_index=True)

Remove requested sites.

In [7]:
sites = sites.loc[~sites['id'].isin(remove_sites)]

View full list of sites.

In [8]:
sites.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79 entries, 0 to 78
Data columns (total 3 columns):
id      79 non-null int64
name    79 non-null object
url     79 non-null object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB


In [9]:
sites

Unnamed: 0,id,name,url
0,0,FiveThirtyEight,fivethirtyeight.com
1,1,Politico,politico.com
2,2,MSNBC,msnbc.com
3,3,Washington Post,washingtonpost.com
4,4,Business Insider,businessinsider.com
5,5,Washington Times,washingtontimes.com
6,6,The Daily Stormer,dailystormer.name
7,7,CNBC,cnbc.com
8,8,The Hill,thehill.com
9,9,The Intercept,theintercept.com


Save `sites`.

In [10]:
sites.to_csv('sites.csv', index=False)

## Ranks

In [11]:
ranks = pd.read_csv('ranks.csv')

Remove requested sites.

In [12]:
ranks = ranks.loc[~ranks['id'].isin(remove_sites)]

Current ranks info:

In [13]:
ranks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 769 entries, 0 to 768
Data columns (total 3 columns):
datetime    769 non-null object
id          769 non-null int64
rank        769 non-null int64
dtypes: int64(2), object(1)
memory usage: 24.0+ KB


In [14]:
ranks.sample(3)

Unnamed: 0,datetime,id,rank
463,2018-03-07 07:35:15.605233,6,7795
611,2018-03-07 22:17:18.107721,78,633
487,2018-03-07 07:35:15.605233,30,107


## Scrape Current Site Ranks

In [15]:
def find_rank(url):
    'Scrape the Alexa Rank in the United States of the given URL'
    try:
        dfs = pd.read_html('http://www.alexa.com/siteinfo/%s' % url)
        idx = 0
        while True:
            country_ranks = dfs[idx]
            try:
                country_ranks.set_index('Country', inplace=True)
                return country_ranks.loc['United States', 'Rank in Country']
            except:
                idx += 1
    except:
        print('\tWARNING: Lookup failed on %s' % url)
        return np.nan

Find the currents ranks of all sites in `sites` and append the results to `ranks`.

In [16]:
dtime = datetime.datetime.utcnow() # Current UTC datetime

print('Scraping current site ranks...')
for idx in range(sites.shape[0]):
    siterank = find_rank(sites.loc[idx, 'url'])
    if pd.notnull(siterank):
        ranks = ranks.append({'datetime':dtime, 'id':sites.loc[idx, 'id'], 'rank':siterank}, ignore_index=True)
    
print('Done')

Scraping current site ranks...
Done


In [17]:
ranks.tail()

Unnamed: 0,datetime,id,rank
843,2018-03-11 00:23:01.467944,75,16989
844,2018-03-11 00:23:01.467944,76,1153
845,2018-03-11 00:23:01.467944,78,639
846,2018-03-11 00:23:01.467944,79,1294
847,2018-03-11 00:23:01.467944,80,2361


Save `ranks.csv`.

In [18]:
ranks.to_csv('ranks.csv', index=False)

## Data Merging and Pivoting

Convert datetime strings to datetime objects.

In [19]:
ranks.loc[:,'datetime'] = pd.to_datetime(ranks['datetime'])

Create dataframe with datetime, name, and rank data merged from `sites` and `ranks`.

In [20]:
site_ranks = ranks.merge(sites, on='id')
site_ranks = site_ranks.loc[:, ['datetime', 'name', 'rank']]

In [21]:
site_ranks.sample(3)

Unnamed: 0,datetime,name,rank
710,2018-03-07 07:35:15.605233,Daily Wire,829
4,2018-03-06 01:00:31.524655,ABC News,243
343,2018-03-04 03:34:24.925354,The Independent,391


Pivot `site_ranks`.

In [22]:
site_ranks = site_ranks.pivot(index='datetime', columns='name', values='rank')

## Most Recent Ranks

### Sorted by Site Name

In [23]:
pd.DataFrame({'US Alexa Rank, %s' % site_ranks.index[-1].date():site_ranks.iloc[-1]})

Unnamed: 0_level_0,"US Alexa Rank, 2018-03-11"
name,Unnamed: 1_level_1
ABC News,240.0
AP News,1077.0
Alternet,2964.0
Axios,1012.0
BBC,80.0
Bloomberg,161.0
Breitbart,59.0
Business Insider,105.0
BuzzFeed,65.0
CBS News,205.0


### Sorted by Rank

In [24]:
site_ranks.sort_values(by=site_ranks.index[-1], axis=1, inplace=True)

In [25]:
pd.DataFrame({'US Alexa Rank, %s' % site_ranks.index[-1].date():site_ranks.iloc[-1]})

Unnamed: 0_level_0,"US Alexa Rank, 2018-03-11"
name,Unnamed: 1_level_1
CNN,24.0
New York Times,31.0
Fox News,56.0
Breitbart,59.0
BuzzFeed,65.0
Washington Post,66.0
HuffPost,76.0
BBC,80.0
Vice,81.0
USA Today,99.0


## Visualizations

### Most Recent Site Ranks

In [26]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go

In [27]:
plotly.tools.set_credentials_file(username='jgcorliss', api_key='4vKw6KTPiNvhVdVZTYAv')

In [28]:
#cf.set_config_file(theme='white')
data = [go.Bar(y=site_ranks.columns, x=site_ranks.iloc[-1], orientation='h')]
layout = go.Layout(
    title='Most Recent Site Ranks - %s UTC' % site_ranks.index[-1],
    yaxis={},
    xaxis={'title':'US Alexa Rank', 'type':'log'},
    showlegend=False,
    height=1400,
    margin={'l':200},
    hovermode='closest',
    hoverlabel={'bgcolor':'white', 'namelength':-1}
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='alexa-ranks-current')

Link to this plot: https://plot.ly/~jgcorliss/12/

### Site Ranks History

In [29]:
data = [go.Scatter(x=site_ranks.index, y=site_ranks[col], mode="markers+lines", name=col) for col in site_ranks.columns]
layout = go.Layout(
    title='Site Ranks History',
    xaxis={'title':'Datetime (UTC)'},
    yaxis={'title':'US Alexa Rank', 'type':'log'},
    showlegend=False,
    height=2400,
    hovermode='closest',
    hoverlabel={'namelength':-1}
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='alexa-ranks-history')

Link to this plot: https://plot.ly/~jgcorliss/14/