# US Alexa Ranks of Online News Media

## Instructions

1. To update Alexa US rank data and overwrite `ranks.csv`, simply run all cells below.
2. To add website(s) to the rank data collection, fill in the values in `add_sites` below and run all cells. Make sure to enter the URL that Alexa uses for ranking.
3. To remove website(s), add site names to `removes_sites` below and run all cells.

In [8]:
add_sites = [] # list of site (name, url) tuples

In [9]:
remove_sites = [] # list of site names

## Package Imports

In [1]:
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

## Read in Data

### Sites

In [None]:
sites = pd.read_csv('sites.csv')

In [4]:
sites.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79 entries, 0 to 78
Data columns (total 3 columns):
id      79 non-null int64
name    79 non-null object
url     79 non-null object
dtypes: int64(1), object(2)
memory usage: 1.9+ KB


In [6]:
sites

Unnamed: 0,id,name,url
0,0,FiveThirtyEight,fivethirtyeight.com
1,1,Politico,politico.com
2,2,MSNBC,msnbc.com
3,3,Washington Post,washingtonpost.com
4,4,Business Insider,businessinsider.com
5,5,Washington Times,washingtontimes.com
6,6,The Daily Stormer,dailystormer.name
7,7,CNBC,cnbc.com
8,8,The Hill,thehill.com
9,9,The Intercept,theintercept.com


### Ranks

In [3]:
ranks = pd.read_csv('ranks.csv')

In [16]:
ranks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 151 entries, 0 to 150
Data columns (total 3 columns):
datetime    151 non-null object
id          151 non-null int64
rank        151 non-null int64
dtypes: int64(2), object(1)
memory usage: 3.6+ KB


In [25]:
ranks.tail()

Unnamed: 0,datetime,id,rank
146,2018-03-02 05:04:22.252590,29,162
147,2018-03-02 05:04:22.252590,3,61
148,2018-03-02 05:04:22.252590,24,9515
149,2018-03-02 05:04:22.252590,50,1642
150,2018-03-02 05:04:22.252590,16,172


## Add/Remove Sites

### Sites

Add requested sites.

In [11]:
for (name, url) in add_sites:
    sites = sites.append({'id':sites.id.max()+1, 'name':name, 'url':url}, ignore_index=True)

Remove requested sites.

In [13]:
for name in remove_sites:
    sites = sites.loc[sites['name'] != name]

Save `sites`.

In [14]:
sites.to_csv('sites.csv', index=False)

### Ranks

Remove requested sites.

In [26]:
ids_to_remove = sites.loc[sites['name'].isin(remove_sites), 'id']
ranks = ranks.loc[~ranks['id'].isin(ids_to_remove)]

## Scrape Current Site Ranks

Subroutine to find the current Alexa rank "Rank in United States" via a scrape of alexa.com.

In [None]:
def find_US_rank(url):
    'Find the Alexa Rank in the United States of the given URL'
    try:
        dfs = pd.read_html('http://www.alexa.com/siteinfo/%s' % url)
        idx = 0
        while True:
            country_ranks = dfs[idx]
            try:
                country_ranks.set_index('Country', inplace=True)
                return country_ranks.loc['United States', 'Rank in Country']
            except:
                idx += 1
    except:
        print('\tWARNING: Lookup failed on %s' % url)
        return np.nan

Find the currents ranks of all sites in `siteinfo` and store the results in the dict `current_ranks`.

In [None]:
num_sites = siteinfo.shape[0]
current_ranks = {'UTC_datetime':datetime.datetime.utcnow()}

print('Scraping alexa.com...')
for (idx, row) in siteinfo.iterrows():
    current_ranks[row['Name']] = find_US_rank(row['URL'])
    
print('Done')

## Update Ranks Data

Load current ranks data.

In [None]:
ranks.loc[:,'UTC_datetime'] = pd.to_datetime(ranks['UTC_datetime'])

Append the current ranks to `ranks`.

In [None]:
ranks = ranks.append(current_ranks, ignore_index=True)

Reorder the site columns in `ranks` according to `siteinfo`, leaving the datetime as the first column. Any sites not in `siteinfo` are removed from `ranks`.

In [None]:
ranks = pd.concat([ranks['UTC_datetime'], ranks.loc[:,siteinfo['Name'].values]], axis=1)

In [None]:
ranks.tail()

Save `ranks.csv`.

## Plots

In [None]:
f, ax = plt.subplots(figsize=(12, 36))
ax.set(yscale="log")
for col in ranks.columns[1:]:
    plt.plot(ranks['UTC_datetime'], ranks[col], '.-')
plt.legend(ranks.columns[1:], bbox_to_anchor=(1,1))