## Overview

This notebook analyzes the following statistics for the [IBM Watson Data Labs publication on medium.com](https://medium.com/ibm-watson-data-lab):

   Publication statistics:
       - Views 
       - Reads 
       - Recommendations 
       - Recommendations/reads ratio 


   Story statistics:
       - Views 
       - Reads 
       - Reads/views ratio
       - Recommendations 
       - Recommendations/reads ratio 
        
   Author statistics:     
       - Total views
       - Total reads
       - Total recommendations
       - Average reads/views ratio
       - Average recommendations/reads ratio

   Tag statistics:     
       - Total views
       - Total reads
       - Total recommendations
       - Average reads/views ratio
       - Average recommendations/reads ratio

  The measures are defined as follows:
     - Views: Number of visitors who landed on the story page 
     - Reads: Number of visitors that have read the complete story (this is an estimate)
     - Recommendations: Number of visitors that have recommended the story
     - Reads/views ratio (derived) average number of visitors that read the entire story (and not just part of it)
     - Recommendations/reads ratio (derived) average number of visitors that liked a story after reading it
     

 > A higher value is better for all measures.

In [None]:
import IPython
import pandas as pd
import pixiedust
import re
import requests
from datetime import datetime

Configure the input data sets and load the data

In [None]:
# TODO: specify the input data file names
#
stats_csv_url = 'https://raw.githubusercontent.com/ibm-watson-data-lab/medium-publication-stats/master/assets/stats.csv'
meta_csv_url = 'https://raw.githubusercontent.com/ibm-watson-data-lab/medium-publication-stats/master/assets/metadata.csv'

In [None]:
def httpLoad(filename):    
    r = requests.get(filename)
    if r.status_code == 200:
        df = pd.read_csv(filename, encoding='utf-8')
        print 'Loaded {} into DataFrame. Row count: {}'.format(filename, len(df.index))   
        return df
    else:
        print 'Error loading {}. HTTP status: {}'.format(filename, r.text)
        return None

# load post stats     
stats_df = httpLoad(stats_csv_url)
# load post metadata
meta_df = httpLoad(meta_csv_url) 

### Prepare for stats analysis

In [None]:
# merge both DataFrames
input_df = stats_df.merge(meta_df, how = 'outer', on = 'title', indicator = True)

# identify and report missing data
def validation(row):
    '''
    input: data row
    output: True of the row is valid, False otherwise
    '''
    if row['_merge'] == 'left_only':
        # issue warning
        print u'Warning. Metadata are missing for row with title "{}"'.format(row['title'])
        return False
    elif row['_merge'] == 'right_only':
        # issue warning
        print u'Warning. Stats are missing for row with title "{}"'.format(row['title'])
        return False
    return True
# create analysis DataFrames, keeping only rows that contain stats and metadata
analysis_df = input_df[input_df.apply(validation, axis = 1)]
del analysis_df['_merge']

# enrich DataFrame
today = datetime.now().date()

# helper
def calcElapsed(col):
    '''
        input: col - date expressed as 'YYYY-MM'
        output: number of month since 'YYYY-MM', including current month (value is always > 0)
    '''
    d = datetime.strptime(col, '%Y-%m').date()
    return ((today.year - d.year) * 12 + today.month - d.month) + 1

# calculate for how many month a story has been published
analysis_df['duration'] = analysis_df['full_month'].apply(calcElapsed)

# helper
def calcAvgInt(col1, col2):
    '''
        input: col1 - numeric column
        input: col2 - numeric column, > 0
        output: int(col1/col2)
    '''
    return int(col1/col2)

# calculate average views per published month
analysis_df['avg_views'] = analysis_df.apply(lambda row: calcAvgInt(row['views'],row['duration']), axis=1)
# calculate average reads per published month
analysis_df['avg_reads'] = analysis_df.apply(lambda row: calcAvgInt(row['reads'],row['duration']), axis=1)
# calculate average number of fans per published month
analysis_df['avg_fans'] = analysis_df.apply(lambda row: calcAvgInt(row['fans'],row['duration']), axis=1)

# calculate fans/views ratio
analysis_df['fv_ratio'] = analysis_df['fans'] / analysis_df['views']

# calculate reads/views ratio
analysis_df['rv_ratio'] = analysis_df['reads'] / analysis_df['views']

print 'Post DataFrame dimensions: {} (rows, columns)'.format(analysis_df.shape)
IPython.display.display(analysis_df.head(5))

### Prepare for tag analysis

In [None]:
# create tag dataframe
tag_stats = {}
tag_associations = []
for row in analysis_df.itertuples():
    if row[11] is not None:
        for tag in row[11].split(','):
            tag = tag.strip()
            if tag in tag_stats:
                #print 'Updating ' + tag + ' ' + str(tag_stats[tag])
                tag_stats[tag]['count'] += 1
                tag_stats[tag]['views'] += row[5]
                tag_stats[tag]['reads'] += row[6]
                tag_stats[tag]['fans'] += row[8]
                #print 'Updated  ' + tag + ' ' + str(tag_stats[tag])
            else:
                tag_stats[tag] = {
                                    'count': 1,
                                    'views': row[5],
                                    'reads': row[6],
                                    'fans': row[8],                         
                }
                #print tag + ' ' + str(tag_stats[tag])
                
            tag_associations.append((tag, row[1], row[3], row[10]))     

tag_associations_df = pd.DataFrame(tag_associations, columns=['tag','title','full_month', 'url'])
            
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html                
tag_analysis_df = pd.DataFrame.from_dict(tag_stats, orient='index').reset_index().rename(columns={"index": "tag"})

def calcAvgInt(col1, col2):
    return int(col1/col2)

tag_analysis_df['avg_views'] = tag_analysis_df.apply(lambda row: calcAvgInt(row['views'],row['count']), axis=1)
tag_analysis_df['avg_reads'] = tag_analysis_df.apply(lambda row: calcAvgInt(row['reads'],row['count']), axis=1)
tag_analysis_df['avg_fans'] = tag_analysis_df.apply(lambda row: calcAvgInt(row['fans'],row['count']), axis=1)

# calculate fans/views ratio for each tag
tag_analysis_df['fv_ratio'] = tag_analysis_df['fans'] / tag_analysis_df['reads']

# calculate reads/views ratio
tag_analysis_df['rv_ratio'] = tag_analysis_df['reads'] / tag_analysis_df['views']

# medium.com tag URL
tag_analysis_df['url'] = 'https://medium.com/search?q=' + tag_analysis_df['tag'].str.replace(' ','+')

print 'Tag DataFrame dimensions: {} (rows, columns)'.format(analysis_df.shape)
IPython.display.display(tag_analysis_df.head(5))

***
## Plot basic publication statistics

 - Stories per month
 - Reads/views correlation
 - Fans/views correlation
 - Total number of views/reads/fans grouped by story publication month

In [None]:
analysis_df['count'] = 1

In [None]:
# number of posts per month
display(analysis_df)

In [None]:
# views/reads/fans per month
display(analysis_df)

Reads/views (RV) ratio: How many people who open a story page actually read the entire story? Refer to the RV story stats chart further down for details.

In [None]:
# correlation between reads and views
display(analysis_df)

Fans/views (FV) ratio: How many people who viewed a story clapped? Refer to the FV story stats chart further down for details.

In [None]:
# correlation between recommendations and reads
display(analysis_df)

***

## Poststatistics

### Views, reads and fans

Identify how many people have viewed/read/liked a story:
 - Options > Values: views
 - Options > Values: reads (default)
 - Options > Values: fans

In [None]:
# views/reads/fans per story
display(analysis_df)

### Conversion ratios

A higher value is better, with 0 indicatig that nobody read/liked a story (boo!) and 1 that everybody read/liked a story (yay!)

Example: 1 out of 10 readers recommend story "The secret to maximizing story recommendations."

Recap: 
 * Views represent the number of visitors that accessed a story page. 
 * Reads represent the approximate number of visitors that read the story.
 * Recommends represent the number of visitors that liked a story.
 
To display the ratios select 
 * `Options` > `Values` : `rv_ratio` or
 * `Options` > `Values` : `fv_ratio` (default)

In [None]:
# fans/views or reads/views ratios
display(analysis_df)

***

## Author statistics

### Views/reads/fans for each author

To display the stats select one of these measures
 * `Options` > `Values` : `views`
 * `Options` > `Values` : `reads` (default)
 * `Options` > `Values` : `fans`
 
and the desired aggregation 

 * `Options` > `Aggregation` : `SUM` (default)
 * `Options` > `Aggregation` : `AVG` 

In [None]:
# higher values are better 
display(analysis_df)

### Recommends/reads and reads/views ratios for each author

To display the ratios for each author select 
 * `Options` > `Values` : `rv_ratio` 
 * `Options` > `Values` : `fv_ratio` (default) 

In [None]:
# recommendation ratio author (higher is better)
display(analysis_df)

***
## Tag statistics

Each story is associated with zero or more tags. For example, a story about PixieDust might be tagged using `Data Science`


### Tag frequencies


To display how frequently stories with a particular tag were viewed/read/recommended choose
 * `Options` > `Values` : `views` 
 * `Options` > `Values` : `avg_views` 
 * `Options` > `Values` : `reads` (default) 
 * `Options` > `Values` : `avg_reads` 
 * `Options` > `Values` : `fans` 
 * `Options` > `Values` : `avg_fans`
 
Example: Stories tagged with "ice cream" were read 5,000 times. Assuming that 10 sories were associated with the tag, the average reads is 500.

In [None]:
display(tag_analysis_df)

### Fans/views and reads/views ratios for each tag

To display the ratios for each tag select 
 * `Options` > `Values` : `rv_ratio` 
 * `Options` > `Values` : `fv_ratio` (default) 

In [None]:
display(tag_analysis_df)

### Tag/story associations

Identify stories that are associated with a particular tag

In [None]:
tag = 'Data Science'
display(tag_associations_df[tag_associations_df['tag'] == tag])

### Related stories on medium.com

Explore other stories covering these topics:

In [None]:
for row in tag_analysis_df.itertuples(): 
    print 'Stories tagged "' + row[1] + '": ' + row[11]