## Overview

This notebook analyzes the following statistics for the [IBM Watson Data Labs publication on medium.com](https://medium.com/ibm-watson-data-lab):

   Publication statistics:
       - Views 
       - Reads 
       - Recommendations 
       - Recommendations/reads ratio 


   Story statistics:
       - Views 
       - Reads 
       - Reads/views ratio
       - Recommendations 
       - Recommendations/reads ratio 
        
   Author statistics:     
       - Total views
       - Total reads
       - Total recommendations
       - Average reads/views ratio
       - Average recommendations/reads ratio

   Tag statistics:     
       - Total views
       - Total reads
       - Total recommendations
       - Average reads/views ratio
       - Average recommendations/reads ratio

  The measures are defined as follows:
     - Views: Number of visitors who landed on the story page 
     - Reads: Number of visitors that have read the complete story (this is an estimate)
     - Recommendations: Number of visitors that have recommended the story
     - Reads/views ratio (derived) average number of visitors that read the entire story (and not just part of it)
     - Recommendations/reads ratio (derived) average number of visitors that liked a story after reading it
     

 > A higher value is better for all measures.

In [None]:
import pixiedust
import re

In [None]:

# @hidden_cell
credentials = {
  "auth_url": "https://identity.open.softlayer.com",
  "projectId": "**projectId**",
  "region": "**region**",
  "userId": "**userId**",
  "username": "**username**",
  "domainId": "**domainId**",
  "password": "**password**"
}

container = '**container**'

In [None]:
# load scraped stats from Object Storage

from io import StringIO
import requests
import json
import pandas as pd

# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials(credentials, container, filename):
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domainId']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)

# Your data file was loaded into a StringIO object and you can process the data.
# Please read the documentation of pandas to learn more about your possibilities to load your data.
# pandas documentation: http://pandas.pydata.org/pandas-docs/stable/io.html
stats_df = pd.read_csv(get_object_storage_file_with_credentials(credentials, container, 'data.csv'), encoding='utf-8')
meta_df = pd.read_csv(get_object_storage_file_with_credentials(credentials, container, 'metadata.csv'), encoding='utf-8', dtype={'title':str, 'author':str, 'url':str, 'tags':str})

In [None]:
print 'Medium stats dataframe dimensions: {} (rows, columns)'.format(stats_df.shape)
print 'Metadata stats dataframe dimensions: {} (rows, columns)'.format(meta_df.shape)

In [None]:
analysis_df = pd.merge(stats_df, meta_df, how='left', on='title')

# calculate recommendations/reads ratio
analysis_df['rr ratio'] = analysis_df['recommends'] / analysis_df['reads']

# calculate reads/views ratio
analysis_df['rv ratio'] = analysis_df['reads'] / analysis_df['views']

analysis_df['count'] = 1

print 'Analysis dataframe dimensions: {} (rows, columns)'.format(analysis_df.shape)
analysis_df.head(3)

In [None]:
# create tag dataframe
tag_stats = {}
for row in analysis_df.itertuples():
    if row[11] is not None:
        for tag in row[11].split(','):
            tag = tag.strip()
            if tag in tag_stats:
                #print 'Updating ' + tag + ' ' + str(tag_stats[tag])
                tag_stats[tag]['count'] += 1
                tag_stats[tag]['views'] += row[5]
                tag_stats[tag]['reads'] += row[6]
                tag_stats[tag]['recommends'] += row[8]
                #print 'Updated  ' + tag + ' ' + str(tag_stats[tag])
            else:
                tag_stats[tag] = {
                                    'count': 1,
                                    'views': row[5],
                                    'reads': row[6],
                                    'recommends': row[8],                         
                }
                #print tag + ' ' + str(tag_stats[tag])
                
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html                
tag_analysis_df = pd.DataFrame.from_dict(tag_stats, orient='index').reset_index().rename(columns={"index": "tag"})

# calculate recommendations/reads ratio for each tag
tag_analysis_df['rr ratio'] = tag_analysis_df['recommends'] / tag_analysis_df['reads']

# calculate reads/views ratio
tag_analysis_df['rv ratio'] = tag_analysis_df['reads'] / tag_analysis_df['views']

tag_analysis_df.head(3)

***
## Publication statistics

 - Stories per month
 - Total number of views/reads/recommendations grouped by story publication month

In [None]:
# number of posts per month
display(analysis_df)

In [None]:
# views/reads/recommendations per month
display(analysis_df)

In [None]:
# correlation between recommendations and reads
display(analysis_df)

***

## Story statistics

### Views, reads and recommendations

Identify how many people have viewed/read/recommended a story:
 - Options > Values: views
 - Options > Values: reads (default)
 - Options > Values: recommends

In [None]:
# views/reads/recommends per story
display(analysis_df)

### Conversion ratios

Calculate reads/views and recommendations/reads ratios for each story. A higher value is better, with 0 indicatig that nobody read/liked a story (boo!) and 1 that everybody read/liked a story (yay!)

Recap: 
 * Views represent the number of visitors that accessed a story page. 
 * Reads represent the approximate number of visitors that read the story.
 * Recommends represent the number of visitors that liked a story.
 
To display the ratios select 
 * `Options` > `Values` : `rv ratio` or
 * `Options` > `Values` : `rr ratio` (default)
 

In [None]:
# recommendation/reads or reads/views ratios
display(analysis_df)

***

## Author statistics

### Views/reads/recommends for each author

To display the stats select 
 * `Options` > `Values` : `views`
 * `Options` > `Values` : `reads` (default)
 * `Options` > `Values` : `recommends`

In [None]:
# higher values are better 
display(analysis_df)

### Recommends/reads and reads/views ratios for each author

To display the ratios for each author select 
 * `Options` > `Values` : `rr ratio`
 * `Options` > `Values` : `reads` (default)

In [None]:
# story recommendations by author (higher is better)
display(analysis_df)

### Recommends/reads and reads/views ratios for each author

To display the ratios for each author select 
 * `Options` > `Values` : `rv` 
 * `Options` > `Values` : `rr ratio` (default) 

In [None]:
# recommendation ratio author (higher is better)
display(analysis_df)

***
## Tag statistics

Each story is associated with zero or more tags. For example, a story about PixieDust might be tagged using `Data Science`


### Tag frequencies

Measures:
 - Total views/reads/recommends by tag

In [None]:
display(tag_analysis_df)

### Recommends/reads and reads/views ratios for each tag

To display the ratios for each tag select 
 * `Options` > `Values` : `rv ratio` 
 * `Options` > `Values` : `rr ratio` (default) 

In [None]:
display(tag_analysis_df)