# Introduction: Article Investigation

The purpose of this notebook is to look at the articles I published over the past year. This is primarily for enjoyment and for the article "What I learned by writing one data science article per week". This should be a fun opportunity to use plotly. 

You can also run this notebook on mybinder (coming soon).

In [3]:
# Standard Data Science Helpers
import numpy as np
import pandas as pd
import scipy

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot

import cufflinks as cf
cf.set_config_file(world_readable=True, theme="pearl")
cf.go_offline(connected=True)

# Extra options
pd.options.display.max_rows = 10
pd.options.display.max_columns = 25
# Show all code cells outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

## Data Loading

We'll load in the data using the `parquet` format which saves the data and the correct data types. This makes it easier to read and write data (at least in pandas).

In [4]:
df = pd.read_parquet('https://github.com/WillKoehrsen/Data-Analysis/blob/master/medium/data/medium_data_2019_01_28?raw=true')
df.tail()

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

In [None]:
import ipywidgets as widgets
from ipywidgets import interact

In [None]:
df_orig = df.copy()
df = df.set_index('published_date')

# Articles in 2018

Let's quickly see how many articles I published in 2018. This uses an interactive widget to select dates.

In [2]:
def print_articles_published(start_date, end_date):
    start_date = pd.Timestamp(start_date)
    end_date = pd.Timestamp(end_date)
    stat_df = df.loc[(df.index >= start_date) & (df.index <= end_date)].copy()
    total_words = stat_df['word_count'].sum()
    total_read_time = stat_df['read_time'].sum()
    num_articles = len(stat_df)
    print(f'You published {num_articles} articles between {start_date.date()} and {end_date.date()}.')
    print(f'These articles totalled {total_words:,} words and {total_read_time/60:.2f} hours to read.')
    
_ = interact(print_articles_published,
             start_date=widgets.DatePicker(value=pd.to_datetime('2018-01-01')),
             end_date=widgets.DatePicker(value=pd.to_datetime('2018-12-31')))

NameError: name 'interact' is not defined

## Article Summary

We'll use a basic `describe` to get the stats for my 2018 articles.

In [6]:
start_date = pd.to_datetime('2018-01-01'); end_date = pd.to_datetime('2018-12-31')

stat_df = df.loc[(df.index >= start_date) & (df.index <= end_date)].copy()
stat_df.describe()

Unnamed: 0,claps,days_since_publication,fans,num_responses,read_ratio,read_time,reads,title_word_count,views,word_count,claps_per_word,editing_days,<tag>Education,<tag>Data Science,<tag>Towards Data Science,<tag>Machine Learning,<tag>Python
count,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0,98.0
mean,2216.438776,208.830039,434.061224,8.438776,30.350408,11.418367,7469.22449,7.520408,27213.27551,2703.0,1.19247,26.428571,0.867347,0.734694,0.540816,0.44898,0.326531
std,2586.615868,127.978761,502.508405,9.413816,12.300063,8.174544,9006.639102,3.325099,33877.48657,2170.707728,2.036551,85.305515,0.340943,0.443766,0.500893,0.499947,0.471355
min,0.0,32.168843,0.0,0.0,8.11,1.0,1.0,2.0,3.0,163.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,576.75,84.224527,102.25,1.25,21.955,7.0,1258.5,5.0,3217.25,1558.0,0.205499,0.0,1.0,0.0,0.0,0.0,0.0
50%,1100.0,202.73539,221.0,6.0,28.265,10.0,4264.5,7.0,14946.5,2340.5,0.636654,1.0,1.0,1.0,1.0,0.0,0.0
75%,2975.0,343.152215,607.5,12.75,35.7275,13.0,9674.0,10.0,35470.5,2993.0,1.417725,6.5,1.0,1.0,1.0,1.0,1.0
max,13600.0,394.079582,2600.0,59.0,74.24,54.0,42301.0,16.0,135743.0,15063.0,17.891817,349.0,1.0,1.0,1.0,1.0,1.0


# Days from Start to Publish

In [7]:
stat_df.reset_index(inplace=True)
stat_df['days_to_publish'] = ((stat_df['published_date'] - stat_df['started_date']) / pd.Timedelta(days=1)).astype(int)

In [8]:
data = [go.Scatter(x=[x['started_date'], x['published_date']], 
                   y=[x['word_count'], x['word_count']],
                  text = x['title'][:10] + ':' + str(int(x['days_to_publish'])),
                   mode='markers+lines',
                  name=x['title'][:10]) for i, x in stat_df.query("days_to_publish < 100").iterrows()]

In [9]:
figure = go.Figure(data=data, layout=go.Layout(title='Started and Published Date with Word Count', 
                                               yaxis=dict(title='Word Count'),
                                               xaxis=dict(title='Started and Published Date')))
iplot(figure)

# Cumulative Word Count

This shows the total words I wrote over the year. We use a cumulative sum plotted against the published date.

In [10]:
stat_df.set_index('published_date', inplace=True)
stat_df['word_count'].cumsum().iplot(kind='scatter', mode='markers+lines', size=6, xTitle='Published Date', yTitle='Word Count',
                                colorscale='plotly', theme='white', title='Total Words over 2018')

In [11]:
stat_df = stat_df.query("days_to_publish < 100")

dr = pd.date_range('2018-01-01', '2018-12-31', freq='7 D')
words = []
for i in range(len(dr)-1):
    subset = stat_df[(stat_df.index.date >= dr[i].date()) & (stat_df.index.date < dr[i+1].date())]
    words.append(subset['word_count'].sum())
    
weekly_df = pd.DataFrame({'word_count': words}, index=dr[:-1])

In [12]:
weekly_df['word_count'].rolling(2, center=True).mean().fillna(method='bfill').iplot(xTitle='Date', yTitle='Word Count',
                                                                                    mode='markers+lines', theme='white', size = 8,
                                                                                    title="Rolling Average of Words per Week Published")

# Cumulative Fans

In [13]:
stat_df['fans'].cumsum().iplot(kind='scatter', mode='markers+lines', size=6, xTitle='Published Date', yTitle='Fans',
                                colorscale='plotly', theme='white', title='Total Fans over 2018')

# Spread Plot of Views and Reads

In [14]:
import cufflinks as cf

In [15]:
@interact
def plot_views_reads(theme=list(cf.themes.THEMES.keys()),
                     colorscale=list(cf.colors._scales_names.keys())):
    stat_df[['views', 'reads']].cumsum().iplot(kind='spread', mode='markers+lines', theme=theme, 
                                           size=6, xTitle='Published Date', 
                                colorscale=colorscale, title='Total Views and Reads over 2018')

interactive(children=(Dropdown(description='theme', options=('ggplot', 'pearl', 'solar', 'space', 'white', 'po…

# Average Reading Percent by Month

In [16]:
stat_df.resample('1 M').mean()['read_ratio'].iplot(kind='bar', title='Reading Percent by Month')

# Average Reading Percent by Reading Time

In [17]:
stat_df['binned_readtime'] = pd.cut(stat_df['read_time'], bins=range(0, 101, 5))

stat_df['binned_readtime'] = stat_df['binned_readtime'].astype(str)
stat_df['binned_readtime'] = stat_df['binned_readtime'].replace({'(5, 10]': '(05, 10]'})

averages = stat_df.groupby('binned_readtime')['read_ratio'].mean()


averages.sort_index(inplace=True)
averages.iplot(kind='bar', xTitle='Reading Time (mins)',  
                                                              yTitle='Reading Percent',
                                                              title='Reading Percent vs Reading Time')

# Correlation Heatmap

In [18]:
import plotly.figure_factory as ff

corrs = stat_df[[c for c in stat_df if '<tag>' not in c]].corr()

figure = ff.create_annotated_heatmap(z = corrs.round(2).values, 
                                     x =list(corrs.columns), 
                                     y=list(corrs.index), 
                                     colorscale='Viridis',
                                     annotation_text=corrs.round(2).values)
iplot(figure)

In [19]:
stat_df.describe()

Unnamed: 0,claps,days_since_publication,fans,num_responses,read_ratio,read_time,reads,title_word_count,views,word_count,claps_per_word,editing_days,<tag>Education,<tag>Data Science,<tag>Towards Data Science,<tag>Machine Learning,<tag>Python,days_to_publish
count,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0,91.0
mean,2386.208791,222.158956,467.285714,9.087912,31.229341,9.813187,8042.703297,7.659341,29299.747253,2293.087912,1.284121,3.252747,0.857143,0.725275,0.582418,0.417582,0.351648,3.252747
std,2608.248448,123.034773,506.486334,9.462967,11.998429,4.366316,9097.987866,3.406594,34282.994752,1131.733092,2.085958,5.326232,0.351866,0.448849,0.495893,0.495893,0.48013,5.326232
min,11.0,32.168843,3.0,0.0,14.42,1.0,58.0,2.0,109.0,163.0,0.010252,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,712.5,107.674535,116.5,3.0,22.26,6.5,1332.5,5.0,4561.5,1445.0,0.339899,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,1300.0,222.227241,272.0,6.0,28.35,9.0,5188.0,7.0,17223.0,2185.0,0.707339,1.0,1.0,1.0,1.0,0.0,0.0,1.0
75%,3200.0,351.51913,622.5,13.0,36.86,12.0,10837.5,10.0,36336.5,2794.5,1.47024,4.0,1.0,1.0,1.0,1.0,1.0,4.0
max,13600.0,394.079582,2600.0,59.0,74.24,27.0,42301.0,16.0,135743.0,7125.0,17.891817,26.0,1.0,1.0,1.0,1.0,1.0,26.0


# Scatterplot Matrix

In [20]:
figure = ff.create_scatterplotmatrix(stat_df[['read_time', 'reads', 'read_ratio', 'claps', 'type']],
                                     index = 'type', colormap='Cividis', colormap_type='cat',
                                     title='Scatterplot Matrix',
                                     diag='histogram', width=800, height=800)
iplot(figure)

# Conclusions

Well, this was more a fun exercise than any serious work. I think it shows that above all, data science can be enjoyable. 