<h1 align="center">Medium Author Leaderboard</h1>

<br>
In this notebook I will create a leaderboard for the most clapped data-science writers on Medium for the last year.

## Where the data came from.

I pulled this data from Medium's data-science archive pages using a Selenium scraper and BeautifulSoup. The data-science archive page lists every story published with a data-science tag, organized by date. For the purposes of this notebook, <b>I scraped data from every data-science tagged story published between October 1st, 2017 and October 1st, 2018.</b>
<br>

<h3 align="center"> Image of the "<a href="https://medium.com/tag/data-science/archive">data-science</a>" Archive</h3>

## Data Collected
Each timeline-card was scraped for a few pieces of key information. Author, Publication, Date, ReadTime, Title, Subtitle, and Claps-Received
<br>
<h3 align="center"> Example of Data Scraped from a Timeline Card</h3>

# The Data

In [1]:
import os 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

medium = pd.read_csv("DS_Clean.csv")
medium = medium.drop("Unnamed: 0", axis=1)
ds = medium

In [52]:
#Sorry I had to
ds.head(3)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Reading_Time,Claps,url,Author_url,Tag_data-science
0,Jupyter Data Science Stack + Docker in under 1...,Motivation,0,Tanbal,Towards Data Science,2017,10,1,4,68.0,https://towardsdatascience.com/jupyter-data-sc...,https://towardsdatascience.com/@ruhayel?source...,1
1,Discovering similarities across my Spotify mus...,,1,Juan De Dios Santos,Towards Data Science,2017,10,1,16,631.0,https://towardsdatascience.com/discovering-sim...,https://towardsdatascience.com/@jdiossantos?so...,1
2,How I got ~98% prediction accuracy with Kaggle...,,1,Tom Cusack,Towards Data Science,2017,10,1,3,189.0,https://towardsdatascience.com/how-i-got-98-pr...,https://towardsdatascience.com/@__tomcusack?so...,1


In [51]:
print("Number of Stories: ", ds.shape[0])
print("Number of Authors: ", ds.Author.value_counts().shape[0])
print("Number of Publications: ", ds.Publication.value_counts().shape[0])

Number of Stories:  17161
Number of Authors:  8512
Number of Publications:  1920


In [62]:
print("Top 5 Most Publishing Authors")
print()
print("Authors                          # Articles")
print(ds.Author.value_counts().head())

Top 5 Most Publishing Authors

Authors                          # Articles
#ODSC The Data Science Community    93
NYU Center for Data Science         85
Corsair's Publishing                81
Kan Nishida                         78
elvis                               57
Name: Author, dtype: int64


In [66]:
print("Top 5 Most Publishing Publications")
print()
print("Publication              # Articles")
print(ds.Publication.value_counts().head())

Top 5 Most Publishing Publications

Publication              # Articles
Towards Data Science       2085
Hacker Noon                 221
Data Driven Investor        156
freeCodeCamp.org             97
Center for Data Science      82
Name: Publication, dtype: int64


# Top 100 Most Clapped  Data-Science Authors
For data-science tagged articles posted between 10/2017 to 10/2018 

In [2]:
def print_list(data, metric):
    dash = '-' * 100
    for i in range(len(data)):
        if i == 0:
          print(dash)
          print('{:<8s}{:<7s}{:^25s}{:^12s}'.format("Rank",metric,"Author","Url"))
          print(dash)
        auth = data.index[i]
        link = medium[medium.Author==auth].Author_url.values[0]
        link = link.split("?")[0]
        print('{:<8d}{:<10d}{:<25s}{:<20s}'.format(i+1,int(data[i]),auth, link))
        
def print_multi_list(data, metric1, metric2):
    dash = '-' * 100
    for i in range(len(data)):
        if i == 0:
          print(dash)
          print('{:<8s}{:<12s}{:<7s}{:^25s}{:^12s}'.format("Rank",metric1,metric2,"Author","Url"))
          print(dash)
        auth = data.index[i]
        link = medium[medium.Author==auth].Author_url.values[0]
        link = link.split("?")[0]
        print('{:<8d}{:<12d}{:<10d}{:<25s}{:<20s}'.format(i+1,int(data.iloc[i][0]),int(data.iloc[i][1]),auth, link))

In [39]:
temp = ds.copy()
#creates a row that we will sum to get total articles by each author
temp["count"] = 1
auths_arts = temp.groupby("Author").sum()
total_claps = auths_arts["Claps"]
total_claps = np.flip(total_claps.sort_values(ascending=True)[-100:], axis=0)

print("Most Clapped Data-Science Authors")
print()
print_list(total_claps, "# Claps")

Most Clapped Data-Science Authors

----------------------------------------------------------------------------------------------------
Rank    # Claps         Author              Url     
----------------------------------------------------------------------------------------------------
1       136244    William Koehrsen         https://towardsdatascience.com/@williamkoehrsen
2       62150     Corsair's Publishing     https://comprehension360.corsairs.network/@Corsairs
3       53884     Mybridge                 https://medium.mybridge.co/@Mybridge
4       47423     Devin Soni               https://towardsdatascience.com/@devins
5       45647     George Seif              https://towardsdatascience.com/@george.seif94
6       45000     Michael Jordan           https://medium.com/@mijordan3
7       41769     Jonny Brooks-Bartlett    https://towardsdatascience.com/@jonnybrooks04
8       32000     YK Sugi                  https://medium.freecodecamp.org/@ykdojo
9       30900     Robert Cha

<br>

# Highest Clap per Article of Data Science Authors

In [5]:
auths_arts["CA_Ratio"] = auths_arts['Claps']/auths_arts["count"]
CA = auths_arts.sort_values("CA_Ratio", ascending=False)[["CA_Ratio", "count"]]

## No Restrictions

In [71]:
def print_top20(df):
    temp = df.copy()
    #creates a row that we will sum to get total articles by each author
    temp["count"] = 1
    auths_arts = temp.groupby("Author").sum()
    
    auths_arts["CA_Ratio"] = auths_arts['Claps']/auths_arts["count"]
    CA = auths_arts.sort_values("CA_Ratio", ascending=False)[["CA_Ratio", "count"]]

    print("Top 20 Consistent Authors with Highest Claps per Article")
    print()
    print_multi_list(CA[:20],"Claps/Art", "# Articles")
    return None

print_top20(ds)

Top 20 Consistent Authors with Highest Claps per Article

----------------------------------------------------------------------------------------------------
Rank    Claps/Art   # Articles         Author              Url     
----------------------------------------------------------------------------------------------------
1       45000       1         Michael Jordan           https://medium.com/@mijordan3
2       32000       1         YK Sugi                  https://medium.freecodecamp.org/@ykdojo
3       21000       1         James Loy                https://towardsdatascience.com/@jamesloyys
4       15100       1         Xoel L pez Barata        https://medium.com/@xoelop
5       13400       2         Konark Modi              https://medium.freecodecamp.org/@konarkmodi
6       13000       1         Nick Abouzeid            https://blog.producthunt.com/@nickabouzeid
7       12300       1         Ethan Arsht              https://towardsdatascience.com/@ethanarsht
8       11500    

## Restricting On Number of Articles Published (Total Articles > 10)

In [77]:
authors = ds.Author.value_counts()
veterans = authors[authors>10]

ds_veterans = ds[ds.Author.isin(veterans.index)]
print("All Authors: ",authors.shape[0])
print("Authors with more than 10 stories: ",veterans.shape[0])

All Authors:  8512
Authors with more than 10 stories:  148


In [78]:
print_top20(ds_veterans)

Top 20 Consistent Authors with Highest Claps per Article

----------------------------------------------------------------------------------------------------
Rank    Claps/Art   # Articles         Author              Url     
----------------------------------------------------------------------------------------------------
1       3647        13        Devin Soni               https://towardsdatascience.com/@devins
2       3480        12        Jonny Brooks-Bartlett    https://towardsdatascience.com/@jonnybrooks04
3       3260        14        George Seif              https://towardsdatascience.com/@george.seif94
4       2724        50        William Koehrsen         https://towardsdatascience.com/@williamkoehrsen
5       2694        20        Mybridge                 https://medium.mybridge.co/@Mybridge
6       1410        20        Tirthajyoti Sarkar       https://towardsdatascience.com/@tirthajyoti
7       1286        21        Cassie Kozyrkov          https://towardsdatascience.

## Restricting on Consistency. (One story a month, last three months)
We want authors who have posted a story every month for the last three months.

In [47]:
def consistent(df):
    #Checks if the author wrote a story in July, aug, and sep.
    months = [7,8,9]
    check1 = int((df.Month==7).any())
    check2 = int((df.Month==8).any())
    check3 = int((df.Month==9).any())
    if check1 + check2 + check3 == 3:
        return True
    else: 
        return False
            
result = ds.groupby("Author").apply(consistent)

consistent_authors = result[result==True].index

ds_consistent = ds[ds.Author.isin(consistent_authors)]
print("All Authors: ", ds.Author.value_counts().shape[0])
print("Number of Consistent Authors: ", consistent_authors.shape[0])

All Authors:  8512
Number of Consistent Authors:  84


In [72]:
print_top20(ds_consistent)

Top 20 Consistent Authors with Highest Claps per Article

----------------------------------------------------------------------------------------------------
Rank    Claps/Art   # Articles         Author              Url     
----------------------------------------------------------------------------------------------------
1       3260        14        George Seif              https://towardsdatascience.com/@george.seif94
2       2724        50        William Koehrsen         https://towardsdatascience.com/@williamkoehrsen
3       1410        20        Tirthajyoti Sarkar       https://towardsdatascience.com/@tirthajyoti
4       1286        21        Cassie Kozyrkov          https://towardsdatascience.com/@kozyrkov
5       960         6         Supervise.ly             https://hackernoon.com/@deepsystems
6       910         30        Susan Li                 https://towardsdatascience.com/@actsusanli
7       767         81        Corsair's Publishing     https://comprehension360.cors