# Medium Archive Data-Analysis

* In This notebook we will clean the data that is extracted from Mediums archive pages
* Each archive page is associated to a story-tag and is a collection of Medium timeline cards organized by date.
* Data was Scraped from the mediums archive pages using selenium and beautifulsoup.
* The data was pulled from popular Medium story-tag archives. Each archive was scraped for each day between Jan 1, 2019 and Dec   12, 2019.
* The Tags Scraped: data-science,python,ai,machine-learning,deep-learning,big-data,computer-vision,nlp.
* The analysis is specially for field of Data-Science

# Purpose of the Data
1. To create a performance metric for Medium's authors, so they can compare their work to the rest of Medium.
2. To compare the performance of authors and publications on Medium.
3. To create a leaderboard of the top performing authors and publications in each tag .
4. To find the differences that distinguish well-received articles.

# Structure of the data
* Title
* Subtitle
* Image (yes/no)
* Author
* Publication
* Year - Month - Day
* Tag
* Reading Time
* Claps
* Comment (yes/no)
* Story Url
* Author URL

# Import the Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import glob

# Load the Data

In [2]:
tech_files = glob.glob("TAG_SCRAPES/*.csv")

frames =[]
for file in tech_files:
    #all of the seperate scrapes from different tags
    df = pd.read_csv(file)
    frames.append(df)
medium = pd.concat(frames)
medium.head(3)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Tag,Reading_Time,Claps,Comment,url,Author_url
0,Getting Started With Google Colab,A Simple Tutorial for the Frustrated and Confused,1,Anne Bonner,Towards Data Science,2019,1,1,ai,8,3.2K,0,https://towardsdatascience.com/getting-started...,https://towardsdatascience.com/@annebonner?sou...
1,Deep learning (),2018 Artificial intelligence,1,HD COE,,2019,1,1,ai,3,55,0,https://medium.com/@hadee2531earvesdrop/deep-l...,https://medium.com/@hadee2531earvesdrop?source...
2,5 Trends for AI in 2019,"Ethics, driverless cars, cashierless stores, c...",1,David Vandegrift,Predict,2019,1,1,ai,5,278,0,https://medium.com/predict/5-trends-for-ai-in-...,https://medium.com/@DavidVandegrift?source=tag...


In [3]:
print("Number of articles scraped (before cleaning): ",medium.shape[0])

Number of articles scraped (before cleaning):  117178


In [4]:
medium.dtypes

Title           object
Subtitle        object
Image            int64
Author          object
Publication     object
Year             int64
Month            int64
Day              int64
Tag             object
Reading_Time     int64
Claps           object
Comment          int64
url             object
Author_url      object
dtype: object

# Converting Strings to Floats
* Before we can work with the data we need to convert the "Claps" column from string to float values. Note that the Object
  datatype is non-numeric. There is also an issue with Claps in the form of "3.2K", rather than "3200".

# Reformatting Clap Information to Floats


In [5]:
#Claps entries higher than 999 are written "3.2K"
# here we remove the "K", convert the string to float, then multiply by 1000.
numeric_claps = []
for x in medium.Claps:
    if "K" in x:
        numeric_claps.append(float(x[:-1])*1000)
    else:
        numeric_claps.append(x)
medium["Claps"] = numeric_claps
medium["Claps"] = pd.to_numeric(medium["Claps"])
print("Clap dtype: ", medium.dtypes["Claps"])

Clap dtype:  float64


# Removing Comment Entries
* Comment entries have been encoded into the data with the Comment column. Since these entries are not articles, we remove them   in the following script.

In [6]:
no_comm = medium[medium.Comment==0]
no_comm = no_comm.drop(["Comment"], axis=1)
print("Number of Entries to be removed: ", medium.shape[0]-no_comm.shape[0])
print("Percentage of remaining data: " ,round(((medium.shape[0]-no_comm.shape[0])/medium.shape[0])*100,2), "%")
medium = no_comm

Number of Entries to be removed:  3901
Percentage of remaining data:  3.33 %


# Cleaning up Urls.

In [7]:
#before
for i in range(3):
    print(medium.Author_url.values[i])

https://towardsdatascience.com/@annebonner?source=tag_archive---------0-----------------------
https://medium.com/@hadee2531earvesdrop?source=tag_archive---------1-----------------------
https://medium.com/@DavidVandegrift?source=tag_archive---------2-----------------------


In [8]:
medium.url = medium.url.str.split("?", expand=True)
medium.Author_url = medium.Author_url.str.split("?", expand=True)

In [9]:
#after
for i in range(3):
    print(medium.Author_url.values[i])

https://towardsdatascience.com/@annebonner
https://medium.com/@hadee2531earvesdrop
https://medium.com/@DavidVandegrift


# Checking for Non Entries in the Data
## All NaNs in Each Column
* We only have missing values in Title, Subtitle, or Publication. NaNs in publication column because not all articles are         published.

In [10]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                 3522
Subtitle             45760
Image                    0
Author                 171
Publication          59249
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url             171

Total Entries:   113277


# Remove NaN Authors
* The cards on the archive timeline have neither author nor publication. Since there are only a coulple hundred entries without  an author, I choose to remove these from the data.

In [11]:
medium = medium[medium.Author.notnull()]

# NaN Title and Subtitle Entries

* Sometimes when scraping the archive page, Titles are in weird formats. The result, some articles titles are scraped as         subtitles.

* Here is a breakdown of the NonEntries in Title/SubTitle Columns. I choose to keep these in the data.

In [12]:
#Total entries with no Title
print("Total NaN Title Entries: ", medium[medium.Title.isnull()].shape[0])

#Entries with no title but with a subtitle
print("Entries with NaN Title but existing SubTitle: ",medium[(medium.Title.isnull() & medium.Subtitle.notnull())].shape[0])

#Neither Possible explanations?
print("Entries with neither title nor subtitle: ", medium[(medium.Title.isnull() & medium.Subtitle.isnull())].shape[0])

Total NaN Title Entries:  3519
Entries with NaN Title but existing SubTitle:  1647
Entries with neither title nor subtitle:  1872


# Total Nans

In [13]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                 3519
Subtitle             45715
Image                    0
Author                   0
Publication          59241
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url               0

Total Entries:   113106


# Removing Duplicate Articles (Multi-tagged)
* Medium allows an author to include 5 tags for each story.

* When we scraped the archive page, we scraped each individual tag. As a result, stories will appear multiple times in our data   (with different tags)

In [14]:
#one hot encode the tags 
medium = pd.get_dummies(medium, columns = ["Tag"])

#multi_tags is all entries in the dataset that have duplicates (includes all duplicates)
multi_tags = medium[medium.duplicated(subset=["url", "Year", "Month","Day"], keep=False)]
print("There are: ", multi_tags.shape[0], "Duplicated entries.")
print("Unique posts with multiple tags: ", multi_tags.shape[0]- medium[medium.duplicated(subset=["url", "Year", "Month","Day"],
      keep="last")].shape[0])

There are:  58849 Duplicated entries.
Unique posts with multiple tags:  25806


# Combining each multitagged article into ONE row
* Combine the onehot encoded tags of each multiposted article into one entry

In [15]:

#groupby urls since a unique story has a unique url, sum the rows for all tags
#now all tag vectors will be on one line
gb = multi_tags.groupby(["url","Year","Month","Day"]).sum().reset_index()
tags = gb.iloc[:,7:].copy()
tags.head(2)

Unnamed: 0,Tag_ai,Tag_big-data,Tag_computer-vision,Tag_data-science,Tag_deep-learning,Tag_machine-learning,Tag_nlp,Tag_python
0,0,0,0,1,0,1,0,0
1,1,0,0,0,0,1,0,0


* Remove all but one of each duplicate entry, then sort, so rows match up with the groupby dataframe.

In [16]:
#keep only one entry of each duplicated article
sort = multi_tags[~multi_tags.duplicated(subset=["url","Year", "Month","Day"], keep="first")]

#sort the entry to put it in the exact same order as the groupby above
sort = sort.sort_values(["url","Year","Month","Day"]).reset_index().drop("index",axis=1)

#keep only the combined tags for a merge later
sort = sort.iloc[:,:12].copy()
sort.head(2)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Reading_Time,Claps,url,Author_url
0,Digging for value in the text mines,The oil and gas industrys new toolset is trans...,1,Jesse Lord,Bransjebloggenmin,2019,2,27,4,0.0,https://3min.io/digging-for-value-in-the-text-...,https://3min.io/@jesse_81306
1,Who killed JFK?,Knowledge Mining can help us solve mysteries o...,1,Izabela Hawry ko,Bransjebloggenmin,2019,6,6,5,2.0,https://3min.io/who-killed-jfk-9aa4514442b2,https://3min.io/@izabela.hawrylko


* Check that the two frames are aligned

In [17]:
# double check the two dataframes match up
(sort.url==gb.url).all()

True

* Combine the two dataframes horizontally

In [18]:
#smoosh em
combined = pd.concat([sort, tags], axis=1, sort=False)
combined.head(2)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Reading_Time,Claps,url,Author_url,Tag_ai,Tag_big-data,Tag_computer-vision,Tag_data-science,Tag_deep-learning,Tag_machine-learning,Tag_nlp,Tag_python
0,Digging for value in the text mines,The oil and gas industrys new toolset is trans...,1,Jesse Lord,Bransjebloggenmin,2019,2,27,4,0.0,https://3min.io/digging-for-value-in-the-text-...,https://3min.io/@jesse_81306,0,0,0,1,0,1,0,0
1,Who killed JFK?,Knowledge Mining can help us solve mysteries o...,1,Izabela Hawry ko,Bransjebloggenmin,2019,6,6,5,2.0,https://3min.io/who-killed-jfk-9aa4514442b2,https://3min.io/@izabela.hawrylko,1,0,0,0,0,1,0,0


* Remove all duplicates from original dataframe, append combined entries to the bottom of the dataset

In [19]:
before = medium.shape[0]

#Remove all duplicates articles with same date title and author
medium = medium[~medium.duplicated(subset=["url", "Year", "Month","Day"], keep=False)]
#Add the combined data that we made in the last two scripts to the end of the datafream
dframes = [medium, combined]
#merge the two dataframes
medium = pd.concat(dframes)

after = medium.shape[0]
print("# of duplicate rows deleted: ", before-after)

# of duplicate rows deleted:  33043


# Clean Data

In [20]:

print("Number of after cleaning: ", medium.shape[0])

Number of after cleaning:  80063


In [21]:

medium.to_csv("Medium_Clean.csv")