# Medium Archive  Analysis (Data Cleaning Phase)
In this notebook I will clean the data pulled from medium's archive pages with the scrape_master.py file. I will focus on removing duplicate entries and analyzing potential concerns of data consistency.

## Where the data came from.

I pulled this data from Medium's archive pages. Each archive page is associated to a story-tag and is a collection of Medium timeline cards organized by date.

#### Image of the "data-science" Archive

<img src="img/archive.jpg" width=500>




### How the data was scraped
The data was pulled from from  95 popular Medium story-tag archives. Each archive was <b>scraped for each day between Aug 1, 2017 and Aug 1, 2018.</b>

These specific dates were chosen because:
1. Medium's clap metric was introduced in August 2017, and older posts might not be relevant. 
2. The popularity of Medium may have grown, so older posts may not generalize to the preformance of posts today. 
3. The end date was chosen so that newer posts (September) were not included, as they have not had time to mature and accumulate claps.

#### The 95 Tags Scraped
['android', 'apple', 'architecture', 'art', 'bitcoin', 'blacklivesmatter', 'blockchain', 'blog', 'blogging', 'books', 'branding', 'business', 'college', 'creativity', 'cryptocurrency', 'culture', 'deep-learning', 'design', 'dogs', 'donald-trump', 'economics', 'education', 'energy', 'entrepreneurship', 'environment', 'ethereum', 'feminism', 'fiction', 'food', 'football', 'gadgets', 'google', 'government', 'happiness', 'health', 'history', 'humor', 'inspiration', 'interior-design', 'investing', 'ios', 'javascript', 'jobs', 'journalism', 'leadership', 'life', 'life-lessons', 'love', 'machine-learning', 'marketing', 'medium', 'mobile', 'motivation', 'movies', 'music', 'nba', 'news', 'nutrition', 'parenting', 'personal-development', 'photography', 'poem', 'poetry', 'politics', 'product-design', 'productivity', 'programming', 'psychology', 'python', 'python', 'racism', 'react', 'relationships', 'science', 'self-improvement', 'social-media', 'software-engineering', 'sports', 'startup', 'tech', 'technology', 'travel', 'trump', 'ux', 'venture-capital', 'visual-design', 'web-design', 'web-development', 'women', 'wordpress', 'work', 'writing']

## Purpose of the Data
 1. To <b>create a performance metric for Medium's authors</b>, so they can compare their work to the rest of Medium.
 2. To <b>compare the performance of authors and publications</b> on Medium.
 3. To <b>create a leaderboard</b> of the top performing authors and publications in each tag .
 
 4. To <b>find the differences that distinguish well-received articles.</b>
 
 


## Structure of the data
- Title
- Subtitle 
- Image (yes/no)
- Author
- Publication
- Year - Month - Day
- Tag
- Reading Time
- Claps
- Comment (yes/no)
- Story Url
- Author URL

<img src="img/card.png" width=500>

<hr>
# Load the Data

In [28]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import glob

tech_files = glob.glob("TAG_SCRAPES/*.csv")

frames =[]
for file in tech_files:
    #all of the seperate scrapes from different tags
    df = pd.read_csv(file)
    frames.append(df)
medium = pd.concat(frames)
medium.head(2)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Tag,Reading_Time,Claps,Comment,url,Author_url
0,The AI Hierarchy of Needs,As is usually the case with fast-advancing tec...,0,Monica Rogati,Hacker Noon,2017,8,1,ai,6,5.4K,0,https://hackernoon.com/the-ai-hierarchy-of-nee...,https://hackernoon.com/@mrogati?source=tag_arc...
1,Why will declarative programming rule chatbots...,Creating smart applications,1,Hristo Borisov,Progress NativeChat,2017,8,1,ai,7,149,0,https://medium.com/nativechat/why-will-declara...,https://medium.com/@hristoborisov?source=tag_a...
2,Online Animation: Mixamo vs Norah AI,Online animations tools provide game designers...,1,Emma Laurent,,2017,8,1,ai,5,12,0,https://medium.com/@laurentemma/online-animati...,https://medium.com/@laurentemma?source=tag_arc...


In [29]:
print("Number of articles scraped (before cleaning): ",medium.shape[0])

Number of articles scraped:  2113020


<hr>
# Converting Strings to Floats

Before we can work with the data we need to <b>convert the "Claps" column from string to float values</b>. Note that the Object datatype is non-numeric. There is also an issue with <b>Claps in the form of "5.5K", rather than "5500".</b>

### Preview of DataTypes

In [30]:
medium.dtypes

Title           object
Subtitle        object
Image            int64
Author          object
Publication     object
Year             int64
Month            int64
Day              int64
Tag             object
Reading_Time     int64
Claps           object
Comment          int64
url             object
Author_url      object
dtype: object

### Reformatting Clap Information to Floats

In [31]:
#Claps entries higher than 999 are written "5.5K"
# here we remove the "K", convert the string to float, then multiply by 1000.
numeric_claps = []
for x in medium.Claps:
    if "K" in x:
        numeric_claps.append(float(x[:-1])*1000)
    else:
        numeric_claps.append(x)
medium["Claps"] = numeric_claps
medium["Claps"] = pd.to_numeric(medium["Claps"])
print("Clap dtype: ", medium.dtypes["Claps"])

Clap dtype:  float64


<hr>
# Removing Comment Entries
Comment entries have been encoded into the data with the Comment column. Since these entries are not articles, I remove them in the following script.

In [32]:
no_comm = medium[medium.Comment==0]
no_comm = no_comm.drop(["Comment"], axis=1)
print("Number of Entries to be removed: ", medium.shape[0]-no_comm.shape[0])
print("Percentage of remaining data: " ,round(((medium.shape[0]-no_comm.shape[0])/medium.shape[0])*100,2), "%")
medium = no_comm

Number of Entries to be removed:  118963
Percentage of remaining data:  5.63 %


# Cleaning up  Urls.


In [33]:
#before
for i in range(3):
    print(medium.Author_url.values[i])

https://hackernoon.com/@mrogati?source=tag_archive---------0---------------------
https://medium.com/@hristoborisov?source=tag_archive---------1---------------------
https://medium.com/@laurentemma?source=tag_archive---------2---------------------


In [34]:
medium.url = medium.url.str.split("?", expand=True)
medium.Author_url = medium.Author_url.str.split("?", expand=True)

In [35]:
#after
for i in range(3):
    print(medium.Author_url.values[i])

https://hackernoon.com/@mrogati
https://medium.com/@hristoborisov
https://medium.com/@laurentemma



# Checking for Non Entries in the Data


### All NaNs in Each Column
We only have missing values in Title, Subtitle, or Publication. <b>NaNs in publication column because not all articles are published. </b>

In [36]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                67651
Subtitle            647096
Image                    0
Author                6981
Publication        1339095
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url            6975

Total Entries:   1994057


## Remove NaN Authors
Medium is doing something weird with adding existing articles from sites like pcmag.com. The cards on the archive timeline have neither author nor publication. Since there are only a coulple hundred entries withou an author, I choose to remove these from the data.

In [37]:
medium = medium[medium.Author.notnull()]

## NaN Title and Subtitle Entries
Sometimes when scraping the archive page, Titles are in weird formats. The result, <b> some articles titles are scraped as subtitles</b>.

Here is a breakdown of the NonEntries in Title/SubTitle Columns. I choose to keep these in the data.

In [38]:
#Total entries with no Title
print("Total NaN Title Entries: ", medium[medium.Title.isnull()].shape[0])

#Entries with no title but with a subtitle
print("Entries with NaN Title but existing SubTitle: ",medium[(medium.Title.isnull() & medium.Subtitle.notnull())].shape[0])

#Neither Possible explanations?
print("Entries with neither title nor subtitle: ", medium[(medium.Title.isnull() & medium.Subtitle.isnull())].shape[0])

Total NaN Title Entries:  67518
Entries with NaN Title but existing SubTitle:  36242
Entries with neither title nor subtitle:  31276


## Final NaNs

In [39]:
print("Number of NaNs")
for x in range(13):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                67518
Subtitle            644600
Image                    0
Author                   0
Publication        1338404
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0
Claps                    0
url                      0
Author_url               0

Total Entries:   1987076


<hr>
# Removing Duplicate Articles (Multi-tagged)
Medium allows an  author to include 5 tags for each story.

When we scraped the archive page, we scraped each individual tag. <b>As a result, stories will appear multiple times in our data (with different tags)</b>



In [40]:
#one hot encode the tags 
medium = pd.get_dummies(medium, columns = ["Tag"])

#multi_tags is all entries in the dataset that have duplicates (includes all duplicates)
multi_tags = medium[medium.duplicated(subset=["url", "Year", "Month","Day"], keep=False)]
print("There are: ", multi_tags.shape[0], "Duplicated entries.")
print("Unique posts with multiple tags: ", multi_tags.shape[0]- medium[medium.duplicated(subset=["url", "Year", "Month","Day"], keep="last")].shape[0])

### Combining each multitagged article into ONE row

#### 1. Combine the onehot encoded tags of each multiposted article into one entry

In [71]:
#groupby urls since a unique story has a unique url, sum the rows for all tags
#now all tag vectors will be on one line
gb = multi_tags.groupby(["url","Year","Month","Day"]).sum().reset_index()
tags = gb.iloc[:,7:].copy()
tags.head(2)

Unnamed: 0,Tag_ai,Tag_android,Tag_apple,Tag_architecture,Tag_art,Tag_artificial-intelligence,Tag_big-data,Tag_bitcoin,Tag_blacklivesmatter,Tag_blockchain,...,Tag_travel,Tag_trump,Tag_ux,Tag_venture-capital,Tag_web-design,Tag_web-development,Tag_women,Tag_wordpress,Tag_work,Tag_writing
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


####  2 Remove all but one of each duplicate entry, then sort, so rows match up with the groupby dataframe

In [66]:
#keep only one entry of each duplicated article
sort = multi_tags[~multi_tags.duplicated(subset=["url","Year", "Month","Day"], keep="first")]

#sort the entry to put it in the exact same order as the groupby above
sort = sort.sort_values(["url","Year","Month","Day"]).reset_index().drop("index",axis=1)

#keep only the combined tags for a merge later
sort = sort.iloc[:,:12].copy()
sort.head(2)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Reading_Time,Claps,url,Author_url
0,Lentrepreneuriat social : un moyen de sortir d...,"Lengagement de Claudia, avec son organisation ...",0,ATD Quart Monde Int,1001 Histoires,2017,8,23,4,0.0,https://1001histoires.atd-quartmonde.org/lentr...,https://1001histoires.atd-quartmonde.org/@ATDQ...
1,Networking and Mentoring Done Right,How Cortado from Ten Thousand Coffees gives ev...,1,Ten Thousand Coffees,Ten Thousand Coffees,2018,5,25,3,72.0,https://10kcblog.com/cortadoworkplace-tenthous...,https://10kcblog.com/@10kcoffees


#### 3 Check that the two frames are aligned

In [68]:
# double check the two dataframes match up
(sort.url==gb.url).all()

True

#### 4 Combine the two dataframes horizontally

In [70]:
#smoosh em
combined = pd.concat([sort, tags], axis=1, sort=False)
combined.head(2)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Reading_Time,Claps,...,Tag_travel,Tag_trump,Tag_ux,Tag_venture-capital,Tag_web-design,Tag_web-development,Tag_women,Tag_wordpress,Tag_work,Tag_writing
0,Lentrepreneuriat social : un moyen de sortir d...,"Lengagement de Claudia, avec son organisation ...",0,ATD Quart Monde Int,1001 Histoires,2017,8,23,4,0.0,...,0,0,0,0,0,0,0,0,0,0
1,Networking and Mentoring Done Right,How Cortado from Ten Thousand Coffees gives ev...,1,Ten Thousand Coffees,Ten Thousand Coffees,2018,5,25,3,72.0,...,0,0,0,0,0,0,0,0,0,0


#### 5 Remove all duplicates from original dataframe, append combined entries to the bottom of the dataset

In [73]:
before = medium.shape[0]

#Remove all duplicates articles with same date title and author
medium = medium[~medium.duplicated(subset=["url", "Year", "Month","Day"], keep=False)]
#Add the combined data that we made in the last two scripts to the end of the datafream
dframes = [medium, combined]
#merge the two dataframes
medium = pd.concat(dframes)

after = medium.shape[0]
print("# of duplicate rows deleted: ", before-after)

# of duplicate rows deleted:  596043


## Conclusion
How much data do we have after cleaning?

In [74]:
print("Number of after cleaning: ", medium.shape[0])

Number of after cleaning:  1391033


In [75]:
medium.to_csv("Medium_Clean.csv")