# Medium Archive  Analysis (Data Cleaning Phase)
In this notebook I will clean the data pulled from medium's archive pages with the scrape_master.py file. I will focus on removing duplicate entries and analyzing potential concerns of data consistency.

## Where the data came from.

I pulled this data from Medium's archive pages. Each archive page is associated to a story-tag and is a collection of Medium timeline cards organized by date.

#### Image of the "data-science" Archive

<img src="img/archive.jpg" width=500>




### How the data was scraped
The data was pulled from from  36 popular Medium story-tag archives. Each archive was <b>scraped for each day between Aug 1, 2017 and Aug 1, 2018.</b>

These specific dates were chosen because:
1. Medium's clap metric was introduced in August 2017, and older posts might not be relevant. 
2. The popularity of Medium may have grown, so older posts may not generalize to the preformance of posts today. 
3. The end date was chosen so that newer posts (September) were not included, as they have not had time to mature and accumulate claps.

#### 36 Tags Scraped
['ai', 'artificial-intelligence',
 'blogging', 'business',
 'data-science', 'design',
 'education', 'entrepreneurship',
 'health', 'humor',
 'inspiration', 'javascript',
 'leadership', 'life',
 'life-lessons', 'love',
 'machine-learning', 'marketing',
 'motivation', 'personal-development',
 'poetry', 'politics',
 'productivity', 'programming',
 'python', 'racism',
 'science', 'self-improvement',
 'software-engineering', 'startup',
 'tech', 'technology',
 'travel', 'web-design',
 'web-development', 'writing']
 
## Purpose of the Data
 1. To <b>create a performance metric for Medium's authors</b>, so they can compare their work to the rest of Medium.
 2. To <b>compare the performance of authors and publications</b> on Medium.
 3. To <b>create a leaderboard</b> of the top performing authors and publications in each tag .
 
 4. To <b>find the differences that distinguish well-received articles.</b>
 
 


## Structure of the data
- Title
- Subtitle 
- Image (yes/no)
- Author
- Publication
- Year - Month - Day
- Tag
- Reading Time
- Claps
- Comment (yes/no)
- Story Url
- Author URL

<img src="img/card.png" width=500>

<hr>
# Load the Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

frames =[]
tags = ['ai', 'artificial-intelligence',
 'blogging', 'business',
 'data-science', 'design',
 'education', 'entrepreneurship',
 'health', 'humor',
 'inspiration', 'javascript',
 'leadership', 'life',
 'life-lessons', 'love',
 'machine-learning', 'marketing',
 'motivation', 'personal-development',
 'poetry', 'politics',
 'productivity', 'programming',
 'python', 'racism',
 'science', 'self-improvement',
 'software-engineering', 'startup',
 'tech', 'technology',
 'travel', 'web-design',
 'web-development', 'writing']

for tag in tags:
    #all of the seperate scrapes from different tags
    df = pd.read_csv("TAG_SCRAPES/medium_"+tag+".csv")
    frames.append(df)
medium = pd.concat(frames)
medium.head(3)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Tag,Reading_Time,Claps,Comment,url,Author_url
0,The AI Hierarchy of Needs,As is usually the case with fast-advancing tec...,0,Monica Rogati,Hacker Noon,2017,8,1,ai,6,5.2K,0,https://hackernoon.com/the-ai-hierarchy-of-nee...,https://hackernoon.com/@mrogati?source=tag_arc...
1,Why will declarative programming rule chatbots...,Creating smart applications,1,Hristo Borisov,Progress NativeChat,2017,8,1,ai,7,150,0,https://medium.com/nativechat/why-will-declara...,https://medium.com/@hristoborisov?source=tag_a...
2,Online Animation: Mixamo vs Norah AI,Online animations tools provide game designers...,1,Emma Laurent,,2017,8,1,ai,5,12,0,https://medium.com/@laurentemma/online-animati...,https://medium.com/@laurentemma?source=tag_arc...


In [2]:
print("Number of articles scraped: ",medium.shape[0])

Number of articles scraped:  993318


<hr>
# Converting Strings to Floats

Before we can work with the data we need to <b>convert the "Claps" column from string to float values</b>. Note that the Object datatype is non-numeric. There is also an issue with <b>Claps in the form of "5.5K", rather than "5500".</b>

### Preview of DataTypes

In [3]:
medium.dtypes

Title           object
Subtitle        object
Image            int64
Author          object
Publication     object
Year             int64
Month            int64
Day              int64
Tag             object
Reading_Time     int64
Claps           object
Comment          int64
url             object
Author_url      object
dtype: object

### Reformatting Clap Information to Floats

In [4]:
#Claps entries higher than 999 are written "5.5K"
# here we remove the "K", convert the string to float, then multiply by 1000.
numeric_claps = []
for x in medium.Claps:
    if "K" in x:
        numeric_claps.append(float(x[:-1])*1000)
    else:
        numeric_claps.append(x)
medium["Claps"] = numeric_claps
medium["Claps"] = pd.to_numeric(medium["Claps"])
print("Clap dtype: ", medium.dtypes["Claps"])

Clap dtype:  float64


<hr>
# Removing Comment Entries
Comment entries have been encoded into the data with the Comment column. Since these entries are not articles, I remove them in the following script.

In [5]:
no_comm = medium[medium.Comment==0]
no_comm = no_comm.drop(["Comment"], axis=1)
print("Number of Entries to be removed: ", medium.shape[0]-no_comm.shape[0])
print("Percentage of remaining data: " ,round(((medium.shape[0]-no_comm.shape[0])/medium.shape[0])*100,2), "%")
medium = no_comm

Number of Entries to be removed:  62934
Percentage of remaining data:  6.34 %



# Checking for Non Entries in the Data


### All NaNs in Each Column
We only have missing values in Title, Subtitle, or Publication. <b>NaNs in publication column because not all articles are published. </b>

In [6]:
print("Number of NaNs")
for x in range(10):
    print("%-15s %10d" % (medium.columns.values[x], medium.iloc[:,x].isna().sum()))
print()
print("Total Entries:  ", medium.shape[0])

Number of NaNs
Title                28418
Subtitle            296936
Image                    0
Author                   3
Publication         628957
Year                     0
Month                    0
Day                      0
Tag                      0
Reading_Time             0

Total Entries:   930384


## NaN Title and Subtitle Entries
Sometimes when scraping the archive page, Titles are in weird formats. The result, <b> some articles titles are scraped as subtitles</b>. This is not a big deal as we dont really "need" the Title data, it just bothers me. 

Here is a breakdown of the NonEntries in Title/SubTitle Columns.

In [7]:
#Total entries with no Title
print("Total NaN Title Entries: ", medium[medium.Title.isnull()].shape[0])

#Entries with no title but with a subtitle
print("Entries with NaN Title but existing SubTitle: ",medium[(medium.Title.isnull() & medium.Subtitle.notnull())].shape[0])

#Neither
print("Entries with neither title nor subtitle: ", medium[(medium.Title.isnull() & medium.Subtitle.isnull())].shape[0])

Total NaN Title Entries:  28418
Entries with NaN Title but existing SubTitle:  16032
Entries with neither title nor subtitle:  12386


<hr>
# Removing Duplicate Articles (Same Tag)
Duplicate articles have the same name, author,Tag. <b>Author will repost there article, or spam post it 3 times</b> 

The important thing to note is that we are looking for entries with identical Tags. We will deal with duplicates with different tags next.

<strong>Strategy:</strong> 
1. Create a sub-dataframe with the ["Title", "Author","Tag", "Date"] columns. 
2. Create a mask with df.duplicated(keep="first") will mark all rows that duplicate the above three columns.
3. Remove all rows marked True from Medium dframe (use the ~ operator on the mask to switch bool values).

<strong>Result:</strong> Medium DataFrame will have all rows that are duplicates of title, author,Tag, and pub date removed. We will see how many rows this removes.

#### TL;DR We are removing all but one of each spam-posted article.

In [8]:
dup = medium[medium.duplicated(subset=["Title", "Author","Tag"], keep="first")] 
no_dup = medium[~medium.duplicated(subset=["Title", "Author","Tag"], keep="first")]
print("Number of Rows removed: ", medium.shape[0]-no_dup.shape[0])
print("Percentage of remaining data: ", round(((medium.shape[0]-no_dup.shape[0])/medium.shape[0])*100,2),"%")
print("Average Claps of spam-posted population: ",dup.Claps.mean())

Number of Rows removed:  21061
Percentage of remaining data:  2.26 %
Average Claps of spam-posted population:  148.6183467071839


### Top 3 Multi-posted Article

In [9]:
dup.sort_values("Claps", ascending=False).head(3)

Unnamed: 0,Title,Subtitle,Image,Author,Publication,Year,Month,Day,Tag,Reading_Time,Claps,url,Author_url
14841,Want To Become A Multi-Millionaire? Do These 1...,,1,Benjamin P. Hardy,Thrive Global,2018,2,5,startup,21,96000.0,https://medium.com/thrive-global/want-to-becom...,https://medium.com/@benjaminhardy?source=tag_a...
10627,30 Behaviors That Will Make You Unstoppable,A lot of people are good at what they do. Some...,1,Benjamin P. Hardy,,2017,11,22,productivity,13,55000.0,https://medium.com/@benjaminhardy/30-behaviors...,https://medium.com/@benjaminhardy?source=tag_a...
11655,30 Behaviors That Will Make You Unstoppable,A lot of people are good at what they do. Some...,1,Benjamin P. Hardy,,2017,11,22,self-improvement,13,55000.0,https://medium.com/@benjaminhardy/30-behaviors...,https://medium.com/@benjaminhardy?source=tag_a...


In [10]:
medium = no_dup

<hr>
# Removing Duplicate Articles (Multi-tagged)
Medium allows an  author to include 5 tags for each story.

When we scraped the archive page, we scraped each individual tag. <b>As a result, stories will appear multiple times in our data (with different tags)</b>

<strong>Strategy:</strong>
1. One hot encode the Tag column.
2. Search all entries for duplicate rows in ["Title", "Author", "Date"]
3. Add the duplicate entries' one hot encodings together.
4. Delete the Duplicates

<strong>Result:</strong> A one hot encoded list of each articles tags. With duplicates removed.

In [11]:
medium = pd.get_dummies(medium, columns =["Tag"])

multi_tags = medium[medium.duplicated(subset=["Title", "Author", "Year", "Month","Day"], keep=False)]
print("There are: ", multi_tags.shape[0], "Duplicated entries.")
print("Unique posts with multiple tags: ", multi_tags.shape[0]- medium[medium.duplicated(subset=["Title", "Author", "Year", "Month","Day"], keep="last")].shape[0])

There are:  339394 Duplicated entries.
Unique posts with multiple tags:  150604


### Combining each multitagged article into ONE row

1. Combine the onehot encoded tags of each multiposted article into one entry

In [12]:
sort = multi_tags.sort_values(["Title", "Author", "Year", "Month","Day"]).reset_index()

#Iterate over each row in the data frome
for index, row in sort.iterrows():
    if (index+1) == sort.shape[0]:
        break
    #For each row if the title and author and date are the same as the next row
    if ((row["Title"]==sort.iloc[index+1]["Title"]) and (row["Author"]==sort.iloc[index+1]["Author"]) and (row["Year"]==sort.iloc[index+1]["Year"]) and (row["Month"]==sort.iloc[index+1]["Month"]) and (row["Day"]==sort.iloc[index+1]["Day"])):
        #Pass forward your tag encoding to the next row
        sort.iloc[index+1, 13:]+=sort.iloc[index, 13:]
        assert (sort.iloc[index+1,13:]<2).all(), "Error at row %r" % index
    else:
        continue

 2.Delete all multiposted tags from the original data

3.Append the combined data from above to the end of the Medium dataframe

In [13]:
#save only merged OHE entries
merged = sort[~sort.duplicated(subset=["Title", "Author","Year", "Month","Day"], keep="last")]
merged = merged.drop(["index"], axis=1)
print("Total entries deleted: ", sort.shape[0] - merged.shape[0])
print("Percentage of remaining data: " ,round(((sort.shape[0]-merged.shape[0])/medium.shape[0])*100,2), "%")

Total entries deleted:  188790
Percentage of remaining data:  20.76 %


In [14]:
#Remove all duplicates
medium = medium[~medium.duplicated(subset=["Title", "Author", "Year", "Month","Day"], keep=False)]
#Append the merged duplicate frame
dframes = [medium, merged]
#merge
medium = pd.concat(dframes)

## Conclusion
How much data do we have after cleaning?

In [15]:
print("Number of after cleaning: ", medium.shape[0])

Number of after cleaning:  720533


In [16]:
medium.to_csv("Medium_Clean.csv")