In [2]:
import numpy as np

from bs4 import BeautifulSoup
import requests

import os

from tqdm.notebook import tqdm
import pandas as pd
import re

import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.insert(1, './script')
import medium

# Initialization

The __Medium__ class works with in a two-phases process.
The two phases are called __First Extraction__ and __Second Extraction__.

### First Extraction

First, the class goes to the link of the tag. In the following picture there is the exampke for the tag `okr`.

![./figures/fig1.png](./figures/fig1.png)

This page shows _archive_, namely the dates where at least one article is published.

![fig2.png](./figures/fig2.png)

Then the method check for all the days where at least one article is found.

![fig3.png](./figures/fig3.png)
![fig4.png](./figures/fig4.png)

When initializing the class there is the need to specify the first year you want to start scraping the articles. In the case of `okr` we choose 2014.

In [29]:
ev = medium('okr',2014)

Checking archive...
Checking year 2012...
Checking year 2013...
Checking year 2014...
Checking year 2015...
Checking year 2016...
Checking year 2017...
Checking year 2018...
Checking year 2019...
Checking year 2020...


In this first phase, a folder is created. The folder is stored in `./data/buffer/first_extraction`.
In this folder are stored files for each day.

![fig5.png](./figures/fig5.png)

This is an example of file.
![fig6.png](./figures/fig6.png)

# Second Extraction

### Articles dump

For each of the files stored in the folder `./data/buffer/first_extraction` are downloaded the full page html and stored in a the folder `./data/buffer/second_extraction`.

### Read html

For each of the files stored in the folder `./data/buffer/second_extraction`, the html is scraped in order to extract the title, hyperlinks, tags and full content.

##### The two phases are called by only one method

In [520]:
okr.extract_single_articles()

Dumping articles...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Reading html...


HBox(children=(FloatProgress(value=0.0, max=1724.0), HTML(value='')))




#### See the data extracted

In [6]:
okr.data()

Data extracted


Unnamed: 0,index,date,title,url,pubDate,pubId,content,tags,links
0,0,https://medium.com/tag/okr/archive/2018/10/02,OKRs — A Simple Template,https://medium.com/@taratan/okrs-a-simple-temp...,2018-10-02,Y2018M10D2N0001,"\n\nDespite its uncatchy moniker, OKRs (which ...",Open in app; Okr; Startups; Productivity; Goals,http://tiny.cc/okr-template; /startup-tools/ok...
1,0,https://medium.com/tag/okr/archive/2019/01/02,The Yearly Reflection Getaway Holiday,https://medium.com/@runetheill/the-yearly-refl...,2019-01-02,Y2019M1D2N0001,\n\nSince I decided to step up and take the le...,Open in app; Okr; Reflections; Startup; CEO; M...,https://www.rockstart.com/; https://www.entrep...
2,1,https://medium.com/tag/okr/archive/2019/01/02,OKRs (Objective and Key Results) As A Primal B...,https://medium.com/@primalbranding/okrs-object...,2019-01-02,Y2019M1D2N0002,\n\nA friend handed me a copy of John Doerr’s ...,Open in app; Objective And Key Results; Primal...,
3,2,https://medium.com/tag/okr/archive/2019/01/02,Confused about the difference between a priori...,https://medium.com/@beckfeldt/confused-about-t...,2019-01-02,Y2019M1D2N0003,\n\nView original article on www.eckfeldt.com....,Open in app; Okr; Priorities; Goals; Goal Sett...,http://www.eckfeldt.com/blog-posts/confused-ab...
4,0,https://medium.com/tag/okr/archive/2018/11/26,¡Hablemos en Números!,https://medium.com/@AndresERojasI/hablemos-en-...,2018-11-26,Y2018M11D26N0001,\n\n\begin_title\nCómo medir el éxito de tu Pr...,Open in app; Startup; Kpi; Okr; Technology; Su...,
...,...,...,...,...,...,...,...,...,...
1719,0,https://medium.com/tag/okr/archive/2018/12/26,"New alternative to OKR’s, KRA’s & KPI’s for St...",https://medium.com/@siddharthram/new-alternati...,2018-12-26,Y2018M12D26N0001,"\n\n\begin_title\nGoals, Methods, Habits for y...",Open in app; Kpi; Startup; Entrepreneur; Produ...,https://medium.com/u/67f5049293c7?source=post_...
1720,1,https://medium.com/tag/okr/archive/2018/12/26,The Common Mistakes in Writing OKR,https://medium.com/product-narrative/the-commo...,2018-12-26,Y2018M12D26N0002,\n\nWelcome to Shared Narrative #11!\n\nOKR is...,OKR; Our Newsletter; (in) Bahasa Indonesia; Ok...,/product-narrative/the-trouble-two-challenges-...
1721,2,https://medium.com/tag/okr/archive/2018/12/26,Personal Goal Setting & Living Deliberately,https://medium.com/@amamujee/personal-goal-set...,2018-12-26,Y2018M12D26N0003,\n\nDecember 2018\n\nI wanted to share my pers...,Open in app; Goals; Planning; Okr,https://en.wikipedia.org/wiki/OKR; https://doc...
1722,3,https://medium.com/tag/okr/archive/2018/12/26,Come 2019 (Part 2),https://medium.com/the-learning-machine-projec...,2018-12-26,Y2018M12D26N0004,\n\n\begin_title\nOKRs and a Successful 2019\n...,Okr; Personal Development; Goals; Personal Gro...,/the-learning-machine-project/come-2019-part-1...
