In [22]:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import urllib3
import certifi

#### Obtain BeautifulSoup Format HTML (Requests Approach)

In [23]:
# assign url of article to variable article
article = 'https://thebolditalic.com/in-defense-of-papyrus-your-guide-for-when-to-use-despised-fonts-73fd6edcd2a2'

# use requests get function to obtain the article's webpage
request = requests.get(article)

# convert article webpage object to html 
data = request.text

# convert the html to a BeautifulSoup object (nested data structure)
soup = BeautifulSoup(data, 'html.parser')

#### Obtain BeautifulSoup Format HTML (Urllib3 Approach)

In [43]:
# assign url of article to variable article
article = 'https://thebolditalic.com/in-defense-of-papyrus-your-guide-for-when-to-use-despised-fonts-73fd6edcd2a2'

# initialise a poolmanager object with certications from certifi library
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())

# obtain the article's webpage through a get function of the poolmanager (http) 
request = http.request('GET', article)

# obtain the html data of the article's webpage 
data = request.data

# convert the html to a BeautifulSoup object (nested data structure)
soup = BeautifulSoup(data, 'html.parser')

View HTML Code

In [33]:
# View HTML code of article webpage in BeautifulSoup format
print(soup.prettify())

<!DOCTYPE html>
<html xmlns:cc="http://creativecommons.org/ns#">
 <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# medium-com: http://ogp.me/ns/fb/medium-com#">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   The Anatomy of a Thousand Typefaces – Florian Schulz – Medium
  </title>
  <link href="https://medium.com/@getflourish/the-anatomy-of-a-thousand-typefaces-f7b9088eed1" rel="canonical"/>
  <meta content="The Anatomy of a Thousand Typefaces – Florian Schulz – Medium" name="title"/>
  <meta content="unsafe-url" name="referrer"/>
  <meta content="Even years after Avatar’s release, there’s one thing Ryan Gosling just can’t get over: the choice of the movie’s logo font “Papyrus”. In the parody produced by Saturday night life, the designer of…" name="description"/>
  <meta content="#000000" name="theme-color"/>
  <meta content="The Anatomy of a Thousand Typefaces – Flor

Create pandas dataframe and format it with desired attributes as columns

In [26]:
cols = ['title', 'author', 'published_date', 'published_time', 'content']
article_data = pd.DataFrame(columns = cols)

### Scrap title, author, published date and published time from BS object

In [34]:
# Obtaining the title of the article
title = soup.title.string.split(" – ",1)[0]

# Author
author = soup.find(property='author')['content']

# Published Datetime
published_date = soup.find(property='article:published_time')['content'][:10]
published_time = soup.find(property='article:published_time')['content'][11:19]

# To check whether extracted article information is accurate
print('Title: ' + title + '\nAuthor: ' + author + '\nPublished Datetime: ' + published_date + ' ' + published_time)

Title: The Anatomy of a Thousand Typefaces
Author: Florian Schulz
Published Datetime: 2017-10-22 12:37:29


Parse article body and format it to become one fully connected string

In [39]:
body = soup.findAll(["p", "h3"])

body = soup.findAll(["p","h2","h3","h4","h5","h6"])
content = ''
for x in range(0, len(body)-6):
    if isinstance(body[x].string, str):
        if body[x].name in ['h2','h3','h4','h5','h6']:
            if content == '':
                content += body[x].string + '\n'
            else:
                content += '\n \n' + body[x].string + '\n'
        if body[x].name == 'p':
            content += body[x].string + ' '
            
print(content)

An attempt to building a font database with opentype.js
On one hand, a limitation to system fonts, as seen in the video, can lead to a bad choice because there simply isn’t something better installed. On the other hand, web font libraries with hundreds or thousands of fonts can be quite overwhelming and lead to a paradox of choice. 
 
Dinner for none: The font menu’s bitter taste
The average font menu presents a list of available fonts, sorted by name, but completely unrelated otherwise: A typeface designed for bold headlines is followed by one designed for small user interfaces and then a fancy script typeface made for wedding invitations shows up. Now you either get trapped in a time consuming process of scrolling through the whole list from start to end or you simply decide to pick the first best match from the upper part of the list and call it a day. This is obviously not an interface made for systematic exploration — but infinite surprises. While I like to be surprised, I also li

### Place scraped data into dataframe

In [41]:
# Putting data into a list
data = [title, author, published_date, published_time, content]

# Inserting data into article_data DataFrame
article_data.loc[0] = data 

In [42]:
article_data

Unnamed: 0,title,author,published_date,published_time,content
0,The Anatomy of a Thousand Typefaces,Florian Schulz,2017-10-22,12:37:29,An attempt to building a font database with op...


Save to csv file

In [46]:
dest = "Output/"
filename = 'single_article_1.csv'
article_data.to_csv(dest + filename, encoding='utf-8')

Save to excel file

In [47]:
dest = "Output/"
filename = 'single_articles_1.xlsx'
article_data.to_excel(dest + filename, encoding='utf-8')