In [2]:
from bs4 import BeautifulSoup
import requests
import re
import os
import csv
import pandas as pd
import numpy as np

# Retrieve article urls

## Using a generator

The method below is a "generator" method, which can be used to generate urls for a specific topic. 

If you want to use it to generate urls, first create a generator method object, and then pass this obejcet to the `next` method each time you want a new url:

    gen = article_url_generator("some topic")
    article_url = next(gen)
    
Alternatively, you can loop over all article urls like this:

    for article_url in article_url_generator():
        ...


In [15]:
def article_url_generator(tag, max_count=None):
    base_url = "https://www.engadget.com"
    page = 1
    while 1:
        url = base_url + "/tag/{}/page/{}/".format(tag, page)
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        
        # This is the top article - treat it differently
        article = soup.find("article")
        if article:
            article_url_tag = article.find("a")
            if article_url_tag is not None:
                yield base_url + article_url_tag['href']
        else:
            break
        
        # All other article url's are found using this approach
        for article_url_tag in soup.find_all("a", class_="o-hit__link"):
            yield base_url + article_url_tag['href']
            
        # Move to the next page
        page += 1

In [16]:
gen = article_url_generator("ai")
for i in range(2):
    print(next(gen))

https://www.engadget.com/2017/03/30/toyota-research-ai-battery-material-hunt/
https://www.engadget.com/2017/03/29/samsungs-bixby-ai-assistant-can-see-as-well-as-talk/


## Using a normal method

You can also just create a normal method, but this will retrieve all the article urls before you need them

In [13]:
def get_article_urls(tag, num_pages):
    urls = []
    base_url = "https://www.engadget.com"
    for page in range(1, num_pages+1):
        url = base_url + "/tag/{}/page/{}/".format(tag, page)
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        
        # This is the top article - treat it differently
        article = soup.find("article")
        if article:
            article_url_tag = article.find("a")
            if article_url_tag is not None:
                urls.append(base_url + article_url_tag['href'])
        else:
            break
        
        # All other article url's are found using this approach
        for article_url_tag in soup.find_all("a", class_="o-hit__link"):
            urls.append(base_url + article_url_tag['href'])
            
    return urls

In [14]:
article_urls = get_article_urls("ai", 2)
for i in range(2):
    print(article_urls[i])

https://www.engadget.com/2017/03/30/toyota-research-ai-battery-material-hunt/
https://www.engadget.com/2017/03/29/samsungs-bixby-ai-assistant-can-see-as-well-as-talk/


# Extracting the article body

The following method is used to extract the article body from the BeautifulSoup-object (`soup`). Note that each part of the body may have substrings, so this is why we need to iterate through all the `strings` of the body elements.

In [17]:
def extract_article_body(soup):
    body = ""
    for article_text in soup.find_all("div", class_=re.compile("article-text")):
        for paragraph in article_text.find_all("p"):
            for s in paragraph.strings:
                body += " " + s
    body = re.sub(" (?=[.!?])", "", body)
    body = " ".join(body.split())
    return body.strip()

In [18]:
urls = get_article_urls("ai", 1)
soup = BeautifulSoup(requests.get(urls[0]).text, "html.parser")
extract_article_body(soup)

'Toyota has turned to artificial intelligence for help in the hunt for new advanced battery materials and fuel cell catalysts. The Toyota Research Institute (TRI) is investing $35 million into the project and is teaming up with various institutions and companies, including MIT and Stanford University. According to the automaker\'s research devision, materials development usually spans decades. By using artificial intelligence techniques, such as machine learning, the researchers can reduce the time it takes to conjure up new materials it wants to use for future zero-emission and carbon-neutral vehicles. TRI Chief Science Officer Eric Krotkov said: "Toyota recognizes that artificial intelligence is a vital basic technology that can be leveraged across a range of industries, and we are proud to use it to expand the boundaries of materials science. Accelerating the pace of materials discovery will help lay the groundwork for the future of clean energy and bring us even closer to achieving

# Retrieving an article

When retrieving an article, we want the following:
- The **title** of the article
- The **preamble** (introduction) of the article
- The **body** (main part) of the article
- The **author** of the article
- The **time** the article was published

In [19]:
def get_article(url):
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    article = {}
    
    # Get the title of the article
    article["title"] = soup.title.get_text()
    
    # Get the preamble of the article
    try:
        article["preamble"] = soup.find("div", class_=re.compile("t-d7@m-")).get_text().strip()
    except AttributeError:
        article["preamble"] = ""
    
    # Get the body of the article
    article["body"] = extract_article_body(soup)
    
    # Get the author of the article
    article["author"] = soup.find("meta", {"name": "blogger_name"}).get("content")
    
    # Publish time
    article["time"] = soup.find("meta", {"name": "published_at"}).get("content")
        
    return article

In [20]:
urls = get_article_urls("ai", 1)
get_article(urls[0])

{'author': 'Mariella Moon',
 'body': 'Toyota has turned to artificial intelligence for help in the hunt for new advanced battery materials and fuel cell catalysts. The Toyota Research Institute (TRI) is investing $35 million into the project and is teaming up with various institutions and companies, including MIT and Stanford University. According to the automaker\'s research devision, materials development usually spans decades. By using artificial intelligence techniques, such as machine learning, the researchers can reduce the time it takes to conjure up new materials it wants to use for future zero-emission and carbon-neutral vehicles. TRI Chief Science Officer Eric Krotkov said: "Toyota recognizes that artificial intelligence is a vital basic technology that can be leveraged across a range of industries, and we are proud to use it to expand the boundaries of materials science. Accelerating the pace of materials discovery will help lay the groundwork for the future of clean energy 

# Retrieve articles for some topics

In [29]:
num_per_topic = 100
articles = []
topics = ["ai", "gaming", "vr"]
for topic in topics:
    url_gen = article_url_generator(topic)
    print("Retrieving {} articles for topic '{}'".format(num_per_topic, topic))
    for i in range(num_per_topic):
        try:
            article = get_article(next(url_gen))
        except StopIteration:
            break
        article["topic"] = topic
        articles.append(article)

Retrieving 100 articles for topic 'ai'
Retrieving 100 articles for topic 'gaming'
Retrieving 100 articles for topic 'vr'


# Save the article data

In [32]:
if not os.path.exists("data"):
    os.makedirs("data")
    
with open(os.path.join("data", "engadget_articles.csv"), 'w', encoding='utf-8') as csvfile:
    columns = ["author", "preamble", "time", "title", "body", "topic"]
    csv_writer = csv.DictWriter(csvfile, fieldnames=columns)
    csv_writer.writeheader()
    for row in articles:
        csv_writer.writerow(row)

# Load the data to a pandas dataframe

In [44]:
df = pd.read_csv("data/engadget_articles.csv")

## Example - average number of words per topic

In [45]:
df["num_words"] = df["body"].apply(len)
pd.pivot_table(data=df, values=["num_words"], index=["topic"], aggfunc=np.mean)

Unnamed: 0_level_0,num_words
topic,Unnamed: 1_level_1
ai,2064.14
gaming,2334.31
vr,1994.01


## Example - which topics has the authors written about?

In [49]:
pd.crosstab(df["author"], df["topic"], margins=True).head()

topic,ai,gaming,vr,All
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaron Souppouris,2,0,1,3
Amber Bouman,1,0,0,1
Andrew Dalton,2,1,2,5
Andrew Tarantola,3,2,2,7
Autoblog,1,0,0,1


## Example - sort articles by time

In [48]:
df.sort_values(["time"], ascending=True).head()

Unnamed: 0,author,preamble,time,title,body,topic,num_words
99,Jon Fingas,It should be less racist this time around.,2016-12-05T15:55:00-05:00,Microsoft's second try at social chat bots arr...,Microsoft's first foray into social chat bots ...,ai,1449
98,Autoblog,"This could be really cool, or quite unsettling.",2016-12-05T21:37:00-05:00,Honda's NeuV concept fires up its 'emotion eng...,Enthusiasts frequently talk about cars as thou...,ai,1790
97,Jon Fingas,The frequently secretive company is opening up...,2016-12-06T15:09:00-05:00,Apple will publish its AI research,Apple isn't exactly known for sharing its rese...,ai,1212
96,Jon Fingas,The question is: could it and should it have i...,2016-12-07T17:00:00-05:00,Facebook patent hints at an automated solution...,Facebook may have said that it's stepping up i...,ai,2109
95,Matt Brian,The physical home button may also be replaced ...,2016-12-08T06:25:00-05:00,Samsung's Galaxy S8 might have a true edge-to-...,With the Galaxy Note 7 debacle weighing heavy ...,ai,1724


More cool pandas functions can be found [here](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/)