# 2. Low Level Crawling

In the following we will crawl [TechCrunch.com](https://techcrunch.com) as an example. Please note that this is only an example and should not be used in any commercial context or something similar, to not violate TechCrunches terms and conditions.
<br>
<br>
To implement our low level crawler we will only use basic Python packages, such as [requests](http://docs.python-requests.org/en/master/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and [Pandas](https://pandas.pydata.org/pandas-docs/stable/). Firstly, we will implement the individual components of the webcrawler as functions. Secondly, we will combine everything to functional webcrawler.

### Preparation
The preparation consists of importing all needed packages and defining the basic variables.

In [8]:
import requests
import bs4
import json
import time
import pandas as pd

In [9]:
base_url = 'https://techcrunch.com'
number_of_pages = 3

### Get the webpage for one url
First we will implement a simple function, which gets an URL and returns a BeautifulSoup-Object. Therefor we will send a GET request and parse the response text to the BeautifulSoup-Object.

In [65]:
def get_page(url):
    # TODO:
    #   - Get HTML Page
    #   - Check Status
    #   - Convert to BeautifulSoup
    # HTTP Get request 
    
    response = requests.get(url)

    if not response.status_code == 200:
        raise Exception("Error with reqest")
    
    page = bs4.BeautifulSoup(response.text, 'lxml')
    
    return page

### Get Article URLs
Now we need to find all URLs to the articles listed on the page we previously retrieved. We can find the urls in the **Read More** buttons.
![Example Article](img/article.png)
Each link has a structure like this:
```HTML 
<a href="https://techcrunch.com/2018/03/02/some-random-article/" 
   class="read-more" 
   data-omni-sm="gbl_river_readmore,2">
        Read More
</a>```
To identfiy all relevant links we can use BeautifullSoup's find all function, which allows us also to filter for specific classes. In our case the class is called **read-more**.


In [66]:
a_s = soup.find_all('a', {'class': 'post-block__title__link'})
for a in a_s:
    print(a.attrs['href'])

https://techcrunch.com/2018/03/22/tide-ceo-change/
https://techcrunch.com/2018/03/22/revolut-launches-disposable-virtual-cards/
https://techcrunch.com/story/facebook-responds-to-data-misuse/
https://techcrunch.com/2018/03/21/first-impressions-of-the-199-oculus-go-vr-headset/
https://techcrunch.com/2018/03/21/video-the-driver-of-the-autonomous-uber-was-distracted-before-fatal-crash/
https://techcrunch.com/2018/03/21/get-the-latest-tc-stories-read-to-you-over-the-phone-with-braillevoice/
https://techcrunch.com/2018/03/21/ai-game-trainer-gosu-ai-raises-1-9m-to-give-gamers-a-virtual-assistant/
https://techcrunch.com/2018/03/21/now-would-be-a-good-time-for-mark-zuckerberg-to-resign/
https://techcrunch.com/2018/03/21/twitters-chief-information-security-officer-is-quitting/
https://techcrunch.com/gallery/yc-demo-day-top-startups-2018/
https://techcrunch.com/2018/03/21/burrow-series-a/
https://techcrunch.com/2018/03/21/bitcoin-jack-dorsey-quote-single-currency/
https://techcrunch.com/2018/03/2

In [37]:
def get_article_urls(page):
    # Get a list of all links of class read-more
    
    refs = []
    a_s = page.find_all('a', {'class': 'post-block__title__link'})
    for a in a_s:
        refs.append(a.attrs['href'])
    return refs

Test your code:

In [38]:
url = 'https://techcrunch.com/page/2/'
page = get_page(url)
article_urls = get_article_urls(page)
article_urls[:3]



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


['https://techcrunch.com/2018/03/21/omega-takes-us-to-the-dark-side-with-their-new-moonwatch/',
 'https://techcrunch.com/2018/03/21/elon-musks-boring-co-flamethrower-ships-in-time-for-summer-bbqs/',
 'https://techcrunch.com/2018/03/21/netflix-launches-bug-bounty-program-to-pay-researchers-to-track-down-bugs/']

### Get Article Info
In this step we will implement a function, which extracts all wanted information for one article url. You will need to implement:
1. Get the page (Hint: You can use already implemented functions)
2. Extract all desired information (Title, Authors, Date, Tags, Text)
3. Combine all in a dictionary

The extraction of the information works kind of similar to previous code. You just need `page.find(...)` and `page.find_all(...)`

In [60]:
def get_article_info(url, delay=1):
    
    article = get_page(url)

    title = article.find('h1').text
    author = article.find('div', {"class": "article__byline"}).find('a').text.strip()
    text = article.find('div', {"class": "article-content"}).text.strip()
    
    import datetime
    date_crawling = str(datetime.date.today())
    
    article = {
        'title': title,
        'author': author,
        'text': text,
        'date': '/'.join(url.split('/')[3:6]),
        'crawled_at': date_crawling
    }
    
    return article

Test your code:

In [61]:
url = 'https://techcrunch.com/2018/03/22/revolut-launches-disposable-virtual-cards/'
article = get_article_info(url, delay=0)
# Shorten text for better readability. Not needed in the real crawler.
#article['text'] = article['text'][:300] + '...'
article



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


{'author': 'Romain Dillet',
 'crawled_at': '2018-03-22',
 'date': '2018/03/22',
 'text': 'Fintech  startup Revolut is launching a new type of virtual cards — disposable cards for online purchases. While you could already generate additional virtual cards for a fee, this is a different kind of virtual card as it gets destroyed after each transaction.\nIf you usually shop on Amazon or if you have a Spotify subscription, those services first asked you to enter your card number and they keep charging the same card.\nBut what if you end up on a dodgy-looking site but you really want to buy that funny pair of socks? Chances are you won’t ever purchase anything againt on this website. And you don’t want to give them your actual card information.\nNow, you can generate a virtual card in Revolut  and enter it on that weird site. After the transaction, Revolut will disable this card forever. If the website wants to charge you again, the transaction will fail.\nAnd if you’re on a shopping spree, 

'2018/03/22'

### Get the next URL
Finally we need to extraxt the URL of the next page listing articles. We will use the same procedure as used before to find the href with the text next.<br>
![Next Button](img/next_button.png)

In [62]:
def get_next_url(page, base_url):
    
    # TODO: Extract next url
    
    url = page.find('a', {'class': 'load-more'})
    url = url.attrs['href']
    
    return url 

In [63]:
get_next_url(page, base_url)

'https://techcrunch.com/page/3/'

### Put it all together
Now we implemented all the important parts, we need to run the crawler. The last challenge is to put them together. Therfor we will run through **number_of_pages** pages, which list recent articles on TechCrunch, by:
1. Get the page for the current URL
2. Extract the article URLs for each page.
3. Get the information for every article.
4. Add the article information to the list articles, containg the information for all articles.
5. Find reference to next page listing articles

In [68]:
current_url = base_url
number_of_pages = 2

articles = []
for n in range(number_of_pages):
    
    print('Crawling: {}'.format(current_url))    
    
    # 1. Get the page for the current URL    
    # 2. Extract the article URLs for each page.
    # 3. Get the information for every article.
    # 4. Add the article information to the list articles, containg the information for all articles.
    # 5. Find reference to next page listing articles
   
    current_page = get_page(current_url)
    
    article_urls = get_article_urls(current_page)
    
    for article_url in article_urls:
        print(article_url)
        try:
            info = get_article_info(article_url)
            articles.append(info)
        except:
            pass
        
    current_url = get_next_url(current_page, base_url)
    
print('Finished crawling. Found {} Articles'.format(len(articles)))

Crawling: https://techcrunch.com
https://techcrunch.com/story/facebook-responds-to-data-misuse/
https://techcrunch.com/2018/03/22/tide-ceo-change/
https://techcrunch.com/2018/03/22/revolut-launches-disposable-virtual-cards/
https://techcrunch.com/2018/03/21/first-impressions-of-the-199-oculus-go-vr-headset/
https://techcrunch.com/2018/03/21/video-the-driver-of-the-autonomous-uber-was-distracted-before-fatal-crash/
https://techcrunch.com/2018/03/21/get-the-latest-tc-stories-read-to-you-over-the-phone-with-braillevoice/
https://techcrunch.com/2018/03/21/ai-game-trainer-gosu-ai-raises-1-9m-to-give-gamers-a-virtual-assistant/
https://techcrunch.com/2018/03/21/now-would-be-a-good-time-for-mark-zuckerberg-to-resign/
https://techcrunch.com/2018/03/21/twitters-chief-information-security-officer-is-quitting/
https://techcrunch.com/gallery/yc-demo-day-top-startups-2018/
https://techcrunch.com/2018/03/21/burrow-series-a/
https://techcrunch.com/2018/03/21/bitcoin-jack-dorsey-quote-single-currency/

### Storing results to csv
To store the our articles we use the library [Pandas](https://pandas.pydata.org/pandas-docs/stable/), which is the default library for Python to handle dataframes.

[{'author': "Steve O'Hear",
  'crawled_at': '2018-03-22',
  'date': '2018/03/22',
  'text': 'Changes are afoot at Tide, the U.K. fintech startup that offers banking services for small businesses. TechCrunch has learned that founder George Bevis is planning to step down as CEO, and that the nearly three-year old company is actively headhunting for his replacement.\nIt comes at a time when Tide  — which counts 30,000 small business sign ups — is said to be entering ‘scale-up’ mode, with a headcount approaching 100 employees, and ambitions to expand internationally. Earlier this week the service saw a rebrand, including a new ‘vertical’ design for the Tide card and the slogan “Do Less Banking,” a reference to the startup’s mission to make the lives of small business owners easier.\nThe company also announced that it had got a regulatory upgrade and is now authorised as an electronic money institution by U.K. regulator the FCA. This gives Tide more direct access to banking infrastructure a

In [70]:
df = pd.DataFrame(articles)
df.head()

Unnamed: 0,author,crawled_at,date,text,title
0,Steve O'Hear,2018-03-22,2018/03/22,"Changes are afoot at Tide, the U.K. fintech st...",The founder of business banking startup Tide p...
1,Romain Dillet,2018-03-22,2018/03/22,Fintech startup Revolut is launching a new ty...,Revolut launches disposable virtual cards
2,Lucas Matney,2018-03-22,2018/03/21,Virtual reality seems to have become a very ti...,First impressions of the $199 Oculus Go VR hea...
3,Matt Burns,2018-03-22,2018/03/21,"The Tempe, Arizona police department have rele...",Video: The driver of the autonomous Uber was d...
4,Devin Coldewey,2018-03-22,2018/03/21,"For the visually impaired, there are lots of a...",Get the latest TC stories read to you over the...


In [71]:
df.to_csv('filex.csv', sep=';')

### Summary
In this section we learned how to write a basic webcrawler, which gets a starting URL and explores the articles published on TechCrunch in a given pattern. The webcrawler we developed is a really simple one. You could enhence the webcrawler e.g. by:
- Storing all files HTML files [Reference](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3)
- Add logging to your code [Reference](https://docs.python.org/3/howto/logging-cookbook.html)
- Filter the pages, e.g. to only collect articles with the tag _Artificial Intelligence_

### Questions?
