# Webcrawling Tutorium (Solution)

## 1. Introduction to web crawling

<br><br>
## 2. Low Level Crawling
In the following we will crawl [TechCrunch.com](https://techcrunch.com) as an example. Please note that this is only an example and should not be used in any commercial context or something similar, to not violate TechCrunches terms and conditions.
<br>
<br>
To implement our low level crawler we will only use basic Python packages, such as [requests](http://docs.python-requests.org/en/master/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and [Pandas](https://pandas.pydata.org/pandas-docs/stable/). Firstly, we will implement the individual components of the webcrawler as functions. Secondly, we will combine everything to functional webcrawler.

### Preparation
The preparation consists of importing all needed packages and defining the basic variables.

In [73]:
import requests
import bs4
import json
import time
import pandas as pd

In [74]:
base_url = 'https://techcrunch.com'
number_of_pages = 3

### Get the webpage for one url
First we will implement a simple function, which gets an URL and returns a BeautifulSoup-Object. Therefor we will send a GET request and parse the response text to the BeautifulSoup-Object.

In [96]:
def get_page(url):
    response = requests.get(url)
    
    # Simple error creation, just stop execution when no proper response received
    if response.status_code != 200:
        raise RuntimeError('Error getting page {}'.format(current_url))
        
    # Converting raw response text to usable BeautifulSoup
    page = bs4.BeautifulSoup(response.text, "lxml")
    
    return page

### Get Article URLs
Now we need to find all URLs to the articles listed on the page we previously retrieved. We can find the urls in the **Read More** buttons.
![Example Article](img/article.png)
Each link has a structure like this:
```HTML 
<a href="https://techcrunch.com/2018/03/02/some-random-article/" 
   class="read-more" 
   data-omni-sm="gbl_river_readmore,2">
        Read More
</a>```
To identfiy all relevant links we can use BeautifullSoup's find all function, which allows us also to filter for specific classes. In our case the class is called **read-more**.


In [101]:
def get_article_urls(page):
    a_s = page.find_all('a', {'class': 'read-more'})
    
    hrefs = []
    for a in a_s:
        hrefs.append(a.attrs['href'])
        
    return hrefs

Test your code:

In [104]:
url = 'https://techcrunch.com/page/2/'
article_urls = get_article_urls(page)
article_urls[:3]

['https://techcrunch.com/2018/03/02/2018-party-and-sxsw-panels/',
 'https://techcrunch.com/2018/03/02/air-lets-you-record-high-quality-home-movies-without-running-out-of-space/',
 'https://techcrunch.com/2018/03/02/blockchain-will-work-in-trucking-but-only-if-these-three-things-happen/']

### Get Article Info
In this step we will implement a function, which extracts all wanted information for one article url. You will need to implement:
1. Get the page (Hint: You can use already implemented functions)
2. Extract all desired information (Title, Authors, Date, Tags, Text)
3. Combine all in a dictionary

The extraction of the information works kind of similar to previous code. You just need `page.find(...)` and `page.find_all(...)`

In [112]:
def get_article_info(url, delay=1):
    
    # Wait for delay seconds to crawl the next page
    time.sleep(delay)
    
    page = get_page(url)
    
    # Exctract Information
    title = page.find('h1', {'class': 'tweet-title'}).text
    
    authors_raw = page.find_all('a', {'rel': 'author'})
    authors = [author.text for author in authors_raw]
    
    date = page.find('time').attrs['datetime']
    
    tags_raw = page.find_all('a', {'class': 'tag'})
    tags = [tag.get_text(strip=True) for tag in tags_raw]
    
    # Get the text. The two staged filtering is needed, because 
    # in some articles div.text contains also scripts and adds, which we don't want to include.
    # The relevant text can be found in all p tags in text_raw.
    text_raw = page.find('div', {'class': 'text'})
    text_raw = [t.get_text(strip=True) for t in text_raw.find_all('p')]
    text = ' '.join(text_raw)
    
    # Combine all information in one set
    article = {
        'title': title,
        'url': url,
        'date': date,
        'authors': authors,
        'tags': tags,
        'text': text
    }
    
    return article

Test your code:

In [116]:
url = 'https://techcrunch.com/2018/03/02/2018-party-and-sxsw-panels/'
article = get_article_info(url, delay=0)
# Shorten text for better readability. Not needed in the real crawler.
article['text'] = article['text'][:300] + '...'
article

{'authors': ['Josh Constine'],
 'date': '2018-03-02 03:23:27',
 'tags': ['Entertainment', 'SXSW', 'TechCrunch'],
 'text': 'TechCrunch invites you to our annual Crunch By Crunch Fest party in Austin, Texas.RSVPto come meet our writers while enjoying free drinks and musical performances by live electronic pop wizardsAutograf, digital RnB drummer Mobley, angelic songwriter MIEARS, and yacht dance DJs Glassio. It’s going do...',
 'title': 'Come to TechCrunch’s party and SXSW panels',
 'url': 'https://techcrunch.com/2018/03/02/2018-party-and-sxsw-panels/'}

### Get the next URL
Finally we need to extraxt the URL of the next page listing articles. We will use the same procedure as used before to find the href with the text next.<br>
![Next Button](img/next_button.png)

In [99]:
def get_next_url(page, base_url):
    
    list_item = page.find('li', {'class': 'next'})
    href = list_item.find('a').attrs['href']
    
    url = base_url + href
    return url

### Put it all together
Now we implemented all the important parts, we need to run the crawler. The last challenge is to put them together. Therfor we will run through **number_of_pages** pages, which list recent articles on TechCrunch, by:
1. Get the page for the current URL
2. Extract the article URLs for each page.
3. Get the information for every article.
4. Add the article information to the list articles, containg the information for all articles.
5. Get the URL for the next page, which lists the articles

In [100]:
current_url = base_url

articles = []
for n in range(number_of_pages):
    
    print('Crawling: {}'.format(current_url))    
    
    # Get a beautiful soup object representing the current page
    page = get_page(current_url)
    
    article_urls = get_article_urls(page)
    # Run through all articles and extract the desired information
    for url in article_urls:
        try:
            article_info = get_article_info(url, delay=0.3)
        except:
            print('Error for article: {}'.format(url))
        articles.append(article_info)
        
    # Find reference to next page listing articles
    current_url = get_next_url(page, base_url)
    
print('Finished crawling. Found {} Articles'.format(len(articles)))

Crawling: https://techcrunch.com
Finished crawling. Found 19 Articles


### Storing results to csv
To store the our articles we use the library [Pandas](https://pandas.pydata.org/pandas-docs/stable/), which is the default library for Python to handle dataframes.

In [92]:
df = pd.DataFrame(articles)
df.to_csv('my_crawled_articles.csv', index=False)

### Summary
In this section we learned how to write a basic webcrawler, which gets a starting URL and explores the articles published on TechCrunch in a given pattern. The webcrawler we developed is a really simple one. You could enhence the webcrawler e.g. by:
- Storing all files HTML files [Reference](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3)
- Add logging to your code [Reference](https://docs.python.org/3/howto/logging-cookbook.html)
- Filter the pages, e.g. to only collect articles with the tag _Artificial Intelligence_

<br><br>
## 3. Higher Level Webcrawling
In the first example we create a web crawler from scratch. Now we will use the propably most used Webcrawling Framework [Scrapy](https://scrapy.org/) to do the same thing. 