# Webcrawling Tutorium (Solution)

## 1. Introduction to web crawling

<br><br>
## 2. Low Level Crawling
In the following we will crawl [TechCrunch.com](https://techcrunch.com) as an example. Please note that this is only an example and should not be used in any commercial context or something similar, to not violate TechCrunches terms and conditions.
<br>
<br>
To implement our low level crawler we will only use basic Python packages, such as [requests](http://docs.python-requests.org/en/master/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and [Pandas](https://pandas.pydata.org/pandas-docs/stable/). Firstly, we will implement the individual components of the webcrawler as functions. Secondly, we will combine everything to functional webcrawler.

### Preparation
The preparation consists of importing all needed packages and defining the basic variables.

In [None]:
import requests
import bs4
import json
import time
import pandas as pd

In [None]:
base_url = 'https://techcrunch.com'
number_of_pages = 3

### Get the webpage for one url
First we will implement a simple function, which gets an URL and returns a BeautifulSoup-Object. Therefor we will send a GET request and parse the response text to the BeautifulSoup-Object.

In [None]:
def get_page(url):
    # HTTP Get request 
    response = requests.get(url)
    
    # Simple error creation, just stop execution when no proper response received
    if response.status_code != 200:
        raise RuntimeError('Error getting page {}'.format(current_url))
        
    # Converting raw response text to usable BeautifulSoup
    page = bs4.BeautifulSoup(response.text, "lxml")
    
    return page

In [None]:
page = get_page(base_url)
page.text[:300]

### Get Article URLs
Now we need to find all URLs to the articles listed on the page we previously retrieved. We can find the urls in the **Read More** buttons.
![Example Article](img/article.png)
Each link has a structure like this:
```HTML 
<a href="https://techcrunch.com/2018/03/02/some-random-article/" 
   class="read-more" 
   data-omni-sm="gbl_river_readmore,2">
        Read More
</a>```
To identfiy all relevant links we can use BeautifullSoup's find all function, which allows us also to filter for specific classes. In our case the class is called **read-more**.


In [None]:
def get_article_urls(page):
    # Get a list of all links of class read-more
    a_s = page.find_all('a', {'class': 'read-more'})
    
    hrefs = []
    # Extract the href URLs for every a in a_s
    for a in a_s:
        hrefs.append(a.attrs['href'])
        
    return hrefs

Test your code:

In [None]:
url = 'https://techcrunch.com/page/2/'
page = get_page(url)
article_urls = get_article_urls(page)
article_urls[:3]

### Get Article Info
In this step we will implement a function, which extracts all wanted information for one article url. You will need to implement:
1. Get the page (Hint: You can use already implemented functions)
2. Extract all desired information (Title, Authors, Date, Tags, Text)
3. Combine all in a dictionary

The extraction of the information works kind of similar to previous code. You just need `page.find(...)` and `page.find_all(...)`

In [None]:
def get_article_info(url, delay=1):
    
    # Wait for delay seconds to crawl the next page
    time.sleep(delay)
    
    page = get_page(url)
    
    # Exctract Information
    title = page.find('h1', {'class': 'tweet-title'}).text
    
    authors_raw = page.find_all('a', {'rel': 'author'})
    authors = [author.text for author in authors_raw]
    
    date = page.find('time').attrs['datetime']
    
    tags_raw = page.find_all('a', {'class': 'tag'})
    tags = [tag.get_text(strip=True) for tag in tags_raw]
    
    # Get the text. The two staged filtering is needed, because 
    # in some articles div.text contains also scripts and adds, which we don't want to include.
    # The relevant text can be found in all p tags in text_raw.
    text_raw = page.find('div', {'class': 'text'})
    text_raw = [t.get_text(strip=True) for t in text_raw.find_all('p')]
    text = ' '.join(text_raw)
    
    # Combine all information in one set
    article = {
        'title': title,
        'url': url,
        'date': date,
        'authors': authors,
        'tags': tags,
        'text': text
    }
    
    return article

Test your code:

In [None]:
url = 'https://techcrunch.com/2018/03/02/2018-party-and-sxsw-panels/'
article = get_article_info(url, delay=0)
# Shorten text for better readability. Not needed in the real crawler.
article['text'] = article['text'][:300] + '...'
article

### Get the next URL
Finally we need to extraxt the URL of the next page listing articles. We will use the same procedure as used before to find the href with the text next.<br>
![Next Button](img/next_button.png)

In [None]:
def get_next_url(page, base_url):
    
    list_item = page.find('li', {'class': 'next'})
    href = list_item.find('a').attrs['href']
    
    url = base_url + href
    return url

In [None]:
get_next_url(page, base_url)

### Put it all together
Now we implemented all the important parts, we need to run the crawler. The last challenge is to put them together. Therfor we will run through **number_of_pages** pages, which list recent articles on TechCrunch, by:
1. Get the page for the current URL
2. Extract the article URLs for each page.
3. Get the information for every article.
4. Add the article information to the list articles, containg the information for all articles.
5. Find reference to next page listing articles

In [None]:
current_url = base_url

articles = []
for n in range(number_of_pages):
    
    print('Crawling: {}'.format(current_url))    
    
    # 1. Get the page for the current URL
    page = get_page(current_url)
    
    # 2. Extract the article URLs for each page.
    article_urls = get_article_urls(page)
    
    # Run through all articles and extract the desired information
    for url in article_urls:
        
        try:
            # 3. Get the information for every article.
            article_info = get_article_info(url, delay=0.3)
        except:
            print('Error for article: {}'.format(url))
            
        # 4. Add the article information to the list articles, containg the information for all articles.
        articles.append(article_info)
        
    # 5. Find reference to next page listing articles
    current_url = get_next_url(page, base_url)
    
print('Finished crawling. Found {} Articles'.format(len(articles)))

### Storing results to csv
To store the our articles we use the library [Pandas](https://pandas.pydata.org/pandas-docs/stable/), which is the default library for Python to handle dataframes.

In [None]:
df = pd.DataFrame(articles)
df.head()

In [None]:
df.to_csv('my_crawled_articles.csv', index=False)

### Summary
In this section we learned how to write a basic webcrawler, which gets a starting URL and explores the articles published on TechCrunch in a given pattern. The webcrawler we developed is a really simple one. You could enhence the webcrawler e.g. by:
- Storing all files HTML files [Reference](https://www.digitalocean.com/community/tutorials/how-to-handle-plain-text-files-in-python-3)
- Add logging to your code [Reference](https://docs.python.org/3/howto/logging-cookbook.html)
- Filter the pages, e.g. to only collect articles with the tag _Artificial Intelligence_

### Questions?


<br><br>
## 3. Higher Level Webcrawling
In the first example we create a web crawler from scratch. Now we will use the propably most used Webcrawling Framework [Scrapy](https://scrapy.org/) to do the same thing. 

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exporters import JsonItemExporter
import json
import logging
import pandas as pd

In [2]:
class JsonWriterPipeline(object):
    def __init__(self):
        self.file = open("articles_pipeline.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()
 
    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()
 
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

In [3]:
class ArticleSpider(scrapy.Spider):
    name = 'articles'

    start_urls = ['https://techcrunch.com/']

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',
        'FEED_URI': 'articles.json'
    }
    
    pagination_count = 0
    max_pages = 3
        
    def parse(self, response):
        print("Starting Crawling: {}".format(response.url))
        # follow links to article pages
        for href in response.css('a.read-more::attr(href)'):
            yield response.follow(href, self.parse_author)

        self.pagination_count += 1
        if self.pagination_count < self.max_pages:
            # follow pagination links
            next_href = response.css('li.next a::attr(href)').extract_first()
            next_page = response.urljoin(next_href)
            yield response.follow(next_page, self.parse)

    def parse_author(self, response):

        title   = response.css('h1.tweet-title::text').extract_first().strip()
        authors = response.xpath('//a[@rel="author"]/text()').extract()
        date    = response.css('time::attr(datetime)').extract_first()
        tags    = response.css('a.tag::text').extract()
        
        text_raw = response.css('div.text p::text').extract()
        text = ' '.join(text_raw)
        

        yield {
            'title': title,
            'authors': authors,
            'date': date,
            'tags': tags,
            'text': text
        }

In [4]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ArticleSpider)
process.start()

2018-03-04 19:31:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-03-04 19:31:44 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x7ff155aac080>

Starting Crawling: https://techcrunch.com/
Starting Crawling: https://techcrunch.com/page/2/
Starting Crawling: https://techcrunch.com/page/3/


In [6]:
dfjson = pd.read_json('articles_pipeline.json')
print(dfjson.shape)
dfjson.head()

(58, 5)


Unnamed: 0,authors,date,tags,text,title
0,[Josh Constine],2018-03-02 13:30:57,"[Apps, Snapchat, Evan Spiegel, snap inc, snapc...","“Timing”, Snapchat CEO Evan Spiegel said crypt...",Snapchat is stuck in the uncanny valley of AR ...
1,[Jonathan Salama],2018-03-02 07:45:38,"[Transportation, trucking]",\n Jonathan Salama is chief technology officer...,Blockchain will work in trucking — but only if...
2,[Sarah Perez],2018-03-02 10:53:10,"[Apps, iphone apps, storage, iOS apps, Apps]","These days, home movies aren’t recorded with h...",Air’s app lets you record high-quality home mo...
3,[Devin Coldewey],2018-03-03 17:05:20,"[eCommerce, Amazon, counterfeit]",It’s become a standard part of my dwindling Am...,Another small business complains of counterfei...
4,[Danny Crichton],2018-03-04 09:17:01,"[Government, Facebook, Google]",If there is one policy dilemma facing nearly e...,No one wants to build a “feel good” internet
