# Web Crawling

In the previous example we saw how to extract information from a single webpage, but what if we wanted to do this for a whole website? This is called web crawling, it's a form of recursion, extract all info and links from a page, follow those links to new pages, extract all links...repeat. When crawling we should be considerate of the servers hosting the pages we crawl, if we request pages to fast we can overload the servers, and they may even block our IP address. We could crawl with just the request package however after more than a few pages this start to become more difficult. Since the web is messy not all pages will be written in the same way or may contain malformated html, we will need to use alot exception handling to ensure that our programs don't crash. In addition what if we want to make concurrent request, use rotating proxies or excute javascript? For all of these reasons sometimes it simpler to use a more robust scraping framework which has many useful featues built in.

# Scrapy

Scrapy is a complete framework for writting web crawlers, a programme to extract structured data from a website. It provides a series of command line tools and a shell to make the process easier. 


# Creating The Project


We'll make a simple spider that starts on a wikipedia page extracts the title of the page and then follows the links to all the other pages. To start a new project open a terminal and run:

```
scrapy startproject wikiSpider

```

This will will create a bunch of files that look like this.

```

wikiSpider
├── scrapy.cfg
├── wiki.log
└── wikiSpider
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── __pycache__
    │   ├── __init__.cpython-36.pyc
    │   ├── items.cpython-36.pyc
    │   └── settings.cpython-36.pyc
    ├── settings.py
    └── spiders
        ├── articleSpider.py
        ├── __init__.py
        └── __pycache__
            ├── Article.cpython-36.pyc
            ├── articleSpider.cpython-36.pyc
            └── __init__.cpython-36.pyc



```

# Items

Before we start scrapping we should have an idea of what information we want from the webpage. Threrefore we can start by defining an `Item` class (taht should inherient from the `Item` class),as this is what well use to store our collected data. We can think of an `Item` like a python dict, just a container to store information, however they give us extra protection if we make a mistake. For example if we make a typo with one of the keys in a normal dict nothing will happen, with `Items` however if we try to add or access a field thats not in the item it will give an error. In addition `Item` are used in the `pipelines.py` file which we can use to further process the data we extract, such as outputing it to a database. In the items.py add the following code.

```python
from scrapy import Item, Field

class Article(Item):
    # define the fields for your item here like
    title = Field()
    content = Field()
```

# Shell

The shell is the perfect place to test our code before we copy them into our spider script, it allows us to interactively build the spider, hopefully leading to less buggy code. To start it in a terminal run the bellow from within the project folder.
```
scrapy shell
```

We can then download the html using:

```python
fetch('https://en.wikipedia.org/wiki/London')
```

We now have a `response` object that we can use css selectors, xpath or regex to get data from. Use chrome devtools to inspect the page and figure out the selector for the element of intrest. After we can use the `.css` method to apply the selector to the response object.

```python
div = response.css('#mw-content-text > div > div:nth-child(134) > div > div.thumbcaption')
```

The `.css` method will return a selector list, so afterward we can apply another selector.

```python
l =  div.css('div.legend::text')
```

Notice the psudeo selector `::text` which we're using to extract text from each div. Finally we can apply a regex to get the clean text we want. Unlike `.css` the regex returns us a list of strings, not a selector list. 


```python
l.re('\w.+\)')
```

This returns the ethinicy of London. 
```python
['White British (44.9%)',
 'Other White (14.9%)',
 'Asian (18.4%)',
 'Black (13.3%)',
 'Arab (1.3%)',
 'Mixed (5%)',
 'Other (2.2%)']
```
If the text was cleaner we could use `.extract` rather then `.re` to get the text we wanted. Once you've build up some expression that extracts the data in the shell you can copy it into your spider script, which we will crate next.

## Spiders



In the spiders folder we then need to define a spider to scrap the articles. A spider is a python class that subclasses one of the scrapy spiders, in this case the we'll use the `CrawlSpider`. Scrapy provides a command line tool to help generate the template for a spider. To generate spider template cd into the spider folder and run the following.

```
scrapy genspider -t crawl article en.wikipedia.org
```

For more info run:

```
scrapy genspider --help
```

Modify the spider code so it looks like.

```python
from scrapy.spiders import CrawlSpider, Rule
from wikiSpider.items import Article
from scrapy.linkextractors import LinkExtractor

class ArticleSpider(CrawlSpider): #
    
    name = "article" #spider name
    allowed_domains = ["en.wikipedia.org"]#only follow links from this domain
    start_urls = ["http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
    rules = [ 
        #Link extractor uses a regex to define which links to follow
        Rule(LinkExtractor(allow=('(/wiki/)((?!:).)*$'),), callback="parse_item", follow=True)
    ]

    #Spiders can take custom settings, these are high priority then the setting.py
    custom_settings = {
        'DEPTH_LIMIT': 1,
        'DOWNLOAD_DELAY' : 0.25
    }

    def parse_item(self, response):
        item = Article() 
        title = response.css('#firstHeading::text').extract_first()
        content =  response.css('#toc *::text').extract()
        print("Title is: "+title)
        item['title'] = title
        item['content'] = content
        return item
```

The code in the top part of the class defines the crawling logic essentially which page to vists, whereas the code in `parse_item` is the parsing logic, what items in that page are we intrested in.  Lets break down the top half of the code in more detial. 

* `name` - is a string which we can use to call our spider on the command line.
* `custom_settings` - contains additional setting specific to this spider.
* `allowed_domains` - is a list of domains which we'll let our spider vist. 
* `start_urls` - url of the page or pages to start on. 
* [LinkExtractor](https://doc.scrapy.org/en/latest/topics/link-extractors.html) - takes a regex or a list of regexes, that will match the links we wish to extract.
* `Rule` - The rule object takes a LinkExtractor and two other arguments.  The `callback` argument takes a function which will be called on each page to excute our parsing logic. And the final argument `follow` which is a boolean which specifies if links should be followed from each response extracted, it defualts to True. 

In the second half we create a new instance of our `Article` class and add the h1 text to the title field. We are using `css` selectors to extract the data. Finally we return the item, we could use command line flags or a pipe line to write the item to file.

# Crawling

We can now start the spider by running the bellow in the terminal from the project directory.

```
scrapy crawl article
```

Note `article` is the name we put in are `ArticleSpider` class. You can stop the spider with `Ctrl+C` or `Ctrl+D` depending on your OS (operation system).

Scrappy produces alot of debugging information, often a little to much. There are 5 levels to this info:

* CRITICAL
* ERROR
* WARNING
* DEBUG
* INFO

We can change the log level by putting `LOG_LEVEL = "ERROR"` in the settings.py Or we can pass it as a cmd flag.

```
scrapy crawl --logfile="wiki.log" --loglevel="ERROR"
```


Scrapy uses the `Item` object to determine which information to save. You can use the bellow commands to output the item to csv or json.

```
 scrapy crawl article -o articles.csv -t csv
 scrapy crawl article -o articles.json -t json
 
```

For Longer crawls we may want to pause it and resume at a later point, this can be acheieved by running:

```
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
```

We can stop the spider with Ctrl+C, and the restart it later by running the same command.

## Pipelines

Another option is to save the item using a pipeline.  Pipelines are often used to further process `Items` which we extracted with our spider, typicall usecases include:
* Cleaning the scrapped data
* Removing duplicate data.
* Save database to a database or file. 


In order to activate a pipeline you need to add the bellow code to the settings.py file.

```
ITEM_PIPELINES = {
    'wikiSpider.pipelines.CsvPipeline': 500,
}
```

We'll use a pipeline to clean up the content menu data and write it to a csv with along with the page title.  The items 'keys' will be used as the column names.

```python
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
from scrapy.exporters import CsvItemExporter

class CsvPipeline(object):

    def open_spider(self, spider):
        self.file = open('output.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        
        #remove all spaces
        item['content'] = [ s for s in item['content'] if not s.isspace() ]
        self.exporter.export_item(item)
        return item
```



# Proxies

Proxies allow us to speed up scraping as we can make many concurrect request without the risk of getting our personal IP ban. If you want to use proxies with scrapy you can you the [scrapy-proxies](https://github.com/aivarsk/scrapy-proxies) middleware. Bellow is an example python script you can use to generate a list of proxies that can be used with scrapy-proxies.

```python
from bs4 import BeautifulSoup
import requests
import os
import pandas as pd

headers = {'User-Agent':"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"}
proxy_url = "https://free-proxy-list.net/"
r = requests.get(proxy_url, headers=headers)
    
path = os.getcwd()
path = os.path.join('proxies.txt')

if r.status_code == 200:

    print('Request Successful')

    dfs = pd.read_html(r.text)
    df = dfs[0]
    df = df[df.Https == "yes"]

    proxies = []

    for _,row in df[["IP Address","Port"]].iterrows():
        ip = row["IP Address"]
        port = row["Port"]
        s = f"https://{ip}:{int(port)}"
        proxies.append(s)

    print(f'Writing {len(proxies)} proxies to {path}')

    with open(path,"w") as f:
        f.write("\n".join(proxies))
else:
    print(f'Request failed with status code{r.status_code}')
```

## Exercise

* Extract all of the quotes and author names from [Quotes to Scrape](http://quotes.toscrape.com/), and write to a csv file.
* Pick a website of your choice and scrape it using scrapy. Make sure that the page doesn't dynamically load the data with javascript. This can be achieved by right clicking on the page and then `View Page Source`, if the data is there then you can scrapy without the hassle of dynamic javascript.

# Resources

* [Web Scrapping in python using scrapy](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/)
* [Getting Started with Web Scrapping using Scrapy](https://www.youtube.com/watch?v=vkA1cWN4DEc&list=PLZyvi_9gamL-EE3zQJbU5N3nzJcfNeFHU&t=0s&index=1)
* [Scrapy Resources](https://scrapy.org/resources/)
* [Scrapy exporting json and csv](http://www.scrapingauthority.com/2016/09/19/scrapy-exporting-json-and-csv/)
* [Xpath and CSS selectors](https://devhints.io/xpath)