# 1. Getting Started with Web Scraping
> Web Scraping is basically about extracting data from web pages. 

> The first step of web scraping is identifying what data points we want and where they are on the webpage. 

>> For this example we want to scrape the quote, author name, and tags from quotes.toscrape.com/random

> We can use scrapy shell to test how to extract the data from our webpage. 

> In your terminal, or command line, type: 

**scrapy shell 'quotes.toscrape.com/random’**

> With the command we just passed scrapy will provide us with a **response object** that we can use to access the data from the url we entered. 

> In your terminal, or command line, type: 

**view(response)** 

>will return the page we extracted in a web browser 

<img src="img\ss1.png">

**print(response.text)**
>will return the entire page context in html 

<img src="img\ss2.png">

> Now lets use our **inspect** tool to find what we want to etract on the page

<img src="img\ss3.png">

> We can see above that our quote is wrapped in a **span element** with a **text class**.

> We will use the response.css method to select which points of the webpage we want to extract. Below is the basic framework we will use to extract data with the scrapy shell.
	
**response.css('element.class').extract()**

> In your terminal, or command line, type: 

**response.css('span.text').extract()**

<img src="img\ss4.png">

> We successfuly extracted what we wanted using the last code, but now we want to get rid of the HTML

> To get rid of the HTML type:

**response.css('span.text::text').extract()**

<img src="img\ss5.png">

> If we want our data in string format (above is in list format) we can do one of two options:

**response.css('span.text::text')[0].extract()**

**response.css('span.text::text').extract_first()**

<img src="img\ss6.png">

> Finally, we've extracted what we wanted. 

> Now **you** try extracting the author name (in string format) and the tags (in list format) and in the next section we will create a spider to scrape this data.

> When you are finished with the scrapy shell you can type **quit()** to exit the shell

# 2. Creating your First Scrapy Spider 
> Now we will automate the data extraction by building a scrapy spider. In the terminal, or command line, we will type the following command:

**scrapy genspider quotes toscrape.com** 

>The first parameter in the genspider command is 'quotes' - this is the name of our spider. 

>The second parameter in the genspider command is 'toscrape.com' - this is the url that we will be scraping. 

<img src="img\ss7.png">

>This command will create a **quotes.py** file in the directory you selected. I didn't change my directory so the file is found in my hoffmanbrandon1 user folder. You can open your file in any type of code editing software. I will be using Atom to manipulate my code. 

>If you would like, you can find more about Atom here: https://atom.io/

>The quotes.py file will look like this:

<img src="img\ss8.png">

>We can see from this that a spider is merely a python **class** with a few **attributes**:

>**name** - name we provided to genspider command 

>**allowed_dowmains** - we set this attribute when we want to make sure our spider will 	follow links for certain domains 
	
>**start_urls** - here we find the urls that our spider will visit when we execute it 

>For this example, change the **start_urls** attribute to **'http://quotes.toscrape.com/random'**

>The most important part of our spider is the parse method. **Scrapy will call this method when responses to the start urls are received.**

>We want our spider to create a dictionary with the fields 'author_name', 'text', and 'tags'. We should have found how to scrape these fields in the section 1. All we need to do is put these selectors in a dictionary under the parse method. Your final code for your parse method should look like this: 

<img src="img\ss9.png">


>Our first spider is finished!

>To run our spider we will go to our terminal, or command line, and type:

**scrapy runspider quotes.py**

>The output will be printed in the terminal. We can save the output to a file by typing:

**scrapy runspider quotes.py -o items.json **
	
>This will save our output to a file called items.json

>To open this file from the terminal we can type:

**more items.json**

<img src="img\ss10.png">


# 2.1 Real World Application

>Now we will attempt to create a similiar spider that will scrape a job posting from stack overflow.

>Follow the link below and use the inspect tool in your browser to check out the webpage. 

>https://stackoverflow.com/jobs/159909/principal-data-scientist-zalando-se?so=i&pg=1&offset=5

<img src="img\ss11.png">


>The spider we want to create will extract the job title, company name, company location, and the perks underneath the company name & location. Because there is the possibility of multiple perks, only one perk, or no perks, we will extract the perks in list format. 

<img src="img\ss12.png">


>When we inspect the page we can see that all the information we want is nestled inside a div.-description element. To make the data extraction easier, we will create a variable with this to extract the information from. 

>Lets get started. In our terminal, or command line, type:

**scrapy shell**

>After the shell has loaded type:

**fetch('url')**

>Copy the url from above, or if this job posting has been deleted, use a different job post url. 

>Now we will create a variable with the div.-description element. 

**data = response.css('div.-description')**

>With this variable we can etract the data inside using data.css() rather than extracting from the entire page with response.css() 

>Now lets check out the job title to try and extract it. 

<img src="img\ss13.png">


>Ok so it looks like we want to extract the title attribute from the a element, which is nestled in an h1 element. 

>In our scrapy shell we'll type: 

**data.css('h1.-title > a::attr(title)').extract_first()**

>Perfect! We got the data we wanted. Now try and do the same for the company name, company location, and perks... Don't feel bad if you cant extract them now. I'll supply the code to extract these, but I want you to notice how they are formatted in the HTML. Then use my code to see how I went about extracting them. 

>Job Title:

**data.css('h1.-title > a::attr(title)').extract_first()**

>Company Name:

**data.css('div.-name > a.employer::text')[0].extract()**

>Company Location:

**data.css('div.-location::text')[0].extract()**

>Perks:

**data.css('div.-perks > p::text').extract()**

>Now that we have all the data we need from this page we can **quit()** the scrapy shell and create our spider. 

>In the terminal, or command line, type: 

**scrapy genspider jobs stackoverflow.com**

<img src="img\ss14.png">


>Now we just need to change the start_urls attribute to the url we are using, and put our extraction logic under our parse function!

>In the end, our spider should look like this: 

<img src="img\ss15.png">


>Now lets run our spider, save the output to json, and view our json file. 

**scrapy runspider jobs.py -o jobs.json**

**more jobs.json**

<img src="img\ss16.png">


>We can see from the output that we have successfully extracted "perks", "job_title", "company", and "location".

# 3. Scraping Multiple Items per Page

>For extracting multiple items per page, the best practice to use is extracting all the data from one instance and using a for loop to cycle through the instances. For our example, we will use quotes.toscrape.com. When you go to this site you will see ten quotes on the page. 

>The data from each quote is nestled inside a div.quote element. Because each quote has the data we need nested in a div.quote element, we will use this in our for loop. 

>In our scrapy shell we can set the div.quote element to a variable since it nests all the data we want. Then we can find the elements we want to scrape using the variable. (Just like in our last example)

<img src="img\ss17.png">


>Now we will open our scrapy shell and practice extracting the data we want. 

**scrapy shell quotes.toscrape.com**

>Next we will create our variable. We will put [0] at the end of our variable to focus on the first quote, but we will not put it in our spider's for loop.

**quote = response.css('div.quote')[0]**

>Now our quote variable holds all the data for the first quote. We can use the same extraction logic that we used in our first example, except instead of using response.css() we will use **quote.css()**

<img src="img\ss18.png">


>Now lets **quit()** our scrapy shell and create a new spider. 

**scrapy genspider multiple_quotes quotes.toscrape.com**

<img src="img\ss19.png">


>Now we will put our for loop with our extraction logic in the parse function and our spider will be complete!

<img src="img\ss20.png">


>Finally lets run our spider, save the output to json, and view our output.

**scrapy runspider multiple_quotes.py -o mquotes.json**

**more mquotes.json**

<img src="img\ss21.png">


# 3.1 Real World Application

>Now we're going to extract the same information as we did in our last real world application, except we will extract it from the job listing page. The url we will be using is:

https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab

>Let's start by opening our scrapy shell and fetching this url.

>After we fetch the url we can inspect the page and see that each quote is in a div element with a <br> -job-summary class. 

<img src="img\ss22-1.png">


>We will use this information to create our variable in our shell. 

**data = response.css('div.-job-summary')[0]**

>Now we can further extract the data we want from here using **data.css()** Unfortunately, the extraction logic isn't exactly like the logic we used earlier so we will need to change a couple things. 

>When we inspect the job title we see that the title is within a h2 element instead of h1 and the class is g-col10. Other than that, the extraction logic remains the same from earlier.

<img src="img\ss23.png">


>The extraction logic for the job title is:

**data.css('h2.g-col10 > a::attr(title)').extract_first()**

>Now try and find the extraction logic to extract company name, location, and perks.

>Job Title:

**data.css('h2.g-col10 > a::attr(title)').extract_first()**

>Company Name:

**data.css('div.-name::text')[0].extract()**

>Company Location:

**data.css('div.-location::text')[0].extract()**

>Perks:

**data.css('div.-perks > span::text').extract()**

>Now that we have all the data we need from this page we can **quit()** the scrapy shell and create our spider. 

>In the terminal, or command line, type: 

**scrapy genspider multiple_jobs stackoverflow.com**

>Now we just need to change the start_urls attribute to the url we are using, and put our extraction logic under our parse function!

>In the end, our spider should look like this: 

<img src="img\ss24.png">

>Now let's run our spider but this time lets save it to a csv file. 

**scrapy runspider multiple_jobs.py -o mjobs.csv**

>After the command runs go to the directory that our csv file was saved to and open the file in Excel. The file should look similiar to this: 

<img src="img\ss25.png">

# 4. Following Pagination Links

>In the previous sections, we learned how to extract data from a website. Now we will learn how to crawl a website. Crawling is the ability of our spider to jump from one page to another via hyperlinks. 

>On our example website quotes.toscrape.com there is a next button that will load ten more quotes. Our goal is to create a spider that will extract all quotes from the domain. The data extraction logic will remain the same from the last section. 

<img src="img\ss26.png">

>The pagination (next) button is nested in a list element with the class 'next', and inside of it we have an anchor element with the actual link. Because we want the link in the href attribute, our code to extract link from the next button will look like:

**response.css('li.next > a::attr(href)').extract_first()**

>However, this is a relative url and we need an absolute url. 

**next_page_url = response.css('li.next > a::attr(href)').extract_first()**

**response.urljoin(next_page_url)**

>url.join - joins the base url from the **response** to the url that we passed as a **parameter**. 

>After our spider finds the link to the next page, it needs to create a new request. 

**yield scrapy.Request(url=next_page_url, callback=self.parse)**

>Let's create our spider to scrape all the quotes from this site. 

**scrapy genspider all_quotes quotes.toscrape.com**

<img src="img\ss27.png">


>This is how our spider should look in the end. Notice how we put an **if statement** to verify that our spider will stop when there are **no more pages**

>Now when we run our spider, we can look at the output in the terminal and see the **item_scraped_count** = 100. Meaning we successfully scraped all ten quotes from all 10 pagination links. 

<img src="img\ss28.png">


# 4.1 Real World Application

>To finish up our real world spider, we need to inspect the next pagination button and use the same code from above. 

>Our final spider should look like this:

<img src="img\ss29.png">

>Notice the code I put in our if statement. If you delete this code the spider will extract all the jobs from all the pages. I only wanted to get the data from the first two pages. 

I hope you enjoyed this tutorial on web scraping with Scrapy. If you would like to learn more on web scraping I would advise you to check out Scrapy's tutorials! 

https://learn.scrapinghub.com/scrapy/