# Scraping Yelp

The aim of this exercise is to allow a user to make an automatic search on <a href="https://www.yelp.fr/" target="_blank">Yelp</a> and store the results in a `.json` file. You will be guided through the different steps: making a form request with search keywords, parsing the search results, crawling all the result pages and storing the results into a file.

⚠ **As scrapy is not made to launch several crawler processes in the same script, you will have to restart your notebook's kernel before completing each question!**

In [5]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
     |████████████████████████████████| 264 kB 10.6 MB/s            
[?25hCollecting w3lib>=1.17.0
  Using cached w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting cssselect>=0.9.1
  Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting parsel>=1.5.0
  Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=16.0.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting zope.interface>=4.1.3
  Using cached zope.interface-5.4.0-cp39-cp39-manylinux2010_x86_64.whl (255 kB)
Collecting tldextract
  Downloading tldextract-3.2.0-py3-none-any.whl (87 kB)
     |████████████████████████████████| 87 kB 5.2 MB/s             
[?25hCollecting lxml>=3.5.0
  Using cached lxml-4.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
Collecting PyDispatcher>=2.0.5
  Using cached PyDispatcher-2.0.5-py3-none-any.whl
Collecting

1. Create a class `YelpSpider(scrapy.Spider)` with `start_urls = ['https://www.yelp.fr/']`. In this class, define a `parse(self, response)` method that automatically fills Yelp's homepage form with : "restaurant japonais" as search keywords and "Paris" as search location. Then, define another method `after_search(self, response)` that parses the first page of results, and yields the name and url of each search result. Finally, declare a `CrawlerProcess` that will store the results in a file named `"restaurant_japonais-paris.json"`.

In [7]:
!python 02-Optional_Scraping_Yelp/yelp1.py

2022-02-24 14:53:49 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-02-24 14:53:49 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.1, Platform Linux-5.4.144+-x86_64-with-glibc2.31
2022-02-24 14:53:49 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-02-24 14:53:49 [scrapy.extensions.telnet] INFO: Telnet Password: 3353827ae5fc5891
2022-02-24 14:53:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-02-24 14:53:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.do

2. Once you've managed to get the first page's results in `restaurant_japonais-paris.json`, complete the `after_search(self,response)` method to crawl the different result pages, such that all the search results will be stored in the file `"restaurant_japonais-paris.json"`. Restart your notebook's kernel, execute the new `CrawlerProcess` and check that all the search results (and not only the first page) are now stored in the file.

In [8]:
!python 02-Optional_Scraping_Yelp/yelp2.py

2022-02-24 16:19:11 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-02-24 16:19:11 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.1, Platform Linux-5.4.144+-x86_64-with-glibc2.31
2022-02-24 16:19:11 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-02-24 16:19:11 [scrapy.extensions.telnet] INFO: Telnet Password: d9e8b220bb7d5214
2022-02-24 16:19:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-02-24 16:19:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.do

Congrats, you've just made the proof of concept of making an automated search on Yelp with Scrapy! Now, let's improve the script such that it will allow the user to make any search at any location 😎

3. Use python's `input()` function to ask the user which keywords and location he would like to use, and save them into two variables: `search_keywords` and `search_location`. Then, change the `parse(self, response)` method such that it fills Yelp's form with user-defined keywords and location. Finally, change the `CrawlerProcess` such that it stores the results in a file named with the following format : `search_keywords-location.json`. 

Try your search engine with different keywords and locations ✌️

Run the following command in the terminal to be able to interract
```shell
python 02-Optional_scraping_yelp/yelp3.py
```

4. Create a script that will use the json file you just created to create a list of urls you wish to scrape and then 

In [4]:
import os
os.getcwd()

'/home/jovyan/FULL_STACK_12_WEEK_PROGRAM/M04-Data_Collection_and_Management/D02-Web_Scraping/03-Instructors/01-Solutions'

In [2]:
import json
file = open("02-Optional_Scraping_Yelp/restaurant_japonais-paris.json")
file = json.load(file)

In [3]:
list_urls = ["https://www.yelp.fr/" + element["url"] for element in file]
list_urls[:10]

['https://www.yelp.fr//biz/sanukiya-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/sushi-yaki-paris-4?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/onigiriya-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/aki-paris-2?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/okuda-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/y-izakaya-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/ippudo-paris-2?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/la-maison-du-sak%C3%A9-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/teppanyaki-ginza-onodera-paris?osq=restaurant+japonais',
 'https://www.yelp.fr//biz/ginza-paris-5?osq=restaurant+japonais']

5. Scrape the list of urls and gather the following data about each restaurant (or place):
    * name
    * stars
    * number of votes
    * address
    * opening hours
    * phone
    * amenities
    * reviews

In [10]:
!python 02-Optional_Scraping_Yelp/yelp4.py

2022-03-18 17:39:16 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-18 17:39:16 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) - [GCC 9.4.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 36.0.1, Platform Linux-5.4.144+-x86_64-with-glibc2.31
2022-03-18 17:39:16 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True, 'LOG_LEVEL': 20, 'USER_AGENT': 'Chrome/97.0'}
2022-03-18 17:39:16 [scrapy.extensions.telnet] INFO: Telnet Password: e155f1a4f63028fc
2022-03-18 17:39:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2022-03-18 17:3