**You will find in this notebook some scrapy exercises to practise your scraping skills**.<br>**Remember:**
- **To get each request status code to ensure you get the proper response from the web***
- **To print the response text in each request to evaluate the what kind of info you are getting and its format.** 
- **To check for patterns in the response text to extract the data/info requested in each question.**
- **To visit each url and take a look on its code through Chrome developer tool.**


- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

**All the libraries and modules you will need are included below. Feel free to explore other libraries i.e. scrapy**

In [111]:
import requests
from pprint import pprint
from bs4 import BeautifulSoup
#import scrapy
from lxml import html
from lxml.html import fromstring
import urllib.request
from urllib.request import urlopen
import random
import re
import pandas as pd

### 1.Download and display the content of robot.txt for Wikipedia

Check [here](http://www.robotstxt.org/robotstxt.html) to know more about ***robot.txt***

In [112]:
# This is the url you will scrape in this exercise
url = "https://en.wikipedia.org/robots.txt"

In [113]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("pre")
rows

[<pre>
 #
 # Localisable part of robots.txt for en.wikipedia.org
 #
 # Edit at https://en.wikipedia.org/w/index.php?title=MediaWiki:Robots.txt&amp;action=edit
 # Don't add newlines here. All rules set here are active for every user-agent.
 #
 # Please check any changes using a syntax validator such as http://tool.motoricerca.info/robots-checker.phtml
 # Enter https://en.wikipedia.org/robots.txt as the URL to check.
 #
 # https://bugzilla.wikimedia.org/show_bug.cgi?id=14075
 Disallow: /wiki/MediaWiki:Spam-blacklist
 Disallow: /wiki/MediaWiki%3ASpam-blacklist
 Disallow: /wiki/MediaWiki_talk:Spam-blacklist
 Disallow: /wiki/MediaWiki_talk%3ASpam-blacklist
 Disallow: /wiki/Wikipedia:WikiProject_Spam
 Disallow: /wiki/Wikipedia_talk:WikiProject_Spam
 #
 # Folks get annoyed when XfD discussions end up the number 1 google hit for
 # their name. 
 # https://phabricator.wikimedia.org/T16075
 Disallow: /wiki/Wikipedia:Articles_for_deletion
 Disallow: /wiki/Wikipedia%3AArticles_for_deletion
 Disall

### 2. Display the name of the most recently added dataset on data.gov.

In [114]:
# This is the url you will scrape in this exercise
url ='http://catalog.data.gov/dataset?q=&sort=metadata_created+desc'

In [115]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 7]> <html lang="en" class="ie ie7"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="ie ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <!--[if lte ie 8]><script type="text/javascript" src="/fanstatic/vendor/:version:2018-01-17T19:07:30.98/html5.min.js"></script><![endif]-->
  <link href="/fanstatic/vendor/:version:2018-01-17T19:07:30.98/select2/select2.css" rel="stylesheet" type="text/css"/>
  <link href="/fanstatic/css/:version:2018-01-17T19:07:30.94/main.min.css" rel="stylesheet" type="text/css"/>
  <link href="/fanstatic/vendor/:version:2018-01-17T19:07:30.98/font-awesome/css/font-awesome.min.css" rel="stylesheet" type="text/css"/>
  <!--[if ie 7]><link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2018-01-17T19:07:30.98/font-awesome/css/font-awesome-ie7.min.css" /><![endif]-->
  <link href="/fanstatic/ckanext-harvest/:version:2018-01-17T1

In [116]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.select('.dataset-heading a')
rows = [e.text.strip() for e in rows]
rows

['French Frigate Shoals Site P1A 11/1/2002 17-18M',
 'Hanford Reach - Highway 24 Homestead Reclamation Project 2013',
 "Lateral boundary of the steady-state ground-water flow model by D'Agnese and others (2002), Death Valley regional ground-water flow system, Nevada and California",
 'Little River NWR Inventory and Monitoring Plan',
 'BLM REA COP 2010 Mule Deer: Winter Habitat',
 'Lacreek National Wildlife Refuge Narrative report: May, June, July, and August, 1952',
 'Lisianski Island Site P6 10/1/2002 44-45M',
 'Alligator River National Wildlife Refuge [Land Status Map: Sheet 5 of 6]',
 'Airborne geophysical survey: Plattsburgh, New York and Vermont',
 'Rose Atoll Site 25P 7/30/1999 10-11M',
 'k283ar.m77t - MGD77 data file for Geophysical data from field activity K-2-83-AR in Arctic and Beaufort Sea, Alaska from 08/05/1983 to 08/22/1983',
 'Management plan, Pacific coast brant',
 'The Road Inventory of Deer Flat National Wildlife Refuge',
 'Narrative report: Valentine National Wildlif

### 3. Number of datasets currently listed on data.gov 

In [117]:
# This is the url you will scrape in this exercise
url = 'http://www.data.gov/'

In [118]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.select('a[href="/metrics"]')
rows = [e.text.strip() for e in rows]
rows [0]

'300,295 datasets'

### 4. Display all the image links from Walt Disney wikipedia page

In [119]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [120]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.find_all("img")
a = [("https:" + e.get('src').strip()) for e in rows]
a

['https://upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 'https://upload.wikimedia.

### 5. Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [126]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [130]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("li a")
rows = [ ("https://en.wikipedia.org" + e["href"]) for e in rows]
rows

['https://en.wikipedia.org#Snakes',
 'https://en.wikipedia.org#Ancient_Greece',
 'https://en.wikipedia.org#Media_and_entertainment',
 'https://en.wikipedia.org#Computing',
 'https://en.wikipedia.org#Engineering',
 'https://en.wikipedia.org#Roller_coasters',
 'https://en.wikipedia.org#Vehicles',
 'https://en.wikipedia.org#Weaponry',
 'https://en.wikipedia.org#See_also',
 'https://en.wikipedia.org/wiki/Pythonidae',
 'https://en.wikipedia.org/wiki/Python_(genus)',
 'https://en.wikipedia.org/wiki/Python_(mythology)',
 'https://en.wikipedia.org/wiki/Python_of_Aenus',
 'https://en.wikipedia.org/wiki/Python_(painter)',
 'https://en.wikipedia.org/wiki/Python_of_Byzantium',
 'https://en.wikipedia.org/wiki/Python_of_Catana',
 'https://en.wikipedia.org/wiki/Python_(film)',
 'https://en.wikipedia.org/wiki/Pythons_2',
 'https://en.wikipedia.org/wiki/Monty_Python',
 'https://en.wikipedia.org/wiki/Python_(Monty)_Pictures',
 'https://en.wikipedia.org/wiki/Python_(programming_language)',
 'https://en.w

### 6. Number of Titles that have changed in the United States Code since its last release point 

In [131]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [132]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
table = soup.select("div .usctitle")
rows = [e.text.strip() for e in table]
rows

['All titles in the format selected compressed into a zip archive.',
 '',
 'Title 1 - General Provisions',
 'Title 3 - The President',
 'Title 4 - Flag and Seal, Seat of Government, and the States',
 'Title 6 - Domestic Security',
 'Title 7 - Agriculture',
 'Title 8 - Aliens and Nationality',
 'Title 9 - Arbitration',
 'Title 10 - Armed Forces',
 'Title 11 - Bankruptcy',
 'Title 12 - Banks and Banking',
 'Title 13 - Census',
 'Title 16 - Conservation',
 'Title 18 - Crimes and Criminal Procedure',
 'Title 23 - Highways',
 'Title 24 - Hospitals and Asylums',
 'Title 25 - Indians',
 'Title 26 - Internal Revenue Code',
 'Title 27 - Intoxicating Liquors',
 'Title 29 - Labor',
 'Title 30 - Mineral Lands and Mining',
 'Title 31 - Money and Finance',
 'Title 32 - National Guard',
 'Title 35 - Patents',
 'Title 36 - Patriotic and National Observances, Ceremonies, and Organizations',
 'Title 37 - Pay and Allowances of the Uniformed Services',
 'Title 39 - Postal Service',
 'Title 40 - Public Bui

### 7. A Python list with the top ten FBI's Most Wanted names 

In [133]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [134]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "lxml")
rows = soup.select(".title a")
rows = [e.text for e in rows]
rows

['Most Wanted',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'LAMONT STEPHENSON',
 'JASON DEREK BROWN',
 'GREG ALYN CARLSON',
 'SANTIAGO VILLALBA MEDEROS',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'ALEXIS FLORES',
 'ALEJANDRO ROSALES CASTILLO',
 'YASER ABDEL SAID']

### 8.  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [137]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [146]:
#your code
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("td b a")
rows = [e.text.split() for e in rows]
rows

[['2018-10-31', '17:39:10.1'],
 ['2018-10-31', '17:12:55.2'],
 ['2018-10-31', '16:47:08.4'],
 ['2018-10-31', '16:32:33.7'],
 ['2018-10-31', '16:03:02.0'],
 ['2018-10-31', '15:48:24.4'],
 ['2018-10-31', '15:38:20.5'],
 ['2018-10-31', '15:34:30.0'],
 ['2018-10-31', '15:27:09.0'],
 ['2018-10-31', '15:08:00.0'],
 ['2018-10-31', '15:05:15.2'],
 ['2018-10-31', '14:32:00.0'],
 ['2018-10-31', '14:22:47.0'],
 ['2018-10-31', '14:16:57.0'],
 ['2018-10-31', '14:07:34.0'],
 ['2018-10-31', '13:51:18.0'],
 ['2018-10-31', '13:35:52.0'],
 ['2018-10-31', '13:03:18.5'],
 ['2018-10-31', '12:55:54.5'],
 ['2018-10-31', '12:51:40.0'],
 ['2018-10-31', '12:49:38.1'],
 ['2018-10-31', '12:36:31.9'],
 ['2018-10-31', '12:28:06.4'],
 ['2018-10-31', '12:25:50.8'],
 ['2018-10-31', '12:12:04.0'],
 ['2018-10-31', '12:11:03.0'],
 ['2018-10-31', '11:41:57.2'],
 ['2018-10-31', '11:37:54.3'],
 ['2018-10-31', '11:33:12.0'],
 ['2018-10-31', '11:28:38.0'],
 ['2018-10-31', '11:21:33.9'],
 ['2018-10-31', '11:15:02.0'],
 ['2018-

In [161]:
rows1 = soup.find_all("td", {"class":"tabev1"})
rows1 = [e.string.split() for e in rows1]
rows1

[['42.78'],
 ['12.71'],
 ['33.48'],
 ['116.79'],
 ['19.34'],
 ['154.99'],
 ['43.98'],
 ['78.80'],
 ['7.16'],
 ['106.66'],
 ['37.36'],
 ['20.55'],
 ['37.38'],
 ['121.74'],
 ['9.66'],
 ['83.78'],
 ['38.77'],
 ['15.23'],
 ['16.22'],
 ['97.84'],
 ['19.42'],
 ['155.29'],
 ['36.87'],
 ['3.21'],
 ['16.50'],
 ['94.36'],
 ['16.45'],
 ['95.31'],
 ['16.50'],
 ['95.20'],
 ['18.22'],
 ['103.51'],
 ['15.77'],
 ['96.11'],
 ['37.19'],
 ['20.66'],
 ['40.12'],
 ['19.94'],
 ['19.75'],
 ['102.15'],
 ['37.39'],
 ['20.73'],
 ['57.39'],
 ['66.16'],
 ['36.08'],
 ['117.86'],
 ['61.38'],
 ['149.44'],
 ['37.46'],
 ['20.55'],
 ['16.39'],
 ['95.13'],
 ['37.39'],
 ['20.54'],
 ['37.76'],
 ['21.01'],
 ['14.15'],
 ['93.46'],
 ['16.41'],
 ['95.15'],
 ['38.19'],
 ['36.28'],
 ['16.92'],
 ['100.04'],
 ['17.34'],
 ['95.24'],
 ['20.66'],
 ['105.62'],
 ['14.35'],
 ['93.57'],
 ['37.36'],
 ['20.79'],
 ['37.54'],
 ['20.64'],
 ['29.30'],
 ['72.44'],
 ['35.95'],
 ['96.76'],
 ['35.46'],
 ['29.31'],
 ['37.45'],
 ['20.72'],
 ['27.83

In [None]:
a = pd.Data

### 9. Display the date, days, title, city, country of next 25 Hackevents as a table

In [35]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [11]:
#your code

### 10. Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [52]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [12]:
#your code

### 11.Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [51]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [13]:
#your code

### 12. List all language names and number of related articles in the order they appear in wikipedia.org

In [39]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [14]:
#your code

### 13. A list with the different kind of datasets available in data.gov.uk 

In [5]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [15]:
#your code 

### 14. The total number of publications produced by the GAO (U.S. Government Accountability Office)

In [45]:
# This is the url you will scrape in this exercise
url = 'http://www.gao.gov/browse/date/custom'

In [16]:
#your code 

### 15. Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [27]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [17]:
#your code

### BONUS QUESTIONS

### 16. Scrape a certain number of tweets of a given Twitter account.

In [53]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [19]:
# your code

### 17. IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [35]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [20]:
# your code

### 18. Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [54]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [21]:
#your code

### 19. Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [22]:
# your code

### 20. Book name,price and stock availability as a pandas dataframe.

In [55]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [23]:
#your code