**You will find in this notebook some scrapy exercises to practise your scraping skills**.<br>**Remember:**
- **To get each request status code to ensure you get the proper response from the web***
- **To print the response text in each request to evaluate the what kind of info you are getting and its format.** 
- **To check for patterns in the response text to extract the data/info requested in each question.**
- **To visit each url and take a look on its code through Chrome developer tool.**


- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

**All the libraries and modules you will need are included below. Feel free to explore other libraries i.e. scrapy**

In [1]:
import requests
# from pprint import pprint
from bs4 import BeautifulSoup
# import scrapy
from lxml import html
from lxml.html import fromstring
import urllib.request
from urllib.request import urlopen
import random
import re
import pandas as pd

### 1.Download and display the content of robot.txt for Wikipedia

Check [here](http://www.robotstxt.org/robotstxt.html) to know more about ***robot.txt***

In [2]:
# This is the url you will scrape in this exercise
url = "https://en.wikipedia.org/robots.txt"

In [3]:
html = requests.get(url).content
print(html[0:50]) # Imprimo solo una muestra para que no te desplegue tanta inf 

b'\xef\xbb\xbf# robots.txt for http://www.wikipedia.org/ and '


### 2. Display the name of the most recently added dataset on data.gov.

In [4]:
# This is the url you will scrape in this exercise
url_2 ='http://catalog.data.gov/dataset?q=&sort=metadata_created+desc'

In [5]:
html_2 = requests.get(url_2).content
soup = BeautifulSoup(html_2, "lxml")
recent_data = soup.select(".dataset-content .dataset-heading")
print(recent_data[0].text.strip())

French Frigate Shoals Site P1A 11/1/2002 17-18M


### 3. Number of datasets currently listed on data.gov 

In [6]:
# This is the url you will scrape in this exercise
url3 = 'http://www.data.gov/'

In [7]:
html_3 = requests.get(url3).content
soup = BeautifulSoup(html_3,"lxml")
numb_ds = soup.select(".header.banner.frontpage-search .container .text-center.getstarted a[href]")
print(numb_ds[0].text)

300,295 datasets


### 4. Display all the image links from Walt Disney wikipedia page

In [8]:
# This is the url you will scrape in this exercise
url4 = 'https://en.wikipedia.org/wiki/Walt_Disney'
html_4 = requests.get(url4).content
soup = BeautifulSoup(html_4,"lxml")
images = soup.select(".image")
images_links = ["www.wikipedia.com/" + link.attrs["href"] for link in images]
display(images_links)



['www.wikipedia.com//wiki/File:Walt_Disney_1946.JPG',
 'www.wikipedia.com//wiki/File:Walt_Disney_1942_signature.svg',
 'www.wikipedia.com//wiki/File:Walt_Disney_envelope_ca._1921.jpg',
 'www.wikipedia.com//wiki/File:Trolley_Troubles_poster.jpg',
 'www.wikipedia.com//wiki/File:Steamboat-willie.jpg',
 'www.wikipedia.com//wiki/File:Walt_Disney_1935.jpg',
 'www.wikipedia.com//wiki/File:Walt_Disney_Snow_white_1937_trailer_screenshot_(13).jpg',
 'www.wikipedia.com//wiki/File:Disney_drawing_goofy.jpg',
 'www.wikipedia.com//wiki/File:DisneySchiphol1951.jpg',
 'www.wikipedia.com//wiki/File:WaltDisneyplansDisneylandDec1954.jpg',
 'www.wikipedia.com//wiki/File:Walt_disney_portrait_right.jpg',
 'www.wikipedia.com//wiki/File:Walt_Disney_Grave.JPG',
 'www.wikipedia.com//wiki/File:Roy_O._Disney_with_Company_at_Press_Conference.jpg',
 'www.wikipedia.com//wiki/File:Disney_Display_Case.JPG',
 'www.wikipedia.com//wiki/File:Disney1968.jpg',
 'www.wikipedia.com//wiki/File:P_vip.svg',
 'www.wikipedia.com//w

### 5. Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [9]:
# This is the url you will scrape in this exercise
url_5 ='https://en.wikipedia.org/wiki/Python' 

In [10]:
html_5 = requests.get(url_5).content
soup = BeautifulSoup(html_5,"lxml")
links = soup.select("a[href^=/wiki]") # Filtro por los links que me llevan a otras paginas. 
links_wiki = ["www.wikipedia.com" + i.attrs["href"] for i in links] # aqui estan guardados los links 


### 6. Number of Titles that have changed in the United States Code since its last release point 

In [11]:
# This is the url you will scrape in this exercise
url_6 = 'http://uscode.house.gov/download/download.shtml'

In [12]:
html_6 = requests.get(url_6).content
soup = BeautifulSoup(html_6,"lxml")
changed_tit = soup.select(".uscitem .usctitlechanged")
len(changed_tit)

18

### 7. A Python list with the top ten FBI's Most Wanted names 

In [13]:
# This is the url you will scrape in this exercise
url_7 = 'https://www.fbi.gov/wanted/topten'

In [14]:
html_7 = requests.get(url_7).content
soup = BeautifulSoup(html_7,"lxml")
mw = soup.select(".title")
top_10 = [e.text.strip() for e in mw]
print(top_10[1:] )

['BHADRESHKUMAR CHETANBHAI PATEL', 'LAMONT STEPHENSON', 'JASON DEREK BROWN', 'GREG ALYN CARLSON', 'SANTIAGO VILLALBA MEDEROS', 'RAFAEL CARO-QUINTERO', 'ROBERT WILLIAM FISHER', 'ALEXIS FLORES', 'ALEJANDRO ROSALES CASTILLO', 'YASER ABDEL SAID']


### 8.  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [15]:
# This is the url you will scrape in this exercise
url_8 = 'https://www.emsc-csem.org/Earthquake/'

In [168]:
import datetime
import pandas as pd
html_8 = requests.get(url_8).content
soup = BeautifulSoup(html_8,"lxml")
latitud_s = soup.select("#tbody .tabev1")
latitud = [lat.text.replace("\xa0","") for lat in latitud_s]
cardinal = soup.select("#tbody .tabev2")
cardinal_2 = [lat.text.replace("\xa0","") for lat in cardinal]
cardinal_2 = [e for e in cardinal_2 if e.isalpha()]
region = soup.select(".tb_region")
region_s = [reg.text.replace("\xa0","") for reg in region]
time_date = soup.select("#tbody .tabev6 [href]")
date = [card.text.split()[0] for card in time_date]
time = [card.text.split()[1] for card in time_date]
latitude = latitud[0::2]
longitude = latitud[1::2]
lat_card = cardinal_2[0::2]
long_card = cardinal_2[1::2]
dictionary = {"date":date,"time":time,"latitude2":latitude,"lat_card":lat_card,"long_card":long_card,"longitude2":longitude,"region name":region_s}
df = pd.DataFrame(dictionary)
df["latitude"] = df["latitude2"] + " " + df["lat_card"]
df["longitude"] = df["longitude2"] + " " + df["long_card"]
df2 = df[["date","time","latitude","longitude",'region name']]
df2.index +=1
df2.head(20)



Unnamed: 0,date,time,latitude,longitude,region name
1,2018-10-31,20:29:42.8,36.68 N,121.31 W,CENTRAL CALIFORNIA
2,2018-10-31,20:23:05.4,36.57 N,28.41 E,DODECANESE IS.-TURKEY BORDER REG
3,2018-10-31,20:15:03.4,37.73 N,20.76 E,IONIAN SEA
4,2018-10-31,20:00:19.9,32.61 N,48.57 E,WESTERN IRAN
5,2018-10-31,19:47:30.0,8.72 N,82.46 W,PANAMA-COSTA RICA BORDER REGION
6,2018-10-31,19:35:42.0,37.64 N,20.77 E,IONIAN SEA
7,2018-10-31,19:04:59.5,40.13 N,31.61 E,WESTERN TURKEY
8,2018-10-31,19:03:10.2,43.64 N,147.46 E,KURIL ISLANDS
9,2018-10-31,19:02:48.0,35.89 N,27.47 E,"DODECANESE ISLANDS, GREECE"
10,2018-10-31,18:22:55.0,9.65 S,117.25 E,"SUMBAWA REGION, INDONESIA"


## 9. Display the date, days, title, city, country of next 25 Hackevents as a table

In [17]:
# This is the url you will scrape in this exercise
url_9 ='https://hackevents.co/hackathons'

In [170]:
html_9 = requests.get(url_9).content
soup_9 = BeautifulSoup(html_9,"lxml")
date_m = soup_9.select(".date-month")
month = [month.text.split() for month in date_m]
date_day = soup_9.select(".date-day-number")
day = [day_.text.split() for day_ in date_day]
date_days = soup_9.select(".date-week-days")
week_days_ = [wday.text.split() for wday in date_days]
week_days = [dia[0] for dia in week_days_]
title_ = soup_9.select(".title")
title = [tit.text for tit in title_]
city_ = soup_9.select(".city")
city = [city2.text.strip() for city2 in city_]
country_ = soup_9.select(".country")
country = [ctr.text.strip() for ctr in country_]
date = [(a[0][0]+"-"+a[1][0]) for a in list(zip(month,day))]
keys = ["date","days", "title", "city", "country"]
values = [date,week_days,title,city,country]
dictionary = dict(zip(keys, values))
df = pd.DataFrame(dictionary)
df.index += 11
df

Unnamed: 0,date,days,title,city,country
11,Nov-1,Thu-Fri,Rocket APT Challenge,Boston,United States
12,Nov-2,Fri-Sun,Hack Access Dublin 2017 - register your interest!,Dublin,Ireland
13,Nov-2,Fri-Sun,Disrupt Puerto Rico - Conference & Hackathon,San Juan,Puerto Rico
14,Nov-3,Sat-Sun,jacobsHack! 2018,Bremen,Germany
15,Nov-3,Sat,Women's Hackathon,St. Louis,USA
16,Nov-3,Sat-Sun,jacobsHack! 2018,Bremen,Germany
17,Nov-3,Sat-Sun,jacobsHack! 2018,Bremen,Germany
18,Nov-3,Sat-Sun,HackTheMidlands 3.0,Birmingham,United Kingdom
19,Nov-3,Sat-Sun,jacobsHack! 2018,Bremen,Germany
20,Nov-3,Sat-Sun,jacobsHack! 2018,Bremen,Germany


### 10. Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [19]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [20]:
#your code

### 11.Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [21]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [22]:
#your code

### 12. List all language names and number of related articles in the order they appear in wikipedia.org

In [172]:
# This is the url you will scrape in this exercise
url_12 = 'https://www.wikipedia.org/'

In [296]:
html_12 =requests.get(url_12).content
soup_12 = BeautifulSoup(html_12,"lxml")
languages = soup_12.select("strong")
idiomas = [idiomas.text for idiomas in languages[1:]]
number = soup_12.select("bdi")
numeros = [(str(num).split()[1][-1] + str(num).split()[2]+ str(num).split()[3].replace("+</bdi>","")) for num in number[:10]]
lista = list(zip(idiomas,numeros))
lista

[('English', '5734000'),
 ('Español', '1481000'),
 ('日本語', '1124000'),
 ('Deutsch', '2228000'),
 ('Русский', '1502000'),
 ('Français', '2047000'),
 ('Italiano', '1467000'),
 ('中文', '1026000'),
 ('Português', '1007000'),
 ('Polski', '1303000')]

### 13. A list with the different kind of datasets available in data.gov.uk 

In [310]:
# This is the url you will scrape in this exercise
url_13 = 'https://data.gov.uk/'
html_13 = requests.get(url_13).content
soup_13 = BeautifulSoup(html_13,"lxml")
datasets  = soup_13.select("h2 [href]")
data_sets = [data.text for data in datasets]
display(data_sets)

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

### 14. The total number of publications produced by the GAO (U.S. Government Accountability Office)

In [330]:
# This is the url you will scrape in this exercise
url_14 = 'http://www.gao.gov/browse/date/custom'
html_14 = requests.get(url_14).content
soup_14 = BeautifulSoup(html_14,"lxml")
number_of_pub = soup_14.select("h2.scannableTitle")
display(str(number_of_pub).split()[9])

'54,912'

### 15. Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [332]:
# This is the url you will scrape in this exercise
url_15 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [444]:
html_15 = requests.get(url_15).content
soup_15 = BeautifulSoup(html_15,"lxml")
languages = soup_15.select("a[href^=/wiki]")
lang = [idioma.text for idioma in languages[12:22]]
number = soup_15.select("td")
numb = [numero.text for numero in number[2::4]][:10]
zipped = list(zip(lang,numb))

In [465]:
top10 = pd.DataFrame(zipped)
top10.columns = ["Language","Native Speakers in Millions 2007 (2010)"]
display (top10)

Unnamed: 0,Language,Native Speakers in Millions 2007 (2010)
0,Mandarin,935 (955)
1,Spanish,390 (405)
2,English,365 (360)
3,Hindi,295 (310)
4,Arabic,280 (295)
5,Portuguese,205 (215)
6,Bengali,200 (205)
7,Russian,160 (155)
8,Japanese,125 (125)
9,Punjabi,95 (100)


### BONUS QUESTIONS

### 16. Scrape a certain number of tweets of a given Twitter account.

In [31]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [32]:
# your code

### 17. IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [33]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [34]:
# your code

### 18. Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [35]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [36]:
#your code

### 19. Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [37]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city:


In [38]:
# your code

### 20. Book name,price and stock availability as a pandas dataframe.

In [39]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [40]:
#your code