**You will find in this notebook some scrapy exercises to practise your scraping skills**.<br>**Remember:**
- **To get each request status code to ensure you get the proper response from the web***
- **To print the response text in each request to evaluate the what kind of info you are getting and its format.** 
- **To check for patterns in the response text to extract the data/info requested in each question.**
- **To visit each url and take a look on its code through Chrome developer tool.**


- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

**All the libraries and modules you will need to solve the questions are included in cell below. Feel free to explore other libraries at your convenience**

In [1]:
import requests
from pprint import pprint
from bs4 import BeautifulSoup
import scrapy
from lxml import html
from lxml.html import fromstring
import urllib.request
from urllib.request import urlopen
import random
import re
import pandas as pd

### 1.Download and display the content of robot.txt for Wikipedia

Check [here](http://www.robotstxt.org/robotstxt.html) to know what a ***robot.txt*** is

In [2]:
# This is the url you will scrape in this exercise
url = "https://en.wikipedia.org/robots.txt"

In [3]:
#Your code
response = requests.get(url)
test = response.text
print(response.status_code)


200


In [4]:
print("robots.txt for http://www.wikipedia.org/")
print("===================================================")
print(test)

robots.txt for http://www.wikipedia.org/
﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.i

### 2. Display the name of the most recently added dataset on data.gov.

In [5]:
# This is the url you will scrape in this exercise
url ='http://catalog.data.gov/dataset?q=&sort=metadata_created+desc'

In [6]:
response = requests.get(url)
doc = html.fromstring(response.text)
title = doc.cssselect('h3.dataset-heading')[0].text_content()
print("The name of the most recently added dataset on data.gov is:")
print(title.strip())

The name of the most recently added dataset on data.gov is:
French Frigate Shoals Site P1A 11/1/2002 17-18M


### 3. Number of datasets currently listed on data.gov 

In [7]:
# This is the url you will scrape in this exercise
url = 'http://www.data.gov/'

In [8]:
response = requests.get('http://www.data.gov/')
doc = html.fromstring(response.text)
link = doc.cssselect('small a')[0]
print(link.text)

300,295 datasets


### 4. Display all the image links from Walt Disney wikipedia page

In [9]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [10]:
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('.jpg')})
for image in images: 
    print(image['src']+'\n')

//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg

//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg

//upload.wikimedia.org/wikipedia/commons/thumb/1/13/DisneySchiphol1951.jpg/220px-DisneySchiphol1951.jpg

//upload.wikimedia.org/w

### 5. Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [11]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [12]:
html = urlopen(url)
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
    if 'href' in link.attrs:
        print(link.attrs['href'])

#mw-head
#p-search
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=edit&section=7
/wiki/Py

### 6. Number of Titles that have changed in the United States Code since its last release point 

In [13]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [14]:
txt = requests.get(url).text
print(txt.count('class="usctitlechanged" id'))

18


### 7. A Python list with the top ten FBI's Most Wanted names 

In [15]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [16]:
response = requests.get(url)
test = response.text
print(response.status_code)
doc = html.fromstring(test)
h3title = doc.cssselect('h3.title')
names = [item.text_content().strip() for item in doc.cssselect('h3.title')]
print(names)

200


AttributeError: 'HTTPResponse' object has no attribute 'fromstring'

### 8.  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [17]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [18]:
response = requests.get(url)
print(response.status_code)

200


### 9. Display the date, days, title, city, country of next 25 Hackevents as a table

In [19]:
# This is the url you will scrape in this exercise
url ='https://hackevents.co/hackathons'

In [20]:
res = requests.get(url)
bs = BeautifulSoup(res.text, 'lxml')
hacks_data = bs.find_all('div',{'class':'hackathon '})
for i,f in enumerate(hacks_data,1):
    hacks_month = f.find('div',{'class':'date'}).find('div',{'class':'date-month'}).text.strip()
    hacks_date = f.find('div',{'class':'date'}).find('div',{'class':'date-day-number'}).text.strip()
    hacks_days = f.find('div',{'class':'date'}).find('div',{'class':'date-week-days'}).text.strip()
    hacks_final_date = "{} {}, {} ".format(hacks_date, hacks_month, hacks_days )
    hacks_name = f.find('div',{'class':'info'}).find('h2').text.strip()
    hacks_city = f.find('div',{'class':'info'}).find('p').find('span',{'class':'city'}).text.strip()
    hacks_country = f.find('div',{'class':'info'}).find('p').find('span',{'class':'country'}).text.strip()
    print("{:<5}{:<15}: {:<90}: {}, {}\n ".format(str(i)+')',hacks_final_date, hacks_name.title(), hacks_city, hacks_country))


1)   1 Nov, Thu-Fri : Rocket Apt Challenge                                                                      : Boston, United States
 
2)   2 Nov, Fri-Sun : Hack Access Dublin 2017 - Register Your Interest!                                         : Dublin, Ireland
 
3)   2 Nov, Fri-Sun : Disrupt Puerto Rico - Conference & Hackathon                                              : San Juan, Puerto Rico
 
4)   3 Nov, Sat-Sun : Jacobshack! 2018                                                                          : Bremen, Germany
 
5)   3 Nov, Sat     : Women'S Hackathon                                                                         : St. Louis, USA
 
6)   3 Nov, Sat-Sun : Jacobshack! 2018                                                                          : Bremen, Germany
 
7)   3 Nov, Sat-Sun : Jacobshack! 2018                                                                          : Bremen, Germany
 
8)   3 Nov, Sat-Sun : Hackthemidlands 3.0                        

### 10. Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [21]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [22]:
handle = input('Input your account name on Twitter: ')
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')

try:
    tweet_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("{} has {} number of tweets.".format(handle,tweets.get('data-count')))

except:
    print('Account name not found...')
  

Input your account name on Twitter: @BelenLinacero
@BelenLinacero has 3876 number of tweets.


### 11.Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [23]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [24]:
handle = input('Input your account name on Twitter: ') 
temp = requests.get('https://twitter.com/'+handle)
bs = BeautifulSoup(temp.text,'lxml')
try:
    follow_box = bs.find('li',{'class':'ProfileNav-item ProfileNav-item--followers'})
    followers = follow_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("Number of followers: {} ".format(followers.get('data-count')))
except:
    print('Account name not found...')

Input your account name on Twitter: @BelenLinacero
Number of followers: 15 


### 12. List all language names and number of related articles in the order they appear in wikipedia.org

In [25]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [26]:
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
nameList = bs.findAll('a', {'class' : 'link-box'})
for name in nameList:
    print(name.get_text())


English
5 734 000+ articles


Español
1 481 000+ artículos


日本語
1 124 000+ 記事


Deutsch
2 228 000+ Artikel


Русский
1 502 000+ статей


Français
2 047 000+ articles


Italiano
1 467 000+ voci


中文
1 026 000+ 條目


Português
1 007 000+ artigos


Polski
1 303 000+ haseł



### 13. A list with the different kind of datasets available in data.gov.uk 

In [27]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [28]:
html = urlopen(url)
soup = BeautifulSoup(html,"html.parser")
nameList= soup.findAll('h2')
for name in nameList:
    print(name.get_text())

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport


### 14. The total number of publications produced by the GAO (U.S. Government Accountability Office)

In [29]:
# This is the url you will scrape in this exercise
url = 'http://www.gao.gov/browse/date/custom'

In [30]:
txt = requests.get(url).text
# Browsing Publications by Date  (1 - 10 of 53,004 items)  in Custom Date Range
mx = re.search('Browsing Publications by Date.+', txt).group()
m = re.search('[,\d]+(?= +items)', mx).group()
print(m)

54,912


### 15. Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [31]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

### BONUS QUESTIONS

### 16. Scrape a certain number of tweets of a given Twitter account.

In [33]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [35]:
handle = input('Input your account name on Twitter: ')
ctr = int(input('Input number of tweets to scrape: '))
res=requests.get('https://twitter.com/'+ handle)
bs=BeautifulSoup(res.content,'lxml')
all_tweets = bs.find_all('div',{'class':'tweet'})
if all_tweets:
    for tweet in all_tweets[:ctr]:
        context = tweet.find('div',{'class':'context'}).text.replace("\n"," ").strip()
        content = tweet.find('div',{'class':'content'})
        header = content.find('div',{'class':'stream-item-header'})
        user = header.find('a',{'class':'account-group js-account-group js-action-profile js-user-profile-link js-nav'}).text.replace("\n"," ").strip()
        time = header.find('a',{'class':'tweet-timestamp js-permalink js-nav js-tooltip'}).find('span').text.replace("\n"," ").strip()
        message = content.find('div',{'class':'js-tweet-text-container'}).text.replace("\n"," ").strip()
        footer = content.find('div',{'class':'stream-item-footer'})
        stat = footer.find('div',{'class':'ProfileTweet-actionCountList u-hiddenVisually'}).text.replace("\n"," ").strip()
        if context:
            print(context)
        print(user,time)
        print(message)
        print(stat)
        print()
else:
    print("List is empty/account name not found.")
  

Input your account name on Twitter: @BelenLinacero
Input number of tweets to scrape: 3
Belen Linacero‏ @BelenLinacero 3 jul.
https://aprendemosjuntos.elpais.com/especial/por-que-es-tan-importante-como-miras-a-tu-hijo-alex-rovira/ …https://aprendemosjuntos.elpais.com/especial/por-que-es-tan-importante-como-miras-a-tu-hijo-alex-rovira/ …
0 respuestas     0 retweets     0 Me gusta

Belen Linacero‏ @BelenLinacero 3 jul.
Como la vida misma...  El cambio es,  a veces,  imperceptible pero esta siempre ahí. Vivimos en permanente cambio seamos conscientes o no. https://www.facebook.com/100000221625055/posts/2272279312789434/ …
0 respuestas     0 retweets     0 Me gusta

Belen Linacero‏ @BelenLinacero 3 jul.
Hacia la mejor versión de uno mismo,  sin miedos ni dilaciones porque podemos conseguir lo que nos propongamos. Si hay gente que consigue lo que yo quiero en mi interior,  que me lo impide a mi?  Te lo has preguntado alguna vez? Interesante y útil...https://www.youtube.com/attribution_link?a

### 17. IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [36]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [37]:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

imdb = []

# Store each item into dictionary (data), then put those into a list (imdb)
for index in range(0, len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    imdb.append(data)

top_250_data = [print(item['place'], '-', item['movie_title'], '('+item['year']+') -', 'Starring:',\
  item['star_cast']) for item in imdb]
print(top_250_data)
  

1 - Cadena perpetua (1994) - Starring: Frank Darabont (dir.), Tim Robbins, Morgan Freeman
2 - El padrino (1972) - Starring: Francis Ford Coppola (dir.), Marlon Brando, Al Pacino
3 - El padrino: Parte II (1974) - Starring: Francis Ford Coppola (dir.), Al Pacino, Robert De Niro
4 - El caballero oscuro (2008) - Starring: Christopher Nolan (dir.), Christian Bale, Heath Ledger
5 - 12 hombres sin piedad (1957) - Starring: Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb
6 - La lista de Schindler (1993) - Starring: Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes
7 - El señor de los anillos: El retorno del rey (2003) - Starring: Peter Jackson (dir.), Elijah Wood, Viggo Mortensen
8 - Pulp Fiction (1994) - Starring: Quentin Tarantino (dir.), John Travolta, Uma Thurman
9 - El bueno, el feo y el malo (1966) - Starring: Sergio Leone (dir.), Clint Eastwood, Eli Wallach
1 -  El club de la lucha (1999) - Starring: David Fincher (dir.), Brad Pitt, Edward Norton
11 - El señor de los anillos: La comunid

157 - No es país para viejos (2007) - Starring: Ethan Coen (dir.), Tommy Lee Jones, Javier Bardem
158 - Pozos de ambición (2007) - Starring: Paul Thomas Anderson (dir.), Daniel Day-Lewis, Paul Dano
159 - El sexto sentido (1999) - Starring: M. Night Shyamalan (dir.), Bruce Willis, Haley Joel Osment
160 - Lo que el viento se llevó (1939) - Starring: Victor Fleming (dir.), Clark Gable, Vivien Leigh
161 - La cosa (El enigma de otro mundo) (El enigma de otro mundo) - Starring: John Carpenter (dir.), Kurt Russell, Wilford Brimley
162 - Fargo (1996) - Starring: Joel Coen (dir.), William H. Macy, Frances McDormand
163 - Gran Torino (2008) - Starring: Clint Eastwood (dir.), Clint Eastwood, Bee Vang
164 - El cazador (1978) - Starring: Michael Cimino (dir.), Robert De Niro, Christopher Walken
165 - Buscando a Nemo (2003) - Starring: Andrew Stanton (dir.), Albert Brooks, Ellen DeGeneres
166 - Masacre (ven y mira) (ven y mira) - Starring: Elem Klimov (dir.), Aleksey Kravchenko, Olga Mironova
167 - 

### 18. Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [38]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [39]:
def get_imd_movies(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    movies = soup.find_all("td", class_="titleColumn")
    random.shuffle(movies)
    return movies
def get_imd_summary(url):
    movie_page = requests.get(url)
    soup = BeautifulSoup(movie_page.text, 'html.parser')
    return soup.find("div", class_="summary_text").contents[0].strip()

def get_imd_movie_info(movie):
    movie_title = movie.a.contents[0]
    movie_year = movie.span.contents[0]
    movie_url = 'http://www.imdb.com' + movie.a['href']
    return movie_title, movie_year, movie_url

def imd_movie_picker():
    ctr=0
    print("--------------------------------------------")
    for movie in get_imd_movies('http://www.imdb.com/chart/top'):
        movie_title, movie_year, movie_url = get_imd_movie_info(movie)
        movie_summary = get_imd_summary(movie_url)
        print(movie_title, movie_year)
        print(movie_summary)
        print("--------------------------------------------")
        ctr=ctr+1
        if (ctr==10):
          break;   
if __name__ == '__main__':
    imd_movie_picker()

--------------------------------------------
Cinema Paradiso (1988)
A filmmaker recalls his childhood when falling in love with the pictures at the cinema of his home village and forms a deep friendship with the cinema's projectionist.
--------------------------------------------
Terminator 2: El juicio final (1991)
A cyborg, identical to the one who failed to kill Sarah Connor, must now protect her teenage son, John Connor, from a more advanced and powerful cyborg.
--------------------------------------------
12 años de esclavitud (2013)
In the antebellum United States,
--------------------------------------------
Hasta que llegó su hora (1968)
A mysterious stranger with a harmonica joins forces with a notorious desperado to protect a beautiful widow from a ruthless assassin working for the railroad.
--------------------------------------------
En busca del arca perdida (1981)
In 1936, archaeologist and adventurer Indiana Jones is hired by the U.S. government to find the Ark of the Co

### 19. Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [40]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city:Madrid


In [41]:
#https://github.com/stanfordjournalism/search-script-scrape
def weather_data(query):
    res=requests.get('http://api.openweathermap.org/data/2.5/weather?'+query+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric');
    return res.json();
def print_weather(result,city):
    print("{}'s temperature: {}°C ".format(city,result['main']['temp']))
    print("Wind speed: {} m/s".format(result['wind']['speed']))
    print("Description: {}".format(result['weather'][0]['description']))
    print("Weather: {}".format(result['weather'][0]['main']))
def main():
    city=input('Enter the city:')
    print()
    try:
        query='q='+city;
        w_data=weather_data(query);
        print_weather(w_data, city)
        print()
    except:
        print('City name not found...')
if __name__=='__main__':
    main()
  

Enter the city:Madrid

Madrid's temperature: 8°C 
Wind speed: 1.5 m/s
Description: moderate rain
Weather: Rain



### 20. Book name,price and stock availability as a pandas dataframe.

In [42]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'