![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Web Scraping

## Introduction

As you have learned in the lesson, web "scraping" (also called "web harvesting", "web data extraction" or even "web data mining"), can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner". Or, in other words: instead of a human end-user clicking away in their web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program which can execute it much faster, and more correctly, than a human can. 

Data scientists have often found web scraping to be a powerful tool to have in their arsenal, as many data science projects starts with the first step of obtaining an appropiate data set, so why not utilize the information the web provides?

In this lab, you will practice a series of exercises to test your web scraping skills. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.

## Getting Started

Each exercise is independent from the previous one. If you get stuck in one exercise you can skip to the next one. Read each instruction carefully and provide your answer beneath it. 

## Resources

[Web Scraping Tutorial Dataquest](https://www.dataquest.io/blog/web-scraping-tutorial-python/)

[Web Scraping Tutorial Kdnuggets](https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html)

[HTML Scraping](https://docs.python-guide.org/scenarios/scrape/)

[The Anatomy of a Search Engine](http://infolab.stanford.edu/~backrub/google.html)

# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

### Make sure you have all libraries installed before start the lab!

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
# starting by calling requests.get over the 'url'
get_html = requests.get(url)

In [4]:
# exploring the request.get() methods
get_html.status_code

200

In [5]:
get_html.encoding

'utf-8'

In [6]:
get_html.headers['content-type']

'text/html; charset=utf-8'

In [7]:
# calling the content method to return the page's content
html = get_html.content

In [8]:
# now parse the html object into BeautifulSoup
soup = BeautifulSoup(html, "lxml")

In [9]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-next-699624e4062c462162146d384a2e859d.css" integrity="sha512-aZYk5AYsRiFiFG04Si6FnQoHFwAugnodzKJXgafKqPWsrgrjoWRsapCn//vFuWqjSzr72ucZfPq8/ZbduuSeQg==" media="all" rel="stylesheet"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-next-7dbb92872a8f95d6138b81f09e806b04.css" integrity="sha512-fbuShyqPldYTi4HwnoBr

* The following exercises requires the same taks of importing content and parse into BeautifulSoup. Let's write a function to help us save some time.

In [10]:
def url_bs4(url):
    # get the data using get() from requests
    get_html = requests.get(url)
    # print the status code
    print(get_html.status_code)
    # print the encoding type
    print(get_html.encoding)
    # returns the content
    html = get_html.content
    # parse content to BeautifulSoup
    soup = BeautifulSoup(html)
    return(soup)

In [11]:
# use the function url_bs4 to return the parsed data from the url1
soup = url_bs4(url)

200
utf-8


#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (Á•ûÊ•ΩÂùÇË¶ö„ÄÖ)',
 'script-8']
 ```

In [12]:
# Now let's select the developers's profile
# Iterate through each element from soup and colect data from 
dev_list = [i.text.strip().replace(' ','').replace('\n\n', ' ') for i in soup.find_all('h1', {'class': 'h3 lh-condensed'})]
dev_list

['KentC.Dodds',
 'SethVargo',
 'VadimDemedes',
 'PaulBeusterien',
 'DanImhoff',
 'CalebPorzio',
 'TannerLinsley',
 'InesMontani',
 'Mr.doob',
 'JacobHoffman-Andrews',
 'TianonGravi',
 'TaylorOtwell',
 'MatthewJohnson',
 'MathiasBuus',
 'TimHolman',
 'AlonZakai',
 'HadleyWickham',
 'Bo-YiWu',
 'TobiasKoppers',
 'KentaroWada',
 'TeppeiFukuda',
 'MartinAtkins',
 'RyanMcKinley',
 'KlausPost',
 'JamesAgnew']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [13]:
# This is the url you will scrape in this exercise
url2 = 'https://github.com/trending/python?since=daily'

In [14]:
# use the function url_bs4 to return the parsed data from the url2
soup2 = url_bs4(url2)

200
utf-8


In [15]:
# iterate through each element from soup2
tts = [i.text.strip().replace(' ','').replace('\n\n', ' ')
       for i in soup2.find_all('h1', {'class': 'h3 lh-condensed'})]
tts

['podgorskiy/  ALAE',
 'vt-vl-lab/  3d-photo-inpainting',
 'google/  jax',
 'RasaHQ/  rasa',
 'trailofbits/  algo',
 'horovod/  horovod',
 'rusty1s/  pytorch_geometric',
 'schenkd/  nginx-ui',
 'Kr1s77/  awesome-python-login-model',
 'The-Art-of-Hacking/  h4cker',
 'OpenMined/  PySyft',
 'formatc1702/  WireViz',
 'open-mmlab/  OpenLidarPerceptron',
 'pytorch/  fairseq',
 'TheAlgorithms/  Python',
 'Azure/  azure-cli',
 'deepinsight/  insightface',
 'gvanrossum/  patma',
 'kubeflow/  pipelines',
 'threat9/  routersploit',
 'aws/  aws-cli',
 'chubin/  wttr.in',
 'lucidrains/  stylegan2-pytorch',
 'arielgs/  LinkedBot',
 'Dod-o/  Statistical-Learning-Method_Code']

#### Display all the image links from Walt Disney wikipedia page

In [16]:
# This is the url you will scrape in this exercise
url3 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [17]:
# use the function url_bs4 to return the parsed data from the url3
soup3 = url_bs4(url3)

200
UTF-8


In [18]:
links = [i.get('src').strip('//') for i in soup3.find_all('img')]
links

['upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_a

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [19]:
# This is the url you will scrape in this exercise
url4 ='https://en.wikipedia.org/wiki/Python' 

In [20]:
# use the function url_bs4 to return the parsed data from the url2
soup4 = url_bs4(url4)

200
UTF-8


In [21]:
for link in soup4.findAll("a"):
    if 'href' in link.attrs:
        print(link.attrs['href'])

#mw-head
#p-search
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#People
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=edi

In [22]:
links4 = [element.find('a').get('href') for element in soup4.find_all('li') if element.find('a') is not None if element.find('a').get('href').startswith('/wiki/') and 'ython' in element.find('a').get('href')]
links4

['/wiki/Pythonidae',
 '/wiki/Python_(genus)',
 '/wiki/Python_(mythology)',
 '/wiki/Python_of_Aenus',
 '/wiki/Python_(painter)',
 '/wiki/Python_of_Byzantium',
 '/wiki/Python_of_Catana',
 '/wiki/Python_(film)',
 '/wiki/Pythons_2',
 '/wiki/Monty_Python',
 '/wiki/Python_(Monty)_Pictures',
 '/wiki/Python_(programming_language)',
 '/wiki/CPython',
 '/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 '/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 '/wiki/Python_(Efteling)',
 '/wiki/Python_(automobile_maker)',
 '/wiki/Python_(Ford_prototype)',
 '/wiki/Colt_Python',
 '/wiki/Python_(missile)',
 '/wiki/Python_(nuclear_primary)',
 '/wiki/Python_Anghelo',
 '/wiki/Cython',
 '/wiki/Python',
 '/wiki/Talk:Python',
 '/wiki/Python',
 '/wiki/Special:WhatLinksHere/Python',
 '/wiki/Special:RecentChangesLinked/Python']

#### Number of Titles that have changed in the United States Code since its last release point 

In [23]:
# This is the url you will scrape in this exercise
url5 = 'http://uscode.house.gov/download/download.shtml'

In [24]:
soup5 = url_bs4(url5)

200
UTF-8


In [25]:
[element.text.strip('\n').strip().strip('\n') for element in soup5.find_all('div', {'class': 'usctitlechanged'})]

['Title 5 - Government Organization and Employees Ÿ≠',
 'Title 8 - Aliens and Nationality',
 'Title 10 - Armed Forces Ÿ≠',
 'Title 16 - Conservation',
 'Title 19 - Customs Duties',
 'Title 36 - Patriotic and National Observances, Ceremonies, and Organizations Ÿ≠',
 'Title 50 - War and National Defense']

In [26]:
txt = requests.get(url5).text
count = txt.count('class="usctitlechanged"')
print(f'Number of titles changed: {count}')

Number of titles changed: 7


#### A Python list with the top ten FBI's Most Wanted names 

In [27]:
# This is the url you will scrape in this exercise
url6 = 'https://www.fbi.gov/wanted/topten'

In [28]:
soup6 = url_bs4(url6)

200
utf-8


In [29]:
[element.text.strip('\n') for element in soup6.find_all('h3', {'class': 'title'})]

['YASER ABDEL SAID',
 'ALEXIS FLORES',
 'EUGENE PALMER',
 'SANTIAGO VILLALBA MEDEROS',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [30]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [31]:
#using requests to get the contents 
htmlearth = requests.get(url).content
soupearth = BeautifulSoup(htmlearth, "html.parser")

In [32]:
#here we will get both lat and long information, clean it up and add it to a new list
latlong = [element.text for element in soupearth.findAll("td",{"class" : "tabev1"})]
latlong1=[]
for lat in latlong:
    latlong1.append(re.sub('[^0-9,.]', '', lat))

In [33]:
#in this cell we will separate the lat and long info according to their postition
#on the original list
lat = []
long = []
for num, l in enumerate(latlong1):
    if num%2==0:
        lat.append(l)
    else:
        long.append(l)

In [34]:
#let's do the same logic to get the directions of the lat and long
direction = soupearth.findAll("td",{"class" : "tabev2"})
direction1=[]
for dire in direction:
    d=re.sub('[^N, E , S, W]', '', str(dire))
    if d != " ":
        direction1.append(d)

In [35]:
latdir = []
longdir = []
for num, l in enumerate(direction1):
    if num%2==0:
        latdir.append(l)
    else:
        longdir.append(l)

In [36]:
#here we will get the date information as well as clean it and put it in datetime format
import datetime
date = [element.text for element in soupearth.findAll("td",{"class" : "tabev6"})]
date1=[]
for num, d in enumerate(date):
    d1 = re.sub('[^0-9,-,:]', '', str(d))
    d2 = pd.to_datetime(d1, format="%Y%m%d%H:%M:%S%f")
    date1.append(d2)

In [37]:
#here we will get the region info 
region = [element.text for element in soupearth.findAll("td",{"class" : "tb_region"})]
region1 = []
for r in region:
    region1.append(re.sub('[^A-Z]', ' ', str(r)))

In [38]:
#let's put all the info we got in list format and put it in a dataframe 
#for easier visualization
data=[]
for i in range(20):
    data.append([region1[i],date1[i],lat[i],latdir[i],long[i],longdir[i]])

In [39]:
earthdf = pd.DataFrame(data, columns = ['Region', 'Date and Time', "Latitude", "Direction Latitude", "Longitude", "Direction Longitude"]) 

In [40]:
earthdf.head()


Unnamed: 0,Region,Date and Time,Latitude,Direction Latitude,Longitude,Direction Longitude
0,OFFSHORE VALPARAISO CHILE,2020-06-25 13:05:31.022,32.21,S,71.59,W
1,OFFSHORE VALPARAISO CHILE,2020-06-25 12:58:29.029,32.24,S,71.69,W
2,JAVA INDONESIA,2020-06-25 12:53:07.034,6.22,S,108.32,E
3,PUERTO RICO REGION,2020-06-25 12:44:55.443,17.96,N,66.91,W
4,OFFSHORE OAXACA MEXICO,2020-06-25 12:43:01.044,15.62,N,96.16,W


#### Count the number of tweets by a given Twitter account.
#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
url = "https://twitter.com/"

In [None]:
#making sure the prfile exists
username = input("Type desired profile ")
if requests.get(url + username).status_code != 200:
    print("We couldn't find this twitter handle try a different one")

In [None]:
#requesting the twitter page content and parsing it with beautiful soup
htmltweet = requests.get(url + username).content
souptweet= BeautifulSoup(htmltweet, 'lxml')


In [None]:
#getting the number of tweets of the account
n_tweets = souptweet.find('span', {'class': 'ProfileNav-value'}).text

In [None]:
#getting the number of tweets of the account
followers = souptweet.find('li', {'class': 'ProfileNav-item--followers'}).a.find('span', {'class': 'ProfileNav-value'}).text

In [None]:
#this is just for fun, we can get all the info about the accounts tweets
tweet_df=pd.DataFrame()
tweets = souptweet.find_all('div', {'class':'tweet'})
for tweet in tweets:
    name = tweet.find('span', {'class': 'FullNameGroup'}).find('strong').text
    username = tweet.find('span', {'class': 'username'}).text
    time = tweet.find('small', {'class': 'time'}).a.text
    content = tweet.find('p', {'class': 'TweetTextSize TweetTextSize--normal js-tweet-text tweet-text'}).text
    statistics = tweet.find('div', {'class': 'ProfileTweet-actionCountList u-hiddenVisually'})
    answer=[i.text for i in statistics.find_all("span",{"class":"ProfileTweet-actionCountForAria"})][0]
    retweets=[i.text for i in statistics.find_all("span",{"class":"ProfileTweet-actionCountForAria"})][1]
    likes=[i.text for i in statistics.find_all("span",{"class":"ProfileTweet-actionCountForAria"})][2]
    new_df= pd.DataFrame({"Name":[name],"Username":[username],"Time":[time],"Content":[content],"Answers":[answer],"Retweets":[retweets],"Likes":[likes]})
    tweet_df=pd.concat([tweet_df,new_df],axis=0)
tweet_df.reset_index(inplace=True)
tweet_df.drop("index",axis=1,inplace=True)

In [None]:
tweet_df.head()

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [41]:
#Same logic here but we can see that as long as I still have a navigable string (instead of a list)
#i can aplly the find and find_all method multiple times
url_wiki = 'https://www.wikipedia.org/'
htmlwiki = requests.get(url_wiki).content
soupwiki= BeautifulSoup(htmlwiki, 'lxml')

In [42]:
lang=[l.strong.text for l in soupwiki.find("div",{"class":"central-featured"}).find_all("div")]
count=[c.small.bdi.text for c in soupwiki.find("div",{"class":"central-featured"}).find_all("div")]

In [43]:
wiki_df=pd.DataFrame({"Language":lang,"Number of articles":count})

In [44]:
wiki_df

Unnamed: 0,Language,Number of articles
0,English,6¬†105¬†000+
1,Êó•Êú¨Ë™û,1¬†213¬†000+
2,Espa√±ol,1¬†606¬†000+
3,Deutsch,2¬†446¬†000+
4,–†—É—Å—Å–∫–∏–π,1¬†637¬†000+
5,Fran√ßais,2¬†229¬†000+
6,Italiano,1¬†615¬†000+
7,‰∏≠Êñá,1¬†125¬†000+
8,Portugu√™s,1¬†036¬†000+
9,Polski,1¬†416¬†000+


#### A list with the different kind of datasets available in data.gov.uk.

In [45]:
url_uk = 'https://data.gov.uk/'
html_uk = requests.get(url_uk).content
soupuk= BeautifulSoup(html_uk, 'lxml')

In [46]:
[i.text for i in soupuk.find("div",{"class":"grid-row dgu-topics"}).find_all("h2")]

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [47]:
#we could use request and  get all the info cell by cell but there is a easier way to get 
#structered data from wikipedia using only pandas
url_wiki_lang = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [48]:
wiki_lang_df = pd.read_html(url_wiki_lang)[0]

In [49]:
wiki_lang_df.head()

Unnamed: 0,Rank,Language,Speakers(millions),% of World pop.(March 2019)[8],Language familyBranch
0,1,Mandarin Chinese,918.0,11.922,Sino-TibetanSinitic
1,2,Spanish,480.0,5.994,Indo-EuropeanRomance
2,3,English,379.0,4.922,Indo-EuropeanGermanic
3,4,Hindi (Sanskritised Hindustani)[9],341.0,4.429,Indo-EuropeanIndo-Aryan
4,5,Bengali,228.0,2.961,Indo-EuropeanIndo-Aryan


## Bonus


#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe.¬∂


In [50]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [51]:
html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");

movies = soup.find_all('td', {'class':'titleColumn'})
titles = [movie.find('a').text for movie in movies]
years = [movie.find('span').text[1:-1] for movie in movies]
directors = [movie.find('a').get('title').split(',')[0][:-7] for movie in movies]
actors = [' & '.join(movie.find('a').get('title').split(',')[1:]) for movie in movies]

movies_dict = {'Title': titles, 'Release': years, 'Director': directors, 'Actors': actors}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Director,Actors
0,Um Sonho de Liberdade,1994,Frank Darabont,Tim Robbins & Morgan Freeman
1,O Poderoso Chef√£o,1972,Francis Ford Coppola,Marlon Brando & Al Pacino
2,O Poderoso Chef√£o II,1974,Francis Ford Coppola,Al Pacino & Robert De Niro
3,Batman: O Cavaleiro das Trevas,2008,Christopher Nolan,Christian Bale & Heath Ledger
4,12 Homens e uma Senten√ßa,1957,Sidney Lumet,Henry Fonda & Lee J. Cobb
...,...,...,...,...
245,Munna Bhai M.B.B.S.,2003,Rajkumar Hirani,Sanjay Dutt & Arshad Warsi
246,Trono Manchado de Sangue,1957,Akira Kurosawa,Toshir√¥ Mifune & Minoru Chiaki
247,Neon Genesis Evangelion: O Fim do Evangelho,1997,Hideaki Anno,Megumi Ogata & Megumi Hayashibara
248,Aladdin,1992,Ron Clements,Scott Weinger & Robin Williams


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [52]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [53]:
from random import shuffle;

n_random = 10;

html = requests.get(url).content;
soup = BeautifulSoup(html, "lxml");
movies = soup.find_all('td', {'class':'titleColumn'})

shuffle(movies)

titles = [movie.find('a').text for movie in movies[0:n_random]]
years = [movie.find('span').text[1:-1] for movie in movies[0:n_random]]
links_to_movies = [movie.find('a').get('href') for movie in movies[0:n_random]]

summary = []
for link in links_to_movies:
    html = requests.get('https://www.imdb.com' + link).content;
    soup = BeautifulSoup(html, "lxml");
    summary.append(soup.find('div', {'class':'summary_text'}).text.strip());

movies_dict = {'Title': titles, 'Release': years, 'Summary': summary}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Summary
0,Procurando Nemo,2003,After his son is captured in the Great Barrier...
1,A General,1926,When Union spies steal an engineer's beloved l...
2,Os Ca√ßadores da Arca Perdida,1981,"In 1936, archaeologist and adventurer Indiana ..."
3,"Star Wars, Epis√≥dio VI: O Retorno do Jedi",1983,After a daring mission to rescue Han Solo from...
4,Assassinato √†s Cegas,2018,A series of mysterious events change the life ...
5,Ford vs Ferrari,2019,American car designer Carroll Shelby and drive...
6,Rashomon,1950,The rape of a bride and the murder of her samu...
7,O Grande Ditador,1940,Dictator Adenoid Hynkel tries to expand his em...
8,O Exterminador do Futuro,1984,"In 1984, a human soldier is tasked to stop an ..."
9,O S√©timo Selo,1957,"A man seeks answers about life, death, and the..."


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [54]:
#We will repeat the steps of requests.get on this url but a part of the url will be the input by the user
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'
weather_response = requests.get(url)

Enter the city: Barcelona


In [55]:
#this site has information in json format insted of html, 
#we will need to navigate it as a dictionary, notice that the json is made of 
#dictionaries inside dictionaries
temp = weather_response.json()["main"]["temp"]
ws=weather_response.json()["wind"]["speed"]
wm=weather_response.json()["weather"][0]["main"]
description=weather_response.json()["weather"][0]["description"]

In [56]:
#Let's put it in dataframe format for better visualization
weather_df = pd.DataFrame({"Temperature":[temp],"Weather":[wm], "Description":[description],"Wind Speed":[ws]},[city],)
weather_df

Unnamed: 0,Temperature,Weather,Description,Wind Speed
Barcelona,27.65,Clouds,few clouds,3.1


#### Find the book name, price and stock availability as a pandas dataframe.

In [57]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [58]:
#First let's get the response for the first webpage and explore the tags on its contents
book_response = requests.get('http://books.toscrape.com/')
book_soup = BeautifulSoup(book_response.content)
#this is the tag that contains all the information we need it will be easier to see the 
#specifc tags needed if we look at this piece of html
[item for item in book_soup.find_all("article",{"class":"product_pod"})]

[<article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">¬£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>, <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thu

In [59]:
#Now that we found the tags with the information we want let's create a dataframe with that information
books_df = pd.DataFrame()
books_df["Title"]= [item.h3.a["title"] for item in book_soup.find_all("article",{"class":"product_pod"})]
books_df["Price"] = [item.p.text for item in book_soup.find_all("div",{"class":"product_price"})]
books_df["Stock"] = [available.text.replace("\n","") for item in book_soup.find_all("div",{"class":"product_price"}) for available in item.find_all("p",{"class":"instock availability"})]

In [60]:
books_df

Unnamed: 0,Title,Price,Stock
0,A Light in the Attic,¬£51.77,In stock
1,Tipping the Velvet,¬£53.74,In stock
2,Soumission,¬£50.10,In stock
3,Sharp Objects,¬£47.82,In stock
4,Sapiens: A Brief History of Humankind,¬£54.23,In stock
5,The Requiem Red,¬£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,¬£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,¬£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,¬£22.60,In stock
9,The Black Maria,¬£52.15,In stock


In [61]:
#The first page is ready, in this loop we will create a dataframe for every page in the website
#and concat those dataframes with the one we did for the first page  
for i in range(2,51):
    book_response = requests.get('http://books.toscrape.com/')
    book_soup = BeautifulSoup(book_response.content)
    url = "http://books.toscrape.com/catalogue/page-"+str(i)+".html"
    title= [item.h3.a["title"] for item in book_soup.find_all("article",{"class":"product_pod"})]
    price = [item.p.text for item in book_soup.find_all("div",{"class":"product_price"})]
    stock = [available.text.replace("\n","") for item in book_soup.find_all("div",{"class":"product_price"}) for available in item.find_all("p",{"class":"instock availability"})]
    new_df= pd.DataFrame({"Title":title,"Price":price,"Stock":stock})
    books_df= pd.concat([books_df,new_df],axis=0)
#our dataframe got a little messy from the concat let's clean up our index 
books_df.reset_index(inplace=True)
books_df.drop("index",axis=1,inplace=True)

In [62]:
books_df

Unnamed: 0,Title,Price,Stock
0,A Light in the Attic,¬£51.77,In stock
1,Tipping the Velvet,¬£53.74,In stock
2,Soumission,¬£50.10,In stock
3,Sharp Objects,¬£47.82,In stock
4,Sapiens: A Brief History of Humankind,¬£54.23,In stock
...,...,...,...
995,Our Band Could Be Your Life: Scenes from the A...,¬£57.25,In stock
996,Olio,¬£23.88,In stock
997,Mesaerion: The Best Science Fiction Stories 18...,¬£37.59,In stock
998,Libertarianism for Beginners,¬£51.33,In stock
