# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
import random
import re
# import scrapy
import json

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
#your code

# reading the web page and print the status code
r = requests.get(url)
r.status_code

200

In [4]:
# print the response text
print(r.text[:500])





<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazona


In [5]:
# parsing the HTML using beautifulsoup
soup = bs(r.text, 'html.parser')
# soup

In [6]:
type(soup)

bs4.BeautifulSoup

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [7]:
# name = soup.find('div', class_='col-md-6').text
# name

In [8]:
#your code

# find a single name
name = soup.find('h1', class_='h3').text.replace('\n','').strip()
name

'Rob Dodson'

In [9]:
# find a single nickname
nickname = soup.find('p', class_='f4 text-normal mb-1').text.replace('\n','').replace(' ','')
nickname

'robdodson'

In [10]:
# find all the names & nicknames

# for name in soup.find_all('h1', class_='h3'):
#     print(name.text.replace('\n','').strip())

# list comprehension NAMES
names = [name.text.replace('\n','').strip() for name in soup.find_all('h1', class_='h3')]
names

['Rob Dodson',
 'MichaIng',
 'Gleb Bahmutov',
 'Lukas Taegert-Atkinson',
 'Till Krüss',
 'Jesse Duffield',
 'ᴜɴᴋɴᴡᴏɴ',
 'Arve Knudsen',
 'Niklas von Hertzen',
 'Stephen Celis',
 'Damian Dulisz',
 'Yufan You',
 'Christian Clauss',
 'Jirka Borovec',
 'Timothy Edmund Crosley',
 'James Newton-King',
 'Michael Shilman',
 'Mike Penz',
 'Alex Hall',
 'Diego Sampaio',
 'Dries Vints',
 'JK Jung',
 'Steven',
 'Daniel Martí',
 'Łukasz Magiera']

In [11]:
# list comprehension NICKNAMES
nicknames = [nickname.text.replace('\n','').replace(' ','') for nickname in soup.find_all('p', class_='f4 text-normal mb-1')]
nicknames

['robdodson',
 'bahmutov',
 'lukastaegert',
 'tillkruss',
 'jesseduffield',
 'unknwon',
 'aknuds1',
 'niklasvh',
 'stephencelis',
 'shentao',
 'ouuan',
 'cclauss',
 'Borda',
 'timothycrosley',
 'JamesNK',
 'shilman',
 'mikepenz',
 'alexmojaki',
 'sampaiodiego',
 'driesvints',
 'jkjung-avt',
 'styfle',
 'mvdan',
 'magik6k']

In [12]:
# final list of names 
list_names= []
for name, nickname in zip(names, nicknames):
    i = name + ' (' + nickname +')'
    # print(f'{name} ({nickname})')
    list_names.append(i)
print(list_names)

['Rob Dodson (robdodson)', 'MichaIng (bahmutov)', 'Gleb Bahmutov (lukastaegert)', 'Lukas Taegert-Atkinson (tillkruss)', 'Till Krüss (jesseduffield)', 'Jesse Duffield (unknwon)', 'ᴜɴᴋɴᴡᴏɴ (aknuds1)', 'Arve Knudsen (niklasvh)', 'Niklas von Hertzen (stephencelis)', 'Stephen Celis (shentao)', 'Damian Dulisz (ouuan)', 'Yufan You (cclauss)', 'Christian Clauss (Borda)', 'Jirka Borovec (timothycrosley)', 'Timothy Edmund Crosley (JamesNK)', 'James Newton-King (shilman)', 'Michael Shilman (mikepenz)', 'Mike Penz (alexmojaki)', 'Alex Hall (sampaiodiego)', 'Diego Sampaio (driesvints)', 'Dries Vints (jkjung-avt)', 'JK Jung (styfle)', 'Steven (mvdan)', 'Daniel Martí (magik6k)']


#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [13]:
# This is the url you will scrape in this exercise
url1 = 'https://github.com/trending/python?since=daily'

In [14]:
r = requests.get(url1)
r.status_code

200

In [15]:
print(r.text[:500])





<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazona


In [16]:
soup = bs(r.text, 'html.parser')
# soup

In [17]:
type(soup)

bs4.BeautifulSoup

In [18]:
# find a single repo
repo = soup.find('h1', class_='h3 lh-condensed').find('a')['href'].replace(r"^\W+", "")
repo

'/executablebooks/jupyter-book'

In [19]:
# list comprehension REPOS
repos = [repo.find('a')['href'] for repo in soup.find_all('h1', class_='h3 lh-condensed')]
repos

['/executablebooks/jupyter-book',
 '/sherlock-project/sherlock',
 '/cvg/Hierarchical-Localization',
 '/PrefectHQ/prefect',
 '/RUB-SysSec/mobile_sentinel',
 '/naiveHobo/InvoiceNet',
 '/iswbm/magic-python',
 '/google-research/bert',
 '/geerlingguy/ansible-for-devops',
 '/ansible/ansible',
 '/aio-libs/aiohttp',
 '/pythonstock/stock',
 '/rusty1s/pytorch_geometric',
 '/ekzhang/fastseg',
 '/mks0601/I2L-MeshNet_RELEASE',
 '/microsoft/playwright-python',
 '/d2l-ai/d2l-en',
 '/clovaai/stargan-v2',
 '/shibing624/pycorrector',
 '/huggingface/transformers',
 '/yangjianxin1/GPT2-chitchat',
 '/UKPLab/sentence-transformers',
 '/stanfordnlp/stanza',
 '/secdev/scapy',
 '/google/diff-match-patch']

In [20]:
# final list of repos
list_repos= []
for i in repos:
    i = re.sub(r"^\W+", "", i)
    list_repos.append(i)
print(list_repos)

['executablebooks/jupyter-book', 'sherlock-project/sherlock', 'cvg/Hierarchical-Localization', 'PrefectHQ/prefect', 'RUB-SysSec/mobile_sentinel', 'naiveHobo/InvoiceNet', 'iswbm/magic-python', 'google-research/bert', 'geerlingguy/ansible-for-devops', 'ansible/ansible', 'aio-libs/aiohttp', 'pythonstock/stock', 'rusty1s/pytorch_geometric', 'ekzhang/fastseg', 'mks0601/I2L-MeshNet_RELEASE', 'microsoft/playwright-python', 'd2l-ai/d2l-en', 'clovaai/stargan-v2', 'shibing624/pycorrector', 'huggingface/transformers', 'yangjianxin1/GPT2-chitchat', 'UKPLab/sentence-transformers', 'stanfordnlp/stanza', 'secdev/scapy', 'google/diff-match-patch']


#### Display all the image links from Walt Disney wikipedia page

In [21]:
# This is the url you will scrape in this exercise
url2 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [22]:
#your code
r = requests.get(url2)
r.status_code    

200

In [23]:
print(r.text[:500])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Walt Disney - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"297a02b6-04a9-4477-ba78-b88b7c6963c2","wgCSP


In [24]:
soup = bs(r.text, 'html.parser')
# soup

In [25]:
# find a table image
img_table = soup.find('table', class_='infobox biography vcard').find('a')['href'].replace(r'/wiki/File:','')
img_table

'Walt_Disney_1946.JPG'

In [26]:
# imm = soup.find('a', attrs={'class':'image'}).find('img')['alt']
# imm

In [27]:
im = soup.find('a', attrs={'class':'image'}).find('img')['src']
im

'//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG'

In [28]:
# list comprehension IMAGES
# img + src
imgs = [img.find('img')['src'].replace(r'.*',r'//upload.*?px-.*?-?_?(.*)') for img in soup.find_all('div', class_='thumbinner')]
imgs

['//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_

In [29]:
enlace = '//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg/170px-Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg'

In [30]:
# regex to eliminate the link before the image
search_links = re.compile('//upload.*?px-.*?-?_?(.*)')

In [31]:
search_links.findall(enlace)

['Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg']

In [32]:
# some text to review. I don't like every term is a list
for r in imgs:
    print(search_links.findall(r))   

['Walt_Disney_envelope_ca._1921.jpg']
['seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg']
['Trolley_Troubles_poster.jpg']
['Walt_Disney_and_his_cartoon_creation_%22Mickey_Mouse%22_-_National_Board_of_Review_Magazine.jpg']
['Steamboat-willie.jpg']
['Walt_Disney_1935.jpg']
['Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg']
['Disney_drawing_goofy.jpg']
['DisneySchiphol1951.jpg']
['WaltDisneyplansDisneylandDec1954.jpg']
['Walt_disney_portrait_right.jpg']
['Walt_Disney_Grave.JPG']
['Roy_O._Disney_with_Company_at_Press_Conference.jpg']
['Disney_Display_Case.JPG']
['Disney1968.jpg']


In [33]:
# for i in imgs:
#     print(type(i))

In [34]:
# list comprehension IMAGES
# imgs = [img.find('a')['href'].replace(r'/wiki/File:','') for img in soup.find_all('div', class_='thumbinner')]
# imgs

In [35]:
# sum table image + page images
tot_imgs = 1 + len(imgs)
tot_imgs

16

In [36]:
print(f"In Walt Disney's wikipedia page there are {tot_imgs} images")

In Walt Disney's wikipedia page there are 16 images


#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [37]:
# This is the url you will scrape in this exercise
url3 ='https://en.wikipedia.org/wiki/Python' 

In [38]:
#your code
r = requests.get(url3)
r.status_code  

200

In [39]:
print(r.text[:500])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6b140b43-59bb-4dbc-abde-d705e0a874ae","wgCSPNonce


In [40]:
soup = bs(r.text, 'html.parser')
# soup

In [41]:
for link in soup.findAll("a"):
    # all the links of the wiki page
    if 'href' in link.attrs:
        print(link.attrs['href'])

#mw-head
#searchInput
https://en.wiktionary.org/wiki/Python
https://en.wiktionary.org/wiki/python
#Snakes
#Ancient_Greece
#Media_and_entertainment
#Computing
#Engineering
#Roller_coasters
#Vehicles
#Weaponry
#People
#Other_uses
#See_also
/w/index.php?title=Python&action=edit&section=1
/wiki/Pythonidae
/wiki/Python_(genus)
/w/index.php?title=Python&action=edit&section=2
/wiki/Python_(mythology)
/wiki/Python_of_Aenus
/wiki/Python_(painter)
/wiki/Python_of_Byzantium
/wiki/Python_of_Catana
/w/index.php?title=Python&action=edit&section=3
/wiki/Python_(film)
/wiki/Pythons_2
/wiki/Monty_Python
/wiki/Python_(Monty)_Pictures
/w/index.php?title=Python&action=edit&section=4
/wiki/Python_(programming_language)
/wiki/CPython
/wiki/CMU_Common_Lisp
/wiki/PERQ#PERQ_3
/w/index.php?title=Python&action=edit&section=5
/w/index.php?title=Python&action=edit&section=6
/wiki/Python_(Busch_Gardens_Tampa_Bay)
/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)
/wiki/Python_(Efteling)
/w/index.php?title=Python&action=

#### Number of Titles that have changed in the United States Code since its last release point 

In [42]:
# This is the url you will scrape in this exercise
url4 = 'http://uscode.house.gov/download/download.shtml'

In [43]:
#your code
r = requests.get(url4)
r.status_code  

200

In [44]:
print(r.text[:500])

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <meta http-equiv="X-UA-Compatible" content="IE=8" />
        <meta http-equiv="pragma" content="no-cache" /><!-- HTTP 1.0 -->
        <meta http-equiv="cache-control" content="no-cache,must-revalidate" 


In [45]:
soup = bs(r.text, 'html.parser')
# soup

In [46]:
txt = soup.find('div', class_='usctitlechanged').text.replace('\n','').strip()
txt

'Title 5 - Government Organization and Employees ٭'

In [47]:
# list comprehension TITLE HAVE BEEN CHANGED from the next release point
txts = [t.text.replace('\n','').strip() for t in soup.find_all('div', class_='usctitlechanged')]
txts
print(len(txts))

7


#### A Python list with the top ten FBI's Most Wanted names 

In [48]:
# This is the url you will scrape in this exercise
url5 = 'https://www.fbi.gov/wanted/topten'

In [49]:
#your code 
r = requests.get(url5)
r.status_code  

200

In [50]:
print(r.text[:500])

<!DOCTYPE html>
<html lang="en" data-gridsystem="bs3">
<head>
<meta charset="utf-8">
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="canonical" href="https://www.fbi.gov/wanted/topten"><title>Ten Most Wanted Fugitives &#8212; FBI</title>
<link rel="alternate" href="https://www.fbi.gov/wanted/topten/RSS" title="Ten Most Wanted Fugitives - RSS 1.0" type="application/rss+xml">
<link rel="alternate" href="https:/


In [51]:
soup = bs(r.text, 'html.parser')
# soup

In [52]:
mwf = soup.find('h3', class_='title').text.replace('\n','')
mwf

'ALEXIS FLORES'

In [53]:
# list comprehension FUGITIVES
mw_fugitives = [mwf.text.replace('\n','') for mwf in soup.find_all('h3', class_='title')]
mw_fugitives

['ALEXIS FLORES',
 'EUGENE PALMER',
 'RAFAEL CARO-QUINTERO',
 'ROBERT WILLIAM FISHER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'YASER ABDEL SAID',
 'SANTIAGO VILLALBA MEDEROS']

In [54]:
len(mw_fugitives)

10

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [55]:
# This is the url you will scrape in this exercise
url6 = 'https://www.emsc-csem.org/Earthquake/'

In [56]:
#your code
r = requests.get(url6)
r.status_code  

200

In [57]:
print(r.text[:500])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xml:lang="en" lang="en">
<head><meta name="google-site-verification" content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" /><meta name="msvalidate.01" content="BCAA3C04C41AE6E6AFAF117B9469C66F" /><meta name="y_key" content="43b36314ccb77957" /><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta http-equ


In [58]:
soup = bs(r.text, 'html.parser')
# soup

In [59]:
tables = soup.find_all("table")
#  tables

In [60]:
table = tables[3]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
# tab_data

In [61]:
df = pd.DataFrame(tab_data)
df.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,,,,,,,,,,,,,
1,CitizenResponse,,,,Date & Time UTC,Latitude degrees,Longitude degrees,Depth km,Mag [+],Region name [+],Last update [-],,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,12345678910›»,,,,,,,,,,,,
5,,,,earthquake2020-08-16 18:01:46.015min ago,23.12,S,68.96,W,110,ML,2.7,"ANTOFAGASTA, CHILE",2020-08-16 18:16
6,,,,earthquake2020-08-16 17:30:52.046min ago,24.57,N,94.70,E,35,M,2.8,MYANMAR-INDIA BORDER REGION,2020-08-16 17:41
7,,,,earthquake2020-08-16 17:27:57.049min ago,32.92,S,69.07,W,13,ML,2.7,"MENDOZA, ARGENTINA",2020-08-16 17:43
8,,,,earthquake2020-08-16 17:00:52.61hr 16min ago,35.96,N,117.03,W,4,ML,2.6,CENTRAL CALIFORNIA,2020-08-16 17:26
9,,,,earthquake2020-08-16 16:45:21.01hr 32min ago,33.19,S,70.00,W,11,ML,2.8,"MENDOZA, ARGENTINA",2020-08-16 17:20


In [62]:
# move the first row to the headers
df.columns = df.iloc[1,:]
df.drop(index=1,inplace=True)

In [63]:
# delete lines with null columns
df = df.dropna(how='any', thresh = 10,axis=0) 

In [64]:
df.reset_index(inplace=True)
del df['index']
df.head(20)

1,CitizenResponse,Unnamed: 2,Unnamed: 3,Unnamed: 4,Date & Time UTC,Latitude degrees,Longitude degrees,Depth km,Mag [+],Region name [+],Last update [-],NaN,NaN.1
0,,,,earthquake2020-08-16 18:01:46.015min ago,23.12,S,68.96,W,110,ML,2.7,"ANTOFAGASTA, CHILE",2020-08-16 18:16
1,,,,earthquake2020-08-16 17:30:52.046min ago,24.57,N,94.7,E,35,M,2.8,MYANMAR-INDIA BORDER REGION,2020-08-16 17:41
2,,,,earthquake2020-08-16 17:27:57.049min ago,32.92,S,69.07,W,13,ML,2.7,"MENDOZA, ARGENTINA",2020-08-16 17:43
3,,,,earthquake2020-08-16 17:00:52.61hr 16min ago,35.96,N,117.03,W,4,ML,2.6,CENTRAL CALIFORNIA,2020-08-16 17:26
4,,,,earthquake2020-08-16 16:45:21.01hr 32min ago,33.19,S,70.0,W,11,ML,2.8,"MENDOZA, ARGENTINA",2020-08-16 17:20
5,,,,earthquake2020-08-16 16:40:41.91hr 36min ago,30.4,N,94.84,E,20,mb,4.7,EASTERN XIZANG,2020-08-16 17:25
6,,,,earthquake2020-08-16 16:21:49.81hr 55min ago,20.77,S,173.78,W,10,mb,4.9,TONGA,2020-08-16 17:03
7,,,,earthquake2020-08-16 16:20:36.01hr 56min ago,18.9,N,121.47,E,26,M,3.4,"LUZON, PHILIPPINES",2020-08-16 16:30
8,,,,earthquake2020-08-16 16:17:19.02hr 00min ago,33.27,S,71.76,W,30,ML,2.5,"OFFSHORE VALPARAISO, CHILE",2020-08-16 16:43
9,,,,earthquake2020-08-16 16:11:02.52hr 06min ago,0.28,S,125.24,E,40,mb,4.7,MOLUCCA SEA,2020-08-16 17:00


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [65]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url7 = 'https://twitter.com/'

In [66]:
# w3resource.com
# https://www.w3resource.com/python-exercises/web-scraping/web-scraping-exercise-19.php

handle = input('Input your account name on Twitter: ')
temp = requests.get(url7+handle)
soup = bs(temp.text,'lxml')

try:
    tweet_box = soup.find('li',{'class':'ProfileNav-item ProfileNav-item--tweets is-active'})
    tweets= tweet_box.find('a').find('span',{'class':'ProfileNav-value'})
    print("{} tweets {} number of tweets.".format(handle,tweets.get('data-count')))

except:
    print('Account name not found...')
    
# DOESN'T WORK!
# I'm not able to find the number of tweets

Input your account name on Twitter:  @NBA


Account name not found...


#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [67]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url8 = 'https://twitter.com/'

In [68]:
#your code
# r = requests.get(url8)
# r.status_code  

In [69]:
# print(r.text[:500])

In [70]:
# soup = bs(r.text, 'html.parser')
# soup

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [71]:
# This is the url you will scrape in this exercise
url9 = 'https://www.wikipedia.org/'

In [72]:
#your code
r = requests.get(url9)
r.status_code  

200

In [73]:
print(r.text[:500])

<!DOCTYPE html>
<html lang="mul" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta name="viewport" content="initial-scale=1,user-scalable=yes">
<link rel="apple-touch


In [74]:
soup = bs(r.text, 'html.parser')
# soup

In [75]:
lang = soup.find('a', class_ = 'link-box').find('strong').text
lang

'English'

In [76]:
art = soup.find('a', class_ = 'link-box').find('small').text.replace('\xa0','.')
art

'6.137.000+ articles'

In [77]:
wikipedia = [(wiki.find('strong').text, wiki.find('small').text.replace('\xa0','.')) for wiki in soup.find_all('a', class_ = 'link-box')]
wikipedia

[('English', '6.137.000+ articles'),
 ('æ\x97¥æ\x9c¬èª\x9e', '1.222.000+ è¨\x98äº\x8b'),
 ('EspaÃ±ol', '1.617.000+ artÃ\xadculos'),
 ('Deutsch', '2.467.000+ Artikel'),
 ('Ð\xa0Ñ\x83Ñ\x81Ñ\x81ÐºÐ¸Ð¹', '1.651.000+ Ñ\x81Ñ\x82Ð°Ñ\x82ÐµÐ¹'),
 ('FranÃ§ais', '2.241.000+ articles'),
 ('Italiano', '1.627.000+ voci'),
 ('ä¸\xadæ\x96\x87', '1.136.000+ æ¢\x9dç\x9b®'),
 ('PortuguÃªs', '1.041.000+ artigos'),
 ('Polski', '1.423.000+ haseÅ\x82')]

In [78]:
# ... an easy way to do it!
for wiki in soup.find_all('a', class_ = 'link-box'):
    print(wiki.get_text())


English
6 137 000+ articles


æ¥æ¬èª
1 222 000+ è¨äº


EspaÃ±ol
1 617 000+ artÃ­culos


Deutsch
2 467 000+ Artikel


Ð ÑÑÑÐºÐ¸Ð¹
1 651 000+ ÑÑÐ°ÑÐµÐ¹


FranÃ§ais
2 241 000+ articles


Italiano
1 627 000+ voci


ä¸­æ
1 136 000+ æ¢ç®


PortuguÃªs
1 041 000+ artigos


Polski
1 423 000+ haseÅ



In [79]:
# check languages

#### A list with the different kind of datasets available in data.gov.uk 

In [80]:
# This is the url you will scrape in this exercise
url10 = 'https://data.gov.uk/'

In [81]:
#your code 
r = requests.get(url10)
r.status_code  

200

In [82]:
print(r.text[:500])


<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]-->
<!--[if gt IE 8]><!--><html lang="en"><!--<![endif]-->
<html class="govuk-template">
  <head>
    <meta charset="utf-8">
    <title>Find open data - data.gov.uk</title>

    <meta name="theme-color" content="#0b0c0c" />

    <meta name="viewport" content="width=device-width, initial-scale=1">
    
    <link rel="stylesheet" media="screen" href="/find-assets/application-05aa6420d403adc1fdc7a0dc1fb860dc097af29300239c7e5


In [83]:
soup = bs(r.text, 'html.parser')
# soup

In [84]:
data = []
for i in soup.findAll("a", class_ = 'govuk-link'):
    if 'href' in i.attrs:
        data.append(i.attrs['href'])
print(data)

['/cookies', '/cookies', 'http://www.smartsurvey.co.uk/s/3SEXD/', '/search?filters%5Btopic%5D=Business+and+economy', '/search?filters%5Btopic%5D=Crime+and+justice', '/search?filters%5Btopic%5D=Defence', '/search?filters%5Btopic%5D=Education', '/search?filters%5Btopic%5D=Environment', '/search?filters%5Btopic%5D=Government', '/search?filters%5Btopic%5D=Government+spending', '/search?filters%5Btopic%5D=Health', '/search?filters%5Btopic%5D=Mapping', '/search?filters%5Btopic%5D=Society', '/search?filters%5Btopic%5D=Towns+and+cities', '/search?filters%5Btopic%5D=Transport']


In [85]:
# eliminate the first three elements of the list and modify strings
for d in data[3:]:
    dataset = re.sub(r'.*?=', '', d)  
    print(dataset)

Business+and+economy
Crime+and+justice
Defence
Education
Environment
Government
Government+spending
Health
Mapping
Society
Towns+and+cities
Transport


In [86]:
# ... an easy way to do it!
for tit in soup.findAll("h3"):
    print(tit.get_text())

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [87]:
# This is the url you will scrape in this exercise
url11 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [88]:
#your code
r = requests.get(url11)
r.status_code  

200

In [89]:
print(r.text[:500])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"66b378ee-


In [90]:
soup = bs(r.text, 'html.parser')
# soup

In [91]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [92]:
# find the table I need to analyze
table = tables[0]
tab_data = [[cell.text.replace('\n','') for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
# tab_data

In [93]:
df = pd.DataFrame(tab_data)
df.head(20)

Unnamed: 0,0,1,2,3,4,5
0,Rank,Language,Speakers(millions),% of World pop.(March 2019)[8],Language family,Branch
1,1,Mandarin Chinese,918,11.922,Sino-Tibetan,Sinitic
2,2,Spanish,480,5.994,Indo-European,Romance
3,3,English,379,4.922,Indo-European,Germanic
4,4,Hindi (Sanskritised Hindustani)[9],341,4.429,Indo-European,Indo-Aryan
5,5,Bengali,228,2.961,Indo-European,Indo-Aryan
6,6,Portuguese,221,2.870,Indo-European,Romance
7,7,Russian,154,2.000,Indo-European,Balto-Slavic
8,8,Japanese,128,1.662,Japonic,Japanese
9,9,Western Punjabi[10],92.7,1.204,Indo-European,Indo-Aryan


In [94]:
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)
df.head(10)

Unnamed: 0,Rank,Language,Speakers(millions),% of World pop.(March 2019)[8],Language family,Branch
1,1,Mandarin Chinese,918.0,11.922,Sino-Tibetan,Sinitic
2,2,Spanish,480.0,5.994,Indo-European,Romance
3,3,English,379.0,4.922,Indo-European,Germanic
4,4,Hindi (Sanskritised Hindustani)[9],341.0,4.429,Indo-European,Indo-Aryan
5,5,Bengali,228.0,2.961,Indo-European,Indo-Aryan
6,6,Portuguese,221.0,2.87,Indo-European,Romance
7,7,Russian,154.0,2.0,Indo-European,Balto-Slavic
8,8,Japanese,128.0,1.662,Japonic,Japanese
9,9,Western Punjabi[10],92.7,1.204,Indo-European,Indo-Aryan
10,10,Marathi,83.1,1.079,Indo-European,Indo-Aryan


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [95]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url12 = 'https://twitter.com/'

In [96]:
# your code
r = requests.get(url12)
r.status_code  

200

In [97]:
print(r.text[:500])

<!DOCTYPE html>
<html dir="ltr" lang="en">
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" />
<link rel="preconnect" href="//abs.twimg.com" />
<link rel="preconnect" href="//api.twitter.com" />
<link rel="preconnect" href="//pbs.twimg.com" />
<link rel="preconnect" href="//t.co" />
<link rel="preconnect" href="//video.twimg.com" />
<link rel="dns-prefetch" href="//abs.twimg.com" />
<link rel="dns-prefe


In [98]:
soup = bs(r.text, 'html.parser')
# soup

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [99]:
# This is the url you will scrape in this exercise 
url13 = 'https://www.imdb.com/chart/top'

In [100]:
# your code
r = requests.get(url13)
r.status_code  

200

In [101]:
print(r.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    
    
    

    
    
    

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">
            <style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
           


In [102]:
soup = bs(r.text, 'html.parser')
# soup

In [103]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [104]:
# name of the movie
name = soup.find('td', class_ = 'titleColumn').find('a').text
name

'Cadena perpetua'

In [105]:
# director and stars
stars = soup.find('td', class_ = 'titleColumn').find('a')['title']
stars

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [106]:
# release date
date = soup.find('td', class_ = 'titleColumn').find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')','')
date

'1994'

In [107]:
# find the table I need to analyze
table = tables[0]
tab_data = [[(cell.find('a').text, cell.find('a')['title'],cell.find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')','')) 
              for cell in row.find_all("td", class_ = 'titleColumn')]
                        for row in table.find_all("tr")]
# tab_data

In [108]:
# verify cell type
# for cell in tab_data:
#     print(type(cell))

In [109]:
df = pd.DataFrame(tab_data)
# splitting a list in a Pandas cell into multiple columns
df = df[0].apply(pd.Series)

In [110]:
# move the first row to the headers
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)

In [111]:
# rename columns
df.columns = ['Name', 'Director and cast', 'Date release']
df

Unnamed: 0,Name,Director and cast,Date release
1,Cadena perpetua,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",1994
2,El padrino,"Francis Ford Coppola (dir.), Marlon Brando, Al...",1972
3,El padrino: Parte II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",1974
4,El caballero oscuro,"Christopher Nolan (dir.), Christian Bale, Heat...",2008
5,12 hombres sin piedad,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",1957
...,...,...,...
246,La batalla de Argel,"Gillo Pontecorvo (dir.), Brahim Hadjadj, Jean ...",1966
247,"Swades: We, the People","Ashutosh Gowariker (dir.), Shah Rukh Khan, Gay...",2004
248,Trono de sangre,"Akira Kurosawa (dir.), Toshirô Mifune, Minoru ...",1957
249,Lagaan: Érase una vez en la India,"Ashutosh Gowariker (dir.), Aamir Khan, Raghuvi...",2001


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [112]:
#This is the url you will scrape in this exercise
url14 = 'http://www.imdb.com/chart/top'

In [113]:
#your code
r = requests.get(url14)
r.status_code  

200

In [114]:
print(r.text[:500])




<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    
    
    

    
    
    

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">
            <style>
                body#styleguide-v2 {
                    background: no-repeat fixed center top #000;
                }
           


In [115]:
soup = bs(r.text, 'html.parser')
# soup

In [116]:
# I'm not able to find the summary of the movies!

In [117]:
# table with name and year of the top 10 random movies
table1 = tables[0]
tab_data1 = [[(cell.find('a').text, cell.find('span', class_ = 'secondaryInfo').text.replace('(','').replace(')',''))
              for cell in row.find_all("td", class_ = 'titleColumn')]
                        for row in table1.find_all("tr")]
# tab_data1

In [118]:
df = pd.DataFrame(tab_data1)
# randomly select rows from DF (10 items)
df = df.sample(10) 
df

Unnamed: 0,0
59,"(La vida de los otros, 2006)"
85,"(El retorno del Jedi, 1983)"
147,"(Toro salvaje, 1980)"
53,"(Alien, el octavo pasajero: El montaje del dir..."
166,"(Jurassic Park (Parque Jurásico), 1993)"
151,"(Tres anuncios en las afueras, 2017)"
213,"(El salario del miedo, 1953)"
139,"(Una mente maravillosa, 2001)"
211,"(Siempre a tu lado (Hachiko), 2009)"
42,"(Gladiator (El gladiador), 2000)"


In [119]:
# splitting a list in a Pandas cell into multiple columns
df = df[0].apply(pd.Series)

In [120]:
# rename columns
df.columns = ['Name', 'Date release']
df

Unnamed: 0,Name,Date release
59,La vida de los otros,2006
85,El retorno del Jedi,1983
147,Toro salvaje,1980
53,"Alien, el octavo pasajero: El montaje del dire...",1979
166,Jurassic Park (Parque Jurásico),1993
151,Tres anuncios en las afueras,2017
213,El salario del miedo,1953
139,Una mente maravillosa,2001
211,Siempre a tu lado (Hachiko),2009
42,Gladiator (El gladiador),2000


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [121]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url15 = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city: Barcelona


In [122]:
# your code
r = requests.get(url15)
r.status_code 

200

In [123]:
print(r.json())

{'coord': {'lon': 2.16, 'lat': 41.39}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04d'}], 'base': 'stations', 'main': {'temp': 26.24, 'feels_like': 28.82, 'temp_min': 25, 'temp_max': 27.22, 'pressure': 1011, 'humidity': 78}, 'visibility': 10000, 'wind': {'speed': 3.1, 'deg': 110}, 'clouds': {'all': 75}, 'dt': 1597601884, 'sys': {'type': 1, 'id': 6398, 'country': 'ES', 'sunrise': 1597554109, 'sunset': 1597603776}, 'timezone': 7200, 'id': 3128760, 'name': 'Barcelona', 'cod': 200}


In [124]:
# Extract data from .json
name = r.json()['name']
temp = r.json()['main']['temp']
ws = r.json()['wind']['speed']
descr = r.json()['weather'][0]['main']
weath = r.json()['weather'][0]['description']

# Results
print(f'Weather in {name}')
print(f'Temperature: {temp} C')
print(f'Wind speed: {ws}')
print(f'Description: {descr}')
print(f'Weather: {weath}')

Weather in Barcelona
Temperature: 26.24 C
Wind speed: 3.1
Description: Clouds
Weather: broken clouds


#### Book name,price and stock availability as a pandas dataframe.

In [125]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url16 = 'http://books.toscrape.com/'

In [126]:
#your code
r = requests.get(url16)
r.status_code  

200

In [127]:
r.encoding

'ISO-8859-1'

In [128]:
# change the encoding to eliminate the special character in price output
r.encoding = 'utf-8'

In [129]:
print(r.text[:500])

<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


In [130]:
soup = bs(r.text, 'html.parser')
# soup

In [131]:
# find all the tables in the page
tables = soup.find_all("table")
# tables

In [132]:
# name
name = soup.find('article', class_ = 'product_pod').find('img')['alt']
name

'A Light in the Attic'

In [133]:
# price
price = soup.find('article', class_ = 'product_pod').find('div', class_ = 'product_price').find('p', class_ = 'price_color').text
price

'£51.77'

In [134]:
# stock availability
stock = soup.find('article', class_ = 'product_pod').find('div', class_ = 'product_price').find('p', class_ = 'instock availability').text.replace('\n','').strip()
stock

'In stock'

In [135]:
books = [(book.find('img')['alt'], book.find('div', class_ = 'product_price').find('p', class_ = 'price_color').text, book.find('div', class_ = 'product_price').find('p', class_ = 'instock availability').text.replace('\n','').strip()) for book in soup.find_all('article', class_ = 'product_pod')]
books

[('A Light in the Attic', '£51.77', 'In stock'),
 ('Tipping the Velvet', '£53.74', 'In stock'),
 ('Soumission', '£50.10', 'In stock'),
 ('Sharp Objects', '£47.82', 'In stock'),
 ('Sapiens: A Brief History of Humankind', '£54.23', 'In stock'),
 ('The Requiem Red', '£22.65', 'In stock'),
 ('The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'In stock'),
 ('The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '£17.93',
  'In stock'),
 ('The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '£22.60',
  'In stock'),
 ('The Black Maria', '£52.15', 'In stock'),
 ('Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'In stock'),
 ("Shakespeare's Sonnets", '£20.66', 'In stock'),
 ('Set Me Free', '£17.46', 'In stock'),
 ("Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
  '£52.29',
  'In stock'),
 ('Rip it Up and Start Again', '£35.02', 'In stock'),
 ('Our Band Could Be Your Life: Scen

In [136]:
df = pd.DataFrame(books)
df.columns = ['Name', 'Price', 'Stock availability']
df

Unnamed: 0,Name,Price,Stock availability
0,A Light in the Attic,£51.77,In stock
1,Tipping the Velvet,£53.74,In stock
2,Soumission,£50.10,In stock
3,Sharp Objects,£47.82,In stock
4,Sapiens: A Brief History of Humankind,£54.23,In stock
5,The Requiem Red,£22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,£17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,£22.60,In stock
9,The Black Maria,£52.15,In stock
