# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
import bs4 
import pandas as pd

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [51]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [52]:
# your code here

response = requests.get(url)
html = response.content
html

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKc

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [53]:
# your code here

parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRw

In [54]:
tags = parsed_html.find_all('h1',{'class':'h3 lh-condensed'})
tags

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":772,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="ec47cca81ff9ecd231335746bcd4d9eb24dbae4e87e2443d0c78e9997dc8b852" data-view-component="true" href="/alex">
             Alex Gaynor
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":2008794,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="e1f1d1f98f729dae4ac67476fee4b893abe22159a3d4da09590da9bc16cdfa56" data-view-component="true" href="/johnkerl">
             John Kerl
 </a> </h1>,
 <h1 class="h3 lh-co

In [75]:
lista_developers = []
for t in tags:
    t = str(t)
    t = t.split("href=")
    t = t[-1]
    t = t.split(">")
    t1 = t[1]
    t1 = t1.strip()
    t1 = t1.split("<")
    t1 = t1[0]
    t1 = t1.replace("\n","")
    t2 = t[0]
    t2 = t2.split("/")
    t2 = t2[1]
    t2 = t2.replace('"',"")
    element = t1 + " (" + t2 + ")"
    lista_developers.append(element)

lista_developers

['Alex Gaynor (alex)',
 'John Kerl (johnkerl)',
 'Yair Morgenstern (yairm210)',
 'Hari Sekhon (HariSekhon)',
 'Jason Quense (jquense)',
 'Ary Borenszweig (asterite)',
 'Hadley Wickham (hadley)',
 'Hoang (hoangvvo)',
 'Daniel Imms (Tyriar)',
 'Anders Jenbo (AJenbo)',
 'Manu MA (manucorporat)',
 'Lee Robinson (leerob)',
 'Marten Seemann (marten-seemann)',
 'R.I.Pienaar (ripienaar)',
 'Paul Miller (paulmillr)',
 'Adeeb Shihadeh (adeebshihadeh)',
 'Brandon Morelli (bmorelli25)',
 'Facundo Olano (facundoolano)',
 'Fernando Cejas (android10)',
 'Viktor Szépe (szepeviktor)',
 'abhishek thakur (abhishekkrthakur)',
 'Jakub T. Jankiewicz (jcubic)',
 'pajlada (pajlada)',
 'Igor Pečovnik (igorpecovnik)',
 'dylan (dylanaraps)']

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [3]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [5]:
# your code here
response = requests.get(url)
html = response.content
html

b'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" rel="stylesheet" href="https://github.githubassets.com/assets/light-92c7d381038e.css" /><link crossorigin="anonymous" media="all" integrity="sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKc

In [6]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-92c7d381038e.css" integrity="sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-d4a90c367f0c.css" integrity="sha512-1KkMNn8M/al/dtzBLupRw

In [45]:
tags = parsed_html.find_all('h1',{'class':'h3 lh-condensed'})

lista_repositorios = []

for t in tags:
    t = str(t)
    t = t.split('\n</span>\n')
    t = t[-1]
    t = t[:-11].strip()
    lista_repositorios.append(t)
    
lista_repositorios

['SMSBoom',
 'PicoBoot',
 'system-design-primer',
 'RedditVideoMakerBot',
 'you-get',
 'yt-dlp',
 'searx',
 'searxng',
 'flask',
 'sherlock',
 'DALLE2-pytorch',
 'discoart',
 'GHunt',
 'd2l-en',
 'audio',
 'openpilot',
 'Python',
 'EdgeNeXt',
 'tinygrad',
 'learn-cantrill-io-labs',
 'Cura',
 'peps',
 'professional-programming',
 'wifite2',
 'edgedb']

#### Display all the image links from Walt Disney wikipedia page.

In [1]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [5]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [8]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Walt Disney - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8cd75702-fd7c-488a-ab23-19d1284bff17","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Walt_Disney","wgTitle":"Walt Disney","wgCurRevisionId":1094713831,"wgRevisionId":1094713831,"wgArticleId":32917,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing links to subscription-only content","Articles with short description","Short description is different from Wikidata","Wikipedi

In [31]:
tags = parsed_html.find_all('a',{'class':'image'})

lista_repositorios = []

for t in tags:
    t = str(t)
    t = t.split('src=')
    t = t[-1].split('"')
    t = t[1]
    t = "https:"+t
    lista_repositorios.append(t)
    
lista_repositorios
    
#tags[0]
#t = str(tags[0])
#t = t.split('src=')
#t = t[-1].split('"')
#t = t[1]
#t = "https"+t
#t

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'https://upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.jpg/170px-Walt_Disney_1935.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Di

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.º

In [213]:
import re
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [36]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [37]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Python - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"49230fc6-63f5-4b79-bdcc-6a3196110054","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":1087251762,"wgRevisionId":1087251762,"wgArticleId":46332325,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages",

In [91]:
pattern = '/wiki/'
tags = parsed_html.find_all('a')

lista = []

for t in tags:
    t = str(t)
    if pattern in t:
        lista.append(t)

lista_url = []
for t in lista:
    if "https" in t:
        if ">" in t:
            t = t.split("href=")
            t = t[-1]
            t = t.split(" ")
            t = t[0]
            t = t.replace('"',"")
            t = t.split(">")
            t = t[0]
            lista_url.append(t)
        else :
            t = t.split("href=")
            t = t[-1]
            t = t.split(" ")
            t = t[0]
            t = t.replace('"',"")
            lista_url.append(t)
        
    else :
        t = t.split("href=")
        t = t[-1]
        t = t.split(" ")
        t = t[0]
        t = t.replace('"',"")
        t = "https://en.wiktionary.org"+t
        lista_url.append(t)

lista_url_final = []
for t in lista_url:
    if ">" in t:
        t = t.split(">")
        t = t[0]
        lista_url_final.append(t)
    else:
        lista_url_final.append(t)

lista_url_final

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 'https://en.wiktionary.org/wiki/Pythonidae',
 'https://en.wiktionary.org/wiki/Python_(genus)',
 'https://en.wiktionary.org/wiki/Python_(mythology)',
 'https://en.wiktionary.org/wiki/Python_(programming_language)',
 'https://en.wiktionary.org/wiki/CMU_Common_Lisp',
 'https://en.wiktionary.org/wiki/PERQ#PERQ_3',
 'https://en.wiktionary.org/wiki/Python_of_Aenus',
 'https://en.wiktionary.org/wiki/Python_(painter)',
 'https://en.wiktionary.org/wiki/Python_of_Byzantium',
 'https://en.wiktionary.org/wiki/Python_of_Catana',
 'https://en.wiktionary.org/wiki/Python_Anghelo',
 'https://en.wiktionary.org/wiki/Python_(Efteling)',
 'https://en.wiktionary.org/wiki/Python_(Busch_Gardens_Tampa_Bay)',
 'https://en.wiktionary.org/wiki/Python_(Coney_Island,_Cincinnati,_Ohio)',
 'https://en.wiktionary.org/wiki/Python_(automobile_maker)',
 'https://en.wiktionary.org/wiki/Python_(Ford_prototype)',
 'https://en.wiktionary.org

#### Find the number of titles that have changed in the United States Code since its last release point.

In [23]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [24]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [25]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())


<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <script>
   setInterval(function(){if(!document.getElementById('OPTSmartBannerScript')){var js = document.createElement('script');js.id = 'OPTSmartBannerScript';js.src = 'https://conexionseguraempresas.movistar.es/public/SecureBar/icon.js?preview=0&type=service';var first = document.getElementsByTagName('script')[0];first.parentNode.insertBefore(js, first);}},1000);
  </script>
  <script>
   var g_icon_parameters = { "servicesStatus" : "W=1;V=1;P=1;"}
  </script>
  <head>
   <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
   <meta content="IE=8" http-equiv="X-UA-Compatible"/>
   <meta content="no-cache" http-equiv="pragma"/>
   <!-- HTTP 1.0 -->
   <meta content="no-cache,must-revalidate" http-equiv="cache-control"/>
   <!-- HTTP 1.1 -->
   <meta conten

In [38]:
tags = parsed_html.find_all('div',{'class':'usctitlechanged'})
tags

list_titles = []

for t in tags:
    t = str(t)
    t = t.split("-")
    t = t[1]
    t = t.split("<")
    t = t[0]
    t = t.strip()
    list_titles.append(t)
    
list_titles

['Domestic Security',
 'Agriculture',
 'Bankruptcy',
 'Crimes and Criminal Procedure',
 'Education',
 'Foreign Relations and Intercourse',
 'Judiciary and Judicial Procedure',
 'Crime Control and Law Enforcement',
 'The Public Health and Welfare']

#### Find a Python list with the top ten FBI's Most Wanted names.

In [94]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [105]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="https://www.fbi.gov/wanted/topten" rel="canonical"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="Ten Most Wanted Fugitives | Federal Bureau of Investigation" name="twitter:title"/>
  <meta content="Federal Bureau of Investigation" property="og:site_name"/>
  <meta content="Ten Most Wanted Fugitives | Federal Bureau of Investigation" property="og:title"/>
  <meta content="website" property="og:type"/>
  <meta content="@FBI" name="twitter:site"/>
  <meta content="https://www.facebook.com/FBI" property="og:article:publisher"/>
  <meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the images of suspects to display more information." name="twitter:descri

In [130]:
tags = parsed_html.find_all('h3',{'class':'title'})

lista_most_wanted = []

for t in tags:
    t = str(t)
    t = t.split(">")
    t = t[2]
    t = t.split("<")
    t = t[0]
    lista_most_wanted.append(t)
    
lista_most_wanted

['RUJA IGNATOVA',
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'YULAN ADONAY ARCHAGA CARIAS',
 'RAFAEL CARO-QUINTERO',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'JASON DEREK BROWN']

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [233]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [146]:
# your code here


In [147]:
response = requests.get(url)
html = response.content
type(html)

bytes

In [148]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 
print(parsed_html.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/>
  <meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/>
  <meta content="43b36314ccb77957" name="y_key"/>
  <!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
  <meta content="en" http-equiv="Content-Language"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="all" name="robots"/>
  <meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftersho

In [232]:
tags = parsed_html.find_all('td',{'class': 'tabev6'})

Date = []
Time = []
#DateTime
for t in tags:
    t = str(t)
    t = t.split(">")
    t = t[5]
    t = t.replace("\xa0","")
    t_date = t[0:10]
    t_hour = t[10:20]
    Date.append(t_date)
    Time.append(t_hour)
    
print(Date)
print(Time)

['2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11', '2022-07-11']
['16:22:03.7', '16:10:15.0', '16:02:04.3', '15:56:43.0', '15:45:00.8', '15:44:54.0', '15:39:18.2', '15:12:19.0', '15:02:04.5', '14:51:24.0', '14:36:57.0', '14:30:24.1', '14:30:04.8', '14:24:22.0', '14:24:08.2', '14:20:55.0', '14:19:14.9', '14:18:07.6', '14:15:22.0', '14:06:55.3', '13:55:45.0', '13:

In [202]:
tags_lo_la = parsed_html.find_all('td',{'class': 'tabev1'})
tags_lo_la

latitude = []
longitude = []
count = 2

for t in tags_lo_la:
    if count%2 == 0:
        t = str(t)
        t = t.split(">")
        t = t[1]
        t = t.replace("\xa0","")
        t = t.split("<")
        t = t[0]
        latitude.append(t)
    else:
        t = str(t)
        t = t.split(">")
        t = t[1]
        t = t.replace("\xa0","")
        t = t.split("<")
        t = t[0]
        longitude.append(t) 
    count +=1
    
print(longitude)
print(latitude)

['155.49', '71.63', '43.07', '125.46', '155.50', '125.60', '155.49', '20.44', '116.09', '85.15', '71.00', '164.57', '155.21', '90.27', '24.78', '105.76', '9.77', '3.63', '121.42', '179.24', '120.74', '43.95', '155.50', '30.86', '70.49', '23.45', '3.70', '3.72', '70.09', '3.63', '178.66', '72.11', '67.65', '122.19', '66.84', '155.50', '67.01', '86.97', '28.42', '155.50', '119.37', '167.17', '125.91', '46.95', '122.93', '3.59', '66.15', '96.82', '3.57', '85.55']
['19.18', '30.34', '38.67', '9.31', '19.18', '9.31', '19.19', '39.73', '38.69', '10.74', '34.02', '49.32', '18.86', '13.46', '34.36', '6.78', '44.44', '35.50', '17.63', '36.15', '12.33', '41.18', '19.18', '36.05', '19.58', '39.65', '35.52', '35.48', '19.55', '35.48', '35.37', '31.72', '24.12', '0.59', '17.95', '19.18', '23.45', '11.74', '37.07', '19.22', '1.84', '45.56', '5.96', '12.08', '0.31', '35.46', '22.48', '15.67', '35.48', '11.20']


In [209]:
tags_NS_EW_mag = parsed_html.find_all('td',{'class': 'tabev2'})
tags_NS_EW_mag

NS =[]
EW = []
mag = []

count=0

for t in tags_NS_EW_mag:
    t = str(t)
    t = t.split(">")
    t = t[1]
    t = t.replace("\xa0","")
    t = t.split("<")
    t = t[0]
    if t == 'N' or t == 'S':
        NS.append(t)
    elif t == 'E' or t == 'W':
        EW.append(t)
    else:
        mag.append(t)
        
print(NS)
print(EW)
print(mag)

['N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'S', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'N', 'S', 'S', 'S', 'N', 'N', 'N', 'S', 'N', 'N', 'N', 'S', 'S', 'N', 'N', 'S', 'N', 'S', 'N', 'N', 'N']
['W', 'W', 'E', 'E', 'W', 'E', 'W', 'E', 'W', 'W', 'W', 'E', 'W', 'W', 'E', 'E', 'E', 'W', 'E', 'E', 'E', 'E', 'W', 'E', 'W', 'E', 'W', 'W', 'W', 'W', 'E', 'W', 'W', 'E', 'W', 'W', 'W', 'W', 'E', 'W', 'E', 'E', 'E', 'E', 'E', 'W', 'W', 'W', 'W', 'W']
['2.5', '3.1', '2.7', '3.3', '2.2', '3.3', '2.1', '2.3', '3.5', '2.8', '2.8', '4.7', '2.5', '3.8', '2.5', '3.6', '2.1', '2.4', '3.6', '3.2', '3.2', '3.2', '2.0', '2.3', '3.2', '2.6', '2.1', '1.8', '2.5', '1.9', '3.8', '3.3', '4.0', '3.8', '2.1', '2.4', '4.5', '3.4', '2.0', '3.6', '2.8', '3.3', '3.8', '4.6', '2.9', '2.0', '3.4', '3.7', '2.1', '3.1']


In [235]:
latitude_complete = []
longitude_complete = []
for i in range(len(NS)-1):
    element = latitude[i] + " " + NS[i]
    element2 = longitude[i] + " " + EW[i]
    latitude_complete.append(element)
    longitude_complete.append(element2)
    
print(latitude_complete)
print(longitude_complete)

['19.18 N', '30.34 S', '38.67 N', '9.31 N', '19.18 N', '9.31 N', '19.19 N', '39.73 N', '38.69 N', '10.74 N', '34.02 S', '49.32 S', '18.86 N', '13.46 N', '34.36 N', '6.78 S', '44.44 N', '35.50 N', '17.63 N', '36.15 S', '12.33 N', '41.18 N', '19.18 N', '36.05 N', '19.58 N', '39.65 N', '35.52 N', '35.48 N', '19.55 S', '35.48 N', '35.37 S', '31.72 S', '24.12 S', '0.59 N', '17.95 N', '19.18 N', '23.45 S', '11.74 N', '37.07 N', '19.22 N', '1.84 S', '45.56 S', '5.96 N', '12.08 N', '0.31 S', '35.46 N', '22.48 S', '15.67 N', '35.48 N']
['155.49 W', '71.63 W', '43.07 E', '125.46 E', '155.50 W', '125.60 E', '155.49 W', '20.44 E', '116.09 W', '85.15 W', '71.00 W', '164.57 E', '155.21 W', '90.27 W', '24.78 E', '105.76 E', '9.77 E', '3.63 W', '121.42 E', '179.24 E', '120.74 E', '43.95 E', '155.50 W', '30.86 E', '70.49 W', '23.45 E', '3.70 W', '3.72 W', '70.09 W', '3.63 W', '178.66 E', '72.11 W', '67.65 W', '122.19 E', '66.84 W', '155.50 W', '67.01 W', '86.97 W', '28.42 E', '155.50 W', '119.37 E', '1

In [226]:
tags_region = parsed_html.find_all('td',{'class': 'tb_region'})
tags_region

region = []

for t in tags_region:
    t = str(t)
    t = t.split(">")
    t = t[1]
    t = t.split("\xa0")
    t = t[1]
    t = t.split("<")
    t = t[0]
    region.append(t)
    
region

['ISLAND OF HAWAII, HAWAII',
 'COQUIMBO, CHILE',
 'EASTERN TURKEY',
 'MINDANAO, PHILIPPINES',
 'ISLAND OF HAWAII, HAWAII',
 'MINDANAO, PHILIPPINES',
 'ISLAND OF HAWAII, HAWAII',
 'GREECE',
 'NEVADA',
 'COSTA RICA',
 'REGION METROPOLITANA, CHILE',
 'AUCKLAND ISLANDS, N.Z. REGION',
 'HAWAII REGION, HAWAII',
 'OFFSHORE EL SALVADOR',
 'CRETE, GREECE',
 'SUNDA STRAIT, INDONESIA',
 'NORTHERN ITALY',
 'STRAIT OF GIBRALTAR',
 'LUZON, PHILIPPINES',
 'OFF E. COAST OF N. ISLAND, N.Z.',
 'MINDORO, PHILIPPINES',
 "GEORGIA (SAK'ART'VELO)",
 'ISLAND OF HAWAII, HAWAII',
 'WESTERN TURKEY',
 'DOMINICAN REPUBLIC',
 'AEGEAN SEA',
 'STRAIT OF GIBRALTAR',
 'STRAIT OF GIBRALTAR',
 'TARAPACA, CHILE',
 'STRAIT OF GIBRALTAR',
 'OFF E. COAST OF N. ISLAND, N.Z.',
 'OFFSHORE COQUIMBO, CHILE',
 'ANTOFAGASTA, CHILE',
 'MINAHASA, SULAWESI, INDONESIA',
 'PUERTO RICO REGION',
 'ISLAND OF HAWAII, HAWAII',
 'JUJUY, ARGENTINA',
 'NEAR COAST OF NICARAGUA',
 'WESTERN TURKEY',
 'ISLAND OF HAWAII, HAWAII',
 'SULAWESI, INDONES

In [188]:
when = []

for t in tags:
    t = str(t)
    t = t.split(">")
    time = t[-3]
    time = time.split("<")
    time = time[0]
    when.append(time)

when

['15min ago',
 '26min ago',
 '35min ago',
 '40min ago',
 '52min ago',
 '52min ago',
 '57min ago',
 '1hr 24min ago',
 '1hr 35min ago',
 '1hr 45min ago',
 '2hr 00min ago',
 '2hr 06min ago',
 '2hr 07min ago',
 '2hr 12min ago',
 '2hr 13min ago',
 '2hr 16min ago',
 '2hr 17min ago',
 '2hr 19min ago',
 '2hr 21min ago',
 '2hr 30min ago',
 '2hr 41min ago',
 '2hr 42min ago',
 '2hr 55min ago',
 '3hr 06min ago',
 '3hr 11min ago',
 '3hr 12min ago',
 '3hr 17min ago',
 '3hr 19min ago',
 '3hr 23min ago',
 '3hr 29min ago',
 '3hr 43min ago',
 '3hr 53min ago',
 '3hr 57min ago',
 '4hr 06min ago',
 '4hr 07min ago',
 '4hr 35min ago',
 '4hr 47min ago',
 '4hr 57min ago',
 '5hr 12min ago',
 '5hr 42min ago',
 '6hr 01min ago',
 '6hr 15min ago',
 '6hr 18min ago',
 '6hr 20min ago',
 '6hr 32min ago',
 '6hr 34min ago',
 '6hr 42min ago',
 '6hr 45min ago',
 '6hr 54min ago',
 '7hr 07min ago']

In [236]:
df = pd.DataFrame(list(zip(Date, Time, latitude_complete, longitude_complete, region)),
               columns =['Date', 'Time', 'Latitude', 'Longitude', 'Region'])
df

Unnamed: 0,Date,Time,Latitude,Longitude,Region
0,2022-07-11,16:22:03.7,19.18 N,155.49 W,"ISLAND OF HAWAII, HAWAII"
1,2022-07-11,16:10:15.0,30.34 S,71.63 W,"COQUIMBO, CHILE"
2,2022-07-11,16:02:04.3,38.67 N,43.07 E,EASTERN TURKEY
3,2022-07-11,15:56:43.0,9.31 N,125.46 E,"MINDANAO, PHILIPPINES"
4,2022-07-11,15:45:00.8,19.18 N,155.50 W,"ISLAND OF HAWAII, HAWAII"
5,2022-07-11,15:44:54.0,9.31 N,125.60 E,"MINDANAO, PHILIPPINES"
6,2022-07-11,15:39:18.2,19.19 N,155.49 W,"ISLAND OF HAWAII, HAWAII"
7,2022-07-11,15:12:19.0,39.73 N,20.44 E,GREECE
8,2022-07-11,15:02:04.5,38.69 N,116.09 W,NEVADA
9,2022-07-11,14:51:24.0,10.74 N,85.15 W,COSTA RICA


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [237]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [238]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [239]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

In [273]:
list_languages = []

for i in range(1,10):
    tags = parsed_html.find_all('div',{'class': f'central-featured-lang lang{i}'})
    t = str(tags)
    t= t.split("<strong>")
    t = t[1]
    t = t.split("<")
    t = t[0]
    n = str(tags)
    n = n.split('"ltr">')
    n = n[1]
    n = n.split("+")
    n = n[0]
    n = n.replace("\xa0","")
    n = n + " articles"
    element = t + " " + n
    list_languages.append(element)
    
list_languages

['English 6458000 articles',
 '日本語 1314000 articles',
 'Русский 1798000 articles',
 'Deutsch 2667000 articles',
 'Español 1755000 articles',
 'Français 2400000 articles',
 'Italiano 1742000 articles',
 '中文 1256000 articles',
 'Português 1085000 articles']

#### A list with the different kind of datasets available in data.gov.uk.

In [3]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [4]:
# your code here
response = requests.get(url)
html = response.content
type(html)

bytes

In [5]:
parsed_html = bs4.BeautifulSoup(html, "html.parser") 

In [8]:
tags = parsed_html.find_all('a',{'class': 'govuk-link'})
tags

lista_datasets = []
count = 0

for i in tags:
    if count<4:
        count+=1
    else:
        t = str(i)
        t = t.split("=")
        t = t[-1]
        t = t.split(">")
        t = t[0]
        t = t.replace("+"," ")
        lista_datasets.append(t)
        
lista_datasets

['Business and economy"',
 'Crime and justice"',
 'Defence"',
 'Education"',
 'Environment"',
 'Government"',
 'Government spending"',
 'Health"',
 'Mapping"',
 'Society"',
 'Towns and cities"',
 'Transport"',
 'Digital service performance"',
 'Government reference data"']

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [45]:
# This is the url you will scrape in this exercise
import lxml
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [46]:
# your code here
table = pd.read_html(url)[0]
table

Unnamed: 0,Rank,Language,Native Speakers(millions),Percentageof world pop.(March 2019)[10],Language family,Branch
0,1,Mandarin Chinese,929.0,11.922%,Sino-Tibetan,Sinitic
1,2,Spanish,474.7,5.994%,Indo-European,Romance
2,3,English,372.9,4.922%,Indo-European,Germanic
3,4,Hindi (Sanskritised Hindustani)[11],343.9,4.429%,Indo-European,Indo-Aryan
4,5,Bengali,233.7,4.000%,Indo-European,Indo-Aryan
...,...,...,...,...,...,...
86,87,Czech,10.7,0.139%,Indo-European,Balto-Slavic
87,88,Taʽizzi-Adeni Arabic,10.5,0.136%,Afroasiatic,Semitic
88,89,Uyghur,10.4,0.135%,Turkic,Karluk
89,90,Eastern Min,10.3,0.134%,Sino-Tibetan,Sinitic


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [None]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here