# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import bs4
import ssl
import sqlite3
import re
import mechanicalsoup
import html
import urllib.parse

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
developers_url = 'https://github.com/trending/developers'
print(developers_url)

https://github.com/trending/developers


#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (Á•ûÊ•ΩÂùÇË¶ö„ÄÖ)',
 'script-8']
 ```

In [3]:
# your code here
git_developers = requests.get(developers_url)
git_developers.status_code

200

In [4]:
type(git_developers)

requests.models.Response

In [5]:
# Flat text
# print(git_developers.text)

In [6]:
# Flat bytes
# git_developers.content

In [7]:
# Git developers headers
# git_developers.headers

In [8]:
# Request headers
# git_developers.request.headers

In [9]:
# Request method
git_developers.request.method

'GET'

In [10]:
# Check once more the requested url
git_developers.request.url

'https://github.com/trending/developers'

In [11]:
git_dev_html = git_developers.content
len(git_dev_html)

470115

In [12]:
parsed_git_dev_html = bs4.BeautifulSoup(git_dev_html, "html.parser")
# type(parsed_git_dev_html)

In [13]:
# Check html structure
print(parsed_git_dev_html.prettify())

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-a11y-link-underlines="true" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-0eace2597ca3.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-a167e256da9c.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://

In [14]:
# parsed_git_dev_html.head
# parsed_git_dev_html.body
# parsed_git_dev_html.title

In [15]:
# find 1 "h1" tag:
articles = parsed_git_dev_html.find_all("h1", {"class": "h3 lh-condensed"})
articles

[<h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":658,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="8c1451b2cb5d954c4958014265dcbfc0252451fa59114d2d3436cb14f6c65700" data-view-component="true" href="/stephencelis">
             Stephen Celis
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":1148717,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="7f87bfdd30084fffd05911d217b92b80888d757af0891ad4fe294977cc7c744d" data-view-component="true" href="/emilk">
             Emil Erner

In [16]:
developers = []
for a in articles: # Buscar las etiquetas a en la etiqueta padre article
    a = str(a)
#     print(a)
    soup_a = bs4.BeautifulSoup(a, "html.parser")
#     print(soup_a)
    list_a = soup_a.find_all("a")
#     print(list_a)
    for e in list_a: # Para elementos en la etiqueta a
        developers.append(e.string.strip())
print(developers)

['Stephen Celis', 'Emil Ernerfeldt', 'Tomek Zawadzki', 'Shyam Tawli', 'Simon Warta', 'atomiks', 'Kazuho Oku', 'oobabooga', 'Matthias Fey', 'Rich Harris', 'Yann Collet', 'Chris Maltby', 'Bae Junehyeon', 'Volodymyr Agafonkin', 'Nicolas Patry', 'Bo-Yi Wu', 'Rui Chen', 'tangly1024', 'Tianon Gravi', 'Scott Chacon', 'Eugene Yurtsev', 'Olivier Goffart', 'dgtlmoon', 'Claire', 'Muhun']


#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [17]:
# This is the url you will scrape in this exercise
repo_url = 'https://github.com/trending/python?since=daily'
print(repo_url)

https://github.com/trending/python?since=daily


In [18]:
# your code here
git_repos = requests.get(repo_url)
git_repos.status_code

200

In [19]:
git_repos_html = git_repos.content
len(git_repos_html)

651861

In [20]:
parsed_git_repos_html = bs4.BeautifulSoup(git_repos_html, "html.parser")

In [21]:
repo_articles = parsed_git_repos_html.find_all("h2", {"class": "h3 lh-condensed"})
repo_articles

[<h2 class="h3 lh-condensed">
 <a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":738733003,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="914c705a07049ebfd4e3718fe3e107f471c994a5b62493f26d44ae8916ac6e02" data-view-component="true" href="/danielmiessler/fabric">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 color-fg-muted" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.

In [22]:
# repository = repo_articles[0].a.get('href')
# repository

In [23]:
repositories = []
for a in repo_articles:
    repository = a.find('a').get('href').strip()
    repositories.append(repository)
repositories

['/danielmiessler/fabric',
 '/InkboxSoftware/excelCPU',
 '/haotian-liu/LLaVA',
 '/InstantID/InstantID',
 '/AILab-CVC/YOLO-World',
 '/mlflow/mlflow',
 '/facebookresearch/codellama',
 '/leptonai/leptonai',
 '/PKU-YuanGroup/MoE-LLaVA',
 '/Fanghua-Yu/SUPIR',
 '/deepseek-ai/DeepSeek-Coder',
 '/facebookresearch/llama',
 '/stanfordnlp/dspy',
 '/X-PLUG/MobileAgent',
 '/ansible/ansible',
 '/lich0821/WeChatFerry',
 '/zaigie/palworld-server-tool',
 '/open-compass/opencompass',
 '/FlagOpen/FlagEmbedding',
 '/microsoft/sample-app-aoai-chatGPT',
 '/getredash/redash',
 '/pytorch/pytorch',
 '/vllm-project/vllm',
 '/ultralytics/ultralytics',
 '/pytorch/vision']

#### Display all the image links from Walt Disney wikipedia page.

In [24]:
# This is the url you will scrape in this exercise
disney_url = 'https://en.wikipedia.org/wiki/Walt_Disney'
print(disney_url)

https://en.wikipedia.org/wiki/Walt_Disney


In [25]:
# your code here
disney_wiki = requests.get(disney_url)
disney_wiki.status_code

200

In [26]:
disney_wiki_html = disney_wiki.content
len(disney_wiki_html)

596111

In [27]:
parsed_disney_wiki_html = bs4.BeautifulSoup(disney_wiki_html, "html.parser")

In [28]:
disney_imgs = parsed_disney_wiki_html.find_all("a", {"class": "mw-file-description"})

In [29]:
img = disney_imgs[0]
img

<a class="mw-file-description" href="/wiki/File:Walt_Disney_1946.JPG"><img class="mw-file-element" data-file-height="675" data-file-width="450" decoding="async" height="330" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/330px-Walt_Disney_1946.JPG 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/440px-Walt_Disney_1946.JPG 2x" width="220"/></a>

In [30]:
# Images links
disney_img_links = []
for a in disney_imgs:
    img = a.find('img').get('src').strip()
    disney_img_links.append(img)
disney_img_links

['//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg/220px-Walt_Disney_Snow_white_1937_trailer_screenshot_%2813%29.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/1/15/Disney_drawing_goofy.jpg/170px-Disney_drawing_goofy.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/8c/WaltDisneyplansDisneylandDec1954.jpg/220px-WaltDisneyplansDisneylandDec1954.jpg',
 '//upload.wikimedia.org

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [31]:
# This is the url you will scrape in this exercise
python_url ='https://en.wikipedia.org/wiki/Python' 
print(python_url)

https://en.wikipedia.org/wiki/Python


In [32]:
# your code here
python_wiki = requests.get(python_url)
python_wiki.status_code

200

In [33]:
python_wiki_html = python_wiki.content
# len(python_wiki_html)
print(python_wiki_html)

b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Python - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-

In [34]:
parsed_python_wiki_html = bs4.BeautifulSoup(python_wiki_html, "html.parser")
# print(parsed_python_wiki_html)

In [35]:
import re

links = []
text = []

for link in parsed_python_wiki_html.find_all('a', {'href': re.compile("^/wiki/")}):
    text.append(link.text)
    links.append(link.get('href'))

In [36]:
print(len(links))
print(len(text))

61
61


In [37]:
for i in range(1, len(links)):
    print(text[i] + '---> ' + 'https://en.wikipedia.org/' + links[i])

Contents---> https://en.wikipedia.org//wiki/Wikipedia:Contents
Current events---> https://en.wikipedia.org//wiki/Portal:Current_events
Random article---> https://en.wikipedia.org//wiki/Special:Random
About Wikipedia---> https://en.wikipedia.org//wiki/Wikipedia:About
Help---> https://en.wikipedia.org//wiki/Help:Contents
Learn to edit---> https://en.wikipedia.org//wiki/Help:Introduction
Community portal---> https://en.wikipedia.org//wiki/Wikipedia:Community_portal
Recent changes---> https://en.wikipedia.org//wiki/Special:RecentChanges
Upload file---> https://en.wikipedia.org//wiki/Wikipedia:File_upload_wizard






---> https://en.wikipedia.org//wiki/Main_Page

Search
---> https://en.wikipedia.org//wiki/Special:Search
learn more---> https://en.wikipedia.org//wiki/Help:Introduction
Contributions---> https://en.wikipedia.org//wiki/Special:MyContributions
Talk---> https://en.wikipedia.org//wiki/Special:MyTalk
Article---> https://en.wikipedia.org//wiki/Python
Talk---> https://en.wikipedia.or

#### Find the number of titles that have changed in the United States Code since its last release point.

In [38]:
# This is the url you will scrape in this exercise
us_url = 'http://uscode.house.gov/download/download.shtml'
print(us_url)

http://uscode.house.gov/download/download.shtml


In [39]:
# your code here
us_code_url = requests.get(us_url)
us_code_url.status_code

200

In [40]:
us_code_house = us_code_url.content

In [41]:
parsed_us_code_house = bs4.BeautifulSoup(us_code_house, "html.parser")

In [42]:
us_num_updates = parsed_us_code_house.find_all("div", {"class": "usctitlechanged"})
us_num_updates

[<div class="usctitlechanged" id="us/usc/t25">
 
           Title 25 - Indians
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t26">
 
           Title 26 - Internal Revenue Code
 
         </div>,
 <div class="usctitlechanged" id="us/usc/t49">
 
           Title 49 - Transportation <span class="footnote"><a class="fn" href="#fn">Ÿ≠</a></span>
 </div>,
 <div class="usctitlechanged" id="us/usc/t51">
 
           Title 51 - National and Commercial Space Programs <span class="footnote"><a class="fn" href="#fn">Ÿ≠</a></span>
 </div>]

In [43]:
code = us_num_updates[1]
code.get_text()

'\n\n          Title 26 - Internal Revenue Code\n\n        '

In [44]:
title_changes = [title.get_text(strip=True) for title in us_num_updates]
title_changes

['Title 25 - Indians',
 'Title 26 - Internal Revenue Code',
 'Title 49 - TransportationŸ≠',
 'Title 51 - National and Commercial Space ProgramsŸ≠']

#### Find a Python list with the top ten FBI's Most Wanted names.

In [45]:
# This is the url you will scrape in this exercise
fbi_url = 'https://www.fbi.gov/wanted/topten'
print(fbi_url)

https://www.fbi.gov/wanted/topten


In [46]:
# your code here

fbi_most_wanted = requests.get(fbi_url)
fbi_most_wanted.status_code

'''No pude lograr el acceso'''

'No pude lograr el acceso'

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [47]:
# This is the url you will scrape in this exercise
emsc_url = 'https://www.emsc-csem.org/Earthquake/'
print(emsc_url)

https://www.emsc-csem.org/Earthquake/


In [48]:
# your code here
emsc_quake = requests.get(emsc_url)
emsc_quake.status_code

200

In [49]:
print(emsc_quake.text)

<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta name="google-site-verification" content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" /><meta name="msvalidate.01" content="BCAA3C04C41AE6E6AFAF117B9469C66F" /><meta name="y_key" content="43b36314ccb77957" /><meta name="robots" content="all" /><meta name="description"  lang="en" content="Get informed on the latest earthquakes occurred around the globe. earthquakes today - recent and latest earthquakes, earthquake map and earthquake information. Earthquake information for europe. EMSC (European Mediterranean Seismological Centre) provides real time earthquake information for seismic events with magnitude larger than 5 in the European Mediterranean area and larger than 7 in the rest of the world."/><meta property="fb:app_id" content="705855916142039"/><meta property="og:locale" content="en_FR"/><meta property="og:type" content="website"/><meta property="og:site_name" content="EMSC - European-Mediterranean Seismological Cen

In [50]:
emsc_quake.content

b'<!DOCTYPE html>\n<html lang="en"><head><meta charset="UTF-8"><meta name="google-site-verification" content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" /><meta name="msvalidate.01" content="BCAA3C04C41AE6E6AFAF117B9469C66F" /><meta name="y_key" content="43b36314ccb77957" /><meta name="robots" content="all" /><meta name="description"  lang="en" content="Get informed on the latest earthquakes occurred around the globe. earthquakes today - recent and latest earthquakes, earthquake map and earthquake information. Earthquake information for europe. EMSC (European Mediterranean Seismological Centre) provides real time earthquake information for seismic events with magnitude larger than 5 in the European Mediterranean area and larger than 7 in the rest of the world."/><meta property="fb:app_id" content="705855916142039"/><meta property="og:locale" content="en_FR"/><meta property="og:type" content="website"/><meta property="og:site_name" content="EMSC - European-Mediterranean Seismological 

In [51]:
emsc_quake.headers

{'Server': 'Apache/2.4.6 (CentOS) PHP/7.2.22', 'X-Powered-By': 'PHP/7.2.22', 'Cache-Control': 'max-age=120', 'ServerFred': '192.168.160.12', 'Keep-Alive': 'timeout=5, max=99', 'Content-Type': 'text/html; charset=UTF-8', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Connection': 'Keep-Alive', 'Date': 'Fri, 02 Feb 2024 16:12:52 GMT', 'Age': '16', 'Content-Length': '4684'}

In [52]:
emsc_quake.request.headers

{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [53]:
parsed_emsc_quake = bs4.BeautifulSoup(emsc_quake.text, "html.parser")
# parsed_emsc_quake.prettify()

In [54]:
# th es el padre (header)
th_elements = parsed_emsc_quake.find_all('th',{"class": "tbdat"})

all_td_elements = []

# Iterate over th elements to find td elements
for th_element in th_elements:
    td_elements = th_element.find_all('td')
    # Extend the list with td_elements
    all_td_elements.extend(td_elements)

# Now all_td_elements contains all td elements
print(all_td_elements)

# td es el hijo (data) <th class="tbdat">Date &amp; Time<div>UTC</div></th>

[]


# Checkpoint

In [13]:
import ssl
import sqlite3

browser = mechanicalsoup.StatefulBrowser()
browser.open('https://www.emsc-csem.org/Earthquake_information/')

# Estract table headers https://www.youtube.com/watch?v=MkGQmZoMuRM&ab_channel=PythonSimplified

th = browser.page.find_all('th',{'class':['tbdat','tblat','tblon','tbdep','tbmag','tbreg']})
distribution = [value.text for value in th]
pd.DataFrame(distribution).T # A iterable list

Unnamed: 0,0,1,2,3,4,5
0,Date & TimeUTC,Lat.degrees,Lon.degrees,Depthkm,Mag.[+],Region


In [56]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL del sitio web EMSC
emsc_url = 'https://www.emsc-csem.org/Earthquake/'

# Realizar la solicitud GET al sitio web
emsc_quake = requests.get(emsc_url)

# Comprobar el estado de la solicitud (debe ser 200 para una respuesta exitosa)
if emsc_quake.status_code == 200:
    emsc_quake_html = emsc_quake.content

    # Analizar el contenido HTML
    parsed_emsc_quake_html = BeautifulSoup(emsc_quake_html, "html.parser")

    # Encontrar la tabla que contiene la informaci√≥n de los terremotos
    emsc_table = parsed_emsc_quake_html.find('table')

    # Crear listas vac√≠as para almacenar la informaci√≥n
    dates = []
    times = []
    latitudes = []
    longitudes = []
    regions = []

    # Iterar a trav√©s de las filas de la tabla (omitir la primera fila de encabezados)
    for row in emsc_table.find_all('tr')[1:21]:  # Obtener las primeras 20 filas
        columns = row.find_all('td')

        # Verificar si hay suficientes columnas en la fila
        if len(columns) >= 12:
            # Extraer los datos de las columnas necesarias
            date = columns[0].text.strip()
            time = columns[1].text.strip()
            latitude = columns[4].text.strip()
            longitude = columns[5].text.strip()
            region = columns[11].text.strip()

            # Agregar los datos a las listas
            dates.append(date)
            times.append(time)
            latitudes.append(latitude)
            longitudes.append(longitude)
            regions.append(region)

    # Crear un DataFrame de pandas con los datos recopilados
    earthquake_data = pd.DataFrame({
        'Date': dates,
        'Time': times,
        'Latitude': latitudes,
        'Longitude': longitudes,
        'Region': regions
    })

    # Imprimir el DataFrame
    print(earthquake_data)

else:
    print("Error al obtener la p√°gina.")


Empty DataFrame
Columns: [Date, Time, Latitude, Longitude, Region]
Index: []


#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [57]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [58]:
# your code here

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [59]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [60]:
# your code here

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [61]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [62]:
# your code here

#### A list with the different kind of datasets available in data.gov.uk.

In [63]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [64]:
# your code here

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [65]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [66]:
# your code here

## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [67]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [68]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [69]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [70]:
# your code here

#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [71]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [72]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here