### Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

Tips:

Check the response status code for each request to ensure you have obtained the intended contennt.
Print the response text in each request to understand the kind of info you are getting and its format.
Check for patterns in the response text to extract the data/info requested in each question.
Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

Below are the libraries and modules you may need. requests, BeautifulSoup and pandas are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [3]:
webpage = requests.get(url).content
#webpage

In [4]:
soup = BeautifulSoup(webpage, "html")
soup

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" integrity="sha512-UXiu4O52iBFkqt6Kx5t+pqHYP2/LWWIw9+l5ia74TWw+xPzpH44BFfAQp7yzCe0XFGZa72Xiqyml6tox1KkUjw==" media="all" rel="stylesheet"/><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" integrity="sha512-IX1PnI5wWBz8Kgb1JI0f2QFa/WuRQQHJHe0vkKinQ

Display the names of the trending developers retrieved in the previous step.
Your output should be a Python list of developer names. Each name should not contain any html tag.

Instructions:

Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

Use BeautifulSoup to extract all the html elements that contain the developer names.

Use string manipulation techniques to replace whitespaces and linebreaks (i.e. \n) in the text of each html element. Use a list to store the clean names.

Print the list of names.

In [5]:
h3 = soup.find_all('h1', {'class': 'h3 lh-condensed'})
h3

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":4690128,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="32954dfd8fbcdfcefdee7e6a6925f949dafe893ec726ba4da57712466e05bee2" data-view-component="true" href="/homuler">
             Junrou Nishida
 </a> </h1>,
 <h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":515813,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="adaf10211a18914dfe233564e602becbc378317e6f14c903cc22be773dbcf63b" data-view-component="true" href="/sbc100">
             Sam Clegg
 </a> </h1>,
 <h1 class="h

In [6]:
names = [element.text for element in h3]
clean_names = [i.strip() for i in names]
clean_names

['Junrou Nishida',
 'Sam Clegg',
 'Stefan Prodan',
 'Tianon Gravi',
 'chencheng (‰∫ëË∞¶)',
 'Payton Swick',
 'Vasco Asturiano',
 'Kirill M√ºller',
 'Krasimir Tsonev',
 'Jason Quense',
 'Jan-Otto Kr√∂pke',
 'utam0k',
 'William Candillon',
 'Leigh McCulloch',
 'moxey.eth',
 "Na'aman Hirschfeld",
 'Steve Macenski',
 'Olivier Halligon',
 'Mattias Wadman',
 'Agniva De Sarker',
 'Felix Krause',
 'Nick Raienko',
 'Florian Rival',
 'Robin Cornett',
 'Dotan Simha']

Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [17]:
# This is the url you will scrape in this exercise
url_2 = 'https://github.com/trending/python?since=daily'

In [18]:
webpage_2 = requests.get(url_2).content
#webpage_2

In [19]:
soup_2 = BeautifulSoup(webpage_2, "html")

In [20]:
h3_2 = soup_2.find_all('h1', {'class': 'h3 lh-condensed'})
h3_2

[<h1 class="h3 lh-condensed">
 <a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_REPOSITORIES_PAGE","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":524744625,"originating_url":"https://github.com/trending/python?since=daily","user_id":null}}' data-hydro-click-hmac="bd78bc11cfa3b94f4c5b69effa7bd3ea12a0004d5adaf18e8573855b6219767a" data-view-component="true" href="/CodeXTF2/Burp2Malleable">
 <svg aria-hidden="true" class="octicon octicon-repo mr-1 color-fg-muted" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
 <path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3

In [46]:
repos = [element.text for element in h3_2] 
clean_repos = [i.strip().replace('\n\n     ', '') for i in repos]
clean_repos

['CodeXTF2 / Burp2Malleable',
 'SinicaGroup / Class-agnostic-Few-shot-Object-Counting',
 'aiogram / aiogram',
 'jackfrued / Python-100-Days',
 'deepinsight / insightface',
 'OpenBB-finance / OpenBBTerminal',
 'kovidgoyal / kitty',
 'FDc0d3 / F-Tool',
 'google / jax',
 'hhyo / Archery',
 'aws / deep-learning-containers',
 'zulip / zulip',
 'open-mmlab / mmediting',
 'nccgroup / ScoutSuite',
 'open-mmlab / mmrotate',
 'apache / airflow',
 'rgerum / pylustrator',
 'Delgan / loguru',
 'TCM-Course-Resources / Practical-Ethical-Hacking-Resources',
 'pittcsc / Summer2023-Internships',
 'splunk / security_content',
 'openai / gym',
 'tgbot-collection / YYeTsBot',
 'adrienverge / yamllint']

Display all the image links from Walt Disney wikipedia page

In [73]:
# This is the url you will scrape in this exercise
url_3 = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [74]:
webpage_3 = requests.get(url_3).content

In [75]:
soup_3 = BeautifulSoup(webpage_3, "html")

In [81]:
img = [img.get('src') for img in soup_3.find_all('img')]
img

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg/220px-Walt_Disney_Birthplace_Exterior_Hermosa_Chicago_Illinois.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikime

Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [82]:
# This is the url you will scrape in this exercise
url_4 ='https://en.wikipedia.org/wiki/Python'

In [4]:
def init_bs(url):
    webpage = requests.get(url).content
    return BeautifulSoup(webpage, "html")

In [97]:
soup_4 = init_bs('https://en.wikipedia.org/wiki/Python')
soup_4

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e9f42f06-898a-4508-beb1-69f47035a687","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Python","wgTitle":"Python","wgCurRevisionId":1104201971,"wgRevisionId":1104201971,"wgArticleId":46332325,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Disambiguation pages with short descriptions","Short description is different from Wikidata","All article disambiguation pages","All disambiguation pages","Animal common nam

In [117]:
import re
lista_links = []
for link in soup_4.find_all('a', attrs={'href': re.compile("^https://")}):
    lista_links.append(link.get('href'))
    
lista_links

['https://en.wiktionary.org/wiki/Python',
 'https://en.wiktionary.org/wiki/python',
 'https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Python&namespace=0',
 'https://en.wikipedia.org/w/index.php?title=Python&oldid=1104201971',
 'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en',
 'https://www.wikidata.org/wiki/Special:EntityPage/Q747452',
 'https://commons.wikimedia.org/wiki/Category:Python',
 'https://af.wikipedia.org/wiki/Python',
 'https://als.wikipedia.org/wiki/Python',
 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A7%D9%8A%D8%AB%D9%88%D9%86_(%D8%AA%D9%88%D8%B6%D9%8A%D8%AD)',
 'https://az.wikipedia.org/wiki/Python',
 'https://bn.wikipedia.org/wiki/%E0%A6%AA%E0%A6%BE%E0%A6%87%E0%A6%A5%E0%A6%A8_(%E0%A6%A6%E0%A7%8D%E0%A6%AC%E0%A7%8D%E0%A6%AF%E0%A6%B0%E0%A7%8D%E0%A6%A5%E0%A6%A4%E0%A6%BE_%E0%A6%A8%E0%A6%BF%E0%A6%B0%E0%A6%B8%E0%A6%A8)',
 'https://be.wikipedia.org/wiki/Python',
 'ht

Number of Titles that have changed in the United States Code since its last release point

In [119]:
# This is the url you will scrape in this exercise
url_5 = 'http://uscode.house.gov/download/download.shtml'

In [125]:
soup_5 = init_bs(url_5)
soup_5

<?xml version='1.0' encoding='UTF-8' ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=8" http-equiv="X-UA-Compatible"/>
<meta content="no-cache" http-equiv="pragma"/><!-- HTTP 1.0 -->
<meta content="no-cache,must-revalidate" http-equiv="cache-control"/><!-- HTTP 1.1 -->
<meta content="0" http-equiv="expires"/>
<link href="/javax.faces.resource/favicon.ico.xhtml?ln=images" rel="shortcut icon"/><link href="/javax.faces.resource/cssLayout.css.xhtml?ln=css" rel="stylesheet" type="text/css"/><script src="/javax.faces.resource/jsf.js.xhtml?ln=javax.faces" type="text/javascript"></script><link href="/javax.faces.resource/static.css.xhtml?ln=css" rel="stylesheet" type="text/css"/></head><body style="display:none;"><script src="/javax.faces.resource/browserPreferences.js.xhtml?ln=script

In [200]:
title = soup_5.find_all('div', {'class': 'usctitle'})[-1].text
re.findall('\d+', title)

['54']

In [189]:
titles_alt = [i.text.strip() for i in soup_5.find_all('div', {'class': 'usctitle'})[-1]]
titles_alt = [re.findall('\d', i) for i in titles_alt]
titles_alt

['Title 54 - National Park Service and Related Programs', 'Ÿ≠', '']

In [180]:
titles = [i.text for i in soup_5.find_all('div', {'class': 'usctitle'})] 
#for i in titles:
#    print(re.findall('Title ?\d\d?', i))
titles = [re.findall('Title? \d\d?', i) for i in titles]
titles.remove(titles[0])
len(titles)

54

A Python list with the top ten FBI's Most Wanted names

In [202]:
# This is the url you will scrape in this exercise
url_6 = 'https://www.fbi.gov/wanted/topten'

In [204]:
soup_6 = init_bs(url_6)
soup_6

<!DOCTYPE html>
<html data-gridsystem="bs3" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="https://www.fbi.gov/wanted/topten" rel="canonical"/><meta content="summary_large_image" name="twitter:card"/>
<meta content="Ten Most Wanted Fugitives | Federal Bureau of Investigation" name="twitter:title"/>
<meta content="Federal Bureau of Investigation" property="og:site_name"/>
<meta content="Ten Most Wanted Fugitives | Federal Bureau of Investigation" property="og:title"/>
<meta content="website" property="og:type"/>
<meta content="@FBI" name="twitter:site"/>
<meta content="https://www.facebook.com/FBI" property="og:article:publisher"/>
<meta content="The FBI is offering rewards for information leading to the apprehension of the Ten Most Wanted Fugitives. Select the images of suspects to display more information." name="twitter:description"/>
<meta content="ht

In [206]:
names_fbi = [name.text for name in soup_6.find_all('h3', {'class': 'title'})]
names_fbi = [i.strip() for i in names_fbi]
names_fbi

['YULAN ADONAY ARCHAGA CARIAS',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'OMAR ALEXANDER CARDENAS',
 'ALEJANDRO ROSALES CASTILLO',
 'RUJA IGNATOVA',
 'JASON DEREK BROWN',
 'ARNOLDO JIMENEZ',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'RAFAEL CARO-QUINTERO']

20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [536]:
# This is the url you will scrape in this exercise
url_7 = 'https://www.emsc-csem.org/Earthquake/'

In [537]:
soup_7 = init_bs(url_7)
soup_7

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><meta content="srFzNKBTd0FbRhtnzP--Tjxl01NfbscjYwkp4yOWuQY" name="google-site-verification"/><meta content="BCAA3C04C41AE6E6AFAF117B9469C66F" name="msvalidate.01"/><meta content="43b36314ccb77957" name="y_key"/><!-- 5-Clk8f50tFFdPTU97Bw7ygWE1A -->
<meta content="en" http-equiv="Content-Language"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="earthquake,earthquakes,last earthquake,earthquake today,earthquakes today,earth quake,earth quakes,real time seismicity,seismic,seismicity,seismicity map,seismology,sismologie,EMSC,CSEM,seismicity on google earth,sumatra,tsunami,tsunamis,map,maps,richter,mercalli,moment tensors,epicenter,magnitude,seismology,foreshock,aftershock,tremor" name="keywo

In [550]:
earth_table = soup_7.find_all('table')[3]
table_h = earth_table.find_all('th')
table_h = [i.text for i in table_rows_h][1:]
table_h

['Date & Time UTC',
 'Latitude degrees',
 'Longitude degrees',
 'Depth km',
 'Mag  [+]',
 'Region name  [+]',
 'Last update [-]']

In [539]:
rows = earth_table.find("tbody").find_all("tr")

In [540]:
l = []
for tr in rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row[3:])

print(l)

[['earthquake2022-08-18\xa0\xa0\xa017:19:02.011min ago', '20.85', 'S', '69.32', 'W', '19', 'M', '4.2', 'TARAPACA, CHILE', '2022-08-18 17:22'], ['earthquake2022-08-18\xa0\xa0\xa016:21:00.01hr 09min ago', '19.20', 'N', '69.75', 'W', '23', 'M', '3.1', 'DOMINICAN REPUBLIC', '2022-08-18 17:16'], ['earthquake2022-08-18\xa0\xa0\xa015:56:54.51hr 33min ago', '36.26', 'N', '22.33', 'E', '2', 'ML', '3.2', 'SOUTHERN GREECE', '2022-08-18 16:41'], ['earthquake2022-08-18\xa0\xa0\xa015:52:24.61hr 37min ago', '37.53', 'N', '118.89', 'W', '4', 'Md', '2.1', 'CENTRAL CALIFORNIA', '2022-08-18 17:04'], ['earthquake2022-08-18\xa0\xa0\xa015:41:30.01hr 48min ago', '17.95', 'N', '66.82', 'W', '11', 'Md', '2.5', 'PUERTO RICO REGION', '2022-08-18 16:32'], ['earthquake2022-08-18\xa0\xa0\xa015:24:52.92hr 05min ago', '20.01', 'S', '133.96', 'E', '10', 'ML', '2.5', 'NORTHERN TERRITORY, AUSTRALIA', '2022-08-18 15:35'], ['earthquake2022-08-18\xa0\xa0\xa015:23:44.42hr 06min ago', '40.01', 'N', '27.69', 'E', '6', 'ML', '

In [541]:
for i in l:
    del i[6]

print(l)

[['earthquake2022-08-18\xa0\xa0\xa017:19:02.011min ago', '20.85', 'S', '69.32', 'W', '19', '4.2', 'TARAPACA, CHILE', '2022-08-18 17:22'], ['earthquake2022-08-18\xa0\xa0\xa016:21:00.01hr 09min ago', '19.20', 'N', '69.75', 'W', '23', '3.1', 'DOMINICAN REPUBLIC', '2022-08-18 17:16'], ['earthquake2022-08-18\xa0\xa0\xa015:56:54.51hr 33min ago', '36.26', 'N', '22.33', 'E', '2', '3.2', 'SOUTHERN GREECE', '2022-08-18 16:41'], ['earthquake2022-08-18\xa0\xa0\xa015:52:24.61hr 37min ago', '37.53', 'N', '118.89', 'W', '4', '2.1', 'CENTRAL CALIFORNIA', '2022-08-18 17:04'], ['earthquake2022-08-18\xa0\xa0\xa015:41:30.01hr 48min ago', '17.95', 'N', '66.82', 'W', '11', '2.5', 'PUERTO RICO REGION', '2022-08-18 16:32'], ['earthquake2022-08-18\xa0\xa0\xa015:24:52.92hr 05min ago', '20.01', 'S', '133.96', 'E', '10', '2.5', 'NORTHERN TERRITORY, AUSTRALIA', '2022-08-18 15:35'], ['earthquake2022-08-18\xa0\xa0\xa015:23:44.42hr 06min ago', '40.01', 'N', '27.69', 'E', '6', '2.1', 'WESTERN TURKEY', '2022-08-18 15:4

In [551]:
import pandas as pd

#columnas = table_rows_h
df = pd.DataFrame(l)
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,earthquake2022-08-18¬†¬†¬†17:19:02.011min ago,20.85,S,69.32,W,19,4.2,"TARAPACA, CHILE",2022-08-18 17:22
1,earthquake2022-08-18¬†¬†¬†16:21:00.01hr 09min ago,19.2,N,69.75,W,23,3.1,DOMINICAN REPUBLIC,2022-08-18 17:16
2,earthquake2022-08-18¬†¬†¬†15:56:54.51hr 33min ago,36.26,N,22.33,E,2,3.2,SOUTHERN GREECE,2022-08-18 16:41
3,earthquake2022-08-18¬†¬†¬†15:52:24.61hr 37min ago,37.53,N,118.89,W,4,2.1,CENTRAL CALIFORNIA,2022-08-18 17:04
4,earthquake2022-08-18¬†¬†¬†15:41:30.01hr 48min ago,17.95,N,66.82,W,11,2.5,PUERTO RICO REGION,2022-08-18 16:32
5,earthquake2022-08-18¬†¬†¬†15:24:52.92hr 05min ago,20.01,S,133.96,E,10,2.5,"NORTHERN TERRITORY, AUSTRALIA",2022-08-18 15:35
6,earthquake2022-08-18¬†¬†¬†15:23:44.42hr 06min ago,40.01,N,27.69,E,6,2.1,WESTERN TURKEY,2022-08-18 15:41
7,earthquake2022-08-18¬†¬†¬†14:52:24.02hr 37min ago,8.33,N,82.76,W,11,4.0,PANAMA-COSTA RICA BORDER REGION,2022-08-18 16:00
8,earthquake2022-08-18¬†¬†¬†14:47:00.72hr 43min ago,37.97,N,27.2,E,7,2.0,WESTERN TURKEY,2022-08-18 15:13
9,earthquake2022-08-18¬†¬†¬†14:46:57.02hr 43min ago,6.8,N,125.1,E,11,3.8,"MINDANAO, PHILIPPINES",2022-08-18 15:05


In [552]:
df[1] = df[1]+df[2]
df.drop(2, inplace=True, axis=1)

In [553]:
df[3] = df[3]+df[4]
df.drop(4, inplace=True, axis=1)
df

Unnamed: 0,0,1,3,5,6,7,8
0,earthquake2022-08-18¬†¬†¬†17:19:02.011min ago,20.85S,69.32W,19,4.2,"TARAPACA, CHILE",2022-08-18 17:22
1,earthquake2022-08-18¬†¬†¬†16:21:00.01hr 09min ago,19.20N,69.75W,23,3.1,DOMINICAN REPUBLIC,2022-08-18 17:16
2,earthquake2022-08-18¬†¬†¬†15:56:54.51hr 33min ago,36.26N,22.33E,2,3.2,SOUTHERN GREECE,2022-08-18 16:41
3,earthquake2022-08-18¬†¬†¬†15:52:24.61hr 37min ago,37.53N,118.89W,4,2.1,CENTRAL CALIFORNIA,2022-08-18 17:04
4,earthquake2022-08-18¬†¬†¬†15:41:30.01hr 48min ago,17.95N,66.82W,11,2.5,PUERTO RICO REGION,2022-08-18 16:32
5,earthquake2022-08-18¬†¬†¬†15:24:52.92hr 05min ago,20.01S,133.96E,10,2.5,"NORTHERN TERRITORY, AUSTRALIA",2022-08-18 15:35
6,earthquake2022-08-18¬†¬†¬†15:23:44.42hr 06min ago,40.01N,27.69E,6,2.1,WESTERN TURKEY,2022-08-18 15:41
7,earthquake2022-08-18¬†¬†¬†14:52:24.02hr 37min ago,8.33N,82.76W,11,4.0,PANAMA-COSTA RICA BORDER REGION,2022-08-18 16:00
8,earthquake2022-08-18¬†¬†¬†14:47:00.72hr 43min ago,37.97N,27.20E,7,2.0,WESTERN TURKEY,2022-08-18 15:13
9,earthquake2022-08-18¬†¬†¬†14:46:57.02hr 43min ago,6.80N,125.10E,11,3.8,"MINDANAO, PHILIPPINES",2022-08-18 15:05


In [554]:
df.columns = table_h
df.head(20)

Unnamed: 0,Date & Time UTC,Latitude degrees,Longitude degrees,Depth km,Mag [+],Region name [+],Last update [-]
0,earthquake2022-08-18¬†¬†¬†17:19:02.011min ago,20.85S,69.32W,19,4.2,"TARAPACA, CHILE",2022-08-18 17:22
1,earthquake2022-08-18¬†¬†¬†16:21:00.01hr 09min ago,19.20N,69.75W,23,3.1,DOMINICAN REPUBLIC,2022-08-18 17:16
2,earthquake2022-08-18¬†¬†¬†15:56:54.51hr 33min ago,36.26N,22.33E,2,3.2,SOUTHERN GREECE,2022-08-18 16:41
3,earthquake2022-08-18¬†¬†¬†15:52:24.61hr 37min ago,37.53N,118.89W,4,2.1,CENTRAL CALIFORNIA,2022-08-18 17:04
4,earthquake2022-08-18¬†¬†¬†15:41:30.01hr 48min ago,17.95N,66.82W,11,2.5,PUERTO RICO REGION,2022-08-18 16:32
5,earthquake2022-08-18¬†¬†¬†15:24:52.92hr 05min ago,20.01S,133.96E,10,2.5,"NORTHERN TERRITORY, AUSTRALIA",2022-08-18 15:35
6,earthquake2022-08-18¬†¬†¬†15:23:44.42hr 06min ago,40.01N,27.69E,6,2.1,WESTERN TURKEY,2022-08-18 15:41
7,earthquake2022-08-18¬†¬†¬†14:52:24.02hr 37min ago,8.33N,82.76W,11,4.0,PANAMA-COSTA RICA BORDER REGION,2022-08-18 16:00
8,earthquake2022-08-18¬†¬†¬†14:47:00.72hr 43min ago,37.97N,27.20E,7,2.0,WESTERN TURKEY,2022-08-18 15:13
9,earthquake2022-08-18¬†¬†¬†14:46:57.02hr 43min ago,6.80N,125.10E,11,3.8,"MINDANAO, PHILIPPINES",2022-08-18 15:05


Display the date, days, title, city, country of next 25 hackathon events as a Pandas dataframe table

Count number of tweets by a given Twitter account.

You will need to include a try/except block for account names not found.
Hint: the program should count the number of tweets for any provided account

In [8]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url_8 = 'https://twitter.com/rosalia'

In [9]:
handle = input('Input your account name on Twitter: ')

Input your account name on Twitter: rosalia


In [2]:
from selenium import webdriver

In [3]:
from selenium.webdriver.common.keys import Keys

In [4]:
from selenium.webdriver.chrome.service import Service

In [5]:
from webdriver_manager.chrome import ChromeDriverManager

In [6]:
import time

In [13]:
twitter = requests.get('https://twitter.com/rosalia').content
soup_8 = BeautifulSoup(twitter, "html")
soup_8

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head><meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" name="viewport"/><link href="//abs.twimg.com" rel="preconnect"/><link href="//abs.twimg.com" rel="dns-prefetch"/><link href="//api.twitter.com" rel="preconnect"/><link href="//api.twitter.com" rel="dns-prefetch"/><link href="//pbs.twimg.com" rel="preconnect"/><link href="//pbs.twimg.com" rel="dns-prefetch"/><link href="//t.co" rel="preconnect"/><link href="//t.co" rel="dns-prefetch"/><link href="//video.twimg.com" rel="preconnect"/><link href="//video.twimg.com" rel="dns-prefetch"/><link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/feature-switch-manifest.7104b318.js" nonce="NTEzMmRmZTItOTRhNy00NWM2LThlODQtZDU3MDA0ZGJmYmQ2" rel="preload"/><link as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/client-web-legacy/polyfills.54bbd058.js" 

In [24]:
#import os
#os.chmod('/Users/miguel/Ironhack/chromedriver 3', 755)

In [14]:
driver = webdriver.Chrome('/Users/miguel/Ironhack/chromedriver 3')
driver.get('https://twitter.com/rosalia')
time.sleep(4)
source = driver.page_source

driver.quit()

  driver = webdriver.Chrome('/Users/miguel/Ironhack/chromedriver 3')


In [11]:
from selenium.webdriver.common.by import By

In [13]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://twitter.com/rosalia')
#source = driver.page_source
time.sleep(7)
#soup_8 = BeautifulSoup(source, "html")
#tweets = [tweet.text for tweet in driver.find_element("div")]
driver.find_element(By.CLASS_NAME("css-901oao css-1hf3ou5"))
#tweets

driver.quit()


TypeError: 'str' object is not callable

In [49]:
print(tweets)

['\n\nJavaScript is not available.\nWe‚Äôve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.\nHelp Center\n\nTerms of Service\nPrivacy Policy\nCookie Policy\nImprint\nAds info\n      ¬© 2022 Twitter, Inc.\n    \n', 'Don‚Äôt miss what‚Äôs happeningPeople on Twitter are the first to know.Log inSign upR O S A L √ç A6,180 TweetsSee new TweetsFollowClick to Follow rosaliaR O S A L √ç A@rosaliaMOTOMAMIBarcelonaJoined September 201391 Following3.7M FollowersTweetsTweets & repliesMediaLikesR O S A L √ç A‚Äôs TweetsPinned TweetR O S A L √ç A@rosalia¬∑Jul 28DESPECH√Å OUT NOW VIDEO PRONTOOOO\n\nhttps://rosalia.lnk.to/DESPECHA1,0829,85974.7KR O S A L √ç A RetweetedYoko Ono @yokoono¬∑Aug 17I wish that you get everything you want in life. Wish well, and you will receive it tenfold. Wish badly and you will receive it tenfold.503931,920R O S A L √ç 

List all language names and number of related articles in the order they appear in wikipedia.org

In [3]:
# This is the url you will scrape in this exercise
url_9 = 'https://www.wikipedia.org/'

In [5]:
soup_9 = init_bs(url_9)
lang = soup_9.find_all('a')[:10]

In [6]:
names = [element.text.strip().replace('\n', ' ').replace('\xa0', ' ') for element in lang]
names

['English 6 458 000+ articles',
 'Êó•Êú¨Ë™û 1 314 000+ Ë®ò‰∫ã',
 'Espa√±ol 1 755 000+ art√≠culos',
 '–†—É—Å—Å–∫–∏–π 1 798 000+ —Å—Ç–∞—Ç–µ–π',
 'Deutsch 2 667 000+ Artikel',
 'Fran√ßais 2 400 000+ articles',
 'Italiano 1 742 000+ voci',
 '‰∏≠Êñá 1 256 000+ Êù°ÁõÆ / Ê¢ùÁõÆ',
 'Portugu√™s 1 085 000+ artigos',
 'ÿßŸÑÿπÿ±ÿ®Ÿäÿ© 1 159 000+ ŸÖŸÇÿßŸÑÿ©']

A list with the different kind of datasets available in data.gov.uk

In [710]:
# This is the url you will scrape in this exercise
url_10 = 'https://data.gov.uk/'

In [711]:
soup_10 = init_bs(url_10)
soup_10

<!DOCTYPE html>
<!--[if lt IE 9]><html class="lte-ie8" lang="en"><![endif]--><!--[if gt IE 8]><!--><html lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title>Find open data - data.gov.uk</title>
<meta content="#0b0c0c" name="theme-color"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="/find-assets/application-8b7545934ebe6ea0b37d1329e4ab1781289bbe3194049d0fcf13e5b605f3d694.css" media="screen" rel="stylesheet"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="J4cvbCmWz1JwbxMikECgddl7aZZ3yat6w9P-41GZtMvIZkIs0Jvm04co5wm26gyL5Hyp8JxnQGyy_Z7SB5i_jA" name="csrf-token"/>
</head><body class="govuk-template__body">
<script>document.body.className = ((document.body.className) ? document.body.className + ' js-enabled' : 'js-enabled');</script>
<div aria-label="cookie banner" class="gem-c-cookie-banner govuk-clearfix" data-module="cookie-banner" data-nosnippet="" id="global-cookie-message" role="region">
<div aria-label="Cookies

In [717]:
data = soup_10.find_all('h3')
data = [dat.text for dat in data]
data

['Business and economy',
 'Crime and justice',
 'Defence',
 'Education',
 'Environment',
 'Government',
 'Government spending',
 'Health',
 'Mapping',
 'Society',
 'Towns and cities',
 'Transport',
 'Digital service performance',
 'Government reference data']

Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [7]:
# This is the url you will scrape in this exercise
url_11 = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [8]:
soup_11 = init_bs(url_11)
soup_11

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of languages by number of native speakers - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"8e4b02dd-00df-4b9a-83a7-9e341e93f8ce","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_languages_by_number_of_native_speakers","wgTitle":"List of languages by number of native speakers","wgCurRevisionId":1102319770,"wgRevisionId":1102319770,"wgArticleId":405385,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is 

In [9]:
tables = soup_11.find_all('table', {'class': 'wikitable'})[0]
tables

<table class="wikitable sortable">
<caption>Languages with at least 10 million first-language speakers<sup class="reference" id="cite_ref-Ethnologue2022_9-1"><a href="#cite_note-Ethnologue2022-9">[9]</a></sup>
</caption>
<tbody><tr>
<th>Rank
</th>
<th>Language
</th>
<th>Native Speakers<br/><small>(millions)</small>
</th>
<th>Percentage<br/>of world pop.<br/><small>(March 2019)<sup class="reference" id="cite_ref-10"><a href="#cite_note-10">[10]</a></sup></small>
</th>
<th>Language family
</th>
<th>Branch
</th></tr>
<tr>
<td>1
</td>
<td><a href="/wiki/Mandarin_Chinese" title="Mandarin Chinese">Mandarin Chinese</a>
</td>
<td>929.0
</td>
<td>11.922%
</td>
<td><a href="/wiki/Sino-Tibetan_languages" title="Sino-Tibetan languages">Sino-Tibetan</a>
</td>
<td><a href="/wiki/Varieties_of_Chinese" title="Varieties of Chinese">Sinitic</a>
</td></tr>
<tr>
<td>2
</td>
<td><a href="/wiki/Spanish_language" title="Spanish language">Spanish</a>
</td>
<td>474.7
</td>
<td>5.994%
</td>
<td><a href="/wiki/I

In [10]:
rows = tables.find_all('tr')
rows_h = [i.text.strip().split('\n\n') for i in rows][0]
rows = [i.text.strip().split('\n\n') for i in rows][1:]
rows

[['1', 'Mandarin Chinese', '929.0', '11.922%', 'Sino-Tibetan', 'Sinitic'],
 ['2', 'Spanish', '474.7', '5.994%', 'Indo-European', 'Romance'],
 ['3', 'English', '372.9', '4.922%', 'Indo-European', 'Germanic'],
 ['4',
  'Hindi (Sanskritised Hindustani)[11]',
  '343.9',
  '4.429%',
  'Indo-European',
  'Indo-Aryan'],
 ['5', 'Bengali', '233.7', '4.000%', 'Indo-European', 'Indo-Aryan'],
 ['6', 'Portuguese', '232.4', '2.870%', 'Indo-European', 'Romance'],
 ['7', 'Russian', '154.0', '2.000%', 'Indo-European', 'Balto-Slavic'],
 ['8', 'Japanese', '125.3', '1.662%', 'Japonic', 'Japanese'],
 ['9', 'Western Punjabi[12]', '92.7', '1.204%', 'Indo-European', 'Indo-Aryan'],
 ['10', 'Yue Chinese', '85.2', '0.949%', 'Sino-Tibetan', 'Sinitic'],
 ['11', 'Vietnamese', '84.6', '0.987%', 'Austroasiatic', 'Vietic'],
 ['12', 'Marathi', '83.1', '1.079%', 'Indo-European', 'Indo-Aryan'],
 ['13', 'Telugu', '82.0', '1.065%', 'Dravidian', 'South-Central'],
 ['14', 'Turkish', '82.2', '1.031%', 'Turkic', 'Oghuz'],
 ['1

In [11]:
rows_h

['Rank',
 'Language',
 'Native Speakers(millions)',
 'Percentageof world pop.(March 2019)[10]',
 'Language family',
 'Branch']

In [12]:
wiki_df = pd.DataFrame(rows, columns=rows_h)
wiki_df.head(10)

Unnamed: 0,Rank,Language,Native Speakers(millions),Percentageof world pop.(March 2019)[10],Language family,Branch
0,1,Mandarin Chinese,929.0,11.922%,Sino-Tibetan,Sinitic
1,2,Spanish,474.7,5.994%,Indo-European,Romance
2,3,English,372.9,4.922%,Indo-European,Germanic
3,4,Hindi (Sanskritised Hindustani)[11],343.9,4.429%,Indo-European,Indo-Aryan
4,5,Bengali,233.7,4.000%,Indo-European,Indo-Aryan
5,6,Portuguese,232.4,2.870%,Indo-European,Romance
6,7,Russian,154.0,2.000%,Indo-European,Balto-Slavic
7,8,Japanese,125.3,1.662%,Japonic,Japanese
8,9,Western Punjabi[12],92.7,1.204%,Indo-European,Indo-Aryan
9,10,Yue Chinese,85.2,0.949%,Sino-Tibetan,Sinitic
