# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended contennt.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit each url and take a look at its source through Chrome DevTools. You'll need to identify the html tags, special class names etc. used for the html content you are expected to extract.

- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide) documentation 
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are imported for you. If you prefer to use additional libraries feel free to uncomment them.

In [39]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import regex
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import re
# import scrapy

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [2]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [9]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [10]:
content


<!DOCTYPE html>

<html data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
<link href="https://avatars.githubusercontent.com" rel="preconnect"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-813262e6aaf2a511d6a4b5ec319417a0.css" integrity="sha512-gTJi5qrypRHWpLXsMZQXoL53mXDuVqfZc7AfuiFXreLhf7Pk1RMvXJMWJsiS8dpkFDfq/7t6bFZK+3xS1Ak+Lg==" media="all" rel="stylesheet"><link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-08c766d1eb354e2e3a66e15c28acfe8e.css" integrity="sha512-CMdm0es1Ti46ZuFcKKz+jobtyuFMFz3OIWx

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [28]:
#your code
#Primero localizo los titulos donde estan los nombres
nombres=content.findAll("h1",attrs={"class":"h3 lh-condensed"})

In [37]:
#extraigo los nombres
nombres_sucio=[]
for nombre in nombres:
    nombres_sucio.append(nombre.text)

In [59]:
#Quito espacios con strip()
nombres_limpio=[]
for nombre in nombres_sucio:
    nombres_limpio.append(nombre.strip())
    

In [60]:
nombres_limpio

['Taner Şener',
 'William Boman',
 'Rico Suter',
 'angus croll',
 'Drew Powers',
 'bdring',
 'Stefan Prodan',
 'Eric Holscher',
 'Jared Palmer',
 'Daniel Vaz Gaspar',
 '陈帅',
 'afc163',
 'Martin Atkins',
 'Fatih Arslan',
 'Anthony Shaw',
 'Péter Szilágyi',
 'Saleem Abdulrasool',
 'Matt Glaman',
 'patak',
 'Felix Angelov',
 'DIYgod',
 'Tom Preston-Werner',
 'Andrew Kane',
 '文翼',
 'Toni de la Fuente']

#### Display the trending Python repositories in GitHub

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [61]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [62]:
#your code
resp=requests.get(url,"html.parser")
content=BeautifulSoup(resp.content)

In [90]:
#llego directamente a los titulos
repositorios=content.findAll("h1",attrs={"class":"h3 lh-condensed"})

In [104]:
#creo lista y lo limpio directamente
trend_repos=[]
for i in range(len(repositorios)):
    trend_repos.append((repositorios[i].text).strip().replace("\n","").replace("  ",""))

In [105]:
trend_repos

['huggingface /datasets',
 'Python-World /python-mini-projects',
 'bregman-arie /devops-exercises',
 'apache /superset',
 'ansible /ansible',
 'jackfrued /Python-100-Days',
 'RasaHQ /rasa',
 'locustio /locust',
 'facebookresearch /demucs',
 'davidbombal /red-python-scripts',
 'joke2k /django-environ',
 'xiangmingzhe0928 /hpv4g',
 'gto76 /python-cheatsheet',
 'pytorch /fairseq',
 'fofapro /fapro',
 'nccgroup /ScoutSuite',
 'tiangolo /fastapi',
 'huggingface /transformers',
 'mikel-brostrom /Yolov5_DeepSort_Pytorch',
 'cupy /cupy',
 'yqchilde /JDMemberCloseAccount',
 'samuelcolvin /pydantic',
 'aws /aws-cli',
 'giampaolo /psutil',
 'MIC-DKFZ /nnUNet']

#### Display all the image links from Walt Disney wikipedia page

In [106]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [108]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [125]:
#primero busco todas las imagenes
imagenes=content.findAll("img")

In [129]:
#creo una lista y añado los links
links=[]
for imagen in imagenes:
    links.append(imagen["src"])

In [130]:
links

['//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 '//upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/150px-Walt_Disney_1942_signature.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 '//upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/5/57/

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page

In [131]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [132]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [136]:
#Los links siempre van con la etiqueta a
links=content.findAll("a")

In [141]:
lista_links=[]
for link in links:
    lista_links.append(link.text)

In [142]:
lista_links

['',
 'Jump to navigation',
 'Jump to search',
 'Python',
 'python',
 'Pythonidae',
 'Python (genus)',
 '1 Computing',
 '2 People',
 '3 Roller coasters',
 '4 Vehicles',
 '5 Weaponry',
 '6 Other uses',
 '7 See also',
 'edit',
 'Python (programming language)',
 'CMU Common Lisp',
 'PERQ 3',
 'edit',
 'Python of Aenus',
 'Python (painter)',
 'Python of Byzantium',
 'Python of Catana',
 'Python Anghelo',
 'edit',
 'Python (Efteling)',
 'Python (Busch Gardens Tampa Bay)',
 'Python (Coney Island, Cincinnati, Ohio)',
 'edit',
 'Python (automobile maker)',
 'Python (Ford prototype)',
 'edit',
 'Python (missile)',
 'Python (nuclear primary)',
 'Colt Python',
 'edit',
 'PYTHON',
 'Python (film)',
 'Python (mythology)',
 'Monty Python',
 'Python (Monty) Pictures',
 'edit',
 'Cython',
 'Pyton',
 'Pithon',
 '',
 'disambiguation',
 'internal link',
 'https://en.wikipedia.org/w/index.php?title=Python&oldid=1048703433',
 'Categories',
 'Disambiguation pages',
 'Human name disambiguation pages',
 'Disa

#### Number of Titles that have changed in the United States Code since its last release point 

In [143]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [144]:
#your code- Titles in bold have been changed since the last release point.
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [146]:
titles_change=content.findAll("div",attrs={"class":"usctitlechanged"})

In [152]:
for titulos in titles_change:
    print((titulos.text).strip())
print(f'{len(titles_change)} titulos han cambiado')

Title 21 - Food and Drugs
Title 22 - Foreign Relations and Intercourse
Title 47 - Telecommunications
Title 50 - War and National Defense
4 titulos han cambiado


#### A Python list with the top ten FBI's Most Wanted names 

In [153]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [154]:
#your code 
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [158]:
#lista de nombres
mas_buscados=content.findAll("h3",attrs={"class":"title"})

In [161]:
wanted=[]
for delincuente in mas_buscados:
    wanted.append((delincuente.text).replace("\n",""))

In [162]:
wanted

['YULAN ADONAY ARCHAGA CARIAS',
 'EUGENE PALMER',
 'BHADRESHKUMAR CHETANBHAI PATEL',
 'ALEJANDRO ROSALES CASTILLO',
 'ARNOLDO JIMENEZ',
 'JASON DEREK BROWN',
 'ALEXIS FLORES',
 'JOSE RODOLFO VILLARREAL-HERNANDEZ',
 'OCTAVIANO JUAREZ-CORRO',
 'RAFAEL CARO-QUINTERO']

####  20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe

In [163]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [164]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [195]:
#el tiempo tengo que dividirlo entre date y time
dates=content.findAll("td",attrs={"class":"tabev6"})
latitude=content.findAll("td",attrs={"class":"tabev1"})
longitude=content.findAll("td",attrs={"class":"tabev6"})

In [223]:
#lo voy a hacer a partir de la fila
filasImpares=content.findAll("tr",attrs={"class":"ligne1 normal"})[:10]
filasPares=content.findAll("tr",attrs={"class":"ligne2 normal"})[:10]
#ultimos 20


In [304]:
terrem=[]
for i in filasImpares:
    terremoto={
        "fecha":i.find("a").text.split()[0],
        "hora":i.find("a").text.split()[1],
        "latitud":i.findAll("td",attrs={"class":"tabev1"})[0].text,
        "longitud":i.findAll("td",attrs={"class":"tabev1"})[1].text,
        "region":i.findAll("td",attrs={"class":"tb_region"})[0].text
    }
    terrem.append(terremoto)


In [321]:
pd.DataFrame(terr)

Unnamed: 0,fecha,latitud,longitud,region
0,2021-11-13,28.56,17.84,"CANARY ISLANDS, SPAIN REGION"
1,2021-11-13,35.51,3.68,STRAIT OF GIBRALTAR
2,2021-11-13,57.84,32.65,REYKJANES RIDGE
3,2021-11-13,37.2,4.55,SPAIN
4,2021-11-13,41.43,20.13,ALBANIA
5,2021-11-13,34.89,24.24,"CRETE, GREECE"
6,4,37.12,3.61,SPAIN
7,2021-11-13,10.93,86.67,OFF COAST OF COSTA RICA
8,2021-11-13,17.94,66.83,PUERTO RICO REGION
9,2021-11-13,26.22,93.01,"ASSAM, INDIA"


#### Count number of tweets by a given Twitter account.

You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account

In [323]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### Number of followers of a given twitter account

You will need to include a ***try/except block*** in case account/s name not found. 
<br>***Hint:*** the program should count the followers for any provided account

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
#your code

#### List all language names and number of related articles in the order they appear in wikipedia.org

In [324]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [325]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [336]:
centro=content.findAll("div",attrs={"class":"central-featured-lang"})

In [350]:
lista=[]
for idioma in centro:
    dic={
        "idioma":idioma.find("strong").text,
        "numero_articulos":idioma.find("bdi").text
    }
    lista.append(dic)

In [351]:
pd.DataFrame(lista)

Unnamed: 0,idioma,numero_articulos
0,English,6 383 000+
1,日本語,1 292 000+
2,Русский,1 756 000+
3,Deutsch,2 617 000+
4,Español,1 717 000+
5,Français,2 362 000+
6,中文,1 231 000+
7,Italiano,1 718 000+
8,Português,1 074 000+
9,Polski,1 490 000+


#### A list with the different kind of datasets available in data.gov.uk 

In [352]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [353]:
#your code 
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [356]:
nombres=content.findAll("h3",attrs={"class":"govuk-heading-s"})

In [357]:
for nombre in nombres:
    print(nombre.text)

Business and economy
Crime and justice
Defence
Education
Environment
Government
Government spending
Health
Mapping
Society
Towns and cities
Transport
Digital service performance
Government reference data


#### Top 10 languages by number of native speakers stored in a Pandas Dataframe

In [358]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [359]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [364]:
#creo una lista con las 10 primeras filas de idiomas
idiomas=content.findAll("tr")[2:12]

In [368]:
for idioma in idiomas:
    print(idioma.find("a").text)

Mandarin Chinese
Spanish
English
Hindi
Bengali
Portuguese
Russian
Japanese
Western Punjabi
Marathi


### BONUS QUESTIONS

#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code

#### IMDB's Top 250 data (movie name, Initial release, director name and stars) as a pandas dataframe

In [369]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [370]:
# your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [380]:
filas=content.findAll("tr")[1:]

In [404]:
filas[0].findAll("a")[1]["title"]

'Frank Darabont (dir.), Tim Robbins, Morgan Freeman'

In [450]:
peliculas=[]
for i in filas:
    dic={
        "pelicula":(" ".join(i.find("td",attrs={"class":"titleColumn"}).text.split())),
        "director y estrellas": i.findAll("a")[1]["title"],
        "estreno":i.findAll("span",attrs={"class":"secondaryInfo"}),
        "puntuacion":i.findAll("strong")
    }
    peliculas.append(dic)

In [451]:
#Queda pendiente la limpieza de las ultimas dos columnas!
pd.DataFrame(peliculas)

Unnamed: 0,pelicula,director y estrellas,estreno,puntuacion
0,1. Cadena perpetua (1994),"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",[[(1994)]],[[9.2]]
1,2. El padrino (1972),"Francis Ford Coppola (dir.), Marlon Brando, Al...",[[(1972)]],[[9.1]]
2,3. El padrino: Parte II (1974),"Francis Ford Coppola (dir.), Al Pacino, Robert...",[[(1974)]],[[9.0]]
3,4. El caballero oscuro (2008),"Christopher Nolan (dir.), Christian Bale, Heat...",[[(2008)]],[[9.0]]
4,5. 12 hombres sin piedad (1957),"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",[[(1957)]],[[8.9]]
...,...,...,...,...
245,246. La princesa prometida (1987),"Rob Reiner (dir.), Cary Elwes, Mandy Patinkin",[[(1987)]],[[8.0]]
246,247. Las noches de Cabiria (1957),"Federico Fellini (dir.), Giulietta Masina, Fra...",[[(1957)]],[[8.0]]
247,"248. Paris, Texas (1984)","Wim Wenders (dir.), Harry Dean Stanton, Nastas...",[[(1984)]],[[8.0]]
248,249. Tres colores: Rojo (1994),"Krzysztof Kieslowski (dir.), Irène Jacob, Jean...",[[(1994)]],[[8.0]]


#### Movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [418]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [419]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [426]:
peliculas=content.findAll("td",attrs={"class":"titleColumn"})[:10]

In [444]:
#busco los resumenes en cada una de sus paginas
resumen=[]
new_url="https://www.imdb.com"
for titulo in peliculas:
    resp=requests.get(f'{new_url}{titulo.find("a")["href"]}' )
    content=BeautifulSoup(resp.content,"html.parser")
    resumen.append(content.find("p",attrs={"class":"GenresAndPlot__Plot-cum89p-6"}).text)



In [453]:
pelis=pd.DataFrame(peliculas[:10])

In [455]:
pelis["resumen"]=resumen

In [456]:
pelis

Unnamed: 0,pelicula,director y estrellas,estreno,puntuacion,resumen
0,1. Cadena perpetua (1994),"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",[[(1994)]],[[9.2]],Two imprisoned men bond over a number of years...
1,2. El padrino (1972),"Francis Ford Coppola (dir.), Marlon Brando, Al...",[[(1972)]],[[9.1]],The aging patriarch of an organized crime dyna...
2,3. El padrino: Parte II (1974),"Francis Ford Coppola (dir.), Al Pacino, Robert...",[[(1974)]],[[9.0]],The early life and career of Vito Corleone in ...
3,4. El caballero oscuro (2008),"Christopher Nolan (dir.), Christian Bale, Heat...",[[(2008)]],[[9.0]],When the menace known as the Joker wreaks havo...
4,5. 12 hombres sin piedad (1957),"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",[[(1957)]],[[8.9]],The jury in a New York City murder trial is fr...
5,6. La lista de Schindler (1993),"Steven Spielberg (dir.), Liam Neeson, Ralph Fi...",[[(1993)]],[[8.9]],"In German-occupied Poland during World War II,..."
6,7. El señor de los anillos: El retorno del rey...,"Peter Jackson (dir.), Elijah Wood, Viggo Morte...",[[(2003)]],[[8.9]],Gandalf and Aragorn lead the World of Men agai...
7,8. Pulp Fiction (1994),"Quentin Tarantino (dir.), John Travolta, Uma T...",[[(1994)]],[[8.8]],"The lives of two mob hitmen, a boxer, a gangst..."
8,"9. El bueno, el feo y el malo (1966)","Sergio Leone (dir.), Clint Eastwood, Eli Wallach",[[(1966)]],[[8.8]],A bounty hunting scam joins two men in an unea...
9,10. El señor de los anillos: La comunidad del ...,"Peter Jackson (dir.), Elijah Wood, Ian McKellen",[[(2001)]],[[8.8]],A meek Hobbit from the Shire and eight compani...


#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [458]:
#https://openweathermap.org/current
city = city=input('Enter the city:')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

Enter the city:London


In [459]:
# your code
resp=requests.get(url).json()


In [460]:
resp

{'cod': 401,
 'message': 'Invalid API key. Please see http://openweathermap.org/faq#error401 for more info.'}

In [None]:
#pide api key

#### Book name,price and stock availability as a pandas dataframe.

In [517]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [518]:
#your code
resp=requests.get(url)
content=BeautifulSoup(resp.content,"html.parser")

In [526]:
#Cojo los enlaces de cada una de las paginas
lista=[]
next_page=requests.get(url)
cont=0
while cont<=51:
    lista.append(content.findAll("h3"))
    next_page=requests.get(f'{url}{content.findAll("li",attrs={"class":"next"})[0].findAll("a")[0]["href"]}' )
    cont+=1
    print(f"pagina scrapeada {cont}")


pagina scrapeada 1
pagina scrapeada 2
pagina scrapeada 3
pagina scrapeada 4
pagina scrapeada 5
pagina scrapeada 6
pagina scrapeada 7
pagina scrapeada 8
pagina scrapeada 9
pagina scrapeada 10
pagina scrapeada 11
pagina scrapeada 12
pagina scrapeada 13
pagina scrapeada 14
pagina scrapeada 15
pagina scrapeada 16
pagina scrapeada 17
pagina scrapeada 18
pagina scrapeada 19
pagina scrapeada 20
pagina scrapeada 21
pagina scrapeada 22
pagina scrapeada 23
pagina scrapeada 24
pagina scrapeada 25
pagina scrapeada 26
pagina scrapeada 27
pagina scrapeada 28
pagina scrapeada 29
pagina scrapeada 30
pagina scrapeada 31
pagina scrapeada 32
pagina scrapeada 33
pagina scrapeada 34
pagina scrapeada 35
pagina scrapeada 36
pagina scrapeada 37
pagina scrapeada 38
pagina scrapeada 39
pagina scrapeada 40
pagina scrapeada 41
pagina scrapeada 42
pagina scrapeada 43
pagina scrapeada 44
pagina scrapeada 45
pagina scrapeada 46
pagina scrapeada 47
pagina scrapeada 48
pagina scrapeada 49
pagina scrapeada 50
pagina sc

In [532]:
#scrapeo cada uno de los libros de antes
cont=0
books=[]
for n in range(len(lista)):
    for i in lista[n]:
        new_url=f'{url}{i.find("a")["href"]}' 
        resp=requests.get(new_url)
        content=BeautifulSoup(resp.content,"html.parser")
        dic={
            "titulo": content.find("h1").text,
            "precio":content.find("p",attrs={"class":"price_color"}),
            "stock":content.find("p",attrs={"class":"instock availability"})
                }
        books.append(dic)
        cont+=1
        print(f"libro {cont} scrapeado")

libro 1 scrapeado
libro 2 scrapeado
libro 3 scrapeado
libro 4 scrapeado
libro 5 scrapeado
libro 6 scrapeado
libro 7 scrapeado
libro 8 scrapeado
libro 9 scrapeado
libro 10 scrapeado
libro 11 scrapeado
libro 12 scrapeado
libro 13 scrapeado
libro 14 scrapeado
libro 15 scrapeado
libro 16 scrapeado
libro 17 scrapeado
libro 18 scrapeado
libro 19 scrapeado
libro 20 scrapeado
libro 21 scrapeado
libro 22 scrapeado
libro 23 scrapeado
libro 24 scrapeado
libro 25 scrapeado
libro 26 scrapeado
libro 27 scrapeado
libro 28 scrapeado
libro 29 scrapeado
libro 30 scrapeado
libro 31 scrapeado
libro 32 scrapeado
libro 33 scrapeado
libro 34 scrapeado
libro 35 scrapeado
libro 36 scrapeado
libro 37 scrapeado
libro 38 scrapeado
libro 39 scrapeado
libro 40 scrapeado
libro 41 scrapeado
libro 42 scrapeado
libro 43 scrapeado
libro 44 scrapeado
libro 45 scrapeado
libro 46 scrapeado
libro 47 scrapeado
libro 48 scrapeado
libro 49 scrapeado
libro 50 scrapeado
libro 51 scrapeado
libro 52 scrapeado
libro 53 scrapeado
li

libro 417 scrapeado
libro 418 scrapeado
libro 419 scrapeado
libro 420 scrapeado
libro 421 scrapeado
libro 422 scrapeado
libro 423 scrapeado
libro 424 scrapeado
libro 425 scrapeado
libro 426 scrapeado
libro 427 scrapeado
libro 428 scrapeado
libro 429 scrapeado
libro 430 scrapeado
libro 431 scrapeado
libro 432 scrapeado
libro 433 scrapeado
libro 434 scrapeado
libro 435 scrapeado
libro 436 scrapeado
libro 437 scrapeado
libro 438 scrapeado
libro 439 scrapeado
libro 440 scrapeado
libro 441 scrapeado
libro 442 scrapeado
libro 443 scrapeado
libro 444 scrapeado
libro 445 scrapeado
libro 446 scrapeado
libro 447 scrapeado
libro 448 scrapeado
libro 449 scrapeado
libro 450 scrapeado
libro 451 scrapeado
libro 452 scrapeado
libro 453 scrapeado
libro 454 scrapeado
libro 455 scrapeado
libro 456 scrapeado
libro 457 scrapeado
libro 458 scrapeado
libro 459 scrapeado
libro 460 scrapeado
libro 461 scrapeado
libro 462 scrapeado
libro 463 scrapeado
libro 464 scrapeado
libro 465 scrapeado
libro 466 scrapeado


libro 827 scrapeado
libro 828 scrapeado
libro 829 scrapeado
libro 830 scrapeado
libro 831 scrapeado
libro 832 scrapeado
libro 833 scrapeado
libro 834 scrapeado
libro 835 scrapeado
libro 836 scrapeado
libro 837 scrapeado
libro 838 scrapeado
libro 839 scrapeado
libro 840 scrapeado
libro 841 scrapeado
libro 842 scrapeado
libro 843 scrapeado
libro 844 scrapeado
libro 845 scrapeado
libro 846 scrapeado
libro 847 scrapeado
libro 848 scrapeado
libro 849 scrapeado
libro 850 scrapeado
libro 851 scrapeado
libro 852 scrapeado
libro 853 scrapeado
libro 854 scrapeado
libro 855 scrapeado
libro 856 scrapeado
libro 857 scrapeado
libro 858 scrapeado
libro 859 scrapeado
libro 860 scrapeado
libro 861 scrapeado
libro 862 scrapeado
libro 863 scrapeado
libro 864 scrapeado
libro 865 scrapeado
libro 866 scrapeado
libro 867 scrapeado
libro 868 scrapeado
libro 869 scrapeado
libro 870 scrapeado
libro 871 scrapeado
libro 872 scrapeado
libro 873 scrapeado
libro 874 scrapeado
libro 875 scrapeado
libro 876 scrapeado


In [533]:
pd.DataFrame(books)

Unnamed: 0,titulo,precio,stock
0,A Light in the Attic,[£51.77],"[\n, [], \n \n In stock (22 availabl..."
1,Tipping the Velvet,[£53.74],"[\n, [], \n \n In stock (20 availabl..."
2,Soumission,[£50.10],"[\n, [], \n \n In stock (20 availabl..."
3,Sharp Objects,[£47.82],"[\n, [], \n \n In stock (20 availabl..."
4,Sapiens: A Brief History of Humankind,[£54.23],"[\n, [], \n \n In stock (20 availabl..."
...,...,...,...
1035,Our Band Could Be Your Life: Scenes from the A...,[£57.25],"[\n, [], \n \n In stock (19 availabl..."
1036,Olio,[£23.88],"[\n, [], \n \n In stock (19 availabl..."
1037,Mesaerion: The Best Science Fiction Stories 18...,[£37.59],"[\n, [], \n \n In stock (19 availabl..."
1038,Libertarianism for Beginners,[£51.33],"[\n, [], \n \n In stock (19 availabl..."
