## Tasca M10 T01 

### Exercici 1
***
Realitza web scraping de dues de les tres pàgines web proposades utilitzant BeautifulSoup primer i Selenium després. 

- http://quotes.toscrape.com

- https://www.bolsamadrid.es

- www.wikipedia.es (fes alguna cerca primer i escrapeja algun contingut)



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Web scraping** és el procés d'extreure informació directament des de pàgines web mitjançant una tècnica que permet automatitzar l'obtenció de dades. L'objectiu principal del web scraping és transformar dades no estructurades (com el text HTML d'una pàgina) en dades estructurades que poden ser emmagatzemades i analitzades.

Existeixen diverses eines i biblioteques en diferents llenguatges de programació (com Python amb BeautifulSoup i Scrapy) que faciliten la implementació de web scraping.

#### BeautifulSoup

Amb **BeautifulSoup** farem web scraping de http://quotes.toscrape.com. Aquesta és una pàgina web dissenyada específicament per a practicar i aprendre tècniques de web scraping. És un lloc creat per Scrapinghub i el seu propòsit és servir com un recurs educatiu per a aquells que desitgin practicar extracció de dades de llocs web.

En aquest lloc, pots trobar cites de diferents autors, i cada pàgina conté informació que es pot extreure utilitzant aquestes tècniques.

(Web scraping fet al 20/12/2023)

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
url = 'http://quotes.toscrape.com'
response = requests.get(url)

if response.status_code == 200:
   
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.find_all('span', class_='text')
    authors = soup.find_all('small', class_='author')

    for quote, author in zip(quotes, authors):
        print(f"{author.text}: {quote.text}")

else:
    print(f'Couldn\'t retrieve the page. Status code: {response.status_code}')

Albert Einstein: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
J.K. Rowling: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Albert Einstein: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Jane Austen: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Marilyn Monroe: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Albert Einstein: “Try not to become a man of success. Rather become a man of value.”
André Gide: “It is better to be hated for what you are than to be loved for what you are not.”
Thomas A. Edison: “I have not failed. I've just found 10,000 ways that won't work.”
Eleanor Roosevelt: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Ste

A partir de la informació extreta del Web Scraping, elaborem un dataframe.

In [4]:
df = pd.DataFrame({'Author': [author.text for author in authors],
                   'Quote': [quote.text for quote in quotes]})

df

Unnamed: 0,Author,Quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,“Try not to become a man of success. Rather be...
6,André Gide,“It is better to be hated for what you are tha...
7,Thomas A. Edison,"“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how...
9,Steve Martin,"“A day without sunshine is like, you know, nig..."


La informació és correcta, això si, només ha extret les dades de la primera pàgina. Seguidament, farem Web Scraping de totes les pàgines (10 en total).

In [5]:
base_url = 'http://quotes.toscrape.com'
current_page = 1
quotes_list = []

while True:
    url = f'{base_url}/page/{current_page}/'
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        quotes = soup.find_all('span', class_='text')
        authors = soup.find_all('small', class_='author')

        if not quotes:
            # No more quotes found, exit the loop
            break
        else:
            # Append quotes and authors to the list
            for quote, author in zip(quotes, authors):
                quotes_list.append({'Author': author.text, 'Quote': quote.text})

        # Move to the next page
        current_page += 1
    else:
        print(f'Couldn\'t retrieve the page. Status code: {response.status_code}')
        break


df_quote = pd.DataFrame(quotes_list)

df_quote

Unnamed: 0,Author,Quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
...,...,...
95,Harper Lee,“You never really understand a person until yo...
96,Madeleine L'Engle,“You have to write the book that wants to be w...
97,Mark Twain,“Never tell the truth to people who are not wo...
98,Dr. Seuss,"“A person's a person, no matter how small.”"


#### Selenium

Farem web scraping amb **Selenium** de la pàgina de Wikipedia en anglès de la discografia de The Beatles. Volem treure el títol de l'album i els detalls (data de llançament i discogràfica) dels àlbums d'estudi originals del Regne Unit. En aquest cas posarem la informació extreta directament en un dataset.

- https://en.wikipedia.org/wiki/The_Beatles_discography

(Web scraping fet al 20/12/2023)

In [6]:
import webdriver_manager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

In [24]:
df_beatles = pd.DataFrame(columns=['Title', 'Released', 'Label'])

# Configuración del navegador
options = webdriver.ChromeOptions()
options.add_argument('--headless')

# Crear una instancia del navegador Chrome
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

# URL de la página
url = "https://en.wikipedia.org/wiki/The_Beatles_discography"
driver.get(url)

# Encontrar la tabla
discography_table = driver.find_element("xpath", "//table[contains(@class, 'wikitable')]")

# Extraer la información
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Encontrar la sección de álbumes de estudio originales del Reino Unido
uk_studio_albums_section = soup.find('span', {'id': 'Original_UK_studio_albums'}).find_next('table')

# Extraer información de los álbumes de estudio originales del Reino Unido
for row in uk_studio_albums_section.find_all('tr')[2:]:  # Saltar la primera fila que contiene los encabezados
    columns = row.find_all(['th', 'td'])
    
    if len(columns) >= 3:
        title = columns[0].text.strip()
        released = columns[1].text.strip()
        label = columns[2].text.strip()
        df_beatles = pd.concat([df_beatles, pd.DataFrame({'Title': [title], 
                                                          'Released': [released],
                                                          'Label': [label]})],
                                                           ignore_index=True)
        
# Cerrar el navegador
driver.quit()

In [25]:
df_beatles

Unnamed: 0,Title,Released,Label
0,Please Please Me,Released: 22 March 1963\nLabel: Parlophone,1
1,With the Beatles[A],Released: 22 November 1963\nLabel: Parlophone ...,1
2,A Hard Day's Night,Released: 10 July 1964\nLabel: Parlophone,1
3,Beatles for Sale,Released: 4 December 1964\nLabel: Parlophone,1
4,Help!,Released: 6 August 1965\nLabel: Parlophone,1
5,Rubber Soul,Released: 3 December 1965\nLabel: Parlophone,1
6,Revolver,Released: 5 August 1966\nLabel: Parlophone,1
7,Sgt. Pepper's Lonely Hearts Club Band,"Released: 26 May 1967\nLabel: Parlophone (UK),...",1
8,"The Beatles (""The White Album"")",Released: 22 November 1968\nLabel: Apple,1
9,Yellow Submarine[B],"Released: 13 January 1969\nLabel: Apple (UK), ...",3


In [26]:
# Separem la data de llançament de la discogràfica

df_beatles[['Released', 'Label']] = df_beatles['Released'].str.extract(r'Released: (.+?)\nLabel: (.+)', expand=True)

df_beatles

Unnamed: 0,Title,Released,Label
0,Please Please Me,22 March 1963,Parlophone
1,With the Beatles[A],22 November 1963,"Parlophone (UK), Capitol (Canada), Odeon (France)"
2,A Hard Day's Night,10 July 1964,Parlophone
3,Beatles for Sale,4 December 1964,Parlophone
4,Help!,6 August 1965,Parlophone
5,Rubber Soul,3 December 1965,Parlophone
6,Revolver,5 August 1966,Parlophone
7,Sgt. Pepper's Lonely Hearts Club Band,26 May 1967,"Parlophone (UK), Capitol (US)"
8,"The Beatles (""The White Album"")",22 November 1968,Apple
9,Yellow Submarine[B],13 January 1969,"Apple (UK), Capitol (US)"


### Exercici 2
***
Documenta en un Word el teu conjunt de dades generat amb la informació que tenen els diferents arxius de Kaggle.

*****
### About Dataset (Quotes)

#### Context
This dataset compiles quotes from various authors. Each quote is attributed to a specific author, and the dataset aims to capture insightful and thought-provoking statements for analysis and exploration, allowing users to gain insights into the perspectives of different authors.

#### Content
100 rows and 2 columns, each representing a quote along with its corresponding author. The authors include well-known figures such as Albert Einstein, Marilyn Monroe or Jane Austen.

#### Columns' Description

- **Author:** The name of the author who made the quote.
- **Quote:** The actual quote from the respective author.


*Example Quote:* 
- **Albert Einstein**: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

#### Acknowledgements

Data from 'http://quotes.toscrape.com'

***

### About Dataset (The Beatles)

#### Context
This dataset compiles information about the discography of The Beatles, one of the most iconic and influential bands in the history of music. The dataset focuses on their original studio albums released in the United Kingdom, providing details about each album's title, release date, and label.

#### Content
12 rows and 3 columns, each representing an original studio album by The Beatles, along with its release date and label.

#### Columns' Description

- Title: The title of the original studio album.
- Released: The release date of the respective album.
- Label: The record label associated.

*Example Album entry:* 
- **Please Please Me**: Released on 22 March 1963 by Parlophone.

#### Acknowledgements

Data from 'https://en.wikipedia.org/wiki/The_Beatles_discography'
***

### Exercici 3
***
Tria una pàgina web que tu vulguis i realitza web scraping mitjançant la llibreria Selenium. 

Farem web scraping a la pàgina d'un diari per extreure el títol de les notícies principals, en aquest cas al The Guardian.

- https://www.theguardian.com/europe

(Web scraping fet al 20/12/2023)

In [37]:
df_news = pd.DataFrame(columns=['Headline'])

# Configuración del navegador en modo headless
options = Options()

# Crear una instancia del navegador Chrome
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

# URL de la página
url = "https://www.theguardian.com/europe"
driver.get(url)

# Esperar a que la página cargue completamente
driver.implicitly_wait(20)

# Encontrar los elementos que contienen los títulos de las noticias
headlines = driver.find_elements(By.XPATH, '//h2[@class="headline"]/a')

# Extraer e imprimir los títulos de las noticias
for headline in headlines:
    title_news = headline.text
    df_news = pd.concat([df_news, pd.DataFrame({'Headline': [title_news]})], ignore_index=True)

# Cerrar el navegador
driver.quit()

In [38]:
df_news

Unnamed: 0,Headline
