## Tasca M10 T01 BeautifulSoup

### Exercici 1
***
Realitza web scraping de dues de les tres pàgines web proposades utilitzant BeautifulSoup primer i Selenium després. 

- http://quotes.toscrape.com

- https://www.bolsamadrid.es

- www.wikipedia.es (fes alguna cerca primer i escrapeja algun contingut)



In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### BeautifulSoup

In [25]:
import requests
from bs4 import BeautifulSoup

In [26]:
url = 'http://quotes.toscrape.com'
response = requests.get(url)

if response.status_code == 200:
   
    soup = BeautifulSoup(response.text, 'html.parser')

    quotes = soup.find_all('span', class_='text')
    authors = soup.find_all('small', class_='author')

    for quote, author in zip(quotes, authors):
        print(f"{author.text}: {quote.text}")

else:
    print(f'Couldn\'t retrieve the page. Status code: {response.status_code}')

Albert Einstein: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
J.K. Rowling: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Albert Einstein: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Jane Austen: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Marilyn Monroe: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Albert Einstein: “Try not to become a man of success. Rather become a man of value.”
André Gide: “It is better to be hated for what you are than to be loved for what you are not.”
Thomas A. Edison: “I have not failed. I've just found 10,000 ways that won't work.”
Eleanor Roosevelt: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Ste

A partir de la informació extreta del Web Scrapping, elaborem un dataframe.

In [27]:
df = pd.DataFrame({'Author': [author.text for author in authors],
                   'Quote': [quote.text for quote in quotes]})

df

Unnamed: 0,Author,Quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
5,Albert Einstein,“Try not to become a man of success. Rather be...
6,André Gide,“It is better to be hated for what you are tha...
7,Thomas A. Edison,"“I have not failed. I've just found 10,000 way..."
8,Eleanor Roosevelt,“A woman is like a tea bag; you never know how...
9,Steve Martin,"“A day without sunshine is like, you know, nig..."


La informació és correcta, però veiem que només ha extret les dades de la primera pàgina. Seguidament, farem Web Scrapping de totes les pàgines (10 en total).

In [28]:
base_url = 'http://quotes.toscrape.com'
current_page = 1
quotes_list = []

while True:
    url = f'{base_url}/page/{current_page}/'
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        quotes = soup.find_all('span', class_='text')
        authors = soup.find_all('small', class_='author')

        if not quotes:
            # No more quotes found, exit the loop
            break
        else:
            # Append quotes and authors to the list
            for quote, author in zip(quotes, authors):
                quotes_list.append({'Author': author.text, 'Quote': quote.text})

        # Move to the next page
        current_page += 1
    else:
        print(f'Couldn\'t retrieve the page. Status code: {response.status_code}')
        break


df_quote = pd.DataFrame(quotes_list)

df_quote

Unnamed: 0,Author,Quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."
...,...,...
95,Harper Lee,“You never really understand a person until yo...
96,Madeleine L'Engle,“You have to write the book that wants to be w...
97,Mark Twain,“Never tell the truth to people who are not wo...
98,Dr. Seuss,"“A person's a person, no matter how small.”"


### Exercici 2
***
Documenta en un Word el teu conjunt de dades generat amb la informació que tenen els diferents arxius de Kaggle.


### About Dataset

#### Context
This dataset compiles quotes from various authors. Each quote is attributed to a specific author, and the dataset aims to capture insightful and thought-provoking statements for analysis and exploration, allowing users to gain insights into the perspectives of different authors.

#### Content
100 rows and 2 columns, each representing a quote along with its corresponding author. The authors include well-known figures such as Albert Einstein, Marilyn Monroe or Jane Austen.

#### Columns' Description

- **Author:** The name of the author who made the quote.
- **Quote:** The actual quote from the respective author.


*Example Quote:* 
- **Albert Einstein**: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

#### Acknowledgements

Data from 'http://quotes.toscrape.com'

### Exercici 3
***
Tria una pàgina web que tu vulguis i realitza web scraping mitjançant la llibreria Selenium primer i Scrapy després. 

Farem un webscrapping amb **BeautifulSoup** de la pàgina de Wikipedia en anglès de la discografia de The Beatles. Volem treure el títol de l'album i els detalls (data de llançament i discogràfica) dels àlbums d'estudi originals del Regne Unit.

- https://en.wikipedia.org/wiki/The_Beatles_discography

#### BeautifulSoup

In [29]:
import requests
from bs4 import BeautifulSoup

# URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/The_Beatles_discography"

# Make a GET request to the page
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the discography table
    discography_table = soup.find('table', {'class': 'wikitable'})

    # Check if the table was found
    if discography_table:
        # Get the rows of the table
        rows = discography_table.find_all('tr')[2:]  # Exclude the header row and the first row

        # Print details of the albums
        for row in rows:
            columns = row.find_all(['th', 'td'])
            if len(columns) >= 2:
                title = columns[0].get_text(strip=True)
                details = columns[1].get_text(strip=True)
                print(f"\nTitle: {title}")
                print(f"Album Details: {details}")

    else:
        print("Discography table not found on the page.")

else:
    print(f"Error retrieving the page. Status code: {response.status_code}")


Title: Please Please Me
Album Details: Released: 22 March 1963Label:Parlophone

Title: With the Beatles[A]
Album Details: Released: 22 November 1963Label: Parlophone (UK),Capitol(Canada),Odeon(France)

Title: A Hard Day's Night
Album Details: Released: 10 July 1964Label: Parlophone

Title: Beatles for Sale
Album Details: Released: 4 December 1964Label: Parlophone

Title: Help!
Album Details: Released: 6 August 1965Label: Parlophone

Title: Rubber Soul
Album Details: Released: 3 December 1965Label: Parlophone

Title: Revolver
Album Details: Released: 5 August 1966Label: Parlophone

Title: Sgt. Pepper's Lonely Hearts Club Band
Album Details: Released: 26 May 1967Label: Parlophone (UK), Capitol (US)

Title: The Beatles("The White Album")
Album Details: Released: 22 November 1968Label:Apple

Title: Yellow Submarine[B]
Album Details: Released: 13 January 1969Label: Apple (UK), Capitol (US)

Title: Abbey Road
Album Details: Released: 26 September 1969Label: Apple

Title: Let It Be
Album Det

In [31]:
results = [
    {"Title": "Please Please Me", "Album Details": "Released: 22 March 1963Label:Parlophone"},
    {"Title": "With the Beatles", "Album Details": "Released: 22 November 1963Label: Parlophone (UK) | Capitol (Canada) | Odeon (France)"},
    {"Title": "A Hard Day's Night", "Album Details": "Released: 10 July 1964Label: Parlophone"},
    {"Title": "Beatles for Sale", "Album Details": "Released: 4 December 1964Label: Parlophone"},
    {"Title": "Help!", "Album Details": "Released: 6 August 1965Label: Parlophone"},
    {"Title": "Rubber Soul", "Album Details": "Released: 3 December 1965Label: Parlophone"},
    {"Title": "Revolver", "Album Details": "Released: 5 August 1966Label: Parlophone"},
    {"Title": "Sgt. Pepper's Lonely Hearts Club Band", "Album Details": "Released: 26 May 1967Label: Parlophone (UK) | Capitol (US)"},
    {"Title": "The Beatles (The White Album)", "Album Details": "Released: 22 November 1968Label:Apple"},
    {"Title": "Yellow Submarine", "Album Details": "Released: 13 January 1969Label: Apple (UK) | Capitol (US)"},
    {"Title": "Abbey Road", "Album Details": "Released: 26 September 1969Label: Apple"},
    {"Title": "Let It Be", "Album Details": "Released: 8 May 1970Label: Apple"}
]

titles = []
released_dates = []
labels = []

for result in results:
    title = result["Title"]
    details = result["Album Details"]

    released_start = details.find("Released:")
    released_end = details.find("Label:")
    released = details[released_start + len("Released:"):released_end].strip()

    label_start = details.find("Label:")
    label = details[label_start + len("Label:"):].strip()

    titles.append(title)
    released_dates.append(released)
    labels.append(label)

data2 = {'Title': titles, 'Released': released_dates, 'Label': labels}
df_beatles = pd.DataFrame(data2)


df_beatles

Unnamed: 0,Title,Released,Label
0,Please Please Me,22 March 1963,Parlophone
1,With the Beatles,22 November 1963,Parlophone (UK) | Capitol (Canada) | Odeon (Fr...
2,A Hard Day's Night,10 July 1964,Parlophone
3,Beatles for Sale,4 December 1964,Parlophone
4,Help!,6 August 1965,Parlophone
5,Rubber Soul,3 December 1965,Parlophone
6,Revolver,5 August 1966,Parlophone
7,Sgt. Pepper's Lonely Hearts Club Band,26 May 1967,Parlophone (UK) | Capitol (US)
8,The Beatles (The White Album),22 November 1968,Apple
9,Yellow Submarine,13 January 1969,Apple (UK) | Capitol (US)
