# Data Extraction from website using BeautifulSoup and requests; storing data in csv file format for data analysis.

# 1. Aims, objectives and background

## 1.1. Introduction

In today's digital age, the internet serves as a vast repository of information, offering invaluable insights into various domains. Web scraping, the process of extracting data from websites, has emerged as a powerful tool for gathering structured data from the vast expanse of the internet. In this project, we leverage the capabilities of web scraping using Python libraries such as BeautifulSoup and requests to extract data from websites of interest. By harnessing this data, we aim to perform insightful analyses and create interactive dashboards to visualize trends, patterns, and insights.

## 1.2. Aims and objectives

This project aims to achieve the following objectives:

1. To familiarize ourselves with web scraping techniques using BeautifulSoup and requests, understanding their capabilities and limitations.
2. To extract data from a target website efficiently and effectively, ensuring accuracy and reliability in the retrieved information.
3. To store the extracted data in a CSV (Comma-Separated Values) file format, facilitating easy access and manipulation for data analysis purposes.
4. To explore the selected dataset, identifying patterns, trends, and insights through exploratory data analysis techniques.
5. To critically assess the dataset, considering factors such as data selection criteria, limitations, and the ethics of data sourcing.

By accomplishing these objectives, I aim to demonstrate the practical application of web scraping tools in data extraction and analysis, while also emphasizing the importance of ethical considerations and data integrity.


## 1.3. Steps of the project
1. Data Extraction: Utilize BeautifulSoup and requests to extract data from the target website, adhering to web scraping best practices and handling potential challenges such as dynamic content and anti-scraping measures.

2. Data Storage: Store the extracted data in a CSV file format, organizing it into appropriate columns and ensuring data consistency and integrity.

3. Data Analysis: Perform exploratory data analysis on the extracted dataset, employing statistical and visual techniques to uncover insights, trends, and anomalies.

4. Dataset Assessment: Evaluate the selected dataset, considering factors such as data relevance, completeness, and potential biases. Reflect on the limitations of the data and discuss ethical considerations surrounding its use.



## 1.4. Dataset

### Data Selection: 
The dataset selected for this project will be sourced from a publicly available website, chosen based on its relevance to the project objectives and feasibility of data extraction.

### Data Limitations: 
While web scraping can provide access to vast amounts of data, it's important to acknowledge its limitations, including potential inconsistencies, inaccuracies, and legal or ethical concerns.

### Ethics of Data Source: 
Prioritize ethical considerations when extracting data from websites, respecting terms of service, copyright laws, and privacy regulations. Ensure transparency and accountability in data sourcing and usage practices.

By adhering to best practices in web scraping and data handling, this project aims to demonstrate the value of ethical and responsible data extraction and analysis in the digital age.


## Importing libraries

In [1]:
from bs4 import BeautifulSoup as BSoup
import re
import requests
import csv
import json
import pandas as pd

### Main URL of website 'Quotes to Scrape' and columns for final csv files

In [13]:
url = "https://quotes.toscrape.com/page/"
columns_quotes = ['id','author','quote','tags'] #Columns for quotes
columns_author = ['author','born_date','born_location'] #Columns for quotes

### Function to get quotes from all the pages of site.

In [5]:
def get_quote_set(url, start_page=1, end_page=None):
    quote_set = []
    page = start_page
    next_page = True
    uid = 0
    
    while next_page and (end_page is None or page <= end_page):
        print(url + str(page))
        response = requests.get(url + str(page))
        source = BSoup(response.content, 'html.parser')
        print(source.find('title').get_text())

        if source.find('ul', 'pager').find('li', 'next'):
            txt_next = source.find('ul', 'pager').find('li', 'next').find('a').get_text()
            print(f"Processing {page} {next_page} {txt_next}")
        else:
            txt_next = None

        if txt_next and re.findall(r".*(Next).*", txt_next)[0] == "Next":
            next_page = True
        else:
            next_page = False

        for quotes in source.find_all('div', 'quote'):
            quote = quotes.find(attrs={'itemprop': 'text'}).get_text().strip()
            author = quotes.find(attrs={'class': 'author'}).get_text().strip()
            tags = quotes.find(attrs={'itemprop': 'keywords'}).get('content').strip()
            authorUrl = quotes.find(href=re.compile(r"/author/")).get('href')

            uid += 1
            quote_set.append([uid, author, quote, tags, authorUrl])
            # quote_set.append([uid, author, quote, tags.replace(',', '|'), len(quote)])

        page += 1

    return quote_set

### Main Program to get list of quotes.

In [6]:
# Main program for getting quotes
url = "https://quotes.toscrape.com/page/"
quote_set = get_quote_set(url)

print("Quotes:")
for quote_info in quote_set:
    print(quote_info)

https://quotes.toscrape.com/page/1
Quotes to Scrape
Processing 1 True Next →
https://quotes.toscrape.com/page/2
Quotes to Scrape
Processing 2 True Next →
https://quotes.toscrape.com/page/3
Quotes to Scrape
Processing 3 True Next →
https://quotes.toscrape.com/page/4
Quotes to Scrape
Processing 4 True Next →
https://quotes.toscrape.com/page/5
Quotes to Scrape
Processing 5 True Next →
https://quotes.toscrape.com/page/6
Quotes to Scrape
Processing 6 True Next →
https://quotes.toscrape.com/page/7
Quotes to Scrape
Processing 7 True Next →
https://quotes.toscrape.com/page/8
Quotes to Scrape
Processing 8 True Next →
https://quotes.toscrape.com/page/9
Quotes to Scrape
Processing 9 True Next →
https://quotes.toscrape.com/page/10
Quotes to Scrape
Quotes:
[1, 'Albert Einstein', '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'change,deep-thoughts,thinking,world', '/author/Albert-Einstein']
[2, 'J.K. Rowling', '“It is our choice

### Creating new list for 'author' and 'author url' | also final dataframe for quotes.

In [16]:
col_names = ['id','author','quote','tags','authorUrl']
quote_df = pd.DataFrame(quote_set, columns=col_names)

# Get the DataFrame containing distinct authors and authorUrls
unique_authors_df = quote_df[['author', 'authorUrl']].drop_duplicates()
# the DataFrame with distinct authors and authorUrls
author_list = unique_authors_df['authorUrl'].tolist()  # Get a list of authors

# Removing author_url from quote dataframe
quotes_df = quote_df[['id','author','quote','tags']]
quotes_df


Unnamed: 0,id,author,quote,tags
0,1,Albert Einstein,“The world as we have created it is a process ...,"change,deep-thoughts,thinking,world"
1,2,J.K. Rowling,"“It is our choices, Harry, that show what we t...","abilities,choices"
2,3,Albert Einstein,“There are only two ways to live your life. On...,"inspirational,life,live,miracle,miracles"
3,4,Jane Austen,"“The person, be it gentleman or lady, who has ...","aliteracy,books,classic,humor"
4,5,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...","be-yourself,inspirational"
...,...,...,...,...
95,96,Harper Lee,“You never really understand a person until yo...,better-life-empathy
96,97,Madeleine L'Engle,“You have to write the book that wants to be w...,"books,children,difficult,grown-ups,write,write..."
97,98,Mark Twain,“Never tell the truth to people who are not wo...,truth
98,99,Dr. Seuss,"“A person's a person, no matter how small.”",inspirational


### Function to get author name, born date, born location  

In [10]:
def get_author_set(author_list):
    author_data = []

    for quote_info in author_list:
        author_url = quote_info
        author_url_full = "http://quotes.toscrape.com" + author_url
        #print(author_url_full)
        author_found = False
        for author_info in author_list:
            if author_info[0] == author_url:
                author_found = True
                break
        
        if not author_found:
            response = requests.get(author_url_full)
            source_author = BSoup(response.content, 'html.parser')
            author = source_author.find(attrs={'class': 'author-title'}).get_text().strip()
            born_date = source_author.find(attrs={'class': 'author-born-date'}).get_text().strip()
            born_location = source_author.find(attrs={'class': 'author-born-location'}).get_text().replace('in',
                                                                                                              '').strip()
            author_data.append([author, born_date, born_location])

    return author_data

### Main Program to get author data list

In [11]:
author_data = get_author_set(author_list)
print("\nAuthors:")
for author_info in author_data:
    print(author_info)


Authors:
['Albert Einstein', 'March 14, 1879', 'Ulm, Germany']
['J.K. Rowling', 'July 31, 1965', 'Yate, South Gloucestershire, England, The United Kgdom']
['Jane Austen', 'December 16, 1775', 'Steventon Rectory, Hampshire, The United Kgdom']
['Marilyn Monroe', 'June 01, 1926', 'The United States']
['André Gide', 'November 22, 1869', 'Paris, France']
['Thomas A. Edison', 'February 11, 1847', 'Milan, Ohio, The United States']
['Eleanor Roosevelt', 'October 11, 1884', 'The United States']
['Steve Martin', 'August 14, 1945', 'Waco, Texas, The United States']
['Bob Marley', 'February 06, 1945', 'Ne Mile, Sat Ann, Jamaica']
['Dr. Seuss', 'March 02, 1904', 'Sprgfield, MA, The United States']
['Douglas Adams', 'March 11, 1952', 'Cambridge, England, The United Kgdom']
['Elie Wiesel', 'September 30, 1928', 'Sighet, Romania']
['Friedrich Nietzsche', 'October 15, 1844', 'Röcken bei Lützen, Prussian Provce of Saxony, Germany']
['Mark Twain', 'November 30, 1835', 'Florida, Missouri, The United Stat

### Creating final dataframe for author data

In [17]:
authors_df = pd.DataFrame(author_data, columns=columns_author)
authors_df

Unnamed: 0,author,born_date,born_location
0,Albert Einstein,"March 14, 1879","Ulm, Germany"
1,J.K. Rowling,"July 31, 1965","Yate, South Gloucestershire, England, The Unit..."
2,Jane Austen,"December 16, 1775","Steventon Rectory, Hampshire, The United Kgdom"
3,Marilyn Monroe,"June 01, 1926",The United States
4,André Gide,"November 22, 1869","Paris, France"
5,Thomas A. Edison,"February 11, 1847","Milan, Ohio, The United States"
6,Eleanor Roosevelt,"October 11, 1884",The United States
7,Steve Martin,"August 14, 1945","Waco, Texas, The United States"
8,Bob Marley,"February 06, 1945","Ne Mile, Sat Ann, Jamaica"
9,Dr. Seuss,"March 02, 1904","Sprgfield, MA, The United States"


### Storing both 'quotes' and 'authors' dataframe into .csv file

In [19]:
authors_df.to_csv("authors.csv", index=False)
quotes_df.to_csv('quotes.csv',index=False)

print("CSV files created successfully!")


CSV files created successfully!
