# WEB SCRAPING PROJECT 

# Project Overview

In this project, I developed a web scraping tool using Python to extract and analyze data from the website "Books to Scrape". Leveraging libraries such as `requests`, `BeautifulSoup`, and `pprint`, the tool retrieves and parses HTML content to gather information about books listed on the site.

## Key Features

### Data Extraction

- **Images:** Extracted image URLs and descriptions.
- **Categories:** Retrieved book categories and their URLs.
- **Books:** Collected details on each book including title, price, availability, and product ID.

### Category Analysis

- **Category Statistics:** Counted the number of books in each category.

### Sentiment Analysis

- Utilized the `TextBlob` library to analyze the sentiment of book reviews and descriptions. This involves extracting the text data and computing sentiment scores to gauge the overall sentiment of the content.


# Python Libraries

`requests`
The requests library is used to send HTTP requests to retrieve the HTML code of a web page. It is a simple and powerful tool for interacting with web services.

`BeautifulSoup`
The BeautifulSoup library helps to parse HTML code and extract information from web pages.

`pprint`
pprint is used to display Python data structures in a more readable way.




In [378]:
from bs4 import BeautifulSoup
from pprint import pprint
import requests

In [366]:
url= "https://books.toscrape.com/"
response= requests.get(url)
soup= BeautifulSoup(response.text, 'html.parser')
print(response.status_code)


200


In [367]:
pics= soup.find_all("img")

print(type(pics))

for pic in pics:
    if 'src' in pic.attrs:
      print(pic['src'] ," -->", pic['alt'])


<class 'bs4.element.ResultSet'>
media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg  --> A Light in the Attic
media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg  --> Tipping the Velvet
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg  --> Soumission
media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg  --> Sharp Objects
media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg  --> Sapiens: A Brief History of Humankind
media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg  --> The Requiem Red
media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg  --> The Dirty Little Secrets of Getting Your Dream Job
media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg  --> The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg  --> The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
media/cache/58/46/5846057e28022268153beff6d352b06c.jpg  --> The Black Maria

# Extract Categories From Aside Bar

In [368]:
categories_div =  aside.find('div', class_='side_categories')

links= categories_div.find('ul').find('li').find('ul')
categories=[link.text.strip() for link in links.children if link.name]
print(categories)

['Travel', 'Mystery', 'Historical Fiction', 'Sequential Art', 'Classics', 'Philosophy', 'Romance', 'Womens Fiction', 'Fiction', 'Childrens', 'Religion', 'Nonfiction', 'Music', 'Default', 'Science Fiction', 'Sports and Games', 'Add a comment', 'Fantasy', 'New Adult', 'Young Adult', 'Science', 'Poetry', 'Paranormal', 'Art', 'Psychology', 'Autobiography', 'Parenting', 'Adult Fiction', 'Humor', 'Horror', 'History', 'Food and Drink', 'Christian Fiction', 'Business', 'Biography', 'Thriller', 'Contemporary', 'Spirituality', 'Academic', 'Self Help', 'Historical', 'Christian', 'Suspense', 'Short Stories', 'Novels', 'Health', 'Politics', 'Cultural', 'Erotica', 'Crime']


# Extract Product Title

In [369]:
articles_div= soup.find('section')
articles_name=articles_div.select('article.product_pod')

products_title=[]

for article in articles_name :
    product_title= article.find('h3').find('a').get('title')
    products_title.append(product_title)
    
pprint(products_title)

['A Light in the Attic',
 'Tipping the Velvet',
 'Soumission',
 'Sharp Objects',
 'Sapiens: A Brief History of Humankind',
 'The Requiem Red',
 'The Dirty Little Secrets of Getting Your Dream Job',
 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, '
 'Victoria Woodhull',
 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the '
 '1936 Berlin Olympics',
 'The Black Maria',
 'Starving Hearts (Triangular Trade Trilogy, #1)',
 "Shakespeare's Sonnets",
 'Set Me Free',
 "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
 'Rip it Up and Start Again',
 'Our Band Could Be Your Life: Scenes from the American Indie Underground, '
 '1981-1991',
 'Olio',
 'Mesaerion: The Best Science Fiction Stories 1800-1849',
 'Libertarianism for Beginners',
 "It's Only the Himalayas"]


# Extract Product Data

In [370]:
import re

articles = []
pattern = r'_(\d+)*'

for article in articles_name :
    product_url= article.find('h3').find('a')['href']
    product_title= article.find('h3').find('a').get('title')
    product_price= article.find('div', class_="product_price").find('p', class_="price_color").text.strip()
    product_stock= article.find('div', class_="product_price").find('p', class_="instock availability").text.strip()
    product_id= re.findall(pattern, product_url)
    
    if not product_id :
        product_id = NULL
        
    product_data = {
                'product_id': product_id,
                'title': product_title,
                'price': product_price,
                'stock': product_stock,
                'URL': product_url
            }
    
    articles.append(product_data)

pprint(articles)

[{'URL': 'catalogue/a-light-in-the-attic_1000/index.html',
  'price': 'Â£51.77',
  'product_id': ['1000'],
  'stock': 'In stock',
  'title': 'A Light in the Attic'},
 {'URL': 'catalogue/tipping-the-velvet_999/index.html',
  'price': 'Â£53.74',
  'product_id': ['999'],
  'stock': 'In stock',
  'title': 'Tipping the Velvet'},
 {'URL': 'catalogue/soumission_998/index.html',
  'price': 'Â£50.10',
  'product_id': ['998'],
  'stock': 'In stock',
  'title': 'Soumission'},
 {'URL': 'catalogue/sharp-objects_997/index.html',
  'price': 'Â£47.82',
  'product_id': ['997'],
  'stock': 'In stock',
  'title': 'Sharp Objects'},
 {'URL': 'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
  'price': 'Â£54.23',
  'product_id': ['996'],
  'stock': 'In stock',
  'title': 'Sapiens: A Brief History of Humankind'},
 {'URL': 'catalogue/the-requiem-red_995/index.html',
  'price': 'Â£22.65',
  'product_id': ['995'],
  'stock': 'In stock',
  'title': 'The Requiem Red'},
 {'URL': 'catalogue/the-dirty

# Extract Category's URL

In [330]:
categs= categories_div.find('ul').find('li').find('ul').find_all('a')
categ_links=[link['href'] for link in categs if link]

pprint(categ_links)

['catalogue/category/books/travel_2/index.html',
 'catalogue/category/books/mystery_3/index.html',
 'catalogue/category/books/historical-fiction_4/index.html',
 'catalogue/category/books/sequential-art_5/index.html',
 'catalogue/category/books/classics_6/index.html',
 'catalogue/category/books/philosophy_7/index.html',
 'catalogue/category/books/romance_8/index.html',
 'catalogue/category/books/womens-fiction_9/index.html',
 'catalogue/category/books/fiction_10/index.html',
 'catalogue/category/books/childrens_11/index.html',
 'catalogue/category/books/religion_12/index.html',
 'catalogue/category/books/nonfiction_13/index.html',
 'catalogue/category/books/music_14/index.html',
 'catalogue/category/books/default_15/index.html',
 'catalogue/category/books/science-fiction_16/index.html',
 'catalogue/category/books/sports-and-games_17/index.html',
 'catalogue/category/books/add-a-comment_18/index.html',
 'catalogue/category/books/fantasy_19/index.html',
 'catalogue/category/books/new-adul

# Extract Product Counter \ Category

In [331]:
# URL
base_url = "https://books.toscrape.com/"
cetegories_statistics=[]

for link in categ_links:
    
    url = base_url + link
    print(f"Fetching URL: {url}")    
    resp= requests.get(url)
    
    if resp.status_code == 200:
        
        categ_title=soup.find('div', class_="page-header action").find('h1')
        
        soup = BeautifulSoup(resp.text, 'html.parser')    
        number_products = soup.find('form', class_="form-horizontal").find('strong')
        
        if products_categ:
            
           category_data = {
                'category_title': categ_title.text.strip(),
                'books_counter': number_products.text.strip()
            }
        
        cetegories_statistics.append(category_data)
        
    else:
        print(f"Failed to retrieve the page: {url}, Status code: {response.status_code}")
        
# RESULTS 
print('\n\n')
pprint(cetegories_statistics)

Fetching URL: https://books.toscrape.com/catalogue/category/books/travel_2/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/mystery_3/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/classics_6/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/philosophy_7/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/romance_8/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/fiction_10/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/childrens_11/index.html
Fetching URL: https://books.toscrape.com/catalogue/category/books/religion_12/index.html
Fetching 

# Extract all books that have just one star

In [337]:
response= requests.get(url)
soup= BeautifulSoup(response.text, 'html.parser')

if (response.status_code == 200): 

   artics = soup.find_all('article', class_='product_pod')

   one_star_titles = []

   for artic in artics:
        one_star_titles=[artic.h3.a['title'] for artic in artics if artic.find('p', class_='star-rating One') ]

   pprint(one_star_titles)

else:
        print(f"Failed to retrieve the page: {url}, Status code: {resp.status_code}")

['The Long Shadow of Small Ghosts: Murder and Memory in an American City']


## Libraries added

`deep_translator`

The `deep_translator` library provides a straightforward way to use translation services in Python. Specifically, `GoogleTranslator` from this library is used to access Google's translation API. This allows for translating text between different languages easily. It supports a wide range of languages and is helpful for applications requiring multilingual capabilities.

`TextBlob`

`TextBlob` is a library for processing textual data and performing natural language processing (NLP) tasks. It provides a simple API for diving into common NLP tasks such as part-of-speech tagging, noun phrase extraction, and sentiment analysis.


In [238]:
pip install textblob

Collecting textblob
  Obtaining dependency information for textblob from https://files.pythonhosted.org/packages/02/07/5fd2945356dd839974d3a25de8a142dc37293c21315729a41e775b5f3569/textblob-0.18.0.post0-py3-none-any.whl.metadata
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
   ---------------------------------------- 0.0/626.3 kB ? eta -:--:--
    --------------------------------------- 10.2/626.3 kB ? eta -:--:--
   - ------------------------------------- 30.7/626.3 kB 660.6 kB/s eta 0:00:01
   ------ ------------------------------- 102.4/626.3 kB 980.4 kB/s eta 0:00:01
   ---------------------------- ----------- 450.6/626.3 kB 3.1 MB/s eta 0:00:01
   ---------------------------------------- 626.3/626.3 kB 3.3 MB/s eta 0:00:00
Installing collected packages: textblob
Successfully installed textblob-0.18.0.post0
Note: you may need to restart the kernel to use updated packages.


In [248]:
pip install deep-translator

Collecting deep-translator
  Obtaining dependency information for deep-translator from https://files.pythonhosted.org/packages/38/3f/61a8ef73236dbea83a1a063a8af2f8e1e41a0df64f122233938391d0f175/deep_translator-1.11.4-py3-none-any.whl.metadata
  Downloading deep_translator-1.11.4-py3-none-any.whl.metadata (30 kB)
Downloading deep_translator-1.11.4-py3-none-any.whl (42 kB)
   ---------------------------------------- 0.0/42.3 kB ? eta -:--:--
   ---------------------------------------- 42.3/42.3 kB 1.0 MB/s eta 0:00:00
Installing collected packages: deep-translator
Successfully installed deep-translator-1.11.4
Note: you may need to restart the kernel to use updated packages.


In [376]:
from deep_translator import GoogleTranslator
from textblob import TextBlob

def extract_sentiment_textblob(text):
    translated_text = GoogleTranslator(source='fr', target='en').translate(text)
    
    blob = TextBlob(translated_text)
    sentiment = blob.sentiment.polarity

    if sentiment >= 0.5:
        return "love"
    elif sentiment > 0.3:
        return "joy"
    elif sentiment <= -0.5:
        return "hatred"
    elif sentiment < -0.3:
        return "sadness"
    elif sentiment <= -0.2:
        return "angry"
    elif sentiment > 0:
        return "satisfaction"
    elif sentiment < 0:
        return "fear"
    else:
        return "neutral"




# Extract Feeling From Product Description 

         (with : extract_sentiment_textblob(text) )

In [377]:
for book in articles:
    url_s = "https://books.toscrape.com/" + book.get('URL')
    if url_s:
        res = requests.get(url_s)
        soup = BeautifulSoup(res.text, 'html.parser')
        paragraphs = soup.find_all("p")
                
        if len(paragraphs) >= 3:
            book['feelings'] = extract_sentiment_textblob(paragraphs[3].get_text())
        else:
            book['feelings'] = "NULL"


pprint(articles)


satisfaction
satisfaction
fear
fear
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
fear
joy
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
satisfaction
[{'URL': 'catalogue/a-light-in-the-attic_1000/index.html',
  'feelings': 'satisfaction',
  'price': 'Â£51.77',
  'product_id': ['1000'],
  'stock': 'In stock',
  'title': 'A Light in the Attic'},
 {'URL': 'catalogue/tipping-the-velvet_999/index.html',
  'feelings': 'satisfaction',
  'price': 'Â£53.74',
  'product_id': ['999'],
  'stock': 'In stock',
  'title': 'Tipping the Velvet'},
 {'URL': 'catalogue/soumission_998/index.html',
  'feelings': 'fear',
  'price': 'Â£50.10',
  'product_id': ['998'],
  'stock': 'In stock',
  'title': 'Soumission'},
 {'URL': 'catalogue/sharp-objects_997/index.html',
  'feelings': 'fear',
  'price': 'Â£47.82',
  'product_id': ['997'],
  'stock': 'In stock',
  'title': 'Sharp Objects'},
 {'URL': 'catalogue/sapiens-a-brief-history-of-hu