# Programming for Data Science

## Project Title: Book Price Tracker

**Students:** Salah aldeen - 202111136 \\ Rashed Alfayez -202210706



### Project Summary 
This project builds a complete web-data pipeline: collecting book data from a website using web scraping and crawling, storing it as JSON, accelerating data collection with multithreading, and performing data manipulation & analysis using Pandas. 

---



## 1) Data Source
**Website:** Books to Scrape (educational website designed for scraping practice)

**Collected Fields:**
- Title
- Price
- Rating (1–5)
- `In Stock?` (True/False)
- Product link

---



## 2) Tools & Libraries
- **requests**: send HTTP requests to fetch web pages
- **BeautifulSoup (bs4)**: parse HTML and extract data
- **json**: read/write JSON files
- **concurrent.futures (ThreadPoolExecutor)**: multithreading for faster crawling
- **Pandas**: data cleaning, filtering, grouping, analysis, exporting results
---


In [421]:
import os
import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor, as_completed




Create project folders and define file paths data,output

In [436]:
BASE_DIR = os.getcwd()
DATA_DIR = os.path.join(BASE_DIR, "data")
OUTPUT_DIR = os.path.join(BASE_DIR, "outputs")

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

RAW_JSON_PATH = os.path.join(DATA_DIR, "raw_books.json")
CLEAN_JSON_PATH = os.path.join(DATA_DIR, "clean_books.json")

BASE_URL = "https://books.toscrape.com/"


## 3) JSON
 Save/Load datasets in JSON format

In [392]:
def save_json(data, path):
    with open(path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

def load_json(path):
    if not os.path.exists(path):
        return []
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)


## 4) Web Scraping
### What is Web Scraping?
Web scraping is the process of extracting structured information from HTML web pages.

### How it is implemented here
- Fetch the page HTML using `requests.get()`.
- Convert HTML text into a parseable structure using **BeautifulSoup**.
- Identify each book entry using the HTML pattern:  
  `article.product_pod`
- Extract fields:
  - **title** from the `<a title="...">` 
  - **price** from `<p class="price_color">`
  - **rating** from star-rating class (`star-rating Three`)
  - **availability** from `<p class="instock availability">`
  - **link** from `href`, converted to a full URL
---


test request to make sure the website is reachable

In [437]:
resp = requests.get(BASE_URL, headers=headers, timeout=30)
print("Status:", resp.status_code)
print("Length:", len(resp.text))
print("First 120 chars:\n", resp.text[:120])

Status: 200
Length: 51294
First 120 chars:
 <!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]


 **Parse HTML and Locate Book Blocks**


In [400]:
soup = BeautifulSoup(resp.text, "html.parser")
print(type(soup))
books = soup.find_all("article", class_="product_pod")
print("Number of books on this page:", len(books))

first_book = books[0]
print(first_book.prettify()[:800])




<class 'bs4.BeautifulSoup'>
Number of books on this page: 20
<article class="product_pod">
 <div class="image_container">
  <a href="catalogue/a-light-in-the-attic_1000/index.html">
   <img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
  </a>
 </div>
 <p class="star-rating Three">
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
  <i class="icon-star">
  </i>
 </p>
 <h3>
  <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">
   A Light in the ...
  </a>
 </h3>
 <div class="product_price">
  <p class="price_color">
   Â£51.77
  </p>
  <p class="instock availability">
   <i class="icon-ok">
   </i>
   In stock
  </p>
  <form>
   <button class="btn btn-primary btn-block" data-loading-


Extract title, price, rating, stock status, and link from one book block.

In [None]:
def parse_book(article):
    # 1) title + link
    a_tag = article.find("h3").find("a")
    title = a_tag.get("title", "").strip()
    relative_link = a_tag.get("href", "").strip()
    full_link = urljoin(BASE_URL, relative_link)

    # 2) price
    price_text = article.find("p", class_="price_color").get_text(strip=True)
    price_text = price_text.replace("Â", "").replace("£", "").strip()  
    price = float(price_text)

    # 3) availability
    availability_text = article.find("p", class_="instock availability").get_text(" ", strip=True)
    availability = "In stock" in availability_text

    # 4) rating 
    rating_tag = article.find("p", class_="star-rating")
    rating_classes = rating_tag.get("class", [])
    rating_word = [c for c in rating_classes if c != "star-rating"][0]
    rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    rating = rating_map.get(rating_word, None)

    return {
        "title": title,
        "price": price,
        "rating": rating,
        "In Stock?": availability,
        "link": full_link
    }
    #test 
sample = parse_book(first_book)
sample


{'title': 'A Light in the Attic',
 'price': 51.77,
 'rating': 3,
 'In Stock?': True,
 'link': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'}

In [343]:
page_books = [parse_book(b) for b in books]
print("Parsed books:", len(page_books))

save_json(page_books, RAW_JSON_PATH)
print("Saved to:", RAW_JSON_PATH)


Parsed books: 20
Saved to: c:\Users\Hp\Desktop\PROJECT_DS\data\raw_books.json


## 6) Web Crawling
### What is Web Crawling?
Navigating multiple pages to collect a complete dataset.


### How it is implemented here
- Read total pages from the text like: Page 1 of 50
- Build page URLs using `build_page_url(page_number)`
- Loop from page 1 to last page:
  - fetch page HTML
  - extract all `article.product_pod`
  - parse each book using `parse_book`
- Save the full raw dataset into `raw_books.json`
---

function to fetch any page and return BeautifulSoup.

In [344]:
def get_soup(url):
    headers = {"User-Agent": "Mozilla/5.0"}
    r = requests.get(url, headers=headers, timeout=30)
    
    if r.status_code != 200:
        raise Exception(f"Request failed! Status={r.status_code} for URL: {url}")
    
    return BeautifulSoup(r.text, "html.parser")


 Read the total number of pages from the website 


In [345]:
home_soup = get_soup(BASE_URL)
current_text = home_soup.find("li", class_="current").get_text(" ", strip=True)
print("Page text:", current_text)
total_pages = int(current_text.split("of")[-1].strip())
print("Total pages:", total_pages)



Page text: Page 1 of 50
Total pages: 50


Build the correct URL for each page number

In [425]:
def build_page_url(page_number):
    if page_number == 1:
        return BASE_URL
    return urljoin(BASE_URL, f"catalogue/page-{page_number}.html")

# test
print(build_page_url(1))
print(build_page_url(2))
print(build_page_url(50))


https://books.toscrape.com/
https://books.toscrape.com/catalogue/page-2.html
https://books.toscrape.com/catalogue/page-50.html


Crawl all pages sequentially and measure the runtime

In [347]:
start = time.perf_counter()

all_books = []

for page in range(1, total_pages + 1):
    page_url = build_page_url(page)
    soup_page = get_soup(page_url)
    
    articles = soup_page.find_all("article", class_="product_pod")
    page_data = [parse_book(a) for a in articles]
    
    all_books.extend(page_data)
    print(f"Page {page}/{total_pages} -> {len(page_data)} books | Total so far: {len(all_books)}")
end = time.perf_counter()

print("DONE. Total books collected:", len(all_books))
print(f"Time taken (Not Threaded): {end - start:.2f} seconds")



Page 1/50 -> 20 books | Total so far: 20
Page 2/50 -> 20 books | Total so far: 40
Page 3/50 -> 20 books | Total so far: 60
Page 4/50 -> 20 books | Total so far: 80
Page 5/50 -> 20 books | Total so far: 100
Page 6/50 -> 20 books | Total so far: 120
Page 7/50 -> 20 books | Total so far: 140
Page 8/50 -> 20 books | Total so far: 160
Page 9/50 -> 20 books | Total so far: 180
Page 10/50 -> 20 books | Total so far: 200
Page 11/50 -> 20 books | Total so far: 220
Page 12/50 -> 20 books | Total so far: 240
Page 13/50 -> 20 books | Total so far: 260
Page 14/50 -> 20 books | Total so far: 280
Page 15/50 -> 20 books | Total so far: 300
Page 16/50 -> 20 books | Total so far: 320
Page 17/50 -> 20 books | Total so far: 340
Page 18/50 -> 20 books | Total so far: 360
Page 19/50 -> 20 books | Total so far: 380
Page 20/50 -> 20 books | Total so far: 400
Page 21/50 -> 20 books | Total so far: 420
Page 22/50 -> 20 books | Total so far: 440
Page 23/50 -> 20 books | Total so far: 460
Page 24/50 -> 20 books |

In [348]:
save_json(all_books, RAW_JSON_PATH)
print("Saved raw dataset to:", RAW_JSON_PATH)


Saved raw dataset to: c:\Users\Hp\Desktop\PROJECT_DS\data\raw_books.json


## 7) Multithreading
### Why multithreading?
Crawling many pages sequentially can be slow because each HTTP request waits for a network response.  
Multithreading allows multiple page requests to run in parallel to reduce total runtime.

### Implementation approach
- Use `ThreadPoolExecutor` with a chosen number of worker threads.
- Submit one task per page (fetch + parse page).
- Collect results as tasks complete.
- Combine all book records into one dataset.

**What we print in the notebook:**
- `Time taken (Not Threaded): ... seconds`
- `Time taken (threaded): ... seconds`
---


In [349]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed


function that scrapes one page

In [424]:
def fetch_page_books(page_number):
   
    try:
        page_url = build_page_url(page_number)
        soup_page = get_soup(page_url)
        articles = soup_page.find_all("article", class_="product_pod")
        return [parse_book(a) for a in articles]
    except Exception as e:
        print(f"[ERROR] Page {page_number}: {e}")
        return []


Crawl pages in parallel using ThreadPoolExecutor

In [None]:
start = time.perf_counter()

all_books_threaded = []

max_workers = 10 

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = {executor.submit(fetch_page_books, p): p for p in range(1, total_pages + 1)}
    
    for future in as_completed(futures):
        page_num = futures[future]
        page_data = future.result()
        all_books_threaded.extend(page_data)
        print(f"Finished page {page_num} -> {len(page_data)} books | Total: {len(all_books_threaded)}")

end = time.perf_counter()
print(f"\nThreaded crawling DONE. Total books: {len(all_books_threaded)}")
print(f"Time taken (threaded): ({end - start:.2f}) seconds")


Finished page 8 -> 20 books | Total: 20
Finished page 1 -> 20 books | Total: 40
Finished page 5 -> 20 books | Total: 60
Finished page 4 -> 20 books | Total: 80
Finished page 3 -> 20 books | Total: 100
Finished page 10 -> 20 books | Total: 120
Finished page 2 -> 20 books | Total: 140
Finished page 7 -> 20 books | Total: 160
Finished page 6 -> 20 books | Total: 180
Finished page 9 -> 20 books | Total: 200
Finished page 15 -> 20 books | Total: 220
Finished page 13 -> 20 books | Total: 240
Finished page 20 -> 20 books | Total: 260
Finished page 12 -> 20 books | Total: 280
Finished page 14 -> 20 books | Total: 300
Finished page 17 -> 20 books | Total: 320
Finished page 19 -> 20 books | Total: 340
Finished page 16 -> 20 books | Total: 360
Finished page 11 -> 20 books | Total: 380
Finished page 18 -> 20 books | Total: 400
Finished page 21 -> 20 books | Total: 420
Finished page 26 -> 20 books | Total: 440
Finished page 28 -> 20 books | Total: 460
Finished page 22 -> 20 books | Total: 480
Finis

Save the multithreaded dataset into a JSON file

In [352]:
RAW_JSON_THREADED_PATH = os.path.join(DATA_DIR, "raw_books_threaded.json")
save_json(all_books_threaded, RAW_JSON_THREADED_PATH)
print("Saved threaded dataset to:", RAW_JSON_THREADED_PATH)


Saved threaded dataset to: c:\Users\Hp\Desktop\PROJECT_DS\data\raw_books_threaded.json


## 8) Pandas 

### Loading & Cleaning
- Load the JSON dataset into a Pandas DataFrame
- Check dataset shape and data types (`df.info()` / `df.describe()`)
- Check missing values (`df.isnull().sum()`)
- Fill missing values:
  - `price` filled with mean
  - `rating` filled with median then convert to int
- Grouping:
  - mean price by rating
  - min/max/mean/median price by rating
- Filtering:
  - create a subset for NOT in stock (`In Stock? == False`)
  - create best deals: in stock + rating >= 4 + price < 30
---


Load the JSON dataset into a Pandas DataFrame


In [406]:
data = load_json(RAW_JSON_THREADED_PATH)
df = pd.DataFrame(data)

print("Dataset shape:",df.shape)
df.head(5)


Dataset shape: (1000, 5)


Unnamed: 0,title,price,rating,In Stock?,link
0,In Her Wake,12.84,1.0,True,https://books.toscrape.com/in-her-wake_980/ind...
1,How Music Works,37.32,2.0,True,https://books.toscrape.com/how-music-works_979...
2,Foolproof Preserving: A Guide to Small Batch J...,30.52,3.0,True,https://books.toscrape.com/foolproof-preservin...
3,Chase Me (Paris Nights #2),25.27,5.0,True,https://books.toscrape.com/chase-me-paris-nigh...
4,Black Dust,34.53,5.0,True,https://books.toscrape.com/black-dust_976/inde...


Check dataset shape and data types

In [370]:
df.info()
df.describe()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      1000 non-null   object 
 1   price      976 non-null    float64
 2   rating     992 non-null    float64
 3   In Stock?  1000 non-null   bool   
 4   link       1000 non-null   object 
dtypes: bool(1), float64(2), object(2)
memory usage: 32.4+ KB


Unnamed: 0,price,rating
count,976.0,992.0
mean,35.102059,2.925403
std,14.470956,1.434925
min,10.0,1.0
25%,22.095,2.0
50%,36.11,3.0
75%,47.535,4.0
max,59.99,5.0


Check missing values 

In [371]:
print("The null Values are:\n ",df.isnull().sum())


The null Values are:
  title         0
price        24
rating        8
In Stock?     0
link          0
dtype: int64


Fill missing values for price and

In [411]:
price_mean = df["price"].mean()
print("The mean price is:", price_mean)
df["price"] = df["price"].fillna(price_mean)
print("_________________________________________________________")

rating_median = df["rating"].median()
print("The median rating is:", rating_median)
df["rating"] = df["rating"].fillna(rating_median)
print("_________________________________________________________")


print("The null Values are:\n ",df.isnull().sum())




The mean price is: 35.1020594262295
_________________________________________________________
The median rating is: 3.0
_________________________________________________________
The null Values are:
  title        0
price        0
rating       0
In Stock?    0
link         0
dtype: int64


converting rating from float to int

In [427]:
df["rating"] = df["rating"].astype(int)
df["rating"].sort_values(ascending=False)


999    5
565    5
611    5
610    5
601    5
      ..
683    1
136    1
137    1
661    1
0      1
Name: rating, Length: 1000, dtype: int32

Grouping: average price by rating


In [414]:
#grouping
avg_price_by_rating = df.groupby("rating")["price"].mean()
print(avg_price_by_rating)

rating
1    34.481671
2    35.004810
3    34.281062
4    36.285889
5    35.724771
Name: price, dtype: float64


min/max/mean/median price by rating

In [415]:
group_rating = df.groupby("rating")["price"].agg([ "min", "max", "mean", "median"])
group_rating


Unnamed: 0_level_0,min,max,mean,median
rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,10.4,59.64,34.481671,35.102059
2,10.02,59.95,35.00481,36.17
3,10.16,59.99,34.281062,33.29
4,10.01,59.45,36.285889,37.51
5,10.0,59.92,35.724771,36.83


Books are NOT in stock 

In [429]:

df_instock = df[df["In Stock?"] == False].copy()
print("Not in stock:", df_instock.shape)
df_instock



Not in stock: (9, 5)


Unnamed: 0,title,price,rating,In Stock?,link
882,Catastrophic Happiness: Finding Joy in Childho...,37.35,2,False,https://books.toscrape.com/catastrophic-happin...
886,Boy Meets Boy,21.12,3,False,https://books.toscrape.com/boy-meets-boy_134/i...
891,Big Little Lies,35.102059,1,False,https://books.toscrape.com/big-little-lies_129...
895,Beautiful Creatures (Caster Chronicles #1),21.55,5,False,https://books.toscrape.com/beautiful-creatures...
898,Are We There Yet?,10.66,3,False,https://books.toscrape.com/are-we-there-yet_12...
914,Beyond Good and Evil,43.38,1,False,https://books.toscrape.com/beyond-good-and-evi...
936,"Paper Girls, Vol. 1 (Paper Girls #1-5)",21.71,4,False,https://books.toscrape.com/paper-girls-vol-1-p...
945,When I'm Gone,51.96,3,False,https://books.toscrape.com/when-im-gone_95/ind...
954,"The Wicked + The Divine, Vol. 1: The Faust Act...",36.52,2,False,https://books.toscrape.com/the-wicked-the-divi...


create best deals in stock + high rating + low price

In [380]:
df_best_deals = df[(df["In Stock?"] == True) & (df["rating"] >= 4) & (df["price"] < 30)].copy()
print("Best deals:", df_best_deals.shape)
df_best_deals.sort_values(by="price").head(10)


Best deals: (135, 5)


Unnamed: 0,title,price,rating,In Stock?,link
618,An Abundance of Katherines,10.0,5,True,https://books.toscrape.com/an-abundance-of-kat...
421,The Origin of Species,10.01,4,True,https://books.toscrape.com/the-origin-of-speci...
539,History of Beauty,10.29,4,True,https://books.toscrape.com/history-of-beauty_5...
574,"NaNo What Now? Finding your editing process, r...",10.41,4,True,https://books.toscrape.com/nano-what-now-findi...
835,Green Eggs and Ham (Beginner Books B-16),10.79,4,True,https://books.toscrape.com/green-eggs-and-ham-...
590,The Power Greens Cookbook: 140 Delicious Super...,11.05,5,True,https://books.toscrape.com/the-power-greens-co...
396,Dear Mr. Knightley,11.21,5,True,https://books.toscrape.com/dear-mr-knightley_6...
363,City of Fallen Angels (The Mortal Instruments #4),11.23,4,True,https://books.toscrape.com/city-of-fallen-ange...
621,The Darkest Corners,11.33,5,True,https://books.toscrape.com/the-darkest-corners...
501,"Naturally Lean: 125 Nourishing Gluten-Free, Pl...",11.38,5,True,https://books.toscrape.com/naturally-lean-125-...


Save cleaned data + outputs (JSON/CSV)

In [None]:
# Save clean dataset
clean_records = df.to_dict(orient="records")
save_json(clean_records, CLEAN_JSON_PATH)
CLEAN_CSV_PATH = os.path.join(DATA_DIR, "clean_books.csv")
df.to_csv(CLEAN_CSV_PATH, index=False)

print("Saved clean dataset to:", CLEAN_JSON_PATH)
print("Saved clean CSV to:", CLEAN_CSV_PATH)


In [419]:
group_rating.to_csv(os.path.join(OUTPUT_DIR, "price_stats_by_rating.csv"))
df_best_deals.to_csv(os.path.join(OUTPUT_DIR, "best_deals.csv"), index=False)

print("Saved analysis outputs into outputs/ folder")


Saved analysis outputs into outputs/ folder


## 9) Conclusion
This project collected book data from a website using web scraping and crawling, saved results as JSON, improved crawling speed using multithreading, and analyzed the final dataset using Pandas (missing values handling, filtering, and grouping). Clean datasets and analysis outputs were exported as JSON/CSV files.

