# Scraping Laptops data from BestBuy Canada website

**Updated for modular architecture with Loguru logging**

This is an example of data scraping from BestBuy Canada website. This notebook demonstrates how to use the refactored webscraping project with modular code structure.

This exercise helps in deciding what computer to buy for data science projects by analyzing laptop data from BestBuy. We make requests to the website and use BeautifulSoup to parse HTML and automatically collect data about computers.

After getting the CSV file, we clean it, examine descriptive statistics, and create visualizations.

## New Project Structure

The project has been reorganized into:
- **src/** - All Python modules (config, scraper, data_cleaner, visualizer, logger)
- **data/** - CSV data files
- **notebooks/** - This tutorial notebook
- **docs/** - Documentation

## Setup: Import Required Libraries and Modules

In [1]:
# Standard library imports
import sys
from pathlib import Path

# Add parent directory to path to import from src/
sys.path.insert(0, str(Path.cwd().parent))

# External library imports
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from requests import get

# Project module imports
from src.logger import logger
from src.config import get_config, build_url
from src.scraper import scrape_all_laptops, extract_laptop_data
from src.data_cleaner import save_data, load_and_process_data, clean_price, clean_votes
from src.visualizer import visualize_data, create_histograms, create_boxplots

%matplotlib inline

## Using the Logger

The project now uses **Loguru** for professional logging instead of print statements.

In [2]:
# Demonstrate logging
logger.info("Starting laptop data analysis tutorial")
logger.debug("This is a debug message")
logger.warning("This is a warning message")
logger.success("Logger is working correctly!")

[32m2025-12-09 14:12:33[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mStarting laptop data analysis tutorial[0m
[32m2025-12-09 14:12:33[0m | [32m[1mSUCCESS [0m | [36m__main__[0m:[36m<module>[0m:[36m5[0m - [32m[1mLogger is working correctly![0m


## Configuration

Load the scraping configuration from the config module.

In [3]:
# Get configuration
config = get_config()
logger.info(f"Configuration loaded: {len(config['pages'])} pages, {len(config['ram_sizes'])} RAM sizes")
print(f"Pages to scrape: {config['pages'][:5]}...")  # Show first 5
print(f"RAM sizes: {config['ram_sizes']}")
print(f"Output file: {config['output_file']}")

[32m2025-12-09 14:12:38[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m3[0m - [1mConfiguration loaded: 19 pages, 7 RAM sizes[0m
Pages to scrape: ['1', '2', '3', '4', '5']...
RAM sizes: ['2', '4', '8', '12', '16', '32', '64']
Output file: laptops_rating2019.csv


## Test Single URL Request

Let's test a single request to understand the HTML structure.

In [4]:
# Build a test URL
test_url = build_url('1', '8')
logger.info(f"Test URL: {test_url}")

try:
    response = get(test_url, timeout=10)
    response.raise_for_status()
    logger.success(f"Successfully fetched URL. Status code: {response.status_code}")
    logger.info(f"Response length: {len(response.text)} characters")
except Exception as e:
    logger.error(f"Error fetching URL: {e}")

[32m2025-12-09 14:13:01[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m3[0m - [1mTest URL: https://www.bestbuy.ca/en-ca/category/windows-laptops/36711?page=1&path=category%3AComputers%2B%2526%2BTablets%3Bcategory%3ALaptops%2B%2526%2BMacBooks%3Bcategory%3AWindows%2BLaptops%3Bcustom0ramsize%3A8[0m
[32m2025-12-09 14:13:12[0m | [31m[1mERROR   [0m | [36m__main__[0m:[36m<module>[0m:[36m11[0m - [31m[1mError fetching URL: HTTPSConnectionPool(host='www.bestbuy.ca', port=443): Read timed out. (read timeout=10)[0m


## Parse HTML with BeautifulSoup

In [None]:
# Parse the HTML
html_soup = BeautifulSoup(response.text, 'html.parser')
logger.info(f"HTML parsed. Type: {type(html_soup)}")

# Find laptop containers
laptop_containers = html_soup.find_all('div', class_='item-inner clearfix')
logger.info(f"Found {len(laptop_containers)} laptop containers")

## Extract Data from First Laptop

Let's examine the first laptop container to understand the data structure.

In [None]:
if laptop_containers:
    first_laptop = laptop_containers[0]
    
    # Check if it has a rating
    if first_laptop.find('div', class_='rating-stars-yellow') is not None:
        laptop_data = extract_laptop_data(first_laptop)
        logger.info("Extracted data from first laptop:")
        for key, value in laptop_data.items():
            print(f"  {key}: {value}")
    else:
        logger.warning("First laptop has no rating")
else:
    logger.error("No laptop containers found!")

## Option 1: Run Full Scraper

**Warning:** This will make many requests and may take a long time. Consider using existing data instead.

In [None]:
# UNCOMMENT TO RUN FULL SCRAPING (can take 20+ minutes)
# logger.info("Starting full web scraping...")
# data = scrape_all_laptops(config, build_url)
# logger.success(f"Scraping complete! Collected {len(data['names'])} laptops")
#
# # Save the data
# output_path = f"../data/{config['output_file']}"
# save_data(data, output_path)

## Option 2: Load Existing Data

Let's use the existing scraped data from the data/ directory.

In [None]:
# Load existing data
data_file = '../data/laptops_rating2019.csv'
logger.info(f"Loading data from {data_file}")

try:
    df_raw = pd.read_csv(data_file)
    logger.success(f"Data loaded successfully! Shape: {df_raw.shape}")
    df_raw.head()
except Exception as e:
    logger.error(f"Error loading data: {e}")

## Data Cleaning

Use the data_cleaner module to process the raw data.

In [None]:
# Example: Clean a single price
sample_price = '$1,234.56'
cleaned_price = clean_price(sample_price)
logger.info(f"Price cleaning example: '{sample_price}' -> {cleaned_price}")

# Example: Clean a vote count
sample_vote = '(123)'
cleaned_vote = clean_votes(sample_vote)
logger.info(f"Vote cleaning example: '{sample_vote}' -> {cleaned_vote}")

In [None]:
# Clean the entire dataframe
logger.info("Cleaning dataframe...")
df_clean = load_and_process_data(data_file)
logger.success("Data cleaning complete!")

# Display cleaned data info
print("\nCleaned Data Info:")
df_clean.info()

In [None]:
# Display first few rows
df_clean.head()

## Descriptive Statistics

In [None]:
logger.info("Computing descriptive statistics...")
stats = df_clean[['prices', 'ratings', 'votes']].describe()
print("\nDescriptive Statistics:")
print(stats)

## Find Interesting Insights

In [None]:
# Find laptop with most votes
max_votes_idx = df_clean['votes'].idxmax()
most_voted = df_clean.loc[max_votes_idx]
logger.info(f"Most voted laptop: {most_voted['laptops']} with {most_voted['votes']} votes")
print(f"\nMost Voted Laptop:")
print(f"  Name: {most_voted['laptops']}")
print(f"  Price: ${most_voted['prices']:.2f}")
print(f"  Rating: {most_voted['ratings']}%")
print(f"  Votes: {most_voted['votes']:.0f}")

In [None]:
# Find laptops with perfect rating (100%)
perfect_ratings = df_clean[df_clean['ratings'] == 100.0]
logger.info(f"Found {len(perfect_ratings)} laptops with 100% rating")
if len(perfect_ratings) > 0:
    print("\nLaptops with 100% rating:")
    print(perfect_ratings[['laptops', 'prices', 'votes']].head())

## Data Visualization

Use the visualizer module to create charts.

In [None]:
# Create histograms
logger.info("Creating histograms...")
create_histograms(df_clean)

In [None]:
# Create boxplots
logger.info("Creating boxplots...")
create_boxplots(df_clean)

## Complete Visualization Pipeline

In [None]:
# Use the complete visualization function
logger.info("Running complete visualization pipeline...")
visualize_data(df_clean)

## Custom Analysis

Perform additional custom analysis on the data.

In [None]:
# Price distribution by rating category
logger.info("Analyzing price by rating category...")

# Categorize ratings
df_clean['rating_category'] = pd.cut(
    df_clean['ratings'], 
    bins=[0, 70, 85, 95, 100],
    labels=['Low (0-70)', 'Medium (70-85)', 'High (85-95)', 'Excellent (95-100)']
)

# Group by category and show mean price
price_by_rating = df_clean.groupby('rating_category')['prices'].agg(['mean', 'median', 'count'])
print("\nPrice Statistics by Rating Category:")
print(price_by_rating)

In [None]:
# Visualize price vs rating category
plt.figure(figsize=(12, 6))
df_clean.boxplot(column='prices', by='rating_category', figsize=(12, 6))
plt.title('Price Distribution by Rating Category')
plt.suptitle('')  # Remove automatic title
plt.ylabel('Price ($)')
plt.xlabel('Rating Category')
plt.xticks(rotation=15)
plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:

1. **Modular Code Structure**: Using functions from `src/` modules
2. **Loguru Logging**: Professional logging instead of print statements
3. **Data Pipeline**: Configuration → Scraping → Cleaning → Visualization
4. **Data Analysis**: Descriptive statistics and custom insights
5. **Visualization**: Histograms, boxplots, and custom charts

### Next Steps:

- Explore different RAM sizes and price ranges
- Add more sophisticated analysis (correlation, regression)
- Create a dashboard for interactive exploration
- Schedule regular scraping to track price trends over time

In [None]:
logger.success("Tutorial complete! You've learned how to use the webscraping project modules.")