<a id="1"></a>
# <div style="text-align: center; background-color: #569db3; color: white; padding: 14px; line-height: 1;border-radius:20px">Document — Web Scraping [Dubizzle_mobile_phones.com]</div>

## 🧭 Table of Contents
1. [⚙️ Code Approaches](#️code)
2. [🌐 Which Website Was Scraped](#-which-website-was-scraped)
3. [💾 Data Collected](#-data-collected)
4. [🚧 Challenges and Solutions](#-challenges-and-solutions)
5. [📘 How to Run the Script](#-how-to-run-the-script)

<h1 style="color:red; font-weight:bold; background-color:#f5f5f5; border:2px solid red; padding:10px; border-radius:10px; font-family:monospace;">
<code>1- Code approaches:-</code>
</h1>

In this notebook, I demonstrate **two different approaches** to writing Python web scraping code:

**<span style="color:blue;">1. Monolithic Approach (All-in-One Code)</span>**

   - In this approach, the entire scraping process is written in one continuous block of code.  
   - It includes fetching pages, parsing HTML, extracting product details, and writing to CSV all together.  
   - This method is simple for small scripts but can become hard to maintain and read for larger projects.

**<span style="color:blue;">2. Modular Approach (Using Functions)</span>**
   - In this approach, the code is divided into **functions** for each specific task, such as:
     - Creating the CSV file
     - Fetching a webpage
     - Parsing a product card
     - Writing data to CSV
     - Main loop controlling the scraping
   - This method improves **readability, reusability, and maintainability**.
   - Each function has a clear responsibility, making the code easier to debug and extend.

Both methods achieve the same end result: scraping mobile phone data from the website and saving it to a CSV file.  
The difference lies in **code organization and readability**.


<h1 style="color:red; font-weight:bold; background-color:#f5f5f5; border:2px solid red; padding:10px; border-radius:10px; font-family:monospace;">
<code>2- Which Website Was Scraped :- </code>
</h1>

- **Website:** [Dubizzle.com](https://www.dubizzle.com.eg)  
- **Category:** Mobile Phones  
- The script navigates through **all pages** of the mobile phone category.

<h1 style="color:red; font-weight:bold; background-color:#f5f5f5; border:2px solid red; padding:10px; border-radius:10px; font-family:monospace;">
<code>3- Data Collected </code>
</h1>

### 🧠 Data Description

The dataset generated by this scraper contains detailed information about each mobile phone listing from Dubizzle Egypt.

Here’s what each field represents:

- **`product_name`** : Name of the mobile phone listed.  
- **`price`** : The price of the phone.  
- **`seller`** : The name or type of the seller (individual or store).  
- **`city`** : The city where the phone is listed.  
- **`Governorate`** : The governorate (region) of the listing.  
- **`Brand`** : The brand of the phone (e.g., Samsung, Apple).  
- **`Model`** : The phone’s model name or number.  
- **`RAM`** : The RAM capacity of the phone (e.g., 4 GB).  
- **`Storage`** : The storage size (e.g., 128 GB).  
- **`Battery_Capacity`** : The phone’s battery capacity (if listed).  
- **`Ad_Type`** : Whether the listing is new, used, or other.  
- **`Payment_Option`** : Available payment methods, if provided.  
- **`Warranty`** : Warranty information, if applicable.  
- **`Condition`** : Condition of the phone (new/used).  
- **`page_number`** : The page number from which the listing was scraped.  
- **`url`** : The full URL of the product details page.  



<h1 style="color:red; font-weight:bold; background-color:#f5f5f5; border:2px solid red; padding:10px; border-radius:10px; font-family:monospace;">
<code>4- Some Challenges and Solutions :-</code>
</h1>



### <span style="color:blue;"><code>Challenge 1:</code> Anti-scraping protection on sites like Dubizzle</span>

**`Problem:`**  
Some websites (e.g., Dubizzle) detect rapid, repeated requests coming from the same machine and treat that as automated scraping. When this happens the site may:

- Slow down page responses,  
- Serve empty or different HTML (so your scraper runs but finds no data), or  
- Temporarily block the client for several minutes.

Often, after waiting a while and trying again the scraper works normally — the site lifted the temporary restriction.

**`Solution:`**  
To reduce the chance of being flagged as a bot, i will implement these measures:

1. **Add delays between requests**  
   Insert a pause between consecutive requests so the traffic rate looks human-like. Use randomized delays (e.g., `random.uniform(2, 6)`) rather than a fixed constant to avoid a predictable pattern.

2. **Use realistic request headers**  
   Send browser-like headers (e.g., a common `User-Agent`) to make requests appear similar to a real browser session.


**`Why this helps:`**  
Combining randomized delays and realistic headers makes your scraping traffic resemble a human browsing pattern, which significantly reduces the chance of triggering anti-scraping systems and getting temporarily blocked.



### <span style="color:blue;"><code>Challenge 2:</code> Dynamic Class Names in HTML</span>



**`Problem:`**  
Most of the website’s elements (like `<div>` and `<a>` tags) used **Bootstrap or dynamically generated class names**.  
This caused issues because when the website was updated, class names often changed — which meant the scraper couldn’t find the elements anymore, and no data was returned.

**`Solution:`**  
To make the scraper more stable and update-proof:
- I made the code depend on **static or consistent class names** whenever possible.  
- When all class names were dynamic, I used **CSS selectors with partial matches** (e.g., selecting a part of the class name that was unlikely to change).  
- This allowed the scraper to locate elements reliably even if the site structure changed slightly.



### <span style="color:blue;"><code>Challenge 3:</code>Handling Request Failures</span>

**`Problem:`** Some requests failed due to network issues or missing pages.  
**`Solution:`** Wrapped each `requests.get()` call inside a `try-except` block to catch exceptions and continue scraping without stopping the program.

### <span style="color:blue;"><code>Challenge 4:</code> Missing Data in Some Products</span>

**`Problem:`** Some product pages had missing fields like seller or warranty.  
**`Solution:`** Added conditional checks (`if ... else 'N/A'`) before writing data to the CSV to prevent errors and keep the dataset consistent.

### <span style="color:blue;"><code>Challenge 5:</code>  Incorrect Product Name Extraction</span>


**`Problem:`** 
During the scraping process, there was a common issue where the **product name** field sometimes did not contain the actual phone name.  
Instead, it occasionally displayed the **seller’s  location** or another unrelated title from the page.  

This happened because the `<h1>` tag used for displaying the product name was **not always consistent** — in some listings, the same tag contained the seller’s name or irrelevant text.  
As a result, the dataset ended up with incorrect values under the `product_name` column.


**`Solution:`**
To fix this problem, a **regular expression check** was added to make sure the extracted text from `<h1>` is a valid product name.  
The regex checks if the text contains **letters or numbers** before accepting it.

```python
product_name = (
    second_soup.find('h1').get_text()
    if re.search(r'[A-Za-z0-9]', second_soup.find('h1').get_text())
    else 'N/A'
)



<h1 style="color:red; font-weight:bold; background-color:#f5f5f5; border:2px solid red; padding:10px; border-radius:10px; font-family:monospace;">
<code>📘 5- How to Run the Script and What Sites Were Scraped </code>
</h1>

## 1. Overview
This script scrapes **mobile phone data** from the website [Dubizzle.com](https://www.dubizzle.com.eg), specifically from the **Mobile Phones** category.  
It extracts detailed information about each phone (price, seller, condition, specifications, etc.) and saves the data into a CSV file.

---

## 2. How to Run the Script

**Install required libraries:**
   pip install requests beautifulsoup4
   pip install re

## 🧠 How the Script Works

Run the notebook cell in your Python environment .  
The script will:

- Start from **page 1** of the “Mobile Phones” category.  
- Visit each product page to extract detailed information.  
- Continue automatically to the next page until no more products are available.  

### 🗂️ Output
- A CSV file named **`dubizzle_mobile_phones.csv`** will be created in your working directory.  
- Each row represents one mobile phone with its full details.  
