# Practice Lab: Web Scraping with Beautiful Soup

In this lab, you will practice how to scrape data from a webpage using Python. You will use the `Beautiful Soup` package to extract data from a webpage and perform some basic data analysis.

In the previous lab, you used pandas for web scraping, which works well when the data is neatly structured in a table. However, in many cases, HTML pages are formatted differently. Data might be scattered throughout the page, and the surrounding HTML can vary significantly. This is where Beautiful Soup becomes useful — it allows you to identify repeating patterns in the HTML and extract the data accordingly.

In this scenario, you're working as a data analyst for a real estate company. Your task is to scrape apartment listings from a webpage and identify the top five most affordable city-center apartments with at least two bedrooms. This helps the team track pricing trends and spot high-value opportunities.

## General instructions
- **Replace any instances of `None` with your own code**. All `None`s must be replaced.
- **Compare your results with the expected output** shown below the code.
- **Check the solution** using the expandable cell to verify your answer. If needed, you can copy the code and paste it into the cell

Happy coding!

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important note</strong>: Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 
</div>

## Table of contents
- [Step 0: Inspect the webpage](#step0)
- [Step 1: Import modules](#import-modules)
- [Step 2: Get the HTML code](#get-html)
- [Step 3: Parse the HTML](#parse)
- [Step 4: Extract the elements and create a pandas DataFrame](#extract)
- [Step 5: Clean and process the columns](#clean-and-process)
- [Step 6: Find the right apartments](#find-the-apartments)

<a id="step0"></a>

## Step 0: Inspect the webpage

Begin by inspecting the webpage with the apartment postings

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Open the <a href="https://dlai-lc-dag.s3.us-east-2.amazonaws.com/apartment_finder.html"> webpage</a>. </li>
            <li>Inspect the html code.</li>
            <ul>
                <li>Right click anywhere on the page.</li>
                <li>Select the <code>Inspect</code> option.</li>
            </ul>
        </ol>
</div>

You should see something like this:

<img src="imgsL3/inspection.png">

Each house listing on the page is contained within a `<div>` element with the class `grid-item` (yellow). Inside each of these `grid-item` containers, the key information is organized into different `<div>` elements, each with a specific class:

- **`info`** (orange): Contains general details such as the number of bedrooms, price, and location.  
- **`details`** (purple): Provides more in-depth information like property size, number of floors, and additional features.  
- **`photo`** (green): Holds the image(s) associated with the listing.

<a id="import-modules"></a>

## Step 1: Import modules
First, you need to import the necessary modules. You will use `requests` to access the webpage, `BeautifulSoup` to extract information from it and `pandas` to finally create a DataFrame to analyze the data.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

<a id="get-html"></a>

## Step 2: Get the HTML code
Use the `requests.get()` function to extract the HTML from the webpage. The URL is given below.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Extract the HTML from the webpage. </li>
            <ul>
                <li>Use <code>requests.get()</code> function to extract the HTML.</li>
                <li>Use <code>.status_code</code> to check whether your request was successful.</li>
            </ul>
        </ol>
</div>

In [2]:
# URL of the webpage to scrape
url = "https://dlai-lc-dag.s3.us-east-2.amazonaws.com/apartment_finder.html"

### START CODE HERE ###

# send a GET request to the url
response = requests.get(url)

# get the status of the response for troubleshooting
status = response.status_code

### END CODE HERE ###

print(status)

200


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


```
200
```

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# send a GET request to the url
response = requests.get(url)

# get the status of the response for troubleshooting
status = response.status_code
```
</details>

If the above code succeeded, you can print out the HTML of the webpage to inspect it and find out what you are actually searching for. You should be able to find the structure describes at the beginning of the lab.

In [3]:
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Welcome to CityDwellers</title>
    <link rel="stylesheet" href="styles_green.css">
</head>
<body>

    <div class="content">
        <header>
            <h1>Welcome to CityDwellers - Your Ultimate Apartment Finder</h1>
        </header>
        <section>
            <p>At CityDwellers, we believe that finding your dream apartment should be exciting, not exhausting. Our platform is designed to simplify your search, offering a seamless experience from browsing to signing the lease. Whether you're looking for a cozy studio, a spacious family home, or a luxurious penthouse, your perfect match is just a few clicks away.</p>
        </section>
        <section>
            <h2>Why Choose CityDwellers?</h2>
            <ul>
                <li>Vast Selection: Browse thousands of listings in prime locations. From bustling city centers to qui

By closely inspecting the HTML, you can find that there is a HTML structure that repeats over and over and looks something like this:
```
<div class="grid-item" onclick="toggleDetails(this)">
    <div class="info">
        <p><strong>3 Bedroom</strong></p>
        <p>Location: Southern Suburbs</p>
        <p>Price: $1272</p>
        <div class="details">
            <p>Area: 52 sqm</p>
            <p>Floor: 2</p>
            <p>Furnishing: Unfurnished</p>
            <p>Facing: East</p>
            <p>Parking: Yes</p>
            <p>Bathrooms: 1</p>
            <p>Balcony: 1</p>
            <p>Overlooking: Garden/Park, Pool</p>
        </div>
    </div>
    <div class="photo">
        <img src="imgs/unfurnished/med/2.jpg" alt="House">
        <button class="arrow left" onclick="prevImage(event, this)">&#10094;</button>
        <button class="arrow right" onclick="nextImage(event, this)">&#10095;</button>
    </div>
</div>
```

<a id="parse"></a>

## Step 3: Parse the HTML

As the next step, you will use `BeautifulSoup` to parse the text and extract the individual items that contain apartments data.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below to parse the text and extract the individual items that contain apartments data. </li>
            <ul>
                <li>Use <code>BeautifulSoup</code> to parse the response text.</li>
                <li>Use <code>soup.find_all()</code> to find all instances of the `div` with the correct class (check the output above to find out which class). Pass the class as the <code>class_</code> named parameter.</li>
            </ul>
        </ol>
</div>

In [4]:
### START CODE HERE ###

# Parse the HTML content
soup = BeautifulSoup(response.text)

# Find all the grid items
grid_items = soup.find_all("div", class_="grid-item")

### END CODE HERE ###

# Print the number of grid items that were found
print(len(grid_items))

72


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


```
72
```

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Find all the grid items
grid_items = soup.find_all("div", class_="grid-item")
```
</details>

<a id="extract"></a>

## Step 4: Extract the elements and create a pandas DataFrame

As the next step, you will search through the `grid_items` to extract individual pieces of information about the apartments. Notice that each piece of information is stored in its own paragraph `<p>` and you can find all the paragraphs using `item.find_all("p")`, much the same way as you found all `grid_items` before. This code is provided for you. The `paragraphs` variable then stores a list of all paragraphs and you can access them one by one using indices. For this task, you need to extract the data to create a pandas DataFrame. Four of the columns are already extracted and the rest are left for you to be finished. In order to finish this lab, you need to at least populate the `parking` column, as you will need this one for further analysis. You can populate the rest of the columns for practice.

Since each line of code has quite a lot to unpack, here is a breakdown of what happens in the `number_of_bedrooms` column. This is the code used to extract the value:

`"number_of_bedrooms": paragraphs[0].text.split(" ")[0].strip()`

As you can see this is a dictionary entry, where `number_of_bedrooms` is the key and the rest is the value of this specific entry. First you access the first item in the `paragraphs` list by using `paragraphs[0]`. then you use `.text` to access the text within this paragraph. This gives you only the text without any HTML tags, returning for example `2 Bedrooms`. As you just want to keep the number, you can use `.split(" ")` to split it by the whitespace and then select the first element in the split by using `[0]`. In the end you use `.strip()` to remove any whitespace that may be left.

In this image you can see which index in the `paragraphs` variable corresponds to each piece of information:

<div style="text-align: center">
<img src="imgsL3/listing_screenshot.png" width="800"/>
</div>

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below to extract the individual pieces of information about the apartments. </li>
            <ul>
                <li>Four of the columns are already done for you.</li>
                <li>Extract the data for "furnishing" and "parking".</li>
            </ul>
        <strong>Note</strong>: this part of code will run even if you don't change anything, but the DataFrame will not contain the values.
        </ol>
</div>

In [12]:
# List to store apartment data
apartments = []

# Loop through each grid item and extract the details
for item in grid_items:
    paragraphs = item.find_all("p")
    
    apartment = {
        "number_of_bedrooms": paragraphs[0].text.split(" ")[0].strip(),
        "location": paragraphs[1].text.split(": ")[1].strip(),
        "price": paragraphs[2].text.split(": ")[1].strip(),
        "area": paragraphs[3].text.split(": ")[1].strip(),
        
        ### START CODE HERE ###

        "furnishing": paragraphs[5].text.split(": ")[1].strip(),
        "parking": paragraphs[7].text.split(": ")[1].strip(),
        # Optionally you can extract other columns for practice
        
        ### END CODE HERE ###
    }
    
    apartments.append(apartment)

# Create a pandas DataFrame from the list of apartments
df = pd.DataFrame(apartments)

# Display the DataFrame
df.head()

Unnamed: 0,number_of_bedrooms,location,price,area,furnishing,parking
0,3,Southern Suburbs,$1272,52 sqm,Unfurnished,Yes
1,3,Central,$6120,154 sqm,Unfurnished,Yes
2,1,Other,$745,34 sqm,Unfurnished,Yes
3,1,Southeastern Suburbs,$1048,43 sqm,Unfurnished,Yes
4,3,Central,$2200,63 sqm,Partially Furnished,Yes


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


<img src="./imgsL3/output_step4.png" width=600>

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
        "furnishing": paragraphs[5].text.split(": ")[1].strip(),
        "parking": paragraphs[7].text.split(": ")[1].strip(),
        # Optionally you can extract other columns for practice
        # "floor": paragraphs[4].text.split(": ")[1].strip(),
        # "facing": paragraphs[6].text.split(": ")[1].strip(),
        # "bathrooms": paragraphs[8].text.split(": ")[1].strip(),
        # "balcony": paragraphs[9].text.split(": ")[1].strip(),
        # "overlooking": paragraphs[10].text.split(": ")[1].strip()
```
</details>

Now you can start having a look through the apartments. Remember, you are searching for central apartments with two bedrooms, a parking place and you want to find the cheapest ones. As the first step, you may want to check what are the data types of the columns and analyze what kind of values there are in each of the columns of interest. Your columns of interest are `number_of_bedrooms`, `location`, `price`, and `parking`.

In [None]:
# Display the information about the DataFrame
df.info()

<a id="clean-and-process"></a>

## Step 5: Clean and process the columns

As you can see, all of the datatypes are objects. At the very least you would want to cast the `number_of_bedrooms` and `price` to numeric, so that you can use the greater-than operator and you can sort them by price. Aditionally, you would want to check the values in each of the columns to see what they are and how it may affect your search.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below cast the number of bedrooms to <code>"Int64"</code>. </li>
        </ol>
</div>

In [7]:
### START CODE HERE ###

# Convert the number_of_bedrooms column to integer
df["number_of_bedrooms"] = df["number_of_bedrooms"].astype("Int64")

### END CODE HERE ###

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Convert the number_of_bedrooms column to integer
df["number_of_bedrooms"] = df["number_of_bedrooms"].astype("Int64")
```
</details>

Now have a look at the values `price` column (check the DataFrame above). You cannot directly cast it to a numeric format as it has the dollar sign in front of numbers. You need to take care of it first.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Use the cell below cast the price column to int. </li>
            <ul>
                <li>First remove the dollar signs. Hint: you can use <code>.str.replace()</code> and just replace it with an empty string.</li>
                <li>Then you can cast it to int (or float - but if you check closely, all of the numbers are actually integers so it does not matter).</li>
            </ul>
        </ol>
</div>

In [8]:
### START CODE HERE ###

# Remove the dollar sign from the price column
df["price"] = df["price"].str.replace("$","")

# Convert the price column to integer
df["price"] = df["price"].astype("Int64")

### END CODE HERE ###

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Remove the dollar sign from the price column
df["price"] = df["price"].str.replace("$", "")

# Convert the price column to integer
df["price"] = df["price"].astype(int)
```
</details>

Now have a look at the `location` column. Perhaps there are multiple values there that you are interested in.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Find all the unique values in the <code>location</code> column and print them out. </li>
        </ol>
</div>

In [13]:
### START CODE HERE ###

# Find the unique values in the location column and print them out
print(df["location"].unique())

### END CODE HERE ###

['Southern Suburbs' 'Central' 'Other' 'Southeastern Suburbs' 'Periphery'
 'Northern Suburbs' 'Western Suburbs']


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


```
['Southern Suburbs' 'Central' 'Other' 'Southeastern Suburbs' 'Periphery'
 'Northern Suburbs' 'Western Suburbs']
```

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Find the unique values in the location column
print(df["location"].unique())
```
</details>

It looks like you are good to go. The `location` column has only a few different values in it and it is only the `Central` that you are interested in.

Lastly, check for all the values in the `parking` column.


<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Find all the unique values in the <code>parking</code> column and print them out.</li>
        </ol>
</div>

In [14]:
### START CODE HERE ###

# Find the unique values in the parking column and print them out
print(df["parking"].unique())

### END CODE HERE ###

['Yes' 'No']


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


```
['Yes' 'No']
```

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Find the unique values in the parking column
print(df["parking"].unique())
```
</details>

The `parking` column has only two different values in it and you are interested in all the rows that have `Yes` in them.

<a id="find-the-apartments"></a>

## Step 6: Find the right apartments

Now you can finally filter and sort your DataFrame to find the apartments that you are looking for.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
    <strong>▶▶▶ Directions</strong> 
        <ol>
            <li>Filter and sort the DataFrame to find the cheapest five apartments with central location, two bedrooms and a parking space.</li>
            <ul>
                <li>Filter the DataFrame for apartments with central location.</li>
                <li>Filter the DataFrame for apartments with two or more bedrooms.</li>
                <li>Filter the DataFrame for apartments with parking.</li>
                <li>Sort the DataFrame by price in ascending order. Hint: use <code>.sort_values()</code> and pass <code>price</code> to the named argument <code>by</code>. This returns a sorted DataFrame.</li>
            </ul>
        </ol>
</div>

In [18]:
### START CODE HERE ###

# Filter the DataFrame for apartments with central location
central_apartments_df = df[df["location"] == "Central"]

# Filter the DataFrame for apartments with two or more bedrooms
two_bedroom_apartments_df = central_apartments_df[central_apartments_df["number_of_bedrooms"] >= "2"]

# Filter the DataFrame for apartments with parking
apartments_with_parking_df = two_bedroom_apartments_df[two_bedroom_apartments_df["parking"] == "yes"]

# Sort the DataFrame by price in ascending order
sorted_apartments_df = apartments_with_parking_df.sort_values(by="price")

### END CODE HERE ###

# Get the cheapest five apartments
cheapest_five_apartments_df = sorted_apartments_df.head(5)

# Display the result
cheapest_five_apartments_df

TypeError: '>=' not supported between instances of 'str' and 'int'

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 


<img src="./imgsL3/output_step6.png" width=500>

</details>

<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Filter the DataFrame for apartments with central location
central_apartments_df = df[df["location"] == "Central"]

# Filter the DataFrame for apartments with two or more bedrooms
two_bedroom_apartments_df = central_apartments_df[central_apartments_df["number_of_bedrooms"] >= 2]

# Filter the DataFrame for apartments with parking
apartments_with_parking_df = two_bedroom_apartments_df[two_bedroom_apartments_df["parking"] == "Yes"]

# Sort the DataFrame by price in ascending order
sorted_apartments_df = apartments_with_parking_df.sort_values(by="price")
```
</details>

Congratulations for finishing this lab.

You have used `requests` to get the HTML of a webpage, parsed it using `BeautifulSoup` and created a table of available apartments. Then you have found the cheapest five central apartments with minimum two bedrooms and a parking.

Hope you enjoyed it! 