### *Practice + full code step by step explanation on how to scrap a webpage with requests and beautifull soup.*

In [12]:
# Import the necessary libraries, requests for making HTTP requests and BeautifulSoup for parsing HTML content.
# The writer function from the CSV module in Python is used to create and write data into a CSV writer object. 
# Pandas to create a DF with the CSV

from bs4 import BeautifulSoup
import requests
from csv import writer
import pandas as pd

In [2]:
# Defining the URL / Set the target URL

url= "https://www.pararius.com/apartments/amsterdam?ac=1"

In [3]:
# Sending an HTTP GET Request: Use requests.get(url) to send an HTTP GET request to the specified URL and 
# store the response in the page variable.

url= "https://www.pararius.com/apartments/amsterdam?ac=1"
page = requests.get(url)


In [4]:
# Parsing the HTML: Create a BeautifulSoup object soup to parse the HTML content of the webpage using 'html.parser'.

soup = BeautifulSoup(page.content, 'html.parser')

In [5]:
# Extracting Specific Elements: Use soup.find_all() to locate all <section> elements with the class "listing-search-item" 
# and store them in the lists variable. These elements likely represent individual housing listings on the webpage.

lists = soup.find_all('section', class_="listing-search-item")

In [25]:
# Creating a CSV File: Open a CSV file named 'housing.csv' in write mode with UTF-8 encoding and create a CSV writer object.
# Define a header row containing column names ('Title', 'Location', 'Price', 'Area') and write it to the CSV file.
# Iterate through each list in the lists collection (representing housing listings).
# Extract specific information (title, location, price, area) for each listing using list.find() for relevant HTML elements and classes.
# Replace newline characters with an empty string to clean the extracted text.
# Store the extracted information as a list called info.

with open('housing.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Title', 'Location', 'Price', 'Area']
    thewriter.writerow(header)

    for list in lists:
        title = list.find('a', class_="listing-search-item__link--title").text.replace('\n', '')
        location = list.find('div', class_="listing-search-item__sub-title'").text.replace('\n', '')
        price = list.find('div', class_="listing-search-item__price").text.replace('\n', '')
        area = list.find('li', class_="illustrated-features__item illustrated-features__item--surface-area").text.replace('\n', '')

        info = [title, location, price, area]
        thewriter.writerow(info)

Note: The with statement in Python is used for context management. It ensures that certain operations are carried out before and after a block of code. In the context of file handling, the with statement is often used to open and automatically close files.

Code explanation:

***with open('housing.csv', 'w', encoding='utf8', newline='') as f:***

This line opens a file named 'housing.csv' in write mode ('w') with UTF-8 encoding (encoding='utf8').
The newline='' argument is used to ensure that the correct line endings are used in the file.
The as f part assigns the opened file object to the variable f.
The with statement ensures that the file is properly closed when the block of code inside it is exited, even if an error occurs.

***thewriter = writer(f):***

This line creates a CSV writer object (thewriter) that is associated with the opened file f. The writer function is part of the csv module and is used to write CSV data to a file.

***header = ['Title', 'Location', 'Price', 'Area']:***

This line defines a list called header that contains the column names for the CSV file. These column names will be used as the first row in the CSV file.

***thewriter.writerow(header):***

This line writes the header list (containing column names) as the first row of the CSV file using the writerow method of the CSV writer object thewriter



In [24]:
# Let's see the output opened in a Pandas DataFrame

df = pd.read_csv('housing.csv')
df.head()

Unnamed: 0,Title,Location,Price,Area
0,Flat Van Ostadestraat ...,1074 XE Amsterdam (Nieuwe Pijp) ...,"€2,800 per month ...",75 m²
1,Flat Van Nijenrodeweg ...,1082 HH Amsterdam (Buitenveldert-W...,"€2,250 per month ...",90 m²
2,Flat Vrolikstraat ...,1091 VE Amsterdam (Oosterparkbuurt...,"€1,350 per month ...",30 m²
3,Flat Oostenburgervoors...,1018 MR Amsterdam (Oostelijke Eila...,"€1,850 per month ...",75 m²
4,House Zonneweg 24 ...,1033 CJ Amsterdam (Tuindorp Oostza...,"€1,950 per month ...",90 m²


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Title     30 non-null     object
 1   Location  30 non-null     object
 2   Price     30 non-null     object
 3   Area      30 non-null     object
dtypes: object(4)
memory usage: 1.1+ KB


This code is designed to scrape data from a single webpage specified in the provided URL. To extend its functionality to scrape data from multiple pages and consolidate it into a single CSV file, you can make adjustments as follows:

In [36]:
# Initialize an empty list to store data from all pages
all_data = []

# Start on the first page
page_number = 1

while True:
    url = f"https://www.pararius.com/apartments/amsterdam/page-{page_number}"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    lists = soup.find_all('section', class_="listing-search-item")

    if not lists:
        # If no listings are found on the page, exit the loop
        break

    for list in lists:
        title = list.find('a', class_="listing-search-item__link--title").text.replace('\n', '')
        location = list.find('div', class_="listing-search-item__sub-title'").text.replace('\n', '')
        price = list.find('div', class_="listing-search-item__price").text.replace('\n', '')
        area = list.find('li', class_="illustrated-features__item illustrated-features__item--surface-area").text.replace('\n', '')

        info = [title, location, price, area]
        all_data.append(info)

    # Move to the next page
    page_number += 1

# Create a DataFrame from the accumulated data
df = pd.DataFrame(all_data, columns=['Title', 'Location', 'Price', 'Area'])

# Save the DataFrame to a CSV file
df.to_csv('housing_final.csv', encoding='utf8', index=False)

In [37]:
df_final = pd.read_csv('housing_final.csv')

In [38]:
df_final

Unnamed: 0,Title,Location,Price,Area
0,Flat Vrolikstraat 262 ...,1092 TX Amsterdam (Oosterparkbuurt...,"€2,095 per month ...",58 m²
1,Flat Van Ostadestraat ...,1074 XE Amsterdam (Nieuwe Pijp) ...,"€2,800 per month ...",75 m²
2,Flat Van Nijenrodeweg ...,1082 HH Amsterdam (Buitenveldert-W...,"€2,250 per month ...",90 m²
3,Flat Vrolikstraat ...,1091 VE Amsterdam (Oosterparkbuurt...,"€1,350 per month ...",30 m²
4,Flat Oostenburgervoors...,1018 MR Amsterdam (Oostelijke Eila...,"€1,850 per month ...",75 m²
...,...,...,...,...
1967,Flat Distelweg ...,1031 HD Amsterdam (Noordelijke IJ-...,"€2,850 per month ...",87 m²
1968,Flat Prinsengracht 311...,1016 GX Amsterdam (Grachtengordel-...,"€2,000 per month ...",60 m²
1969,Flat Rustenburgerstraa...,1072 HG Amsterdam (Nieuwe Pijp) ...,"€3,950 per month ...",150 m²
1970,Flat Tussen Meer ...,1069 DT Amsterdam (Osdorp-Midden) ...,"€3,400 per month ...",84 m²
