### Project Summary: Web Scraping Data for Palm Beach (NSW) from Australian Census Site

#### Overview
This project involves scraping data for the suburb of Palm Beach (NSW) from the Australian Bureau of Statistics (ABS) website for the year 2021. The goal is to extract and clean data from an HTML page and store it in an Excel file for analysis.

#### Tools and Libraries Used
- **BeautifulSoup**: For parsing and extracting data from HTML content.
- **Requests**: For sending HTTP requests to fetch the webpage.
- **Pandas**: For organizing data and saving it to an Excel file.

#### Key Steps in the Project

1. **Fetch Data from URL**
   - A request is sent to the ABS website for the specified URL.
   - Example Code:
     ```python
     url = "https://www.abs.gov.au/census/find-census-data/quickstats/2021/SAL13143"
     page = requests.get(url)
     ```

2. **Parse HTML Content**
   - The HTML content is parsed and prettified for easier data extraction.
   - Example Code:
     ```python
     soup = BeautifulSoup(page.text, "html.parser")
     pretty_soup = BeautifulSoup(soup.prettify(), "html.parser")
     ```

3. **Extract and Clean Data**
   - Data is extracted from HTML elements using BeautifulSoup. Text is stripped of extra spaces and newlines.
   - Example Code:
     ```python
     entire_data = pretty_soup.find_all(['th', 'td'])
     paragraph_texts = [p.get_text().strip() for p in entire_data]
     ```

4. **Create DataFrames**
   - Two separate DataFrames are created to store data from two different parts of the HTML table.
   - Example Code:
     ```python
     df1 = pd.DataFrame(columns=['Data', 'Title'])
     df2 = pd.DataFrame(columns=['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'])
     ```

5. **Populate DataFrames**
   - The first DataFrame (`df1`) is populated with data from the first part of the table.
   - The second DataFrame (`df2`) is populated with data from the second part of the table.
   - Example Code:
     ```python
     for i in range(0, len(summary1) - 1, 2):
         df1.loc[len(df1)] = [summary1[i], summary1[i + 1]]
     
     for i in range(0, len(summary2) - 6, 7):
         row_data = [summary2[i], summary2[i + 1], summary2[i + 2], summary2[i + 3], summary2[i + 4], summary2[i + 5], summary2[i + 6]] if i + 6 < len(summary2) else [None] * 7
         df2.loc[len(df2)] = row_data
     ```

6. **Save Data to Excel File**
   - The data is saved to an Excel file with two sheets: one for each DataFrame.
   - Example Code:
     ```python
     with pd.ExcelWriter('2021_data.xlsx') as writer:
         df1.to_excel(writer, sheet_name='Sheet1', index=False)
         df2.to_excel(writer, sheet_name='Sheet2', index=False)
     ```

This project demonstrates web scraping, data cleaning, and file handling using Python libraries to extract and save structured data for further analysis.


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import datetime
import time
import smtplib

In [55]:
def send_mail(title,price):
    server = smtplib.SMTP("smtp.gmail.com", 587)
    server.starttls()
    server.login('kunalchugriaa@gmail.com','XXXX')
    
    subject = "This your "+ title+ " that you want is now available at Rs"+ price+ " Now is your chance to buy!"
    body = "Kunal, This is the moment we have been waiting for. Now is your chance to pick up the "+title+". Don't mess it up! Link here: https://www.amazon.in/Lino-Perros-Womens-Handbag-White/dp/B07L4QGTR4/ref=sr_1_7?crid=3QH24RE1AY651&dib=eyJ2IjoiMSJ9.qwHVo2qRftLZKMJ8OkPin242tF3no1xFoGVD7_rX6X0jgqrJJSmWOVzOakb_7rag4fjcj0DyruCzvVLpT5DVjjIM9HYVwnaKt525IAIkHbUHIz-wc0Yc_7CTl6ERyrE2dvjU5HNBGRZs6J767SlUcKyLTN7G_f8WBveWNGA-NCd5JRvpf6w7NHpw6nBPcqlhU9pJmxsW1iyE7WculrUO2wBY4UWiLjxTwXfAzSLAhKHFhC1LbhsLnPnOVCJ-b3_Rrhi5Dmc_TTmExX8Wt1LQLqx2JAu5OhWKPkJX2fynXDo._6t-lyct6brhbTJ0gZL2nr7EOpGk1tRHBimaX7bTOoY&dib_tag=se&keywords=lino+perros+handbags+for+women&qid=1725187294&sprefix=lino+p%2Caps%2C222&sr=8-7"
   
    msg = f"Subject: {subject}\n\n{body}"
    
    server.sendmail( 'uditrawlani6@gmail.com',
        'kunalchugriaa@gmail.com',
        msg
     
    )

In [64]:
import csv
from bs4 import BeautifulSoup
import requests
import pandas as pd
import datetime
import time
import smtplib
import os

def check_price(list_of_urls,minimum_prices):
   no_of_items=[]
   for i in range(0,len(list_of_urls)):
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
        page=requests.get(list_of_urls[i],headers=headers)
        soup1 = BeautifulSoup(page.content,'html.parser')
        soup2 = BeautifulSoup(soup1.prettify(),"html.parser")
        title = soup2.find(id="productTitle").get_text().strip()
        price = soup2.find(class_='a-price-whole').get_text().strip()
        title1 = ['Title','Price','Date']
        today=str(datetime.datetime.now())
        today = today[:today.find('.')]
        data = [title,price,today]
        if not (os.path.exists("AmazonWebScrapData.csv")):
            with open ("AmazonWebScrapData.csv",'w',newline='') as f:
                writer = csv.writer(f)
                writer.writerow(title1)
                writer.writerow(data)
        with open ("AmazonWebScrapData.csv",'a+',newline='') as f:
            writer = csv.writer(f)
            writer.writerow(data)
        if int(price.replace(',',''))<=int(minimum_prices[i]):
            no_of_items.append(i)
            send_mail(title,price)
   return no_of_items

    

In [None]:
list_of_urls = input('Kindly enter the list of URLs you want to track(comma separated) - ').split(',')
minimum_prices = input('Please enter the minimum prices of the items at which you''d like to receive email notifications(comma separated) - ').split(',')
# list_of_urls = ['https://www.amazon.in/Lino-Perros-Womens-Handbag-White/dp/B07L4QGTR4/ref=sr_1_7?crid=1ZHW7GA6SM1JV&dib=eyJ2IjoiMSJ9.qwHVo2qRftLZKMJ8OkPin242tF3no1xFoGVD7_rX6X0C2KTaSMaGllAkX9wS_l7yE-Ffjm_N8lLFkqxMLKhO5oCexMskWQA6pp1dMzkjfzDMSMdX6w-5SHT6j8q0Ojejft7KHcHaD471-H6QXm6M6noCog0wcQ8qOL39nI-MTZYeRw7KSRkHZ5u4E0Z5XMB3uIHl_DW2yuwBplEJDJMXpmOpzUpb-mOp6_eyW6MokNaE9O5prYhKHbHa_ygBZ5oW8DgJPYCLMagLzry0MbK8gkdnmYcV724mdD2iQ62JI0Q.PKx70NB4WI4L048qyqo-N4am8jrym3SszKzgbZQPhNI&dib_tag=se&keywords=lino+perros+handbags+for+women&qid=1725735417&sprefix=lino+perr%2Caps%2C247&sr=8-7','https://www.amazon.in/Poor-Charlies-Almanack-Essential-Charles/dp/1953953239/ref=sr_1_2?crid=3TBA04OE556YD&dib=eyJ2IjoiMSJ9.vg5pUX_jT-eNweisJvvTB-3iO6rDUSySV6Q1RkC4mqNbc3HbeETghszrSphrfCLNH17Jt1WH5d12Jr5eNSGmQiH48C-VzFkaui_zGNNa9PN6iy7ErdUExzs37V5rDob2_5iqDrnbUYjj9FIqE54Lfyu2iZuF9Fnfo8zd4iiF4oUKljt3djdObJxxRxZOLqxKjdeglSJWYqVeowuC4Hb-N4PROOKzBEb-GIvSr73e55Q.uas7GmnSoFY0EEm7-6dFX4rh3AzJt66Cy8S7Csbf0BM&dib_tag=se&keywords=poor+charlie%27s+almanack&qid=1725735365&sprefix=poor+cha%2Caps%2C209&sr=8-2']
interval = int(input('Please specify the regular interval at which the price should be retrieved and checked -'))
while True:
    no_of_items_satified = check_price(list_of_urls,minimum_prices)
    time.sleep(interval)
    if len(no_of_items_satified)==0:
        continue
    else:
        for i in sorted(no_of_items_satified,reverse=True):
            minimum_prices.pop(i)
            list_of_urls.pop(i)
    if len(minimum_prices)==0 and len(list_of_urls)==0:
        break

In [70]:
import pandas as pd
df=pd.read_csv(r"C:\Users\hp\Desktop\Data science\Portfolio Projects\Amazon Web Scrapping Project Python\AmazonWebScrapData.csv",encoding='ISO-8859-1')
df

Unnamed: 0,Title,Price,Date
0,Lino Perros Women Handbag White,1619,2024-09-08 15:01:44
1,Lino Perros Women Handbag White,1619,2024-09-08 15:01:44
2,Poor Charlies Almanack: The Essential Wit and...,1873,2024-09-08 15:01:53
3,Lino Perros Women Handbag White,1619,2024-09-08 15:02:00
4,Poor Charlies Almanack: The Essential Wit and...,1873,2024-09-08 15:02:08
...,...,...,...
96,Poor Charlies Almanack: The Essential Wit and...,1873,2024-09-08 15:11:42
97,Lino Perros Women Handbag White,1619,2024-09-08 15:11:50
98,Poor Charlies Almanack: The Essential Wit and...,1873,2024-09-08 15:11:58
99,Lino Perros Women Handbag White,1619,2024-09-08 15:12:07
