## Module 10 Assignment Scraping a Website - heny patel

## Project Description

Purpose:
The goal of this project is to develop a web scraping tool using Selenium to extract data from the Charities Bureau Website, specifically focusing on parsing a table containing information about charitable organizations operating in New York State. The extracted data will then be stored in a CSV file and uploaded to an S3 bucket on AWS.

Aim:
The primary objective of this project is to create a reliable and efficient web scraping script using Selenium, which can iteratively extract data from multiple pages of the website and compile it into a single CSV file. Additionally, the project aims to demonstrate the integration of Selenium with AWS services, particularly S3, for storing the extracted data securely and efficiently.

In [1]:
pip install selenium webdriver_manager

Note: you may need to restart the kernel to use updated packages.


#### The below code sets up Selenium WebDriver to automate interactions with the Chrome browser. It ensures the correct version of ChromeDriver is installed and then initializes a new instance of the Chrome browser controlled by Selenium. This allows for automated control and manipulation of the browser for tasks like web scraping or testing.

In [4]:
# importing libraries

import selenium


# the webdriver module from Selenium, which is used to automate web browser interaction

from selenium import webdriver



# Service class specifically for Chrome

from selenium.webdriver.chrome.service import Service 



# webdriver_manager is a tool that provides a way to automatically manage browser drivers, 
# and makes sure that the correct version is used without needing to manually download and set the driver.

from webdriver_manager.chrome import ChromeDriverManager




# Keys is used pressing the Enter key to submit a form on a webpage.

from selenium.webdriver.common.keys import Keys



# By is used to specify by which method the elements (ID, name, CSS selector, XPath etc.) should be located on a web page

from selenium.webdriver.common.by import By



# WebDriverWait is waiting for certain elements to become available on the page before interacting with them. 

from selenium.webdriver.support.ui import WebDriverWait



# expected_conditions contains a set of predefined conditions for waiting for certain events on a web page,
# such as the element to be clickable

from selenium.webdriver.support import expected_conditions as EC




# the correct version of ChromeDriver is installed and ready to be used by Selenium

service = Service(ChromeDriverManager().install())


# Initializes a new instance of the Chrome browser controlled by Selenium’s WebDriver

driver = webdriver.Chrome(service=service) 

## chrome got open and control by software.


In [5]:
import awscli
import boto3
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

#Scraping the given websiteCRAPE#
###First step will be calling the webdriver
s=Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=s)

browser.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

#identify the XPath location of the element that needs to be selected.
inputElement = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]') #identifies the location of the EIN element
inputElement.send_keys('0')  
inputElement1 = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]').click() #instatiates the click of the search
sleep(4)
table = browser.find_element(By.CSS_SELECTOR,'table.Bordered')
sleep(1)

##Create the data frame
#then create an empty dataframe
df =[]


#export the datframe table
for row in table.find_elements(By.CSS_SELECTOR,'tr'):
      cols = df.append([cell.text for cell in row.find_elements(By.CSS_SELECTOR,'td')])
      
#update dataframe with header 
df = pd.DataFrame(df, columns = ["Organization Name", "NY Reg #", "EIN" ,"Registrant Type","City","State"])
df

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
0,,,,,,
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926.0,NFP,PHILADELPHIA,PA
2,"""Incredibly Blessed"" Inc",49-54-61,842071758.0,NFP,STATEN ISLAND,NY
3,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670.0,NFP,ROCHESTER,NY
4,"""Studio 5404"" Inc.",44-39-58,463180470.0,NFP,MASSAPAQUA,NY
5,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128.0,NFP,HUDSON,NY
6,"""Y"" Dive, Inc.",48-45-01,854252095.0,NFP,SAINT ALBANS,NY
7,(ASMA) American Syrian Multicultural Associati...,42-84-63,273130182.0,NFP,BROOKLYN,NY
8,#FeedHamburg,48-37-35,854150318.0,NFP,HAMBURG,NY
9,#HicksStrong Inc.,48-10-48,842612081.0,NFP,CLIFTON PARK,NY


In [6]:
scrape = df[1:]
scrape

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926,NFP,PHILADELPHIA,PA
2,"""Incredibly Blessed"" Inc",49-54-61,842071758,NFP,STATEN ISLAND,NY
3,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670,NFP,ROCHESTER,NY
4,"""Studio 5404"" Inc.",44-39-58,463180470,NFP,MASSAPAQUA,NY
5,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128,NFP,HUDSON,NY
6,"""Y"" Dive, Inc.",48-45-01,854252095,NFP,SAINT ALBANS,NY
7,(ASMA) American Syrian Multicultural Associati...,42-84-63,273130182,NFP,BROOKLYN,NY
8,#FeedHamburg,48-37-35,854150318,NFP,HAMBURG,NY
9,#HicksStrong Inc.,48-10-48,842612081,NFP,CLIFTON PARK,NY
10,#WalkAway Foundation,47-15-80,832820906,NFP,CARLSBAD,CA


In [7]:
!pip install --upgrade pandas



In [10]:
import logging
import boto3
from botocore.exceptions import ClientError

def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region. If a region is not specified,
    the bucket is created in the S3 default region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-east-2'
    :return: True if bucket created, else False
    """
    try:
        if region is None:
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return True

#create the bucket
bucket_name = 'm10-assignment-heny-1' 
region = 'us-east-2'

if create_bucket(bucket_name, region):
    print(f'Bucket "{bucket_name}" created successfully.')
else:
    print(f'Failed to create bucket "{bucket_name}". Please check the logs for more details.')

Bucket "m10-assignment-heny-1" created successfully.


In [11]:
import pandas as pd
import awscli
from io import StringIO
from datetime import datetime 


csv_buffer = StringIO()
df.to_csv(csv_buffer)

timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Initialize an S3 client
s3_resource = boto3.resource('s3')
bucket_name = 'm10-assignment-heny-1' 
file_name = f'charities_bureau_scrape_{timestamp}.csv'  # Append timestamp to file name

#locally Upload the csv 
df.to_csv(file_name)

s3_resource.Object(bucket_name, file_name).put(Body=csv_buffer.getvalue())

print(f" Succesfully File uploaded to s3 bucket: {bucket_name}")

 Succesfully File uploaded to s3 bucket: m10-assignment-heny-1


In [12]:
import awscli
import boto3
import pandas as pd
from io import StringIO

csv_buffer = StringIO()
scrape.to_csv(csv_buffer)

# Initialize an S3 client
s3_resource = boto3.resource('s3')
bucket_name = 'm10-assignment-heny-1' 
file_name = 'charities_bureau_scrape_heny.csv'

s3_resource.Object(bucket_name, file_name).put(Body=csv_buffer.getvalue())

print(f"File {file_name} uploaded to {bucket_name}")

File charities_bureau_scrape_heny.csv uploaded to m10-assignment-heny-1
