## ChatGPT-assisted scraping: Creating a bot to pull data from Maryland employees database

Our new friend ChatGPT: enemy of English professors, cautious friend of computer programmers and to journalists, well, something in-between. The tool has become indispensable for me and many of my colleagues in the news apps class this semester, whether it has been to rewrite entire scripts to debug one line of code or explain a concept found in one of our tutorials. 

In this blog post, I’ll show you how I utilized ChatGPT to create a semi-functional scraper for the vacancies in this Maryland state employees database. This utilization of scraping was much more a practice run than anything, as the vast majority of data on the website I was pulling from turned out to be wildly dated and inaccurate. It became harder to draw any journalistic conclusions from the data by the minute – with the exception of a snide “why hasn’t the state updated its employees database in some areas since at least 2006?” 

But overall, the process turned out to be a fun puzzle that was worth solving. And it wouldn’t have been possible without ChatGPT being there to guide me through what scraping for an admittedly hellish website could look like. I was mildly successful and had to pivot my angle for the scraper often, but hopefully this blog will help others in the future understand what ChatGPT-assisted scraping can do to help, or hurt, someone’s coding process. 

A few disclaimers before we dive in:

#### How ChatGPT should and shouldn't be used to help write code in this context

As we’ll discuss later in this post, ChatGPT can be an indispensable friend in helping folks understand single lines of code or overall templates that can be built off of to achieve a specific web scraping or programming goal. Despite this, the program’s knowledge is not absolute. 

How helpful ChatGPT can be for scraping – or any other purpose – depends almost entirely on a person’s own ability to ask smart, thoughtful questions. When I first started using ChatGPT to help me build a scraper, I did indeed make a first, half-hearted effort to give it my webpage and say “pull all of this information down! I know nothing about this webpage, but you surely will, pesky AI!” 

![Screen%20Shot%202023-05-15%20at%201.16.45%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%201.16.45%20PM.png)

In [1]:
# And here is the code ChatGPT generated:

import requests
from bs4 import BeautifulSoup

# Define the base URL and the initial page
base_url = 'https://www.doit.state.md.us/phonebook/indlisting.asp'
page_url = base_url + '?letter=A'

# Initialize an empty list to store the data
data = []

# Loop through each page (A to Z)
for letter in range(65, 91):  # ASCII codes for A to Z
    page_url = base_url + '?letter=' + chr(letter)
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all tables on the page
    tables = soup.find_all('table')

    # Loop through each table
    for table in tables:
        # Find all rows in the table
        rows = table.find_all('tr')
        for row in rows:
            # Find all cells in the row
            cells = row.find_all('td')
            # Extract the data from each cell
            if len(cells) == 4:
                # Ignore the header row
                name = cells[0].get_text().strip()
                title = cells[1].get_text().strip()
                phone = cells[2].get_text().strip()
                email = cells[3].get_text().strip()
                # Append the data to the list
                data.append([name, title, phone, email])

# Print the data
for item in data:
    print(item)

['Chris Aadland', 'Natural Resources, Department of', 'Power Plant Assessment Division B-3', '(410)260-8736']
['Christopher Aadland', 'Natural Resources, Department of', 'Habitat Restoration', '(410)260-8736']
['David Aalders', 'Public Safety & Correctional Services, Department of', 'Eastern Correctional Institution', '(410)845-4000\xa0Ext.\xa04415']
['Bizhan Aarabi', 'University of Maryland, Baltimore', 'School of Medicine', '(410)328-7371']
['Joy Aaron', 'Maryland Department of Health', 'Maryland Board of Physical Therapy Examiners', '(410)764-4716']
['Mark C. Aaron', 'University of Maryland, Baltimore', 'School of Nursing', 'No Listing']
['Clayton Anthony Aarons', 'Judiciary of Maryland', 'District #5  - Prince George`s County (Upper Marlboro)', '(301)070/4060']
['Sean R. Abad', 'University of Maryland, Baltimore', 'Office of Development and Alumni Relations', '(410)706-8494']
['Eli Abada', 'Maryland Department of Health', 'Program Analysis & MCOs', '(410)767-1392']
['Adam Abadir', 

Obviously, that didn’t work. None of the code from that first attempt to put in zero effort on my end was usable because I did not do enough legwork to make sure ChatGPT understood the patterns and nuances of this type of scraping activity. That’s your first lesson in AI-assisted scraping: the AI technology can only be as smart as you. It only understands the data as much as you do. So put in some legwork to make sure you’re able to help it help you. 

That first scraper ChatGPT generated for me only pulled down data from one page, rather than the 100+ that I needed. So, it was back to the drawing board. 

#### Main code issues ChatGPT was super helpful in solving

If there's one thing ChatGPT is good for, it's generating templates. When our class first started this web scraping assignment and was working off a few previous scrapers we had built, I really just needed a fresh start. From my perspective, I thought it would be much easier to learn and adapt the code we have already learned to my own scraping project.

Basically, I wanted to get a sense of what elements of a scraping template belongs within *all* scrapers, regardless of what website they pulled from, versus what unique aspects were add-ons based on the needs and challenges posed by each individual website. ChatGPT sent me on my way during this process and can do similar things for other folks looking to build off a baseline template for any of their coding needs.

For example, here is the template I generated for a basic, no-frills Beautiful Soup scraper template. Even just asking ChatGPT for explanations of different lines of code within this expanded my understanding of how different elements of the scraper worked together programmatically. 

##### The template itself:

![Screen%20Shot%202023-05-15%20at%201.31.16%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%201.31.16%20PM.png)



In [2]:
import requests
from bs4 import BeautifulSoup

# Set the URL you want to scrape
url = "https://example.com/table"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Find the table element using Beautiful Soup
table = soup.find("table")

# Create a list to store the table data
table_data = []

# Find all table rows using Beautiful Soup
rows = table.find_all("tr")

# Loop through each row and extract the table data
for row in rows:
    # Find all table cells in the row
    cells = row.find_all("td")
    
    # Extract the text from each cell and append it to the table data list
    row_data = []
    for cell in cells:
        row_data.append(cell.text.strip())
    
    # Append the row data to the table data list
    table_data.append(row_data)

# Print the table data
print(table_data)

AttributeError: 'NoneType' object has no attribute 'find_all'

##### Asking a clarifiying question to better understand how certain HTML tags worked in a scraping context

![Screen%20Shot%202023-05-15%20at%201.32.34%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%201.32.34%20PM.png)

##### Updating the template based on a better understanding of my website and its limitations for just doing bare-minimum scraping

![Screen%20Shot%202023-05-15%20at%201.33.00%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%201.33.00%20PM.png)

In [3]:
# and that updated scraper code to work from:

import requests
from bs4 import BeautifulSoup

# Set the URL of the first page to scrape
url = "https://example.com/table?page=1"

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

# Create a list to store the table data
table_data = []

# Loop through all pages of the table
page = 1
while True:
    # Find the table element using Beautiful Soup
    table = soup.find("table")
    
    # Find all table rows using Beautiful Soup
    rows = table.find_all("tr")
    
    # Loop through each row and extract the table data
    for row in rows:
        # Find all table cells in the row
        cells = row.find_all("td")
        
        # Extract the text from each cell and append it to the table data list
        row_data = []
        for cell in cells:
            row_data.append(cell.text.strip())
        
        # Append the row data to the table data list
        table_data.append(row_data)
    
    # Check if there is a next page of the table
    next_button = soup.find("a", {"class": "next-page"})
    if next_button is None:
        # Stop looping if there is no next page
        break
    
    # Set the URL of the next page to scrape
    page += 1
    url = f"https://example.com/table?page={page}"
    
    # Send a GET request to the URL
    response = requests.get(url)
    
    # Parse the HTML content using Beautiful Soup
    soup = BeautifulSoup(response.content, "html.parser")

# Print the table data
print(table_data)

AttributeError: 'NoneType' object has no attribute 'find_all'

### Iterating through pages within the state employees vacancy directory

The main challenge I faced while scraping the Maryland state employees directory was making my scraper iterate through two different patterns on a page at once. The vacancy website was organized into groups of pages within each individual letter of the alphabet. Within each letter tab, there were anywhere from 1 to 20 pages of results containing employees whose last names began with that letter. The pages were incremented by 15 (as there were 15 employees listed on each page) so pages associated with the letter 'A' would begin with a URL of "https://www.doit.state.md.us/phonebook/IndListing.asp?FirstLetter=A&offset=0", the next page in the letter 'A' would have a url of "https://www.doit.state.md.us/phonebook/IndListing.asp?FirstLetter=A&offset=15" and so on. I wanted to iterate through each of these URLs for each individual letter until we reached their final page with the help of ChatGPT, as I knew this needed to take on some form of "for loop inside for loop inside for loop" format but had little experience putting those piece together before working on this task. 

A huge cautionary tale in ChatGPT-assisted scraping is that, no matter how many instructions you give it to do otherwise, ChatGPT is quite stubborn in how it feels something should get done. When I asked ChatGPT for help with this iterating task, I tried to specify as much as possible how to complete it – specifying the tool needed to use URLs that are incremented by 15 and move onto the next multiple of 15 once it was done with a given page. Then, once it reached the last page in a given letter, it needed to iterate into the next letter and repeat the process. As we will see in the following examples, the AI wanted to only determine iteration through the last page of a given letter based on looking for an HTML button tag tht signified the last page.

The only issue was that there was no HTML button tag that existed for these webpages. Each "next" URL was associated by a small image of an arrow and a new href associated with it. See the following examples to understand more about ChatGPT's roundabout way of attempting to help me.

![Screen%20Shot%202023-05-15%20at%202.50.14%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%202.50.14%20PM.png)


In [4]:
##### ChatGPT's initial code for iterating through each letter in an alphabet:

import requests
from bs4 import BeautifulSoup

# Define the base URL
base_url = 'https://www.doit.state.md.us/phonebook/indlisting.asp'

# Initialize an empty list to store the data
data = []

# Loop through each page (A to Z)
for letter in range(65, 91):  # ASCII codes for A to Z
    page_url = base_url + '?letter=' + chr(letter)
    response = requests.get(page_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the pagination links on the page
    pagination = soup.find('div', {'class': 'pagination'})
    if pagination is not None:
        links = pagination.find_all('a')
        num_pages = len(links) + 1
    else:
        num_pages = 1

    # Loop through each page
    for page in range(1, num_pages + 1):
        # Construct the URL for the page
        if page == 1:
            page_url = base_url + '?letter=' + chr(letter)
        else:
            page_url = base_url + '?letter=' + chr(letter) + '&page=' + str(page)
        response = requests.get(page_url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all tables on the page
        tables = soup.find_all('table')

        # Loop through each table
        for table in tables:
            # Find all rows in the table
            rows = table.find_all('tr')
            for row in rows:
                # Find all cells in the row
                cells = row.find_all('td')
                # Extract the data from each cell
                if len(cells) == 4:
                    # Ignore the header row
                    name = cells[0].get_text().strip()
                    title = cells[1].get_text().strip()
                    phone = cells[2].get_text().strip()
                    email = cells[3].get_text().strip()
                    # Append the data to the list
                    data.append([name, title, phone, email])

# Print the data
for item in data:
    print(item)

['Chris Aadland', 'Natural Resources, Department of', 'Power Plant Assessment Division B-3', '(410)260-8736']
['Christopher Aadland', 'Natural Resources, Department of', 'Habitat Restoration', '(410)260-8736']
['David Aalders', 'Public Safety & Correctional Services, Department of', 'Eastern Correctional Institution', '(410)845-4000\xa0Ext.\xa04415']
['Bizhan Aarabi', 'University of Maryland, Baltimore', 'School of Medicine', '(410)328-7371']
['Joy Aaron', 'Maryland Department of Health', 'Maryland Board of Physical Therapy Examiners', '(410)764-4716']
['Mark C. Aaron', 'University of Maryland, Baltimore', 'School of Nursing', 'No Listing']
['Clayton Anthony Aarons', 'Judiciary of Maryland', 'District #5  - Prince George`s County (Upper Marlboro)', '(301)070/4060']
['Sean R. Abad', 'University of Maryland, Baltimore', 'Office of Development and Alumni Relations', '(410)706-8494']
['Eli Abada', 'Maryland Department of Health', 'Program Analysis & MCOs', '(410)767-1392']
['Adam Abadir', 

![Screen%20Shot%202023-05-15%20at%202.51.36%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%202.51.36%20PM.png)


In [5]:
import requests
from bs4 import BeautifulSoup

# Define the base URL
base_url = 'https://www.doit.state.md.us/phonebook/indlisting.asp'

# Initialize an empty list to store the data
data = []

# Loop through each page (A to Z)
for letter in range(65, 91):  # ASCII codes for A to Z
    # Initialize the page number
    page_num = 1
    # Construct the initial URL
    url = f'{base_url}?letter={chr(letter)}&page={page_num}'
    while True:
        # Send a request to the URL
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find all tables on the page
        tables = soup.find_all('table')

        # Loop through each table
        for table in tables:
            # Find all rows in the table
            rows = table.find_all('tr')
            for row in rows:
                # Find all cells in the row
                cells = row.find_all('td')
                # Extract the data from each cell
                if len(cells) == 4:
                    # Ignore the header row
                    name = cells[0].get_text().strip()
                    title = cells[1].get_text().strip()
                    phone = cells[2].get_text().strip()
                    email = cells[3].get_text().strip()
                    # Append the data to the list
                    data.append([name, title, phone, email])

        # Check if there is a "Next" link
        next_link = soup.find('a', {'title': 'Next'})
        if next_link is not None:
            # Increment the page number and construct the new URL
            page_num += 1
            url = f'{base_url}?letter={chr(letter)}&page={page_num}'
        else:
            # No more pages, break out of the loop
            break

# Print the data
for item in data:
    print(item)

['Chris Aadland', 'Natural Resources, Department of', 'Power Plant Assessment Division B-3', '(410)260-8736']
['Christopher Aadland', 'Natural Resources, Department of', 'Habitat Restoration', '(410)260-8736']
['David Aalders', 'Public Safety & Correctional Services, Department of', 'Eastern Correctional Institution', '(410)845-4000\xa0Ext.\xa04415']
['Bizhan Aarabi', 'University of Maryland, Baltimore', 'School of Medicine', '(410)328-7371']
['Joy Aaron', 'Maryland Department of Health', 'Maryland Board of Physical Therapy Examiners', '(410)764-4716']
['Mark C. Aaron', 'University of Maryland, Baltimore', 'School of Nursing', 'No Listing']
['Clayton Anthony Aarons', 'Judiciary of Maryland', 'District #5  - Prince George`s County (Upper Marlboro)', '(301)070/4060']
['Sean R. Abad', 'University of Maryland, Baltimore', 'Office of Development and Alumni Relations', '(410)706-8494']
['Eli Abada', 'Maryland Department of Health', 'Program Analysis & MCOs', '(410)767-1392']
['Adam Abadir', 

![Screen%20Shot%202023-05-15%20at%203.25.40%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%203.25.40%20PM.png)

Here is where ChatGPT's little antics started to get frustrating. No matter how many instructions I gave it to iterate over specific links, the code it kept generating tried to find "next" attributes in the HTML on the page to iterate through. All of the above code blocks, despite having differences in how they are structured, effectively return the same results and provide only the smallest subsect of data from the site after running them as a scraper. Frustrating, right?

This process continued for quite a while, involving a long back-and-forth between ChatGPT and I. During the process, my main strategy for understanding how the AI was implementing my suggestions was asking for explanations whenever I saw it had added a new line of code and telling it to adjust that line depending on if it was the proper strategy to use. 

Later in this process, I attempted to adapt the general vibe of ChatGPT's above code into a usable format where I could have a separate scraping system for each individual letter and then append the values from each letter onto a list once each had been accumulated. ChatGPT was helpful in getting me to this point, as the baseline structure of how the page-by-page scraper was formatted was easily repurposable to try and make things work on a micro-scale. Here's that code that I started with for trying to bring down each individual letter:

In [14]:
while page <= 870:
    a_list = []
    url = f"https://www.doit.state.md.us/phonebook/IndListing.asp?FirstLetter=A&offset={page}"
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html, features="html.parser")
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        cells = row.find_all('td')
        if len(cells) == 4:
            employee_name = cells[0].get_text().strip()
            agency_name = cells[1].get_text().strip()
            office = cells[2].get_text().strip()
            phone = cells[3].get_text().strip()
            a_list.append(
                [employee_name, agency_name, office, phone])
    page = page + 15
    print(a_list)

    # B
while page <= 2010:
    b_list = []
    url = f"https://www.doit.state.md.us/phonebook/IndListing.asp?FirstLetter=B&offset={page}"
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html, features="html.parser")
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        cells = row.find_all('td')
        if len(cells) == 4:
            employee_name = cells[0].get_text().strip()
            agency_name = cells[1].get_text().strip()
            office = cells[2].get_text().strip()
            phone = cells[3].get_text().strip()
            b_list.append(
                [employee_name, agency_name, office, phone])
    page = page + 15

This solution was also far from the best, but the groundwork for figuring out how different scraping methods might get me closer to my end goal. 

The main takeaway from troubleshooting difficult code with ChatGPT's help: sometimes even AI isn't able to help you with convoluted, poorly structured webpages. I had no idea going into this process that it would be relatively difficult to adjust the way ChatGPT was contextualizing and thinking about a problem through the resources it had available to pull. It could be possible that ChatGPT didn't have a concrete example in its back pocket for iterating through every page within every letter of a given webpage. I mean, it surely stumped me, so it may not be super far off for this complex issue to stump a super-powered AI, despite that conclusion being unexpected at the start of this process.

![Screen%20Shot%202023-05-15%20at%203.47.01%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%203.47.01%20PM.png)

##### You can even see my frustration with ChatGPT's methods in one of the 1000 Python files I saved to my repository during this project named "chatgpt_troubleshooting.py" or another similar title. If I revisted this process again with clearer eyes and a less impending deadline, I'm sure I could repurpose the suggestions ChatGPT provided into something workable, but shapeshifting ChatGPT code into a more structured format is far from the best use of time when an assignment deadline is approaching, so I cut my losses and made the scope of my scraper much, much smaller.

I continued to troubleshoot my scraper using ChatGPT, but pivoted my angle a bit to make the problems I addressed using ChatGPT a bit more cut-and-dry and straightforward. I think the way I was conducting my queries for ChatGPT when trying to understand my iteration problem was causing both me and the AI system to chase our tails a bit. 

This is a key element of learning how ChatGPT can be helpful in solving coding-related issues. I definitely chose a website that proved more difficult to scrape than I initially expected because of all of its strange little quirks on each individual page and within its URLs. For the rest of my project, I decided to (1) lower my expectations for the kind of scraper I thought I could build with the time we had allotted for it in class and (2) lower my expectations of what ChatGPT could bend over backward to help me achieve. 

### Structured Messages for ChatGPT

The best use of ChatGPT for web scraping came from having it explain specific examples within the code blocks that I was repurposing for class or helping my beef up my ability to recall Python tricks I had learned in previous semesters. For example, once I decided I would only pull down vacant positions from my scraper, rather than the name and position of every single listed state employee, I needed to make my scraper write and overwrite CSV files each time it ran so that I would be able to see what values were changing in the vacancies. 

The initial goal with this scraper was to keep track of vacancies among state employees as they were filled so we could watch whether new gubernatorial administration in Maryland was taking action on a key campaign promise. My scraping troubles were compounded by the realization that a vast majority of the data within the Maryland employees database was inaccurate. With that said, I wanted to just pull down vacant positons to see if/how these change on a regular basis. 

I used ChatGPT to help me understand more about how to read and write out to CSV files in Python. I feel confident in my Python skills, but let it be known that it's often much easier to ask ChatGPT for a quick template code than parsing through the 100 pages of Python notes some people (me!) keep from previous classes. Here's an example of how ChatGPT was helpful in this context:

![Screen%20Shot%202023-05-15%20at%204.07.29%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.07.29%20PM.png)

Some may call this laziness, but it was much easier than parsing through the various different situations where our news apps class has written or rewritten CSV files for different purposes. In my final code, I implemented this solution to write CSV files based on the updates that would come from my vacancy data:

In [None]:
df = pd.DataFrame(table_data, columns=[
        'employee_name', 'agency_name', 'office', 'phone_number'])
    print("this should be today's data")
    print(df)
    total_vacancies = len(df. index)

    csv_file = open("md_employees" + (time.strftime('%Y_%m_%d')) + ".csv", "w")
    writer = csv.writer(csv_file)
    writer.writerow(["employee_name", "agency_name", "office", "phone"])
    writer.writerows(table_data)

    # write a file that writes over itself each time the scraper runs so that
    # we can see what has changed in GitHub.
    with open(f"md_employees_total.csv", "w") as output_file:
        csvfile = csv.writer(output_file)
        csvfile.writerow(["employee_name", "agency_name", "office", "phone"])
        csvfile.writerows(table_data)

This provides a perfect segway into how ChatGPT was the most helpful in terms of crafting a scraper and eventual bot that would report the results of the scraper. 

ChatGPT is your friend when troubleshooting error messages and creating structured statements for a bot to say. The end of our scraping project involved getting a SlackBot to convey the proper information to a given channel based on some filtration or update to the data being scraped. 

![Screen%20Shot%202023-05-15%20at%204.15.49%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.15.49%20PM.png)

In the above example, I used ChatGPT to help me format my Slack message (a first iteration of the message that included the date of the update – a feature I eventually did away with as I continued to refine what the message would say). 

The major improvement to my SlackBot that ChatGPT helped me achieve was creating a method for checking the previous CSV file to see if there had been an increase or decrease in state employee data across recent days. Here's what that request process looked like from ChatGPT's interface:

![Screen%20Shot%202023-05-15%20at%204.24.08%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.24.08%20PM.png)


In [None]:
# ChatGPT generated code:

import csv
import filecmp

# Define the file names and paths
current_file = 'md_employees2023_04_05.csv'
previous_file = 'md_employees2023_04_04.csv'

# Use filecmp to compare the two files
are_files_equal = filecmp.cmp(current_file, previous_file)

# If the files are different, print the differences
if not are_files_equal:
    with open(current_file, 'r') as current:
        with open(previous_file, 'r') as previous:
            current_csv = csv.reader(current)
            previous_csv = csv.reader(previous)
            for current_row, previous_row in zip(current_csv, previous_csv):
                if current_row != previous_row:
                    print(f'Difference found: {current_row} vs {previous_row}')

                

##### Hannah clarification questions to make this solution more tailored and Pythonic to our specific SlackBot

![Screen%20Shot%202023-05-15%20at%204.25.20%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.25.20%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.25.48%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.25.48%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.26.12%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.26.12%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.26.36%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.26.36%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.27.39%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.27.39%20PM.png)

Finally, after taking and workshopping the code from those many, many clarifications I provided to ChatGPT, I had a relatively workable template to use for the CSV files that I wanted to save and iterate through in order to tell the difference in state employee vacancies each time the scraper ran. Here's the final codeblock ChatGPT provided me on this topic, a helpful adjustment compared to that first, small code segment that compared two CSVs in a less intuitive way.

In [None]:
import filecmp
import os

# Get the current directory of the script
script_dir = os.path.dirname(os.path.abspath(__file__))

# Get the directory where the CSV files are stored relative to the script directory
csv_dir = os.path.join(script_dir, '..', 'csv_files')

# Get the absolute paths of the current and previous day's CSV files
today_file = os.path.join(csv_dir, f"md_employees{datetime.date.today().strftime('%Y_%m_%d')}.csv")
yesterday_file = os.path.join(csv_dir, f"md_employees{(datetime.date.today() - datetime.timedelta(days=1)).strftime('%Y_%m_%d')}.csv")

# Compare the two files and print the differences
if filecmp.cmp(today_file, yesterday_file):
    print("The files are the same.")
else:
    print("The files are different.")

#### Take a look at the code I generated to structure the message my bot would produce after each update. It's easy to see how ChatGPT's suggestions helped me understand how to make this code make sense, even if the AI never wrote out the direct code I wanted to include. 

In [None]:
today = date.today()
    yesterday = today - timedelta(days=1)

    print("this should be yesterday's data")
    yesterday_csv = pd.read_csv(
        "md_employees" + (yesterday.strftime('%Y_%m_%d')) + ".csv")
    print(yesterday_csv)

    total_vacancies = len(df. index)
    yesterday_vacancies = len(yesterday_csv. index)

    slack_token = os.environ.get('SLACK_API_TOKEN')
    client = WebClient(token=slack_token)

    msg = f"Hello from VacancyBot! There are currently {total_vacancies} state employee vacancies.\n"
    if total_vacancies < yesterday_vacancies:
        print("Vacancies have decreased.")
        msg += f"Vacancies have increased since {yesterday}, when there were {yesterday_vacancies}.\n"
        msg += f"Find an updated file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv \nSee any changes in vacancies in this master file: https://github.com/hannahzziegler/newsapps_bot/commits/30dc35f7f5a3eb6534f454a04377553815459a68/md_employees_total.csv \nCheck out the Maryland state employees database: https://www.doit.state.md.us/phonebook/indlisting.asp"
    elif total_vacancies > yesterday_vacancies:
        print(f"Vacancies have increased.")
        msg += f"Vacancies have increased since {yesterday}, when there were {yesterday_vacancies}.\n"
        msg += f"Find an updated file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv \nSee any changes in vacancies in this master file: https://github.com/hannahzziegler/newsapps_bot/commits/30dc35f7f5a3eb6534f454a04377553815459a68/md_employees_total.csv \nCheck out the Maryland state employees database: https://www.doit.state.md.us/phonebook/indlisting.asp"
    elif total_vacancies == yesterday_vacancies:
        print("The number of vacancies has not changed.")
        msg += f"The number of vacancies has not changed since yesterday's update.\nFind the most recent file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv"
    print(msg)
    try:
        response = client.chat_postMessage(
            channel="slack-bots",
            text=msg,
            unfurl_links=True,
            unfurl_media=True
        )
        print("success!")
    except SlackApiError as e:
        assert e.response["ok"] is False
        assert e.response["error"]
        print(f"Got an error: {e.response['error']}")

My final message that pushed to Slack each day looked like this. As you can see, the message incorporated help from ChatGPT in terms of properly structuring dates and counting up rows within a new CSV before comparing it to the number of vacancy rows from the day before. 

![Screen%20Shot%202023-05-15%20at%204.19.28%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.19.28%20PM.png)

### Takeaways from Crafting Messages with ChatGPT

ChatGPT's ability to help me understand the hastle of creating new CSV files and determining the differences between them allowed it to be massively helpful in this project. But if the code and discussions with the AI model above signals anything to you, it should flag that AI-assisted scraping is all about rephrasing and rewriting queries that will allow you to slowly inch closer to what your ideal template code should look like. My best friend phrases during this process all followed the template of "okay, gotcha, can you update that above code to [insert stipulation or filter for the data here]? 

These are the statements that will get you through the frustration of talking through any major coding task with a robot as your right-hand man. Patience is key, and the solution given will rarely be right on the nose for what you wanted out of the query, but learning how to give ChatGPT specific questions to troubleshoot specific areas of confusion or code troubles is the most helpful way of using the tool for any programming-related feat. 

ChatGPT, overall, is most helpful for explaining error messages or giving debriefs of certain lines of code, such as the ones shown in the below examples:

![Screen%20Shot%202023-05-15%20at%204.45.39%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.45.39%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.46.17%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.46.17%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.47.04%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.47.04%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.48.35%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.48.35%20PM.png)

![Screen%20Shot%202023-05-15%20at%204.49.04%20PM.png](attachment:Screen%20Shot%202023-05-15%20at%204.49.04%20PM.png)

Here's my completed scraper (which is currently broken, but I haven't had the time to go figure out what's up with it during this finals season). It's easy to tell where the ChatGPT influences are throughout it, but note that I stayed true to the central thesis of AI-assisted scraping and never expected the program to complete this entire process for me. 

In [None]:
# this is a working scraper that only pulls down the state employee vacancies,
# rather than all of the data. so we have *something* to work with!!!!
# the main thing I wanted to figure out here was making it so I don't have to
# hard-code the number of x15 multiples

import csv
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
from slack import WebClient
from slack.errors import SlackApiError
import datetime
from datetime import date, timedelta
import filecmp


def check_vacancies():
    try:
        with open('last_checked.txt', 'r') as f:
            last_checked_date = f.read().strip()
    except FileNotFoundError:
        with open('last_checked.txt', 'w') as f:
            last_checked_date = '2023-03-30'  # default date if file doesn't exist
            f.write(last_checked_date)

    page = 0
    table_data = []
    while page <= 330:
        url = f"https://www.doit.state.md.us/phonebook/IndListing.asp?FirstLetter=vacant&Submit=Search&offset={page}"
        response = requests.get(url)
        html = response.content
        soup = BeautifulSoup(html, features="html.parser")
        table = soup.find('table')
        rows = table.find_all('tr')
        for row in rows:
            cells = row.find_all('td')
            if len(cells) == 4:
                employee_name = cells[0].get_text().strip()
                agency_name = cells[1].get_text().strip()
                office = cells[2].get_text().strip()
                phone = cells[3].get_text().strip()
                table_data.append([employee_name, agency_name, office, phone])
        page = page + 15

    # print(table_data)

    df = pd.DataFrame(table_data, columns=[
        'employee_name', 'agency_name', 'office', 'phone_number'])
    print("this should be today's data")
    print(df)
    total_vacancies = len(df. index)

    csv_file = open("md_employees" + (time.strftime('%Y_%m_%d')) + ".csv", "w")
    writer = csv.writer(csv_file)
    writer.writerow(["employee_name", "agency_name", "office", "phone"])
    writer.writerows(table_data)

    # write a file that writes over itself each time the scraper runs so that
    # we can see what has changed in GitHub.
    with open(f"md_employees_total.csv", "w") as output_file:
        csvfile = csv.writer(output_file)
        csvfile.writerow(["employee_name", "agency_name", "office", "phone"])
        csvfile.writerows(table_data)

    today = date.today()
    yesterday = today - timedelta(days=1)

    print("this should be yesterday's data")
    yesterday_csv = pd.read_csv(
        "md_employees" + (yesterday.strftime('%Y_%m_%d')) + ".csv")
    print(yesterday_csv)

    total_vacancies = len(df. index)
    yesterday_vacancies = len(yesterday_csv. index)

    slack_token = os.environ.get('SLACK_API_TOKEN')
    client = WebClient(token=slack_token)

    msg = f"Hello from VacancyBot! There are currently {total_vacancies} state employee vacancies.\n"
    if total_vacancies < yesterday_vacancies:
        print("Vacancies have decreased.")
        msg += f"Vacancies have increased since {yesterday}, when there were {yesterday_vacancies}.\n"
        msg += f"Find an updated file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv \nSee any changes in vacancies in this master file: https://github.com/hannahzziegler/newsapps_bot/commits/30dc35f7f5a3eb6534f454a04377553815459a68/md_employees_total.csv \nCheck out the Maryland state employees database: https://www.doit.state.md.us/phonebook/indlisting.asp"
    elif total_vacancies > yesterday_vacancies:
        print(f"Vacancies have increased.")
        msg += f"Vacancies have increased since {yesterday}, when there were {yesterday_vacancies}.\n"
        msg += f"Find an updated file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv \nSee any changes in vacancies in this master file: https://github.com/hannahzziegler/newsapps_bot/commits/30dc35f7f5a3eb6534f454a04377553815459a68/md_employees_total.csv \nCheck out the Maryland state employees database: https://www.doit.state.md.us/phonebook/indlisting.asp"
    elif total_vacancies == yesterday_vacancies:
        print("The number of vacancies has not changed.")
        msg += f"The number of vacancies has not changed since yesterday's update.\nFind the most recent file of state employees vacancies here: https://github.com/hannahzziegler/newsapps_bot/blob/main/md_employees{today.strftime('%Y_%m_%d')}.csv"
    print(msg)
    try:
        response = client.chat_postMessage(
            channel="slack-bots",
            text=msg,
            unfurl_links=True,
            unfurl_media=True
        )
        print("success!")
    except SlackApiError as e:
        assert e.response["ok"] is False
        assert e.response["error"]
        print(f"Got an error: {e.response['error']}")


check_vacancies()