# Grabing the montly data on drug seizures published by the US Customs and Border Patrol

The US CBP update a [dashboard](https://www.cbp.gov/newsroom/stats/drug-seizure-statistics) monthly and also provide CSVs of the data [here](https://www.cbp.gov/document/stats/nationwide-drug-seizures). The scaper in this script is going to go through the following steps:
1. grab the HTML of the website with the CSV files
2. grab the date of the latest CSV
3. compare the date from step 2 to the date of the most recent dataset we have
4. if the date from step 2 equals the date of our most recent data, the script stops, if it is more recent, we download the lastest CSV

Before actually writing our code, we need to import the libraries that we're going to use.

In [70]:
import requests
import urllib3
from bs4 import BeautifulSoup
from datetime import datetime
import re
import pandas as pd

## 1. grab the HTML of the website with the CSV files

In [56]:
# Specify the URL
url = "http://www.cbp.gov/document/stats/nationwide-drug-seizures"

# Make the website think we're a normal person browsing, not a bot
urllib3.disable_warnings()

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

# Get the HTML from the URL
response = requests.get(url, headers=headers)

This next cell is options and will be deleted before the code goes into production. If the last step is successful, the computer returns the status 200. So it we get that status, the notebook prints the code so we can see it, and it not, it tells us we have an error.

In [57]:
if response.status_code == 200:
    print(response.text)  # Print the content of the response
else:
    print(f'Request failed with status code: {response.status_code}')

<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="og: https://ogp.me/ns#">
  <head>
    <meta charset="utf-8" />
<meta name="description" content="Return to the Public Data Portal." />
<meta name="keywords" content="Statistics" />
<link rel="canonical" href="https://www.cbp.gov/document/stats/nationwide-drug-seizures" />
<meta property="og:site_name" content="U.S. Customs and Border Protection" />
<meta property="og:type" content="website" />
<meta property="og:url" content="https://www.cbp.gov/document/stats/nationwide-drug-seizures" />
<meta property="og:title" content="Nationwide Drug Seizures" />
<meta property="og:description" content="Securing America&#039;s Borders" />
<meta property="og:image:url" content="https://www.cbp.gov/sites/default/files/cbp-seal-1200-630-px-2021.jpg" />
<meta name="twitter:card" content="summary" />
<meta name="twitter:title" content="Nationwide Drug Seizures" />
<meta name="twitter:description" content="Return to the Public Data Portal." />
<meta name

## 2. grab the date of the latest CSV

This code is going to take the HTML we just grabbed, as select the first row of a table on the page. The way the HTML is structred[1], 

[1] As of my writing of this code. Websites are often updated, and if the CBP change their website, this code will have to be modified accordingly.

In [75]:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Find all <td> tags
table_tags = soup.find_all('td')[2]

# Strip out the HTML
latest_date_str = re.sub('\s+', '', re.sub('</td>', '', re.sub('<td.*>', '', str(table_tags))))

latest_date = datetime.strptime(latest_date_str, "%m/%d/%Y")
print(latest_date)

TypeError: 'datetime.datetime' object is not subscriptable

## 4. if the date from step 2 equals the date of our most recent data, the script stops, if it is more recent, we download the lastest CSV

I know this is out of order, but bear with me. I step 3, we compare two dates and then best on the results, call different functions. In order for that to work, we need to have the functions already defined. So here, we'll define all the options, then we'll run the comparison and the program will execute whichever function is needed.

In [84]:
base_url = "https://www.cbp.gov"
def get_dataset():
    link_tags = soup.find('tbody')
    link_stub = link_tags.find('a').get('href')

    full_url = base_url + link_stub
    filename = "data/cbp_data_" + str(latest_date)[0:10] + ".csv"

    query_parameters = {"downloadformat": "csv"}
    data_response = requests.get(full_url, params=query_parameters)
    if data_response.status_code == 200:
        with open(filename, mode="wb") as file:
            file.write(data_response.content)
    else:
        raise ValueError(f'Request failed with status code: {data_response.status_code}')
    
    

## 3. compare the date from step 2 to the date of the most recent dataset we have

The function `compare_dates` is going to take the variable `current_date`, which will be undifined for first time this runs and will be the date of the last dataset we grabed from then on, and compare it to the variable we just created, `latest_date`. When `compare_dates` is undefined or earlier than `current_date`, it will call a new function to get the laatest dataset. If `current_date` is equal to the `latest_date`, that means we already have the latest dataset and the program stops. If the `current_date` is greater than the `latest_date`, that means there's and error or the CBP has taken down some of their data. This should be investigated.

In [85]:
def compare_dates(latest_date_str):
    if 'current_date' not in globals():
        get_dataset()
    elif 'current_date' < latest_date_str:
        get_dataset()
    elif 'current_date' == latest_date_str:
        print("Dataset is already up to date")
    else:
        raise ValueError("Error:\nSomething is wrong. The current dataset\nseems to be more recent than the most\nrecent data. You should investigate whether\nthe site structure has changed or a dataset\nwas removed.")
        
compare_dates(latest_date_str)