# Webscraper for Yourway and World Courier Packages
This script scrapes tracking information from the couriers Yourway and World Courier. This script was made due to a lack of an API provided by the couriers. The script returns a spreadsheet with the scraped pickup and delivery datetimes where available. They will not be adjust for timezones or daylight saving time since the SQL server that will be updated using the results of this script automatically adjusts for them.

For demonstration purposes, a file named *demoData.xlsx* will be used. In the file, there are five rows of data. The first two are Yourway packages. The last three are World Courier packages, but one has an invalid airbill number, and will not return anything (at the time of writing. Couriers do not keep tracking information indefinitely).

Start by importing libraries. Pandas will be used for data management. Numpy will be used for managing NULL values. Selenium's webdriver will be used to create a computer-controlled web browser. Selenium's Keys will be used to emulate user input. datetime and time will be used for managing datetime information pulled from the tracking website:

In [35]:
## Import libraries
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from datetime import datetime

Tracking information comes from a Power BI report that lists samples that are out of compliance with certain metrics that relate to the viability of the samples. An Excel file is generated from the report. The sheet has at least the following columns:
* Study
* Site Code
* Participant ID
* Visit Number
* Barcode
* Kit Barcode
* Parent ID
* Cell Line Number
* Processed Date
* Courier Delivery Date
* Courier Pickup Date
* Site Ship Date
* Collection Date
* Carrier
* Airbill Number

The columns with datetimes are kept because to process data change request, the old value must be recorded as well. 

Rows where we do not know the carrier, airbill, or ship date will be removed as they are necessary inputs on the tracking website. Airbills and site shipping dates need to be formatted as strings so that selenium can input them into the website. 

A list of states will also be created so timezones can be determined later.

In [36]:
## Get the flagged PBMC records and select only the relevant columns/rows
flag=pd.read_excel("demoData.xlsx") # Turn the Excel sheet with missing delivery and pick-up date into a Pandas dataframe
flag.dropna(subset=["Carrier","Airbill Number","Site Ship Date"],inplace=True) # Remove rows where there is no information in the Carrier, Airbill Number, and Site Ship Date columns
flag["Airbill Number"]=flag["Airbill Number"].astype('int64') # Pandas turns the airbill numbers into floats. Turn them into integers
flag["Airbill Number"]=flag["Airbill Number"].astype(str) # Turn the airbill numbers into strings (we send strings to the website through the webdriver)
flag["Site Ship Date Str"]=flag["Site Ship Date"].astype(str) # Turn the site ship date into strings

# Create a list of states
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

flag

Unnamed: 0,Study,Site Code,Participant ID,Visit Number,Barcode,Kit Barcode,Parent ID,Cell Line Number,Processed Date,Courier Delivery Date,Courier Pickup Date,Site Ship Date,Collection Date,Carrier,Airbill Number,Site Ship Date Str
0,Study1,1,Arnold,Month A,,1234,1234-H,,NaT,,,2022-11-21,2022-11-21 13:15:00,Custom Courier,527132,2022-11-21
1,Study1,2,Bobby,Month B,11111-A1,5678,,,2022-10-20 12:45:00,,,2022-10-19,2022-10-19 11:30:00,Custom Courier,525119,2022-10-19
2,Study2,3,Charlie,0,11112-A1,9101,2345-G,Alpha,2022-12-20 12:45:00,,,2022-12-18,2022-12-18 01:39:00,World Courier,608751583020,2022-12-18
3,Study2,3,Dennis,0,11113-A1,2345,3456-K,Bravo,2023-02-20 12:15:00,,,2023-02-18,2023-02-18 09:40:00,World Courier,110816182,2023-02-18
4,Study2,3,Eugene,0,11114-A1,3456,4567-P,Hotel,2023-01-17 11:25:00,,,2023-01-15,2023-01-15 01:15:00,World Courier,713775188,2023-01-15


Next, I wrote a function that accesses the tracking websites of each courier. The script goes to the website, enters the information needed to access tracking information, then returns the shipping information extracted from the webpage:

In [37]:
## Define a function that scrapes information row-by-row
def scrape(carrier,airbill,pickupdate,pickupdate_str):
    # Change what the function does based on the carrier (since each website has their tracking page set up differently)
    if carrier=="Custom Courier": # Custom Courier usually refers to Yourway
        time.sleep(2) # Wait for 2 seconds (this is to prevent accidentally DDOSing the tracking website)
        driver.get("https://dispatch.yourwaytransport.com/us/tracking/") # Go to the URL for Yourway's tracking page
        jobNum=driver.find_element("xpath",'''/html/body/div/div[3]/div/form/fieldset[1]/div/div/input''') # Finds the box to put in the tracking number
        jobNum.send_keys(airbill) # Types the airbill into the box to put in the tracking number
        acctNum=driver.find_element("xpath",'''//*[@id="pu_date_input"]''') # Finds the box to put in the date (you can send an account number, but we don't have that ATM)
        acctNum.send_keys(pickupdate_str) # Types the pick-up date into the box for the pick-up date
        track=driver.find_element("xpath",'''/html/body/div/div[3]/div/form/div/input''') # Finds the "track shipment" button
        track.click() # Clicks the "track shipment" button
        time.sleep(1) # Waits for a second
        foundChk=driver.find_element("xpath",'''/html/body''') # Get all of the text on the page
        if "There is not a job that correlates with the search criteria entered." in foundChk.text: # Checks if this error message is on the webpage
            return (np.nan, np.nan, np.nan,np.nan) # Returns a tuple of NULL values

        else: # If we don't see this error message...
            timeStringList=[] # List for string representation of time
            info=driver.find_element("xpath",'''/html/body/div/div[3]/div[1]/div[4]/ul''') # Find the text from the section of the webpage that has the shipment's details
            info=info.text # Gets the text from the section of the webpage that has the shipment's details
            info=info.split('\n') # Split the text whenever there is a newline
            location=driver.find_element("xpath",'''/html/body/div/div[3]/div[1]/div[5]''')
            location=location.text.split('\n')
            location_list=[location[5],location[2]]# Gives us the city names for the delivery and pickup locations, respectively
            for m in range(len(location_list)): # checks to make sure that we at least have a state
                if (len(location_list[m])<2) or (location_list[m][-2:].upper() not in states):
                    location_list[m]=np.nan
                else:
                    continue
            info=[info[7],info[3]] # Get the 7th and 3rd items from the list (remember that Python is 0 indexed). These are the delivery and pickup timestamps, respectively (they are still strings!)
            info[0]=info[0]+":00" # Add :00 to the end of the delivery timestamp
            info[1]=info[1]+":00" # Add :00 to the end of the pickup timestamp
            copyInfo=[datetime.strptime(info[0],"%m/%d/%y %H:%M:%S"),datetime.strptime(info[1],"%m/%d/%y %H:%M:%S")]
            
            return (copyInfo[0], location_list[0] ,copyInfo[1], location_list[1]) #Output the two timestamps
    elif carrier=="World Courier": # If the carrier is World Courier...
        # Some explanation: here, the code simulates someone clicking on the date field on World Courier's tracking page. 
        # There should be PDFs included with this script. They show what the below happens visually/mathematically

        # The first year button on the calendar is the second button. Thus, you need to add 2 to the last digit of the year.
        yearLastInt=(pickupdate.year % 10)+2  # To get the last digit, we do year mod 10 (look up modular arithmetic to learn more). Then, to get the integer needed to select the button, we add 2
        yearCSSSelector='''span.year:nth-child({})'''.format(yearLastInt) # Build the CSS Selector string so that we have the browser select the right thing to click
        # January is the second button, so to select this, we need to add 1 to the integer representation of the month
        monthNum=(pickupdate.month)+1 
        monthCSSSelector='''span.month:nth-child({})'''.format(monthNum) # Build the CSS Selector string so that we have the browser select the right thing to click
        # The button for the day changes on what day the first of the month lands on. 
        tempDate=pd.Timestamp(pickupdate.year,pickupdate.month,1) # Create a temporary timestamp object for the first day of the month
        dayOfWk=(tempDate.dayofweek) # Find what day the first of the month lands on
        # Python does the following: Sunday=6, Mon=0, Tues=1,...,Sat=5
        if dayOfWk==6:
            dayOfWk=8 # If the first lands on a Sunday, its cell number is 8
        else:
            dayOfWk+=9 # If the first lands on a day other than Sunday, its the integer representation of the day of the week plus 9
        dayNum=dayOfWk+pickupdate.day-1 # To get the cell number, we add the above number, the day number, then subtract 1. See explain2 for more info
        dayCSSSelector='''span.day:nth-child({})'''.format(dayNum) # Build the CSS Selector string so that we have the browser select the right thing to click
        
        driver.get('''https://portal.worldcourier.com/en/FastTrack-Shipment''') # Have the browser go to World Courier's tracking page
        airbill_form=driver.find_element("xpath",'''//*[@id="fast-track-shipment-form_house-waybill_input"]''') # Find the box for the airbill
        airbill_form.send_keys(airbill) # Type in the airbill
        pickdate=driver.find_element("xpath",'''/html/body/div[2]/main/div/div[2]/form/span[2]/div/div[1]/input''').click() # Click the header of the calendar to open the calendar interface
        time.sleep(1) # Wait 1 second
        monthmenu=driver.find_element("xpath",'''/html/body/div[2]/main/div/div[2]/form/span[2]/div/div[2]/header/span[2]''').click()# Click the header of the calendar to choose the month
        time.sleep(1) # Wait 1 second
        yearmenu=driver.find_element("xpath",'''/html/body/div[2]/main/div/div[2]/form/span[2]/div/div[3]/header/span[2]''').click() # Click the header of the calendar to choose the year
        time.sleep(1) # Wait 1 second
        year=driver.find_element("css selector",yearCSSSelector).click() # Click on the year
        time.sleep(1) # Wait 1 second
        month=driver.find_element("css selector",monthCSSSelector).click() # Click on the month
        time.sleep(1) # Wait 1 second
        day=driver.find_element("css selector",dayCSSSelector).click() # Click on the day
        time.sleep(1) # Wait 1 second
        trackButton=driver.find_element("xpath",'''/html/body/div[2]/main/div/div[2]/form/button''').click() # Click on the track shipment button
        time.sleep(3) # Wait 2 seconds
        driver.find_element('tag name','body').send_keys(Keys.PAGE_DOWN) # Scroll down
        time.sleep(1) # Wait 1 second
        pageText=driver.find_element("xpath",'''/html/body''') # Get all of the text on the page
        if "Sorry. We can't find anything based on your search terms. Please check your house waybill number and try again." in pageText.text: # If we get this error message on the page...
            return (np.nan, np.nan,np.nan,np.nan) # Return NULL values
        else:
            txt=pageText.text # Get the text from the page
            txt=txt.split('\n') # Split the text whenever there is a newline. Put them into a list
            ## Search for the index in the list where the string has "Pick-Up Date "/'Delivered On/POD ' in it
            for k in txt:
                if "Pick-Up Date " in k:
                    pick_ind=txt.index(k)
                elif 'Delivered On/POD ' in k:
                    deliver_ind=txt.index(k)
                elif 'Pick Up From' in k:
                    pickupLocation=(txt.index(k))+1
                elif 'Deliver To' in k:
                    deliverLocation=(txt.index(k))+1
                else:
                    continue
            locations=[txt[pickupLocation],txt[deliverLocation]]
            txt=[txt[pick_ind],txt[deliver_ind]] # Use the 2 indicies to create a list with only the relevant strings
            
            
            timeText=[]
            txt[0]=txt[0].replace('Pick-Up Date ','') # In the first item in the list, remove 'Pick-Up Date '
            # Find the index where ' AM' or ' PM' is in the string
            if " AM" in txt[0]:
                index=txt[0].find(" AM")
            elif " PM" in txt[0]:
                index=txt[0].find(" PM")
            txt[0]=txt[0][:index]+":00"+txt[0][index:] # Put :00 in between the minutes and AM/PM
            timeText.append(txt[0])
            txt[0] = datetime.strptime(txt[0],"%Y-%m-%d %I:%M:%S %p") # Turn the string into a datetime object
            txt[1]=txt[1].replace('Delivered On/POD ','') # In the second item in the list, remove 'Pick-Up Date '
            # Find the index where ' AM' or ' PM' is in the string
            if " AM" in txt[1]:
                index=txt[1].find(" AM")
            elif " PM" in txt[1]:
                index=txt[1].find(" PM")
            txt[1]=txt[1][:index]+":00"+txt[1][index:] # Put :00 in between the minutes and AM/PM
            timeText.append(txt[0])
            txt[1] = datetime.strptime(txt[1],"%Y-%m-%d %I:%M:%S %p") # Turn the string into a datetime object
            return (txt[1],locations[1],txt[0],locations[0]) # return the 2 datetime objects

Here, a Firefox window is launched and the webscraping is done on every row of the dataframe:

In [38]:
## Start the browser
driver=webdriver.Firefox()

## Apply the function
flag["New Courier Delivery Date"],flag["Delivery city state"],flag["New Courier Pickup Date"],flag["Pickup city state"]=zip(*flag.apply(lambda x: scrape(x["Carrier"],x["Airbill Number"],x["Site Ship Date"],x["Site Ship Date Str"]), axis=1))


flag.to_excel("demoRes.xlsx",index=False) # Create an Excel spreadsheet with the scraped timestamps
driver.quit() # Close the browser

flag

Unnamed: 0,Study,Site Code,Participant ID,Visit Number,Barcode,Kit Barcode,Parent ID,Cell Line Number,Processed Date,Courier Delivery Date,Courier Pickup Date,Site Ship Date,Collection Date,Carrier,Airbill Number,Site Ship Date Str,New Courier Delivery Date,Delivery city state,New Courier Pickup Date,Pickup city state
0,Study1,1,Arnold,Month A,,1234,1234-H,,NaT,,,2022-11-21,2022-11-21 13:15:00,Custom Courier,527132,2022-11-21,2022-11-22 12:40:00,"Seattle, WA",2022-11-21 15:30:00,
1,Study1,2,Bobby,Month B,11111-A1,5678,,,2022-10-20 12:45:00,,,2022-10-19,2022-10-19 11:30:00,Custom Courier,525119,2022-10-19,2022-10-20 10:05:00,"Seaattle, WA",2022-10-19 17:00:00,"Sacramento, CA"
2,Study2,3,Charlie,0,11112-A1,9101,2345-G,Alpha,2022-12-20 12:45:00,,,2022-12-18,2022-12-18 01:39:00,World Courier,608751583020,2022-12-18,NaT,,NaT,
3,Study2,3,Dennis,0,11113-A1,2345,3456-K,Bravo,2023-02-20 12:15:00,,,2023-02-18,2023-02-18 09:40:00,World Courier,110816182,2023-02-18,NaT,,NaT,
4,Study2,3,Eugene,0,11114-A1,3456,4567-P,Hotel,2023-01-17 11:25:00,,,2023-01-15,2023-01-15 01:15:00,World Courier,713775188,2023-01-15,2023-01-16 16:46:00,"INDIANAPOLIS, IN 46241-5502",2023-01-15 16:45:00,"LITTLE ROCK, AR 72202-3500"


For a little data cleaning, I wrote a function to correct the pickup and delivery locations to a standard format:

In [39]:
def locationChk(locName):
    if locName in ["Seattle, WA","Seaattle, WA","SEATTLE, WA 98101","SEATTLE, WA 98101-2795"]:
        return "Seattle, WA"
    elif locName in ["Inianapolis, IN","INDIANAPOLIS, IN 46241-5502"]:
        return "Indianapolis, IN"
    elif locName=="Sacramento, CA":
        return "Sacramento, CA"
    elif locName=="LITTLE ROCK, AR 72202-3500":
        return "Little Rock, AR"
    else:
        return None

flag["Delivery city state"]=flag.apply(lambda x: locationChk(x["Delivery city state"]),axis=1)
flag["Pickup city state"]=flag.apply(lambda x: locationChk(x["Pickup city state"]),axis=1)

flag


Unnamed: 0,Study,Site Code,Participant ID,Visit Number,Barcode,Kit Barcode,Parent ID,Cell Line Number,Processed Date,Courier Delivery Date,Courier Pickup Date,Site Ship Date,Collection Date,Carrier,Airbill Number,Site Ship Date Str,New Courier Delivery Date,Delivery city state,New Courier Pickup Date,Pickup city state
0,Study1,1,Arnold,Month A,,1234,1234-H,,NaT,,,2022-11-21,2022-11-21 13:15:00,Custom Courier,527132,2022-11-21,2022-11-22 12:40:00,"Seattle, WA",2022-11-21 15:30:00,
1,Study1,2,Bobby,Month B,11111-A1,5678,,,2022-10-20 12:45:00,,,2022-10-19,2022-10-19 11:30:00,Custom Courier,525119,2022-10-19,2022-10-20 10:05:00,"Seattle, WA",2022-10-19 17:00:00,"Sacramento, CA"
2,Study2,3,Charlie,0,11112-A1,9101,2345-G,Alpha,2022-12-20 12:45:00,,,2022-12-18,2022-12-18 01:39:00,World Courier,608751583020,2022-12-18,NaT,,NaT,
3,Study2,3,Dennis,0,11113-A1,2345,3456-K,Bravo,2023-02-20 12:15:00,,,2023-02-18,2023-02-18 09:40:00,World Courier,110816182,2023-02-18,NaT,,NaT,
4,Study2,3,Eugene,0,11114-A1,3456,4567-P,Hotel,2023-01-17 11:25:00,,,2023-01-15,2023-01-15 01:15:00,World Courier,713775188,2023-01-15,2023-01-16 16:46:00,"Indianapolis, IN",2023-01-15 16:45:00,"Little Rock, AR"


Lastly, I export the result as an Excel spreadsheet:

In [24]:
flag.to_excel("demoRes.xlsx")