# Exploring Delays on Caltrain
by Özge Terzioğlu (ozterz@stanford.edu | www.github.com/ozterz)
1 March 2023

Caltrain provides commuters from San Francisco to San Jose and Gilroy with commuter rail service. Many Bay Areans rely on Caltrain to get to work, the airport, sporting events, and more. Caltrain states on its website that its vision is to: "Provide a safe, reliable, sustainable modern rail system that meets the growing mobility needs of the San Francisco Bay Area region." 

However, many students who rely on Caltrain to get to San Francisco and commuters alike often criticize Caltrain for being consistently late, delayed, and therefore unreliable. 

Caltrain has been tweeting train delays in real time since 2012, first from its @Caltrain account and now from the @CaltrainAlerts account as of October 1, 2020. The tweets regarding delays are now "hybrid" as of September 2022, meaning most of the delays are automated but if a human is available they will tweet out reasons for delays, as the context is not an automated feature. 

Caltrain hopes to improve the reliability of its trains and delay tweets when its electrification project finishes in 2024. Newer trains will provide quicker, more reliable service with updated GPS features allowing tweets to be faster and more accurate for commuters. 

For this project, I collected all of the tweets from @Caltrain and @CaltrainAlerts using Twitter's API and the python library tweepy (made specifically for calling Twitter APIs).

This dataset spans from 2012 to early 2023. 

I collected the tweets in January in the last days of it being cost-free using an Academic Access Developer account (I had to apply for this but I was approved within 48 hours)... my luck must be good! 

## Let's pull tweets using Twitter's API (main.py)

### Importing our tools

In [None]:
import tweepy
import os
import csv

### Define our authentication token variable 

Disclaimer: I learned in class a few weeks later that I could store it locally on my file and not push it to Github, when I did this via self-learning I inserted it directly into my code-- do NOT do this for security reasons!

In [None]:
bearer_token = "INSERT_YOUR_TOKEN_HERE"

In [None]:
client = tweepy.Client(bearer_token = bearer_token, wait_on_rate_limit = True)

### Create a csv file where we'll store our collected tweets

The column headers were named by looking at the API documentation and seeing which data it collects.

In [None]:
with open("main_account_data.csv", mode = "w") as tweet_file:
    tweet_writer = csv.writer(tweet_file, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    tweet_writer.writerow(["created_at","id","text"])

Now, let's use a for loop to loop through all the tweets from the @Caltrain account since it's creation until the date they stopped tweeting delays from this account. In the nested for loop we are opening the file "main_account_data.csv" that we created above and *appending* each tweet as a new row there. 

In [None]:
for item in tweepy.Paginator(client.search_all_tweets, query = 'from:Caltrain', max_results = 500, start_time = "2006-03-21T00:00:00Z", end_time = "2020-10-01T00:00:00Z", tweet_fields = ["created_at","id","text"]):
    for tweets in item.data:
        with open("main_account_data.csv", mode = "a") as tweet_file:
            tweet_writer = csv.writer(tweet_file, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
            tweet_writer.writerow([tweets["created_at"], tweets["id"], tweets["text"]])

To pull the tweets from the @CaltrainAlerts account, I used the same script, but replaced "query = "from:Caltrain"" with "from:CaltrainAlerts. I also created and opened a new file to save this account's tweets to called "alert_acct_tweets.csv" to keep each account's data separate so I could check for errors before merging the two files.

## Time to filter tweets for delays (filter.py)

Since the main @Caltrain account from 2012-2022 tweeted about both delays and other Caltrain-related notifications, we need to filter only for delays. To decide what to filter for, I scanned the collected tweets looking for consistent formatting. I filtered for words pertaining to delays, like "late", "waiting", "terminated", "single tracking", as well as words pertaining to incidents like "incident" or "emergency". 

### Importing our tools 

In [1]:
import csv
import os
import pandas as pd
import re
from pathlib import Path 

### Identify the path to the csv files holding the tweets on our local machine

In [None]:
cwd = Path(__file__).parent
localfile = Path.joinpath(cwd, "main_account_data.csv")
outfile = Path.joinpath(cwd, "filtered_tweets.csv")
alert_acct_data = Path.joinpath(cwd, "alert_acct_tweets.csv")

Outfile will be used to store the tweets containing notifications about delays only, hence the name "filtered_tweets."

### Combine the csv files holding tweets from @Caltrain and @CaltrainAlerts

Since both files have the same column names, we don't need to to anything to prepare for the merge.

In [None]:
mainacct_tweets = pd.read_csv(localfile)
alert_tweets = pd.read_csv(alert_acct_data)
mainacct_tweets = pd.concat([mainacct_tweets, alert_tweets])

### Use regular expressions to declare what you want to filter the tweets for

I want the regex to look for aforementioned key words pertaining to delays. 

In [None]:
include = "[0-9]+[\'\"]|[-][0-9]+|delay|\blate\b|\bwaiting\b|\bterminated\b|\bstopped\b|\bemergency\b|\bstruck\b|\bc[as]ncel+ed\b|\bbroke down\b|\bdisabled\b|\bfatality\b|\bkilled\b|\bslow\b|\bdown\b|\bsingle track\b|\bsingle tracking\b|\boff loading\b|\bdrop\b|\bdead\b|\bstop\b|\bstalled\b|\bmechanical issue\b|\bincident\b|\bholding\b|\brestriction\b|\bannulled\b|\bnot running\b|\bclosed\b"
exclude = tuple(["@", "RT"])
exclude_regex = r"\bBoard of Directors\b|\bPublic Session\b|\bzoom\b"

Set variables that filter the "text" column using the above regex variables. 

In [None]:
include_mask = mainacct_tweets["text"].str.contains(include)
exclude_mask = mainacct_tweets["text"].str.startswith(exclude)
exclude_words = mainacct_tweets["text"].str.contains(exclude_regex)

Store the filtered tweets to the outfile we created earlier. 

In [None]:
filtered_tweets = mainacct_tweets[(include_mask & ~exclude_mask) & ~exclude_words]
filtered_tweets.to_csv(outfile)

## Separating the "created_at" date into two columns (time and date) (clean.py)

This step is necessary if we want to aggregate delays by the time of day or year or day, etc. Datetime can be annoying so hang in tight.

In [3]:
import csv
import pandas as pd
import os
from pathlib import Path
from datetime import datetime
from pytz import timezone

As usual, let's read the local file with our data here.

In [None]:
cwd = Path(__file__).parent
localfile = Path.joinpath(cwd, "filtered_tweets.csv")
all_tweets = pd.read_csv(localfile)

Let's create a new column where we'll store the time.

In [None]:
all_tweets.insert(2, "time", None, allow_duplicates = True)

When we merged our two csv files together, the indexes from the previous files combined, so the numbers became meaningless. Let's get rid of that for the sake of my sanity.

In [None]:
del all_tweets["Unnamed: 0"]

Time to get to business. First, we'll establish the timezone the tweets were collected in (UTC) and the timezone we want them to be displayed as (PST). I know the tweets were collected in UTC because I read the API documentation. 

In [None]:
UTC = timezone("UTC")
PST = timezone("America/Los_Angeles")

Now we'll cycle through every row and do the following:
1) read in the created_at string into a datetime object so we can manipulate it.
2) Let the datetime object know that the current timezone is UTC.
3) Create a clone of the datetime object and convert that to PST.
4) Have the cloned object spit out a string with the date in YYYY-MM-DD format. One for the date, one for the time. 

In [None]:
for tweet in all_tweets.index:
    utc_time = datetime.strptime(all_tweets["created_at"][tweet], "%Y-%m-%d %H:%M:%S+00:00") 
    utc_time = utc_time.replace(tzinfo = UTC) 
    pst_time = utc_time.astimezone(PST)
    date = pst_time.strftime("%Y-%m-%d")
    time = pst_time.strftime("%H:%M:%S")

Next we'll simply update the respective columns with the newly separated date and time and store it in our csv file.

In [None]:
all_tweets["created_at"][tweet] = date 
all_tweets["time"][tweet] = time
all_tweets.to_csv("filtered_tweets.csv")

## You thought we were done filtering the text? Think again. (filter_text.py)

At this point, the important information in each tweet is all packed into a single column: "text". It would be difficult to do a deep analysis without having the minutes delayed, the approaching station, route number, and reason for the delay in their own separate columns. We must head into the weeds using regular expressions (regex). Let's go!

### Importing our tools

In [None]:
import csv
import os
import pandas as pd
import re
from pathlib import Path

Define path for our data files from our local machine to Python. Establish an outfile as a home where the new dataframe we'll create in this script will live. 

In [None]:
cwd = Path(__file__).parent
filtered_tweets = Path.joinpath(cwd, "NEW_filtered_tweets.csv")
outfile = Path.joinpath(cwd, "analyze_tweets.csv")
start_file = pd.read_csv(filtered_tweets)

Define our regex variables aka what do we want to look for and eventually pull from the "text" column in our filtered_tweets.csv file? 

**Huge Disclaimer:** This step took a week or so. I meticulously scoured through the csv file of about 10,000 rows looking for the different variations of how each component was written. Since the delay tweets were manually written (human created) for 10 years, there was little standardization of how delays were tweeted out. Inevitably, this dataset may not be entirely accurate or reflective of the actual delays. For one, I couldn't grab delays in the triple digits (when I looked through the csv file, there were few of them, but they are still crucial to have as data). I modified the regex variables a lot with the help of https://regexr.com/. Filtering text is a huge pain in the butt, if there's any easier way to go about this, I would love to learn more!

In [None]:
route_num = r"\bNo. [0-9]\b|\b#[0-9]\b|[0-9]{3}|#[A-Za-z]{2}[0-9]+\b|\b[A-Za-z]{2}[0-9]\b|\b[0-9]{3} -"
minutes_delayed = r"-?[0-9]+[\'\"]|[-][0-9]+|[0-9]{2}.|[0-9]{2,3} min"
delay_reason = r"due to\b|\bbecause of\b|\bfollowing\b|\b"
station = r"@?[A-Z][aA-zZ][A-Z]\b"
terminated = r"terminated\b|\bwill terminate\b|\bis terminating\b|\bcanceled\b|\bcancelled\b|\bannulled\b"
grab_reason = r"(?<=due to )(\w| )*(?:.)|(?<=because of )(\w| )*(?:.)|(?<=following )(\w| )*(?:.)"

Now that's out of the way, let's open the csv file where we'll store these newly filtered key items in their own respective columns. In the for loop, we will:
1) Loop through our starting file, "filtered_tweets.csv", and use the regular expressions module to set a condition for each item of interest in the "text" column (route number, minutes delayed, delay reason, station, and if the train was terminated). Since the most consistent data point was the route number, we will pull each of our other items of interest in relation to the route number identified in the tweet to ensure that the information in each row has a relation. 
2) If the route number is identified in the "text" column via the regex variable we made for filtering (found_route), it will be added to found_route_substr. 
3) Then, we will look for the other items of interest associated to this route number. If the item of interest is not found in the string, we will set it equal to an empty string. We will do this for minutes, reason, and station.
4) We wanted a column indicating if a train was terminated or not. If the word terminated is found in the string, we will assign that found string to the terminated column. If it's not found, we will write "No" to the row. 
5) Lastly, we only want data that has at least a route number AND minutes delayed. So if there's no reason, no terminated status, or no minutes delayed, that piece of associated information will be written to our outfile for analysis (analyze_tweets.csv). 

In [None]:
with open("analyze_tweets.csv", mode = "w") as new_file:    
        outfile = csv.writer(new_file, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
        outfile.writerow(["created_at","time","route","station","delay_length","delay_reason","terminated_train"])

        for tweet in start_file.index:
                if re.search(route_num, start_file.loc[tweet,"text"]):
                    route_str = start_file.loc[tweet,"text"]
                    found_route = re.findall(route_num, route_str)
                    found_route_substr = re.split(route_num, route_str)
                    found_route_substr = found_route_substr[1:]

                    for index in range(0,found_route_substr.__len__()):
                        found_mins = None
                        found_reason = None
                        found_station = None
                        found_term = None

                        if re.search(minutes_delayed, found_route_substr[index]):
                            min_str = found_route_substr[index]
                            found_mins = re.findall(minutes_delayed, min_str)[0]
                            if re.findall(minutes_delayed, min_str).__len__() > 1:
                                    print(min_str)
                                    print(re.findall(minutes_delayed, min_str))
                        else:
                            found_mins = "" 

                        if re.search(delay_reason, found_route_substr[index], flags = re.IGNORECASE):
                            reason_str = found_route_substr[index]
                            found_iter = re.finditer(grab_reason, reason_str, flags = re.IGNORECASE)
                            for search in found_iter:
                                found_reason = search.string[search.start():search.end()]
                            if found_reason == None:
                                found_reason = ""
                        else: 
                             found_reason = ""
                    
                        if re.search(station, found_route_substr[index]):
                            station_str = found_route_substr[index]
                            found_station = re.findall(station, station_str)[0]
                        else:
                             found_station = ""
                            
                        if re.search(terminated, found_route_substr[index]):
                            term_str = found_route_substr[index]
                            found_term = re.findall(terminated, term_str)[0]
                        else:
                             found_term = "No"

                        if found_mins != "" or found_reason != "" or found_term != "No":
                            outfile.writerow([start_file.loc[tweet,"created_at"],start_file.loc[tweet,"time"],found_route[index],found_station,found_mins,found_reason,found_term])

## Are we there yet? No. Since we pulled phrases from long strings, it's bound to be messy. Grab a mop and let's get to work. (format_data.py)

### Grabbing our cleaning supplies:

In [5]:
import csv
import os
import pandas as pd 
import re
from pathlib import Path

Creating a path from our local machine to Python for our files to travel here on. We've created quite the path of files up to this point. Let's write whatever manipulated data we'll end up creating to the same csv (analyze_tweets.csv) because we're just cleaning up the formatting.

In [None]:
cwd = Path(__file__).parent
start_file = Path.joinpath(cwd, "analyze_tweets.csv")
tweets = pd.read_csv(start_file)

Let's create news columns to store what will be our cleaned up data in.

In [None]:
if "clean_route" not in tweets.columns:
    tweets.insert(3, "clean_route", None, allow_duplicates = True)
    tweets.insert(5, "clean_delay", None, allow_duplicates = True)
    tweets.insert(7, "clean_terminated", None, allow_duplicates = True)

We're starting with the train route. If the route has an extraneous characters like the hashtag, we want to remove it for ease of analysis. When I scanned the data in the beginning steps, it looked like route numbers were either just numbers or started with a hashtag.

In [None]:
for index, search in enumerate(tweets["route"]):
    flag = r"#"
    clean_route = re.sub(flag, "", search)
    tweets.at[index, "clean_route"] = clean_route

Next, we'll move onto the length of delay (which was plotted in minutes). When scanning the dataset, I found that oftentimes there would be random upper or lowercase letters attached to the beginning or end of the minutes, so let's remove them so we can treat the clean_delay column like floating point numbers. 

In [None]:
for index, search in enumerate(tweets["delay_length"]):
    filter = r"\D|[aA-zZ]"
    if type(search) == float:
        search = str(search)
    clean_mins = re.sub(filter, "", search)
    tweets.at[index, "clean_delay"] = clean_mins

A terminated train was denoted in many different ways (terminated, annulled, canceled, or cancelled). Let's just replace this with "yes" for simplicity's sake.

In [None]:
for index, search in enumerate(tweets["terminated_train"]):
    flag_2 = r"terminated|annulled|canceled|cancelled"
    clean_term = re.sub(flag_2, "Yes", search)
    tweets.at[index, "clean_terminated"] = clean_term

Lastly, we'll send these freshly cleaned up tweets back to our "analyze_tweets.csv" for analysis!

In [1]:
tweets.to_csv(start_file)

NameError: name 'tweets' is not defined