# SE-2200E Notebook 1: Data Collection

Ningsong Shen

December 14, 2020

## Summary

In this notebook, I describe the process by which we collect bus arrival time data for usage in the project. Simple scripts, a virtual machine, and one-click pipeline processing are used. Most of the work was done with scripts so this notebook is mainly for description with some sample code.

## The Problem and Task Description

In order to predict transit times and improve schedules, we need to know what current arrival times look like. London Transit has a great [open data portal](http://www.londontransit.ca/open-data/) where they provide real-time updates on bus arrivals and a repository of schedule information. Unfortunately, historical arrival times are not tracked (confirmed via email) apart from specific timepoints so it is necessary to collect our own data.

This realization prompted me to speed up development of a collection framework because at least a few months of data would be needed to have a sufficient quality of analysis. The framework would have to accomplish the following:
- ping the endpoint frequently for up-to-date information
- download the data and clean it up
- store the relevant information in an easy-to-access way

I decided to use Python to accomplish this task given its ease of use and extensive data science libraries available.

## Understanding Data Formats and API Usage

The common data format used to provide live transit information is called the General Transit Feed Specification (GTFS). It is similar in form to JSON, and contains a variety of information about a transit system's operations. The standard allows third-party applications to use this information and provide live updates to users. Similarly, we can use this format to collect the data that is needed. Certain terms are important, as defined below (or linked)

Below is an example of the scheduled data:
```
trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,timepoint
1342560,6:13:00,6:13:00,KIPPADEL,1,,0,1,1
```

An example of the live data can be found here: https://developers.google.com/transit/gtfs-realtime/examples/trip-updates-full


The live data is the type of information that we'd be pulling from the London Transit API. It is noteworthy that the timestamps being provided are estimates of the arrival times for the next five bus stops for each trip. Each time a bus sends an update, it is delivered to the API which sends out udpated information. Once the bus passes a stop, the time does not show up anymore. Since we only want the latest estimate (i.e. when the bus actually arrives), we will have to filter this data out later. In the meantime, we will pull everything to ensure taht we don't miss anything.

## Manual Experimentation and Data Transformation

The first steps were to test out the API myself. I pulled data from the website and did some inspection on the format, volume, and potential processing methods. I also looked at the different information provided and decided on the ones that I wanted to keep.

Later, I set up a simple script to test data collection on my local machine. This allowed me to see that the processes were working. 

## Automation with a Virtual Machine

After simple testing succeeded on a local machine, I proceeded to pursue a deployment on remote machines so that the data could be reliably pulled 24 hours a day, 7 days a week. The AWS virtual machine that I used had a generous vCPU and limited storage, well-suited for my collection purpose.  A bit of tinkering, updating, and upgrading meant that I could run my Python scripts on the machine. 

To ensure that the data was being pulled consistently, I set up a cron job that would run my scripts. I had considered alternatives like a service or a background process, but cron is definitely the most effective given the repetitive job-like nature of checking the API. To determine the frequency of the checking, I had to balance precision with storage constraints. More checking meant a more precise bus time estimate, but would also mean more space utilization. I ended up with a frequency of checking every minute as that would be good enough for bus tracking--the updates are not extremely precise themselves.

## Data Storage and Transfer to Local Machine

The AWS virtual machine that was used had very limited persistent storage. Moreover, the persistent storage that was available did not remain if the machine was inadvertently suspended. Adding in a third-party storage or database solution would be quite complex and add to the cost of the project. The solution then, was to include downloads to my local machine which I could then sync with the university's provided OneDrive solution.

The method to do so was to use SFTP to transfer the files on a weekly basis to my local machine. Although this was quite fast for a one-time operation, it soon became apparent that the list of commands to run was quite long and would be tedious to do on a long-term basis. I wrote an automated script that would perform the entire download process, including connecting, moving to the right remote folder, downloading, deleting duplicate files, and logging details. It is a bit more challenging than initially imagined, as there were lots of consistency issues with deleting existing files from the remote machine (limited storage), maintaining exactly one copy of each file locally, and accounting for any potential disruptions while downloading with limited bandwith.  This became a one-click persistent service that would take care of the download process so that I wouldn't have to. I just needed to remember clicking it each week.

Note that the large text files were kept in their entirety, even though much of the data was redundant. A system to process the data will be designed for the future so this necessitates the additional local storage cost.

## Larger Scale Processing and Automation

As more files were collected, the need for more organization was apparent and I developed more robust tools to handle the data.

The existing patchwork of scripts had worked well and worked correctly, but it was a chore to get it all done each time. A fair amount of concentration was required to ensure that the scripts were run in the correct order, that nothing had malfunctioned, that the network and connection with the VM was up, and that the correct files were transferred and processed. Finally, it was most taxing to ensure that only one copy of each file remained and was securely stored. 

The solution of this was a master script that ran all necessary parts in the correct order. This involved a bit of shell scripting within Python and other operating system instructions, but it became possible to do all the processing needed in one click. However, another problem remained as the scripting was hard to read and would be difficult to modify in the future if I ever needed to. So I spent more time modularizing the code and organizing it into logical components so that future me would be thankful!

Finally, it was satisfying to see the single source of truth on my local computer. All the files were neatly organized, backed up, and ready to use. And, it was good to know that I could start collecting more data at any time, in one click.

## Data Collection Timeline and Diminishing Returns

The period of approximately 2 months from which the data was collected is ideal given the limited computing resources and smaller scale of prediction that will occur in the project. Although more data is better, there are diminishing returns given that patterns of bus arrival times does not deviate significantly from one period to the next. For more substantial production systems, the current system can definitely be used and implemented as a persistent, reliant service that runs at all hours.