# An Analysis of CitiBikes in New York City

## Introduction

This report will analyze a data set on the CitiBikes in New York City from 2013 until the end of 2016. The three main questions that will be answered here are:
  * What is the distribution of the lengths of time spent per trip?
  * What is the most popular route of the system?`
  * Were any new stations added throughout the time period?

To answer these questions, I'll be using BigQuery over at Google's Cloud Platform for pulling the specific attributes I need to answer these questions, as well as some Python coding for plotting and handling the extracted data. 

The data set I'll be using is one of BigQuery's internal datasets, and can be found at http://www.tiny.cc/nyccitibikesgoogle. It contains 16 attributes and over 30 million entries. The four attributes I'll need to answer the questions above are <code>tripduration, start_station, end_station </code> and <code>starttime</code>. 


## Distribution

### SQL

The only attribute needed for analyzing the distribution of the amount fo time spent per trip is the <code>tripduration</code> column. This column contains an integer datatype representing the total number of secodns each trip took. To simplfy the data without losing too much of the informaiton hidden within it, I spilt the entries into intervals of 15, 30 and 60 seconds. An example of the SQL query used to do this is shown below:
```
    SELECT
      (tripduration - MOD(tripduration, 15)) / 15 as qmin_interval,

    COUNT(*) as num trips

    FROM ‘bigquery-public-data.new york.citibike trips’

    ORDER BY
      qmin_interval ASC
    LIMIT
      15000

```

On the Google Cloud Platform, visualization of data can easily be done by exporting the results form your query to Google Data Studio. A comma-seperated values file can also be exported if you want to plot your own graaphs using other languages like Python or R. I chose to do the latter, using Python to help find the curve that best fit my data.


### Python

There were a few steps getting from .csv file to a fitted distribution plot. The first step was to read in hte data to the program. This is done with help from the <code>pandas</code> method <code>read_csv(filename)</code>. I also defined two new variables ``` qmin_interval ```  and ```num_trips```.

```
    trip_data = pandas.read_csv("trip_per_quarter_minute.csv") 
    qmin_interval = trip_data["qmin_interval"] 
    num_trips = trip_data["num_trips"]

```

Since I wanted to find a matching probability for the data, I needed to convert the values in the ```num_trips``` column to percentages. This was done by simply dividing the ```num_trips``` array by the sum of the same column. 

```
prob_trips = num_trips/sum(num_trips)
```

The next step was to plot the data and see what trends show up. After some trail and error, I concluded that the distribution followed a *log-normal distribution* with σ ≈ 0.75. The plot of ```scipy.stats' lognorm(x, s, loc, scale)``` function on top of the data from my .csv file is shown below.

![Alt text]() <img src="" alt="Alt text"/>






