## Set Up
- Create a (virtual) development environment to make Python version and packages manageable.  
- Import public datasets from Bigquery.

### Miniconda
Download Miniconda installer from the official site [Anaconda](https://www.anaconda.com/download).   
Use Miniconda command prompt to create/activate a new environment specifying Python version and packages to install. 

**Name the environment and choose a Python version.**  
`conda create -n my_env python=3.11`      
**List all created environments.**   
`conda env list`  
**Activate an environment by name.**  
`conda activate my_env`  
**Install essential packages for data analysis.**
```bash
conda install -c conda-forge \
  notebook \
  pandas \
  numpy \
  matplotlib \
  seaborn \
  scipy \
  scikit-learn
```
- Jupyter Notebook: `notebook`
- Data process & numeric operation: `pandas`, `numpy`
- Visulization: `matplotlib`, `seaborn`
- Statistics: `scipy`
- Machine learning: `scikit-learn`

**List all packages installed(dependencies not included) in the current environment**  
`conda env export --from-history`

**Add the environment as a Jupyter kernel**  
Install the IPython kernel package and register your environment as a Jupyter Notebook kernel.
```
conda install ipykernel
python -m ipykernel install --user --name my_env --display-name "Python (my_env)"

```
**Launch Jupyter notebook**  
`jupyter notebook`  
Select the desired kernel in Jupyter notebook interface.

In [4]:
conda env list


# conda environments:
#
base                   /opt/miniconda3
my_env               * /opt/miniconda3/envs/my_env


Note: you may need to restart the kernel to use updated packages.


### Connect to Bigquery
Install gcloud CLI with [offical guide](https://cloud.google.com/sdk/docs/install)(system-wide) and set up a project(to authorize the use of public datasets from Bigquery).  
Install and use `bigframe` package to import [New York Citi Bike dataset](https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-citi-bike?hl=en&inv=1&invt=AbxWgg) from Bigquery.  

**Create a new project in gcloud.**  
`gcloud projects create data-ana-0`  
**List all available projects.**  
`gcloud projects list`  
**Initialize a project.**  
`gcloud init`  
Confirm conversations from CLI.

**Install bigframe via conda.**  
`conda install conda-forge::bigframes`  
Need to restart the kernel.

In [4]:
import bigframes.pandas as bpd

PROJECT_ID = "data-ana-0"
bpd.options.bigquery.project = PROJECT_ID

sql_query = """
SELECT * 
FROM `bigquery-public-data.new_york.citibike_trips`
LIMIT 1000
"""

df = bpd.read_gbq(sql_query)
df.head()

Unnamed: 0,tripduration,starttime,stoptime,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bikeid,usertype,birth_year,gender
0,405,2016-08-04 08:07:33+00:00,2016-08-04 08:14:18+00:00,520,W 52 St & 5 Ave,40.759923,-73.976485,305,E 58 St & 3 Ave,40.760958,-73.967245,17789,Subscriber,1973,male
1,1018,2013-07-11 14:14:05+00:00,2013-07-11 14:31:03+00:00,520,W 52 St & 5 Ave,40.759923,-73.976485,442,W 27 St & 7 Ave,40.746647,-73.993915,19159,Subscriber,1967,female
2,2080,2016-09-09 17:35:27+00:00,2016-09-09 18:10:08+00:00,520,W 52 St & 5 Ave,40.759923,-73.976485,415,Pearl St & Hanover Square,40.704718,-74.00926,22270,Subscriber,1988,male
3,659,2013-10-28 16:44:28+00:00,2013-10-28 16:55:27+00:00,520,W 52 St & 5 Ave,40.759923,-73.976485,285,Broadway & E 14 St,40.734546,-73.990741,15775,Subscriber,1995,female
4,718,2016-08-07 17:46:03+00:00,2016-08-07 17:58:01+00:00,520,W 52 St & 5 Ave,40.759923,-73.976485,492,W 33 St & 7 Ave,40.7502,-73.990931,20344,Subscriber,1966,female


In [11]:
sql_query = """
SELECT DISTINCT end_station_name
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
WHERE end_station_id = 520
ORDER BY end_station_name
"""
df = bpd.read_gbq(sql_query)
df

Unnamed: 0,end_station_name
0,W 52 St & 5 Ave


By comparing the "start_station_id"/"start_station_name" and "end_station_id"/"end_station_name", we can draw conclusion that the id for the starting stations and the ending stations are from the same indeces.

## Exploratory Data Analysis
By examining the "Details" tab of the `citibike_trips` table from the "new_york_citibike" dataset in Bigquery studio, I found that the table consists of:  
- 58,937,715 rows in total and it takes 7.47 GB to store.

While examining the "Schema" tab along with first few entries of the table returned above gives information that the table has:  
- 16 columns.

each row represents one trip using the citibike including:  
- start/end time,  
- start/end location,  
- bike id,  
- user information...

### Span of time of the entries

In [7]:
sql_query = """
SELECT
  MIN(starttime) AS earliest_trip,
  MAX(starttime) AS latest_trip
FROM
  `bigquery-public-data.new_york_citibike.citibike_trips`
"""
df = bpd.read_gbq(sql_query)
df

Unnamed: 0,earliest_trip,latest_trip
0,2013-07-01 00:00:00,2018-05-31 23:59:59.606000


By taking the min/max of the starttime column, we know that the entries from the target table starts from July 2013 to May 2018, roughly **5 years**.  
If we take the average, there are **11,787,543(10 million) trips recorded every year** and **982,295(1 million) trips recorded each month**.

### List of stations
*) Since I found the result of SQL query returned by `bigframe` troubled to display the dataframe in order, I use Bigquery UI to run SQL queries in the next sessions and save the results locally for further exploration.

**SQL query to get a full list of all unique Citibike stations**  
```sql
SELECT
  station_id,
  station_name,
  station_latitude,
  station_longitude
FROM (
  SELECT
    start_station_id AS station_id,
    start_station_name AS station_name,
    start_station_latitude AS station_latitude,
    start_station_longitude AS station_longitude
  FROM `bigquery-public-data.new_york_citibike.citibike_trips`
  WHERE start_station_id IS NOT NULL

  UNION DISTINCT

  SELECT
    end_station_id AS station_id,
    end_station_name AS station_name,
    end_station_latitude AS station_latitude,
    end_station_longitude AS station_longitude
  FROM `bigquery-public-data.new_york_citibike.citibike_trips`
  WHERE end_station_id IS NOT NULL
)
ORDER BY station_id

```

In [15]:
sql_query = """
SELECT DISTINCT start_station_id
FROM `bigquery-public-data.new_york_citibike.citibike_trips`
ORDER BY start_station_id 
LIMIT 100
"""
df = bpd.read_gbq(sql_query)
df.head()

Unnamed: 0,start_station_id
0,265
1,244
2,143
3,271
4,174


## Visulization of the Flow between each Stations