# Get open source data

We will use the well known "Chicago Taxi Dataset"

### 0.1.0 Download it from the official source

In [None]:
### First create a local folder to handle data
!mkdir chicagodata

In [None]:
# Get the dataset Taxi Trips as CSV
!curl --get 'https://data.cityofchicago.org/resource/wrvz-psew.csv' \
  --data-urlencode '$limit=10000' \
  --data-urlencode '$where=trip_start_timestamp >= "2023-01-01" AND trip_start_timestamp < "2023-02-01"' \
  --data-urlencode '$select=tips,trip_start_timestamp,trip_seconds,trip_miles,pickup_community_area,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_community_area,fare,tolls,extras,trip_total' \
  | tr -d '"' > "./chicagodata/trip.csv"

### 0.1.1 Quick analysis

In [None]:
### import depandancies
import pandas as pd
import seaborn as sns

In [None]:
### read data using pandas reader with appropriate type and separator
data =...

In [None]:
### Use pandas to see first lines of the data
data....

In [None]:
#describe and inspect data distribution for the numerical columns
numerical_data = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
data.select_dtypes(include=...).describe()

In [None]:
#Using seaborn, plot the distribution of continuous variable : tips  
sns....

In [None]:
#Using seaborn, plot the distribution of categorical variable : pickup_community_area  
sns....

In [None]:
### Using seaborn heatmap, plot the correlation matrix of the current dataset
correlation = data.corr()
hm = sns.heatmap(..., 
                 cbar=True, 
                 annot=True, 
                 square=True, 
                 fmt='.2f', 
                 yticklabels=data...., 
                 xticklabels=data....)

**Feel free to add any vizualization, description that will help you to summerize the data**

**Now we got a first understanding of the data, we want to store it for later uses**

### 0.1.2 storage

We will use `MinIO`, to store our dataset. You can look for minio in the product portal or go directly take a look at https://storage.course.aiengineer.codex-platform.com/login

#### Use the minio client

In [None]:
# import dependancies
from minio import Minio
import urllib3
import os

In [None]:
# check if "MINIO-ACCESS-KEY" is well assign with "name-surname_SA"
os.getenv("MINIO-ACCESS-KEY")

In [None]:
## Create a client with the access key and the secret key given
client = Minio(
    "storage-api.course.aiengineer.codex-platform.com",
    access_key=os.getenv("MINIO-ACCESS-KEY"),
    secret_key=os.getenv("MINIO-SECRET-KEY"),
    secure=True,
    http_client=urllib3.PoolManager(
        
        retries=urllib3.Retry(
            total=5,
            backoff_factor=0.2,
            status_forcelist=[500, 502, 503, 504],
        ),
    ),
)

In [None]:
### Using the minio documentation, list and print buckets
buckets = client....
for bucket in buckets:
    ...

In [None]:
## define a bucket name like your credentials : firstname-lastname
# this bucket has already been created for you before the lab
bucket= ''#name-surname

#### Put taxi dataset into your bucket

In [None]:
# install dependancies : pyarrow
%pip install pyarrow

In [None]:
# import depandancies
from io import BytesIO
import pyarrow

In [None]:
### We will persist using "parquet" instead of csv for encoding/typing purpose
### convert data to parquet using pandas (if you struggle with the parquet engine used by pandas, choose pyarrow)
parquet_bytes=data....

### Use BytesIO to wrap parquet into a bytes stream objetc
parquet_buffer = ...

In [None]:
### Define the path to your chicago taxi dataset (intermediate folders are created automatically)
path_minio="datasets/chicago/trips.parquet"

In [None]:
### put the parquet file
### fill the params with the put_object documentation
client.put_object(...,
                   path_minio,
                    data=...,
                    length=len(...),
                    content_type='application/parquet')

#### Verify it's stored

In [None]:
### use the api to list objects into the bucket
objects = client....
for obj ...

**In this notebook, we got open data, applied quick analysis and store the data in our object storage**