# CSV to CSV with BlazingSQL

[This Notebook](https://github.com/gumdropsteve/silent-disco/blob/master/bsql_csv_to_csv_etl.ipynb) shows ETL writing a sample of data from 3 CSV files in the [gumdropsteve/turbo-telegram](https://github.com/gumdropsteve/turbo-telegram) repository to this repository's data directory.

**Data Check**

The next cell will check to make sure the data used in this demo is downloaded as expected. 

If you do not a clone of the turbo-telegram repo ([v0.0.2-beta](https://github.com/gumdropsteve/turbo-telegram/tree/0.0.2-beta) or [later](https://github.com/gumdropsteve/turbo-telegram)) in the same directory as this repo (silent-disco), an emply `turbo-telegram` directory and `turbo-telegram/data` sub-directory will be made and the data will be downloaded to `turbo-telegram/data`.

In [1]:
from os import path, system

if not path.exists('../turbo-telegram'):
    print('making blank turbo-telegram repo & data directory')
    system('mkdir ../trubo-telegram')
    system('mkdir ../trubo-telegram/data')

base_url = 'https://raw.githubusercontent.com/gumdropsteve/turbo-telegram/0.0.3-beta/data'
files = ['nyc_taxi_jan15', 'nyc_taxi_feb15', 'nyc_taxi_march15']
    
for fn in files:
    if not path.isfile(f'../turbo-telegram/data/{fn}.csv'):
        print(f'downloading {fn}.csv to ../turbo-telegram/data')
        system(f'wget -P ../turbo-telegram/data {base_url}/{fn}.csv')
        
if not path.exists('data'):
    print('making data directory for this repo')
    system('mkdir data')

downloading nyc_taxi_feb15.csv to ../turbo-telegram/data
making data directory for this repo


## ETL

In [2]:
%%time
from blazingsql import BlazingContext
bc = BlazingContext()

# identify path to this directory (cwd) & above directory (main)
from os import getcwd
cwd = getcwd()
main = '/'.join(cwd.split('/')[:-1])

# create table from from 3 csv files stored in turbo-telegram repo
bc.create_table('taxi', f'{main}/turbo-telegram/data/nyc_taxi_*.csv', header=0)

# overwrite table w/ 1m samples of the prior tables's 1.9m rows 
bc.create_table('taxi', bc.sql('select * from taxi').to_pandas().sample(1000000))

# query the whole table & store results as csv in this repo's data directory
bc.sql('select * from taxi').to_csv('data/nyc_yellow_cab.csv', index=False)

# forget the table exists
bc.drop_table('taxi')

# create a table from the file we just wrote & see how it looks
bc.create_table('new_taxi', f'{cwd}/data/nyc_yellow_cab.csv', header=0)
bc.sql('select * from new_taxi')

BlazingContext ready
CPU times: user 9.6 s, sys: 2.53 s, total: 12.1 s
Wall time: 12.2 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_x,pickup_y,RateCodeID,store_and_fwd_flag,dropoff_x,dropoff_y,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge
0,2,2015-03-29 04:35:02,2015-03-29 05:06:31,2,8.46,-8237041.014,4976969.222,1,N,-8231378.729,4965405.816,2,29.0,0.5,0.5,0.00,0.0,30.30
1,2,2015-02-17 18:56:56,2015-02-17 19:07:43,2,2.75,-8237333.173,4971417.775,1,N,-8233394.119,4970193.574,2,11.0,1.0,0.5,0.00,0.0,12.80
2,1,2015-02-04 22:23:41,2015-02-04 22:37:13,1,2.80,-8235801.885,4977181.703,1,N,-8236263.055,4972415.178,1,11.5,0.5,0.5,2.55,0.0,15.35
3,1,2015-01-07 12:26:10,2015-01-07 12:46:03,1,3.70,-8235904.650,4977015.194,1,N,-8239372.343,4969838.949,1,15.5,0.0,0.5,3.26,0.0,0.30
4,2,2015-01-17 18:39:27,2015-01-17 18:44:30,1,0.92,-8235334.770,4975885.027,1,N,-8236558.611,4977024.164,2,5.5,0.0,0.5,0.00,0.0,0.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,1,2015-03-19 06:17:23,2015-03-19 06:20:22,1,0.80,-8233500.281,4977410.447,1,N,-8234279.939,4975993.217,1,4.5,0.0,0.5,1.05,0.0,6.35
999996,1,2015-01-01 19:44:50,2015-01-01 19:58:00,1,3.10,-8237107.259,4971248.563,1,N,-8236611.268,4976348.066,1,12.0,0.0,0.5,2.55,0.0,0.00
999997,1,2015-02-15 16:58:12,2015-02-15 17:03:35,1,0.40,-8235828.213,4978243.614,1,N,-8234988.255,4977852.252,1,5.0,0.0,0.5,1.16,0.0,6.96
999998,1,2015-03-08 19:58:02,2015-03-08 20:08:31,1,1.90,-8236211.248,4973078.115,1,N,-8234247.665,4976628.927,1,9.0,0.5,0.5,1.50,0.0,11.80


#### Runtime Details

In [3]:
conda list blazingsql

# packages in environment at /opt/conda-environments/rapids-nightly:
#
# Name                    Version                   Build  Channel
blazingsql                0.13.0a         cuda10.0_py37_267    blazingsql-nightly/label/cuda10.0

Note: you may need to restart the kernel to use updated packages.


In [4]:
!nvidia-smi

Fri Apr 10 10:10:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 440.48.02    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    27W /  70W |    670MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  