# CSV to CSV with BlazingSQL

[This Notebook](https://github.com/gumdropsteve/silent-disco/blob/master/bsql_csv_to_csv_etl.ipynb) shows ETL writing a sample of data from 3 CSV files in the [gumdropsteve/turbo-telegram](https://github.com/gumdropsteve/turbo-telegram) repository to this repository's data directory.

**Data Check**

The next cell will check to make sure the data used in this demo is downloaded as expected. 

If you do not a clone of the turbo-telegram repo ([v0.0.2-beta](https://github.com/gumdropsteve/turbo-telegram/tree/0.0.2-beta) or [later](https://github.com/gumdropsteve/turbo-telegram)) in the same directory as this repo (silent-disco), an emply `turbo-telegram` directory and `turbo-telegram/data` sub-directory will be made and the data will be downloaded to `turbo-telegram/data`.

**Note**: This Notebook assumes it has been downloaded as part of the [silent-disco](https://github.com/gumdropsteve/silent-disco) repository.

In [1]:
from os import path, system

if not path.exists('../turbo-telegram'):
    print('making blank turbo-telegram repo & data directory')
    system('mkdir ../trubo-telegram')
    system('mkdir ../trubo-telegram/data')

base_url = 'https://raw.githubusercontent.com/gumdropsteve/turbo-telegram/0.0.3-beta/data'
files = ['nyc_taxi_jan15', 'nyc_taxi_feb15', 'nyc_taxi_march15']
    
for fn in files:
    if not path.isfile(f'../turbo-telegram/data/{fn}.csv'):
        print(f'downloading {fn}.csv to ../turbo-telegram/data')
        system(f'wget -P ../turbo-telegram/data {base_url}/{fn}.csv')

## ETL

In [2]:
%%time
from blazingsql import BlazingContext
bc = BlazingContext()

# identify path to turbo-telegram repo
from os import getcwd
main = f"{getcwd().replace('/silent-disco', '/')}"

# create table from from 3 csv files stored in turbo-telegram repo
bc.create_table('taxi', f'{main}/turbo-telegram/data/nyc_taxi_*.csv', header=0)

# overwrite table w/ 1m samples of the prior tables's 1.9m rows 
bc.create_table('taxi', bc.sql('select * from taxi').to_pandas().sample(1000000))

# query the whole table & store results as csv in this repo's data directory
bc.sql('select * from taxi').to_csv('data/nyc_yellow_cab.csv', index=False)

# forget the table exists
bc.drop_table('taxi')

# create a table from the file we just wrote & see how it looks
bc.create_table('new_taxi', f'{main}/silent-disco/data/nyc_yellow_cab.csv', header=0)
bc.sql('select * from new_taxi')

BlazingContext ready
CPU times: user 9.65 s, sys: 2.48 s, total: 12.1 s
Wall time: 12.2 s


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_x,pickup_y,RateCodeID,store_and_fwd_flag,dropoff_x,dropoff_y,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge
0,1,2015-01-08 18:49:37,2015-01-08 18:52:53,1,0.50,-8236359.026,4973672.723,1,N,-8236812.552,4972639.888,2,4.0,1.0,0.5,0.00,0.00,0.30
1,2,2015-02-20 17:34:08,2015-02-20 17:46:59,1,2.09,-8236550.118,4977245.055,1,N,-8236637.596,4977126.199,2,10.5,1.0,0.5,0.00,0.00,12.30
2,1,2015-03-09 20:13:17,2015-03-09 20:25:41,1,2.60,-8234781.875,4975580.644,1,Y,-8234781.875,4975580.644,1,11.5,0.5,0.5,1.00,0.00,13.80
3,2,2015-02-22 14:54:38,2015-02-22 15:13:47,1,7.28,-8236710.636,4967613.450,1,N,-8232552.462,4977599.949,1,23.0,0.0,0.5,4.76,0.00,28.56
4,1,2015-02-22 18:40:03,2015-02-22 19:09:55,2,16.40,-8222387.186,4978506.586,1,N,-8238673.369,4970121.864,1,45.0,0.0,0.5,10.20,5.33,61.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,1,2015-01-09 18:21:44,2015-01-09 18:34:08,1,1.40,-8238223.240,4975761.703,1,N,-8236490.667,4973926.605,1,9.5,1.0,0.5,2.25,0.00,0.30
999996,2,2015-02-08 19:48:07,2015-02-08 19:54:55,2,1.32,-8236527.187,4972522.769,1,N,-8234779.327,4973607.152,1,7.0,0.0,0.5,1.56,0.00,9.36
999997,2,2015-03-26 16:10:04,2015-03-26 16:14:09,1,0.52,-8233129.137,4978629.945,1,N,-8233704.113,4978096.711,2,4.5,1.0,0.5,0.00,0.00,6.30
999998,2,2015-02-08 10:41:32,2015-02-08 10:49:18,2,1.76,-8235618.436,4975266.182,1,N,-8233364.393,4976517.927,2,8.0,0.0,0.5,0.00,0.00,8.80


#### Runtime Details

In [3]:
conda list blazingsql

# packages in environment at /opt/conda-environments/rapids-nightly:
#
# Name                    Version                   Build  Channel
blazingsql                0.13.0a         cuda10.0_py37_267    blazingsql-nightly/label/cuda10.0

Note: you may need to restart the kernel to use updated packages.


In [4]:
!nvidia-smi

Fri Apr 10 09:49:11 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 440.48.02    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   39C    P0    27W /  70W |    672MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0  