# [Beginners Guide to BlazingSQL](https://medium.com/dropout-analytics/beginners-guide-to-blazingsql-9ab6c2a9c6ad?source=friends_link&sk=1c4a81ea2cb0a061423c2d370acb60f4)

In [1]:
from blazingsql import BlazingContext
bc = BlazingContext()

BlazingContext ready


### Linear Regression on Data from AWS S3

We are going to predict the fare of a [NYC Yellow Taxi](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) cab ride by running a `LinearRegression()` on a `cudf.DataFrame`. This DataFrame will be generated from a SQL query on a Apache Parquet dataset that resides in a public AWS S3 bucket. For more information on BlazingSQL and cuDF see [The DataFrame Notebook](https://app.blazingsql.com/jupyter/user-redirect/lab/workspaces/auto-b/tree/Welcome_to_BlazingSQL_Notebooks/intro_notebooks/the_dataframe.ipynb).

**Connect AWS S3 bucket**

In [2]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

(True,
 '',
 OrderedDict([('type', 's3'),
              ('bucket_name', 'blazingsql-colab'),
              ('access_key_id', ''),
              ('secret_key', ''),
              ('session_token', ''),
              ('encryption_type', <S3EncryptionType.NONE: 1>),
              ('kms_key_amazon_resource_name', '')]))

**Create table**

In [3]:
bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/1_0_0.parquet')

**Query table**

In [4]:
bc.sql('select * from taxi')

Unnamed: 0,vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_loc_id,dropoff_loc_id,payment_type,Fare_amount,Extra,MTA_tax,Improvement_surcharge,Tip_amount,Tolls_amount,Total_amount
0,1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.300000,1,N,263,161,1,12.5,0.0,0.5,2.00,0.0,0.3,15.300000
1,1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,0.900000,1,N,186,234,1,5.0,0.0,0.5,1.45,0.0,0.3,7.250000
2,1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.100000,1,N,164,161,1,5.5,0.0,0.5,1.00,0.0,0.3,7.300000
3,1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.100000,1,N,236,75,1,6.0,0.0,0.5,1.70,0.0,0.3,8.500000
4,2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,0.020000,2,N,249,234,2,52.0,0.0,0.5,0.00,0.0,0.3,52.799999
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18826826,1,2018-05-01 16:31:07,2018-05-01 16:43:07,0,1.300000,1,N,161,234,1,9.0,1.0,0.5,2.15,0.0,0.3,12.950000
18826827,1,2018-05-01 16:44:57,2018-05-01 17:11:43,0,4.300000,1,N,234,13,1,19.5,1.0,0.5,5.30,0.0,0.3,26.600000
18826828,2,2018-05-01 16:10:21,2018-05-01 16:27:40,2,4.020000,1,N,262,234,1,16.0,1.0,0.5,1.80,0.0,0.3,19.600000
18826829,2,2018-05-01 16:45:53,2018-05-01 17:39:44,1,17.700001,2,N,132,148,1,52.0,4.5,0.5,14.32,0.0,0.3,71.620003


**Hand off Results**

Extract the desired features with `.sql()`, and then split up the data test using cuML's `train_test_split()` function.

In [5]:
%%time
from cuml.preprocessing.model_selection import train_test_split

X = bc.sql('SELECT trip_distance, tolls_amount FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

CPU times: user 942 ms, sys: 481 ms, total: 1.42 s
Wall time: 1.72 s


Then we run the `.fit()` and `.predict()` functions to perform the linear regression on the Taxi data.

In [6]:
%%time
from cuml import LinearRegression

# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

CPU times: user 460 ms, sys: 134 ms, total: 594 ms
Wall time: 605 ms


We can convert test & predicted values `.to_pandas()` & find the model's `r2_score()`.

In [7]:
from sklearn.metrics import r2_score

y_test = y_test.to_pandas()
y_pred = y_pred.to_pandas()

r2_score(y_true=y_test, y_pred=y_pred)

0.806373087168271