# Car price modeling with snowpark

## setup your local python development environment for snowpark


https://docs.snowflake.com/en/developer-guide/snowpark/python/setup

## setup connection to snowflake

Apply for a snowflake trial, .....
Make a note of the username password and accountname
enable Anoconda in the Admin > Billing & Terms section

create a python file connection_config.py with the following contents

```python
connection_parameters = {
    "account": "JTJLRSJ-MR87367", 
    "user": "snowflaketrialuser",
    "password": "yourpassword",
    "warehouse": "COMPUTE_WH",
    "role": "accountadmin",
    "database": "SNOWFLAKE_SAMPLE_DATA",
    "schema": "TPCH_SF10"
}
```


In [1]:
import os
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from connection_config import connection_parameters

import pandas as pd

#### Current Environment Details
def current_snowflake_env():
    snowflake_environment = session.sql('select current_user(), current_role(), current_database(), current_schema(), current_version(), current_warehouse()').collect()
    print('User                     : {}'.format(snowflake_environment[0][0]))
    print('Role                     : {}'.format(snowflake_environment[0][1]))
    print('Database                 : {}'.format(snowflake_environment[0][2]))
    print('Schema                   : {}'.format(snowflake_environment[0][3]))
    print('Warehouse                : {}'.format(snowflake_environment[0][5]))
    print('Snowflake version        : {}'.format(snowflake_environment[0][4]))

#### Set up a connection with Snowflake
session = Session.builder.configs(connection_parameters).create()


In [2]:
current_snowflake_env()

User                     : SNOWFLAKETRIALUSER
Role                     : ACCOUNTADMIN
Database                 : SNOWFLAKE_SAMPLE_DATA
Schema                   : TPCH_SF10
Warehouse                : COMPUTE_WH
Snowflake version        : 7.14.0


In [3]:
session.add_packages("snowflake-snowpark-python", "pandas", "xgboost==1.7.3")

The version of package xgboost in the local environment is 1.7.4, which does not fit the criteria for the requirement xgboost==1.7.3. Your UDF might not work when the package version is different between the server and your local environment


## setup a new database

In [4]:
session.sql('CREATE OR REPLACE database cars_data').collect()


[Row(status='Database CARS_DATA successfully created.')]

In [5]:
session.sql('USE SCHEMA cars_data.public').collect()

[Row(status='Statement executed successfully.')]

## Get the cars data

from different cars sites we scraped cars for sale data, for each car we have....

In [6]:
car_prices = pd.read_csv("https://raw.githubusercontent.com/longhowlam/snowpark_cars_model/master/autos_tekoop.zip", encoding = "ISO-8859-1")

In [24]:
### extract number form vermogen column
car_prices['power'] = car_prices['vermogen'].str.extract('(\d+)')

In [25]:
display(car_prices.sample(7))

Unnamed: 0,bouwjaar,km_stand,brandstof,motorinhoud,vermogen,transmissie,type,kleur,deur,prijs,merk,model,vraagprijs,power
3397,2022,1,Elektrisch,,,Automaat,Hatchback,Zwart,5-deurs,â¬ 41.665,Volkswagen,ID.3,41665,
12061,2016,124097,Elektrisch,,314kW,Automaat,SUV / Terreinwagen,Zwart,5-deurs,â¬ 53.950,Tesla,Model,53950,314.0
47411,2015,70259,Benzine,998cc,60kW,Handgeschakeld,Hatchback,Grijs,5-deurs,â¬ 8.825,Ford,Fiesta,8825,60.0
179647,2020,10350,Benzine,1998cc,146kW,Automaat,Cabriolet,Grijs,2-deurs,â¬ 52.950,BMW,Z4,52950,146.0
208413,2007,320055,Diesel,1422cc,51kW,Handgeschakeld,Stationwagon,Blauw,5-deurs,â¬ 1.499,Skoda,Fabia,1499,51.0
143623,2016,148912,Hybride,1395cc,115kW,Automaat,Stationwagon,Grijs,5-deurs,â¬ 24.185,Volkswagen,Passat,24185,115.0
2624,2018,56052,Elektrisch,,,Automaat,SUV / Terreinwagen,Rood,5-deurs,â¬ 69.800,Jaguar,I-Pace,69800,


## create a snowflake table

In [26]:
## quote_identifiers set to False, 
## identifiers are passed on to Snowflake without quoting, i.e. identifiers will be coerced to uppercase by Snowflake.

session.write_pandas(car_prices, "CAR_PRICES", auto_create_table = True, quote_identifiers = False, overwrite = True)

<snowflake.snowpark.table.Table at 0x18696ec7d90>

## prepare data using snowpark
Now that we have a table in snowflake we are not using pandas to do data manipulation, but using snbowpark instead

In [27]:
cars_sf = session.table('CARS_DATA.PUBLIC.CAR_PRICES')

In [28]:
cars_sf.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"  |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |"POWER"  |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2018        |54700       |Elektrisch   |NULL           | 245kW      |Automaat       |Hatchback            | Rood    | 5-deurs  |â¬ 54.999  |Tesla       |Model    |54999         |245      |
|2017        |56266       |Elektrisch   |NULL           |NULL        |Automaat       | Hatchback           |Wit      | 5-deurs  |â¬ 22.949  |Volkswagen  |e-Golf   |22949         |NULL     |
|2021        |1498        |Elektrisch   |NULL

### create new column age from bouwjaar

In [29]:
cars_sf = (
    cars_sf
    .with_column('age' , 2023 - cars_sf['BOUWJAAR'])
    .with_column('N_doors', cars_sf["DEUR"].substring(1,2))
)

In [30]:
cars_sf.sample(n=10).show()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"   |"TYPE"           |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"         |"MODEL"   |"VRAAGPRIJS"  |"POWER"  |"AGE"  |"N_DOORS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|1989        |97056       |Benzine      | 3946cc        | 140kW      |Handgeschakeld  | Cabriolet       | Zwart   | 2-deurs  |â¬ 44.950  |Morgan         |Plus      |44950         |140      |34     | 2         |
|2014        |89384       |Benzine      | 1798cc        | 104kW      |Handgeschakeld  | Hatchback       | Grijs   | 5-deurs  |â¬ 12.450  |Honda        

In [31]:
cars_sf.count()

231000

### remove outliers

In [32]:
cars_clean = (
    cars_sf
    .filter(F.col("KM_STAND") <= 500000)
    .filter(F.col("AGE") <= 20 )
    .filter(F.col("TRANSMISSIE").in_(F.lit("Handgeschakeld"), F.lit("Automaat")) )
    .filter(F.col("VRAAGPRIJS") <= 100000)
    .filter(F.col("BRANDSTOF").in_(F.lit("Benzine"), F.lit("Diesel")) )
)

In [33]:
## drop the columns that we don't need
cars_clean = cars_clean.drop("PRIJS")

In [34]:
cars_clean.show()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"   |"TYPE"               |"KLEUR"  |"DEUR"    |"MERK"      |"MODEL"   |"VRAAGPRIJS"  |"POWER"  |"AGE"  |"N_DOORS"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2020        |7281        |Benzine      | 1199cc        | 96kW       |Automaat        | SUV / Terreinwagen  | Rood    | 5-deurs  |CitroÃ«n    |C3        |26950         |96       |3      | 5         |
|2015        |26120       |Benzine      | 1242cc        | 51kW       |Handgeschakeld  | Hatchback           | Wit     | 3-deurs  |Fiat        |500       |9750          |51       |8      | 3         |


### save the data into a snowflake table

In [15]:
cars_clean.count()

179490

In [35]:
cars_clean.write.mode("overwrite").save_as_table("CARS_DATA.PUBLIC.CARS_CLEAN")

## Gracefully close snowflake session

In [74]:
session.close()