# Car price modeling with snowpark

## setup your local python development environment for snowpark


https://docs.snowflake.com/en/developer-guide/snowpark/python/setup

## setup connection to snowflake

Apply for a snowflake trial, .....
Make a note of the username password and accountname
enable Anoconda in the Admin > Billing & Terms section

create a python file connection_config.py with the following contents

```python
connection_parameters = {
    "account": "JTJLRSJ-MR87367", 
    "user": "snowflaketrialuser",
    "password": "yourpassword",
    "warehouse": "COMPUTE_WH",
    "role": "accountadmin",
    "database": "SNOWFLAKE_SAMPLE_DATA",
    "schema": "TPCH_SF10"
}
```


In [12]:
import os
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from connection_config import connection_parameters

import pandas as pd

#### Current Environment Details
def current_snowflake_env():
    snowflake_environment = session.sql('select current_user(), current_role(), current_database(), current_schema(), current_version(), current_warehouse()').collect()
    print('User                     : {}'.format(snowflake_environment[0][0]))
    print('Role                     : {}'.format(snowflake_environment[0][1]))
    print('Database                 : {}'.format(snowflake_environment[0][2]))
    print('Schema                   : {}'.format(snowflake_environment[0][3]))
    print('Warehouse                : {}'.format(snowflake_environment[0][5]))
    print('Snowflake version        : {}'.format(snowflake_environment[0][4]))

#### Set up a connection with Snowflake
session = Session.builder.configs(connection_parameters).create()


In [13]:
current_snowflake_env()

User                     : SNOWFLAKETRIALUSER
Role                     : ACCOUNTADMIN
Database                 : SNOWFLAKE_SAMPLE_DATA
Schema                   : TPCH_SF10
Warehouse                : COMPUTE_WH
Snowflake version        : 7.10.1


In [14]:
session.add_packages("snowflake-snowpark-python", "pandas", "xgboost==1.7.3")

## setup a new database

In [5]:
session.sql('CREATE OR REPLACE database cars_data').collect()


[Row(status='Database CARS_DATA successfully created.')]

In [6]:
session.sql('USE SCHEMA cars_data.public').collect()

[Row(status='Statement executed successfully.')]

## Get the cars data

from different cars sites we scraped cars for sale data, for each car we have....

In [7]:
car_prices = pd.read_csv("https://raw.githubusercontent.com/longhowlam/snowpark_cars_model/master/autos_tekoop.zip", encoding = "ISO-8859-1")

In [8]:
display(car_prices.sample(7))

Unnamed: 0,bouwjaar,km_stand,brandstof,motorinhoud,vermogen,transmissie,type,kleur,deur,prijs,merk,model,vraagprijs
220317,2009,148756,Benzine,6208cc,386kW,Automaat,Sedan,Zwart,4-deurs,â¬ 34.950,Mercedes-Benz,S-klasse,34950
219767,2008,282017,Diesel,2720cc,140kW,Automaat,Bedrijfswagens,Zwart,5-deurs,â¬ 9.950,Land,Rover,9950
65122,2020,15590,Benzine,1998cc,141kW,Automaat,Hatchback,Groen,5-deurs,â¬ 35.975,MINI,Cooper,35975
115982,2016,85000,Benzine,999cc,44kW,Handgeschakeld,Hatchback,Zwart,5-deurs,â¬ 8.750,Volkswagen,Up!,8750
198813,2007,295896,Diesel,2179cc,118kW,Handgeschakeld,SUV / Terreinwagen,Blauw,5-deurs,â¬ 5.300,Land,Rover,5300
177174,2008,126038,Benzine,1149cc,43kW,Handgeschakeld,Hatchback,Grijs,3-deurs,â¬ 2.980,Renault,Twingo,2980
112372,2012,103016,Benzine,1390cc,90kW,Handgeschakeld,SUV / Terreinwagen,Grijs,5-deurs,â¬ 10.849,Skoda,Yeti,10849


## create a snowflake table

In [9]:
## quote_identifiers set to False, 
## identifiers are passed on to Snowflake without quoting, i.e. identifiers will be coerced to uppercase by Snowflake.

session.write_pandas(car_prices, "CAR_PRICES", auto_create_table = True, quote_identifiers = False, overwrite = True)

<snowflake.snowpark.table.Table at 0x7fe45cfefb20>

## prepare data using snowpark
Now that we have a table in snowflake we are not using pandas to do data manipulation, but using snbowpark instead

In [15]:
cars_sf = session.table('CARS_DATA.PUBLIC.CAR_PRICES')

In [16]:
cars_sf.show()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"  |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2018        |54700       |Elektrisch   |NULL           | 245kW      |Automaat       |Hatchback            | Rood    | 5-deurs  |â¬ 54.999  |Tesla       |Model    |54999         |
|2017        |56266       |Elektrisch   |NULL           |NULL        |Automaat       | Hatchback           |Wit      | 5-deurs  |â¬ 22.949  |Volkswagen  |e-Golf   |22949         |
|2021        |1498        |Elektrisch   |NULL           |NULL        |Automaat       | SUV / Te

### create new column age from bouwjaar

In [33]:
cars_sf = (
    cars_sf
    .with_column('age' , 2023 - cars_sf['BOUWJAAR'])
    .with_column('N_doors', cars_sf["DEUR"].substring(1,2))
)

In [36]:
cars_sf.sample(n=10).show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"   |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |"AGE"  |"N_DOORS"  |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2020        |32856       |Benzine      | 2995cc        | 260kW      |Automaat        | Cabriolet           | Groen   | 2-deurs  |â¬ 82.995  |Audi        |S5       |82995         |3      | 2         |
|2003        |278482      |Benzine      | 1796cc        | 90kW       |Handgeschakeld  | Sedan               | Blauw   | 4-deurs  |â¬ 1.395   |Opel        |Vectra   |1395          |20     | 4 

### remove outliers, remove columns prijs, 

### maak vermogen inhoud en deur numeriek

In [65]:
#### Do some cleaning by removing outliers
cars_clean = (
    cars_sf
    .filter(F.col("KM_STAND") <= 500000)
)

In [66]:
cars_clean.show()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"  |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2018        |54700       |Elektrisch   |NULL           | 245kW      |Automaat       |Hatchback            | Rood    | 5-deurs  |â¬ 54.999  |Tesla       |Model    |54999         |
|2017        |56266       |Elektrisch   |NULL           |NULL        |Automaat       | Hatchback           |Wit      | 5-deurs  |â¬ 22.949  |Volkswagen  |e-Golf   |22949         |
|2021        |1498        |Elektrisch   |NULL           |NULL        |Automaat       | SUV / Te

## Gracefully close snowflake session

In [74]:
session.close()