# Car price modeling with snowpark

## setup your local python development environment for snowpark


https://docs.snowflake.com/en/developer-guide/snowpark/python/setup

## setup connection to snowflake

Apply for a snowflake trial, .....
Make a note of the username password and accountname
enable Anoconda in the Admin > Billing & Terms section

create a python file connection_config.py with the following contents

```python
connection_parameters = {
    "account": "JTJLRSJ-MR87367", 
    "user": "snowflaketrialuser",
    "password": "yourpassword",
    "warehouse": "COMPUTE_WH",
    "role": "accountadmin",
    "database": "SNOWFLAKE_SAMPLE_DATA",
    "schema": "TPCH_SF10"
}
```


In [1]:
import os
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from connection_config import connection_parameters

import pandas as pd

#### Current Environment Details
def current_snowflake_env():
    snowflake_environment = session.sql('select current_user(), current_role(), current_database(), current_schema(), current_version(), current_warehouse()').collect()
    print('User                     : {}'.format(snowflake_environment[0][0]))
    print('Role                     : {}'.format(snowflake_environment[0][1]))
    print('Database                 : {}'.format(snowflake_environment[0][2]))
    print('Schema                   : {}'.format(snowflake_environment[0][3]))
    print('Warehouse                : {}'.format(snowflake_environment[0][5]))
    print('Snowflake version        : {}'.format(snowflake_environment[0][4]))

#### Set up a connection with Snowflake
session = Session.builder.configs(connection_parameters).create()


In [2]:
current_snowflake_env()

User                     : SNOWFLAKETRIALUSER
Role                     : ACCOUNTADMIN
Database                 : SNOWFLAKE_SAMPLE_DATA
Schema                   : TPCH_SF10
Warehouse                : COMPUTE_WH
Snowflake version        : 7.11.6


In [3]:
session.add_packages("snowflake-snowpark-python", "pandas", "xgboost==1.7.3")

## setup a new database

In [4]:
session.sql('CREATE OR REPLACE database cars_data').collect()


[Row(status='Database CARS_DATA successfully created.')]

In [5]:
session.sql('USE SCHEMA cars_data.public').collect()

[Row(status='Statement executed successfully.')]

## Get the cars data

from different cars sites we scraped cars for sale data, for each car we have....

In [6]:
car_prices = pd.read_csv("https://raw.githubusercontent.com/longhowlam/snowpark_cars_model/master/autos_tekoop.zip", encoding = "ISO-8859-1")

In [7]:
display(car_prices.sample(7))

Unnamed: 0,bouwjaar,km_stand,brandstof,motorinhoud,vermogen,transmissie,type,kleur,deur,prijs,merk,model,vraagprijs
203707,2021,7976,Benzine,1998cc,135kW,Automaat,Hatchback,Zwart,5-deurs,â¬ 69.022,BMW,4-serie,69022
36127,2020,37646,Benzine,1991cc,155kW,Automaat,CoupÃ©,Zwart,5-deurs,â¬ 64.995,Mercedes-Benz,GLC-klasse,64995
8861,2021,24,Elektrisch,,150kW,Automaat,SUV / Terreinwagen,Grijs,5-deurs,â¬ 65.900,Volkswagen,ID.4,65900
4141,2022,10,Elektrisch,,,Automaat,SUV / Terreinwagen,Zwart,5-deurs,â¬ 53.070,Skoda,Enyaq,53070
62733,2021,53336,Benzine,1498cc,110kW,Automaat,SUV / Terreinwagen,Grijs,5-deurs,â¬ 35.450,Skoda,Karoq,35450
220984,2013,29100,Diesel,2198cc,90kW,Handgeschakeld,Bedrijfswagens,Grijs,3-deurs,â¬ 71.995,Land,Rover,71995
143905,2017,37723,Benzine,999cc,52kW,Handgeschakeld,Hatchback,Paars,5-deurs,â¬ 8.950,Mitsubishi,Space,8950


## create a snowflake table

In [8]:
## quote_identifiers set to False, 
## identifiers are passed on to Snowflake without quoting, i.e. identifiers will be coerced to uppercase by Snowflake.

session.write_pandas(car_prices, "CAR_PRICES", auto_create_table = True, quote_identifiers = False, overwrite = True)

<snowflake.snowpark.table.Table at 0x7f9c1c3fbbe0>

## prepare data using snowpark
Now that we have a table in snowflake we are not using pandas to do data manipulation, but using snbowpark instead

In [9]:
cars_sf = session.table('CARS_DATA.PUBLIC.CAR_PRICES')

In [10]:
cars_sf.show()

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"  |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2018        |54700       |Elektrisch   |NULL           | 245kW      |Automaat       |Hatchback            | Rood    | 5-deurs  |â¬ 54.999  |Tesla       |Model    |54999         |
|2017        |56266       |Elektrisch   |NULL           |NULL        |Automaat       | Hatchback           |Wit      | 5-deurs  |â¬ 22.949  |Volkswagen  |e-Golf   |22949         |
|2021        |1498        |Elektrisch   |NULL           |NULL        |Automaat       | SUV / Te

### create new column age from bouwjaar

In [11]:
cars_sf = (
    cars_sf
    .with_column('age' , 2023 - cars_sf['BOUWJAAR'])
    .with_column('N_doors', cars_sf["DEUR"].substring(1,2))
)

In [12]:
cars_sf.sample(n=10).show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"       |"TRANSMISSIE"   |"TYPE"               |"KLEUR"   |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |"AGE"  |"N_DOORS"  |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2019        |11500       |Benzine      | 1199cc        | 81kW            |Handgeschakeld  | SUV / Terreinwagen  | Rood     | 5-deurs  |â¬ 24.940  |Peugeot     |2008     |24940         |4      | 5         |
|1997        |144857      |Benzine      | 1298cc        | 50kW            |Automaat        | Hatchback           | Blauw    | 2-deurs  |â¬ 899     |Suzuki      |Swift 

### remove outliers, remove columns prijs, 

### maak vermogen inhoud en deur numeriek

In [13]:
#### Do some cleaning by removing outliers
cars_clean = (
    cars_sf
    .filter(F.col("KM_STAND") <= 500000)
)

In [14]:
cars_clean.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"BOUWJAAR"  |"KM_STAND"  |"BRANDSTOF"  |"MOTORINHOUD"  |"VERMOGEN"  |"TRANSMISSIE"  |"TYPE"               |"KLEUR"  |"DEUR"    |"PRIJS"     |"MERK"      |"MODEL"  |"VRAAGPRIJS"  |"AGE"  |"N_DOORS"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2018        |54700       |Elektrisch   |NULL           | 245kW      |Automaat       |Hatchback            | Rood    | 5-deurs  |â¬ 54.999  |Tesla       |Model    |54999         |5      | 5         |
|2017        |56266       |Elektrisch   |NULL           |NULL        |Automaat       | Hatchback           |Wit      | 5-deurs  |â¬ 22.949  |Volkswagen  |e-Golf   |22949         |6      | 5      

## Gracefully close snowflake session

In [74]:
session.close()