# DS-forum
Fredag 12. mai 2023

Denne tutoriallen er skamløst stjålet om en noe modifisert for dagens sesjon. Orginalen finner du i linken gitt i teksten under.


![picture](./static/polars-header.jpeg)

# Intro fra originalforfatter
In this article we going to take a closer look at [Polars](https://github.com/ritchie46/polars). Polars is a new Dataframe library implemented in Rust with convenient Python bindings. The [benchmark of H2Oai](https://h2oai.github.io/db-benchmark/) shows that it is one of the fastest Dataframe library of the moment. From the Polars book: '_The goal of Polars is being a fast DataFrame library that utilizes the available cores on your machine. Its ideal use case is data too big for pandas and too small for spark. Similar to spark Polars consists of a query planner that may (and probably does) optimize your query in order to do less work or reduce memory usage._'.

Polars offers both a eager and a lazy API. The lazy API is said to be 'somewhat similar to spark'. The lazy API allows the user to optimise the query before it is ran. Promising 'blazingly' fast performance.

In this article, we will do a first introduction in Python and work with some of the available functionalities of this new dataframe package to get an idea that it has to offer. In the first part of the article we will use the eager API from Polars and at the end we will use the lazy API to check the syntax and see the differences.

To explore the functionalities of Polars we are going to use the [Wine Review dataset](https://www.kaggle.com/zynicide/wine-reviews/) with 150k wine reviews with variety, location, winery, price, and descriptions.

You can download the dataset that we will use [on Kaggle](https://www.kaggle.com/zynicide/wine-reviews/?select=winemag-data_first150k.csv).

It is also possible to run the cells in this article by yourself and play around with the code along the way. You can find this article in a Jupyter notebook format on my [Github page](https://github.com/r-brink/polars-tutorial/blob/master/polars-tutorial.ipynb)

## Installing Polars

We can easily install Polars via Pypi with the following command 

`pip install polars==0.17.12`

In this article, we will specifically use the 0.7.0 release of Polars, because it is the latest more stable version. It is still in an early stage of development, so a lot may change till the first truly stable version; 1.0.

*Note: as a best practice, don't forget to create and activate your virtual environment before installing Polars*

# Import relevant packages

Vi har laget en requerements fil så kjør `pip install -r requirements-txt` i terminalen i det miljøet du vil jobbe i så burde alt fungere smood.


# Klare for å komme i gang?

Fra pycon lærte vi at med følgende 7 opperasjoner burde vi klare det meste:
- select,
- with_columns,
- filter,
- join,
- groupby,
- agg,
- sort


To work with Polars and start analysing the Wine Review dataset we are going to import two packages: Polars and Matplotlib.

In [None]:
import polars as pl
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#ssb_yrke_7_4_api = 'http://data.ssb.no/api/klass/v1/correspondencetables/426.csv'

Polars already offers many functionalities that we are already familiar if you have worked with Pandas before. We can find an overview, including examples (for most), in the [reference guide](https://ritchie46.github.io/polars/python/polars/index.html). 

Let's start with loading the dataset and start with our analyses.

In [None]:
data = pl.read_csv('winemag-data_first150k.csv')
print(type(data))

Now that the data is read into the dataframe. Let's have a closer look at the dataframe.

## Starting with the eager API

### Dataset inspection

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.dtypes

Below we use sample() to get random rows from the dataset to get a feeling of the data that is available. Polars also offers common functions like `head` and `tail` 

In [None]:
print(data.sample(n=5))

In [None]:
data.head()

The dataset has a lot to offer. With 11 variables and over 150k rows there is a lot of data to analyse. We see a couple of variables that are interesting to look into, like `price`, `country`, `points`.

Before we continue we want to have a closer look if there are any `Nulls` in the dataset.

In [None]:
data.null_count()

It seems that around a little less than 10% of the `price` variable has no value. We can either drop the rows with missing values or fill them. In this article, we will choose to use the mean as filling strategy.

# with_columns()

In [None]:
data = data.with_columns(pl.col('price').fill_null(strategy ='mean').alias('price')) #aliasset er helt unødvendig her
data.null_count()

## OPPGAVE: Gitt at kronekursen er 10, lag en kolonne med den norske prisen:

In [None]:
krone_kurs = 10
(data
 .with_columns((pl.col('price')*krone_kurs)
               .alias('price_nok')))

### Some analyses

The next step is to dive in a little deeper and have a closer look at the dataset with some more complex functions.

The goal that we want to achieve in the following part is to have a closer look at the countries and how they compare in terms of price and points.

In [None]:
# Analyses of wine prices
print(f'Median price: {data["price"].median()}')
print(f'Average price: {data["price"].mean()}')
print(f'Maximum price: {data["price"].max()}')
print(f'Minimum price: {data["price"].min()}')

In [None]:
# Analyses of wine points
print(f'Median points: {data["points"].median()}')
print(f'Average points: {data["points"].mean()}')
print(f'Maximum points: {data["points"].max()}')
print(f'Minimum points: {data["points"].min()}')

The minimum number of points shows that there is no such thing as bad wine.

# filter()

In [None]:
data.filter((pl.col('price') > 10))

In [None]:
# Get a list of unique countries that are in the dataset
data['country'].unique().to_list()

In [None]:
print(f'There are {len(data["country"].unique())} countries in the list')

## OPPGAVE: Finn "feilene" i listen med land, sjekk hvor mange rader dette gjelder og fjern de aktuelle radene.


There are two strange values in our dataset: an undefined country ("") and a country called 'US-France'.

In [None]:
data.filter((pl.col('country').is_null()) | (pl.col('country') == 'US-France'))

There were only 6 of them, so it was safe to drop them.

In [None]:
data.filter((pl.col("country").is_null() == False) & (pl.col("country") != "US-France"))

Time to look into the countries that produces the best wine according to the points and has the hightest price for a bottle.

# groupby() & agg()

We group by country, select the `points` variable and call the mean to see the average number of points.

In [None]:

(data
 .groupby('country')
 .agg(pl.col('points')
      .mean()
      .alias('points_mean')))


# sort()
After that we sort the list by 'average points'.

In [None]:
(data
 .groupby('country')
 .agg(pl.col('points')
      .mean()
      .alias('points_mean'))
    .sort('points_mean', descending=True)
 )

England is leading the list for the best wines. Wonder how they think about that on the other side of the Canal in France.

## OPPGAVE: Finn maks prisen per land og sorter fra høyest til lavest.

In [None]:
data.groupby('country').agg(pl.col('price').max().alias('price_max')).sort('price_max', descending=True)

### Plotting while using Polars

To get a better insight into the differences it always helps to have some nice plots. Where Pandas has a plotting functionality build in, we have to rely on our Matplotlib skills for Polars. We focus on the top 15 countries.

In [None]:
# Get a list of the top 15 countries by taking the first 15 rows of the groupby that we did earlier
top_15_countries =(data
                   .groupby('country')
                   .agg(pl.col('points').mean().alias('points_mean'))
                   .sort('points_mean', descending=True)[0:15,0]
)

In [None]:
top_15_countries

# join()

In [None]:
pl.DataFrame({'country': top_15_countries})

In [None]:
df_top15 = pl.DataFrame({'country': top_15_countries}).join(data, on='country', how='left')

In [None]:
df_top15

Now that we have a top 15 countries, it is time to have a closer look at the distribution of points per `country`.

In [None]:
# How to filter
df_top15.filter(pl.col('country') == 'France')

In [None]:
df_top15.with_columns(pl.col('country') == 'France')['points']

In [None]:
fig, ax = plt.subplots(figsize=(15, 5))

for i, x in enumerate(df_top15['country'].unique()):

    ax.boxplot(df_top15.filter(pl.col('country') == x)['points'], labels=[str(x)], positions=[i])

plt.xticks(rotation=90)
plt.xlabel('Countries')
plt.ylabel('Average points')
plt.show()

## Time to go lazy

The lazy API offers a way to optimise your queries, similar to Spark. The major benefit over spark is that we don't have to set up our environment and can therefore continue working from our notebook.

More information can be found in the [Polars-book](https://ritchie46.github.io/polars-book/lazy_polars/intro.html)

In [None]:
import polars as pl

In [None]:
lazy_df = pl.scan_csv('winemag-data_first150k.csv', ignore_errors=True)
print(type(lazy_df))

Printing the type returns 'polars.lazy.LazyFrame' indicating the data is available to us. On to the groupby `country` and find the average `points` to compare with the eager API that we used earlier.

Similar to the filters that we did with the eager API we are going to filter the unknown and 'US-France' values in the `country` variable first.

In [None]:
lazy_df = (
    lazy_df
    .filter(pl.col('country').is_null() == False)
    .filter(pl.col('country') != "US-France")
)

As we can see nothing happens right away. From the documentation: '_This is due to the lazyness, nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution._'

In [None]:
lazy_df = (
    lazy_df
    .groupby('country')
    .agg([pl.mean('points').alias('avg_points')])
    .sort("avg_points", descending=True)
)

As we can see the syntax of the lazy API is different from what we did in the beginning. Although it takes some getting used to the syntax gives a nice overview of the different steps we want to take.

To actually see the results we can do two things: `collect()` and `fetch()`. The difference is that `fetch` takes the first 500 rows and then runs the query, whereas `collect` runs the query over all the results. Below we can see the differences for our case.

In [None]:
print(lazy_df.collect())

In [None]:
print(lazy_df.fetch())

In [None]:
print(f'The length of collect() is {len(lazy_df.collect())}')
print(f'The length of fetch() is {len(lazy_df.fetch())}')

## Output

We have got the output that we are looking for. Polars offers several ways to output our analyses, even to other formats useful for further analyses (e.g. pandas dataframe (`to_pandas()`) or numpy arrays (`to_numpy()`).

In [None]:
lazy_df.collect().write_csv('results.csv')

## Final word


Polars is a new package that is gaining a lot of attention. At the time of writing this article, it has gathered more than 1300 stars on Github, which is impressive looking at the fact that is around for less than a year. It offers almost all the functions that we need to manipulate our dataframe. Next to that, it offers a lazy API that helps us optimising our queries before we execute them. Although we didn't touch it is in this article, the benchmark of H20 shows that it is super efficient and fast. Especially with larger datasets it becomes worthwhile to look into the benefits that the lazy API has to offer.

I hope this article showed some of the potential Polars has to offer. There is a lot more to explore. The developer behind Polars is very responsive to issues. For (beginning) open source developers there are plenty of opportunities to contribute, both on the Python and Rust side. If you want to know more about the design decisions in Polars, I highly recommend [this blogpost](https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/) from the developer behind the package.

[link to Polars' Github page](https://github.com/ritchie46/polars)



![polars-logo](./static/polars-logo-dark.svg)