d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/databricks/koalas/master/Koalas-logo.png" width="220"/>
</div>

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark. By unifying the two ecosystems with a familiar API, Koalas offers a seamless transition between small and large data.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
- Demonstrate the similarities of the Koalas API with the pandas API
- Understand the differences in syntax for the same DataFrame operations in Koalas vs PySpark

[Koalas Docs](https://koalas.readthedocs.io/en/latest/index.html), [Koalas Github](https://github.com/databricks/koalas), Spark+AI Summit Talks from [Niall Turbitt](https://www.youtube.com/watch?v=iUpBSHoqzLM&feature=youtu.be) & [Takuya Ueshin](https://www.youtube.com/watch?v=G_-9VbyHcx8&feature=youtu.be)

`koalas` comes pre-installed on the DBR 7.1 for ML.

-sandbox
### [Performance](https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-hours-to-minutes-with-koalas.html)

<div style="img align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2019/08/koalas-image4.png" width="1000"/>
</div>

**Pandas** DataFrames are mutable, eagerily evaluated, and maintain row order. They are restricted to a single machine, and are very performant when the data sets are small, as shown in a).

**Spark** DataFrames are distributed, lazily evaluated, immutable, and do not maintain row order. They are very performant when working at scale, as shown in b) and c).

**Koalas** provides the best of both worlds: pandas API with the performance benefits of Spark. However, it is not as fast as implementing your solution natively in Spark, and let's see why below.

## InternalFrame

The InternalFrame holds the current Spark DataFrame and internal immutable metadata.

It manages mappings from Koalas column names to Spark column names, as well as from Koalas index names to Spark column names. 

If a user calls some API, the Koalas DataFrame updates the Spark DataFrame and metadata in InternalFrame. It creates or copies the current InternalFrame with the new states, and returns a new Koalas DataFrame.

![](https://files.training.databricks.com/images/301/InternalFrame.png)

## InternalFrame Metadata Updates Only

Sometimes the update of Spark DataFrame is not needed but of metadata only, then new structure will be like this.

![](https://files.training.databricks.com/images/301/InternalFrameMetadata.png)

## InternalFrame Inplace Updates

On the other hand, sometimes Koalas DataFrame updates internal state instead of returning a new DataFrame, for example, the argument  inplace=True is provided, then new structure will be like this.

![](https://files.training.databricks.com/images/301/InternalFrameInPlace.png)

### Read in the dataset

* PySpark
* pandas
* Koalas

Read in Parquet with PySpark

In [0]:
df = spark.read.parquet("dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/")
display(df)

Read in Parquet with pandas

In [0]:
import pandas as pd

pdDF = pd.read_parquet("/dbfs/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/")
pdDF.head()

Read in Parquet with Koalas. You'll notice Koalas generates an index column for you, like in pandas.

Koalas also supports reading from Delta (`read_delta`), but pandas does not support that yet.

In [0]:
import databricks.koalas as ks

kdf = ks.read_parquet("dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/")
kdf.head()

### [Index Types](https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type)

![](https://files.training.databricks.com/images/301/koalas_index.png)

In [0]:
ks.set_option("compute.default_index_type", "distributed-sequence")
kdf_dist_sequence = ks.read_parquet("dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/")
kdf_dist_sequence.head()

### Converting to Koalas DataFrame to/from Spark DataFrame

Creating a Koalas DataFrame from PySpark DataFrame

In [0]:
kdf = ks.DataFrame(df)
display(kdf)

Alternative way of creating a Koalas DataFrame from PySpark DataFrame

In [0]:
kdf = df.to_koalas()
display(kdf)

Go from a Koalas DataFrame to a Spark DataFrame

In [0]:
display(kdf.to_spark())

### Value Counts

Get value counts of the different property types with PySpark

In [0]:
display(df.groupby("property_type").count().orderBy("count", ascending=False))

Get value counts of the different property types with Koalas

In [0]:
kdf["property_type"].value_counts()

### Visualizations with Koalas DataFrames

In [0]:
kdf.plot(kind="hist", x="bedrooms", y="price", bins=200)

### SQL on Koalas DataFrames

In [0]:
ks.sql("select distinct(property_type) from {kdf}")

### Interesting Facts

* With Koalas you can read from Delta Tables and read in a directory of files
* If you use apply on a Koalas DF and that DF is <1000 (by default), Koalas will use pandas as a shortcut - this can be adjusted using `compute.shortcut_limit`
* When you create bar plots, the top n rows are only used - this can be adjusted using `plotting.max_rows`
* How to utilize `.apply` ([docs](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.apply.html#databricks.koalas.DataFrame.apply)) with its use of return type hints similar to pandas UDFs
* How to check the execution plan, as well as caching a Koalas DF (which aren't immediately intuitive)
* Koalas are marsupials whose max speed is 30 kph (20 mph)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>