### 15 minutes intro to Koalas 


**Pandas** is a great tool to analyze small datasets on a single machine. As we have discussed in a previous notebook, once the need for bigger datasets arises, users often choose **PySpark**.  
However, converting code from Pandas to PySpark is not easy as PySpark APIs are considerably different from pandas APIs.  

<br>
<img src="https://miro.medium.com/max/900/1*zV6AdoIf6tpAirU3KnxMlw.jpeg" width="512" height="384" />

**Koalas** makes the learning curve significantly easier by providing pandas-like APIs on the top of PySpark.  
With Koalas, users can take advantage of the benefits of PySpark with minimal efforts, and thus get to value much faster.

+ [Code Source](https://github.com/databricks/koalas)  
+ [Koalas: Easy Transition from pandas to Apache Spark](https://www.databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html)  
+ [10 Minutes from pandas to Koalas on Apache Spark](https://www.databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html)  

This is a short introduction to Koalas, geared mainly at new users.  
This notebook shows you some key differences between Pandas and Koalas.  

Customarily, we import Koalas as follows:

In [0]:
import pandas as pd
import numpy as np
import pyspark.pandas as ps   #This is the new Pyspark API to Koalas


## Object Creation

You can create a Koalas series by passing a list of values, letting Koalas create a default integer index:

In [0]:
s = ps.Series([1, 3, 5, np.nan, 6, 8])

In [0]:
s

You may notice that the values in the Series are differently ordered to how they were in the list. This is an inherent feature of Koalas and Pyspark. We will discuss the resaon for this later.

In [0]:
type(s)

We can also create a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [0]:
kdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

kdf

We create a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns.
We start by creating the datetime index:

In [0]:
dates = pd.date_range('20130101', periods=6)

dates

Next, we create the Pandas dataframe filled with random numbers:

In [0]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

pdf

Now, this pandas DataFrame can be converted to a Koalas DataFrame using `from_pandas()`:

In [0]:
kdf = ps.from_pandas(pdf)
type(kdf)

This Koalas dataframe looks and behaves the same as a pandas DataFrame though:

In [0]:
kdf.head()

Also, it is possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from pandas DataFrame using `createDataFrame()`:

In [0]:
sdf = spark.createDataFrame(pdf)

In [0]:
sdf.show()

Creating Koalas DataFrame from Spark DataFrame uses the `to_koalas()` method.
`to_koalas()` is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

In [0]:
kdf = sdf.to_koalas()

In [0]:
kdf.head()

Koalas Dataframes have specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes).  
Types that are common to both Spark and pandas are currently supported.

In [0]:
kdf.dtypes

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the `kdf` frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index.
When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [0]:
kdf.head()

You can display the index, columns, and the underlying numpy data of the DataFrame.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later.

In [0]:
kdf.index

In [0]:
kdf.columns

In [0]:
nparray = kdf.to_numpy()
print(nparray)

Describe shows a quick statistic summary of your data:

In [0]:
kdf.describe()

Transposing your data also works as usual:

In [0]:
kdf.T

Of course, you can also sort the Koalas DataFrame, for example by its index:

In [0]:
kdf.sort_index(ascending=False)

Similarly, you can sort by value:

In [0]:
kdf.sort_values(by='B')

## Missing Data
Koalas primarily uses the value `np.nan` to represent missing data. These NaN values are by default not included in computations.

In [0]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [0]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [0]:
kdf1 = ps.from_pandas(pdf1)

In [0]:
kdf1

To drop any rows that have missing data.

In [0]:
kdf1.dropna(how='any')

Filling missing data.

In [0]:
kdf1.fillna(value=5)

## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [0]:
kdf.mean()

### Spark Configurations

Various configurations in PySpark could be applied internally in Koalas.
For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html">PySpark Usage Guide for Pandas with Apache Arrow</a>.

In [0]:
prev = spark.conf.get("spark.sql.execution.arrow.enabled")  # Keep its default value.
ps.set_option("compute.default_index_type", "distributed")  # Use default index prevent overhead.

import warnings
warnings.filterwarnings("ignore")  # Ignore warnings coming from Arrow optimizations.

In [0]:
spark.conf.set("spark.sql.execution.arrow.enabled", True)
%timeit ps.range(300000).to_pandas()

In [0]:
spark.conf.set("spark.sql.execution.arrow.enabled", False)
%timeit ps.range(300000).to_pandas()

In [0]:
ps.reset_option("compute.default_index_type")
spark.conf.set("spark.sql.execution.arrow.enabled", prev)  # Set its default value back.

## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [0]:
kdf = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [0]:
kdf

Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [0]:
kdf.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [0]:
kdf.groupby(['A', 'B']).sum()

## Plotting
See the <a href="https://koalas.readthedocs.io/en/latest/reference/frame.html#plotting">Plotting</a> docs.

In [0]:
pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))

In [0]:
kser = ps.Series(pser)

In [0]:
kser = kser.cummax()

In [0]:
kser.plot()

On a DataFrame, the <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.frame.DataFrame.plot.html#databricks.koalas.frame.DataFrame.plot">plot()</a> method is a convenience to plot all of the columns with labels:

In [0]:
pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index,
                   columns=['A', 'B', 'C', 'D'])

In [0]:
kdf = ps.from_pandas(pdf)

In [0]:
kdf = kdf.cummax()

In [0]:
kdf.plot()

## Getting data in/out
See the <a href="https://koalas.readthedocs.io/en/latest/reference/io.html">Input/Output
</a> docs.

##### Let's check first the path to our tmp folder

In [0]:
%fs ls

### CSV

CSV is straightforward and easy to use. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html#databricks.koalas.DataFrame.to_csv">here</a> to write a CSV file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_csv.html#databricks.koalas.read_csv">here</a> to read a CSV file.

In [0]:
kdf.to_csv('dbfs:/tmp/foo.csv')
ps.read_csv('dbfs:/tmp/foo.csv').head(10)

### Parquet

Parquet is an efficient and compact file format to read and write faster. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html#databricks.koalas.DataFrame.to_parquet">here</a> to write a Parquet file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_parquet.html#databricks.koalas.read_parquet">here</a> to read a Parquet file.

In [0]:
kdf.to_parquet('dbfs:/tmp/bar.parquet')
ps.read_parquet('dbfs:/tmp/bar.parquet').head(10)

### Spark IO

In addition, Koalas fully support Spark's various datasources such as ORC and an external datasource.  See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io">here</a> to write it to the specified datasource and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io">here</a> to read it from the datasource.

In [0]:
kdf.to_spark_io('dbfs:/tmp/zoo.orc', format="orc")
ps.read_spark_io('dbfs:/tmp/zoo.orc', format="orc").head(10)