## A beautifull way to work with data

This is an API proposal to access data.

Dataframes would have rows and columns. 

* To access columns just use df.cols()
* To access rows just use df.rows()
* I/O operations to load and save data are in Optimus. op.load.csv(). op.save.csv()

Easy and simple

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from optimus import *

from pyspark.sql.session import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType

sc = SparkSession.builder.getOrCreate()

In [3]:
# Create optimus
op = Optimus(sc)

Using a created Spark Session...
Done.


## Create dataframe
### Spark

This is ugly:

```
val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)```

In [4]:
# Thanks Mr Powers
df = op.create.df([
                ("  I like     fish  ", 1, "dog", "housé", "cat-car", "a","1"),
                ("    zombies", 2, "cat", "tv", "dog-tv", "b","2"),
                ("simpsons   cat lady", 2, "frog", "table","eagle-tv-plus","1","3"),
                (None, 3, "eagle", "glass", "lion-pc", "c","4")
            ],
            [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True),
                ("two strings", StringType(), True),
                ("filter", StringType(), True),
                ("num 2", "string", True)

            ])

df.show()

+-------------------+---+-------+-----+-------------+------+-----+
|              words|num|animals|thing|  two strings|filter|num 2|
+-------------------+---+-------+-----+-------------+------+-----+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|
+-------------------+---+-------+-----+-------------+------+-----+



## Create Columns
### Spark
* You can not create multiple columns at the same time
* You need to use the lit function. lit???

### Pandas
* Similiar behavior


In [5]:
df = df.cols().create("new_col_1", 1)
df.show()

+-------------------+---+-------+-----+-------------+------+-----+---------+
|              words|num|animals|thing|  two strings|filter|num 2|new_col_1|
+-------------------+---+-------+-----+-------------+------+-----+---------+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1|
+-------------------+---+-------+-----+-------------+------+-----+---------+



In [6]:
from pyspark.sql.functions import *

sf = df.cols().create([
    ("new_col_2", 2.22),
    ("new_col_3", lit(3)),
    ("new_col_4", "test"),
    ("new_col_5", df['num']*2)
    ])

df.show()

+-------------------+---+-------+-----+-------------+------+-----+---------+
|              words|num|animals|thing|  two strings|filter|num 2|new_col_1|
+-------------------+---+-------+-----+-------------+------+-----+---------+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1|
+-------------------+---+-------+-----+-------------+------+-----+---------+



## Select columns
### Spark
* You can not select columns by string and index at the same time

### Pandas
* You can not select columns by string and index at the same time

In [7]:
columns = ["words", 1, "animals", 3]
df.cols().select(columns).show()

+-------------------+---+-------+-----+
|              words|num|animals|thing|
+-------------------+---+-------+-----+
|  I like     fish  |  1|    dog|housé|
|            zombies|  2|    cat|   tv|
|simpsons   cat lady|  2|   frog|table|
|               null|  3|  eagle|glass|
+-------------------+---+-------+-----+



In [8]:
df.cols().select(regex = "n.*").show()

+---+-----+---------+
|num|num 2|new_col_1|
+---+-----+---------+
|  1|    1|        1|
|  2|    2|        1|
|  2|    3|        1|
|  3|    4|        1|
+---+-----+---------+



## Rename Column
### Spark
You can not rename multiple columns using Spark Vanilla API


### Pandas
* Almost the same behavior https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

In [24]:
df.cols().rename([('num','number')]).show()

+-------------------+------+-------+-----+-------------+------+-----+---------+
|              words|number|animals|thing|  two strings|filter|num 2|new_col_1|
+-------------------+------+-------+-----+-------------+------+-----+---------+
|  I like     fish  |     1|    dog|housé|      cat-car|     a|    1|        1|
|            zombies|     2|    cat|   tv|       dog-tv|     b|    2|        1|
|simpsons   cat lady|     2|   frog|table|eagle-tv-plus|     1|    3|        1|
|               null|     3|  eagle|glass|      lion-pc|     c|    4|        1|
+-------------------+------+-------+-----+-------------+------+-----+---------+



In [26]:
df.cols().rename(func = str.lower).show()

function
+-------------------+---+-------+-----+-------------+------+-----+---------+
|              words|num|animals|thing|  two strings|filter|num 2|new_col_1|
+-------------------+---+-------+-----+-------------+------+-----+---------+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1|
+-------------------+---+-------+-----+-------------+------+-----+---------+



In [11]:
df.cols().rename(func = str.upper).show()

+-------------------+---+-------+-----+-------------+------+-----+---------+
|              WORDS|NUM|ANIMALS|THING|  TWO STRINGS|FILTER|NUM 2|NEW_COL_1|
+-------------------+---+-------+-----+-------------+------+-----+---------+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1|
+-------------------+---+-------+-----+-------------+------+-----+---------+



## Cast a columns

### Spark
* Can not cast multiple columns

### Pandas
This is a opinionated way to handle column casting. 
One of the first thing that every data cleaning process need to acomplish is define a data dictionary.
Because of that we prefer to create a tuple like this:

df.cols().cast(
[("words","str"),
("num","int"),
("animals","float"),
("thing","str")]
)

instead of pandas

pd.Series([1], dtype='int32')
pd.Series([2], dtype='string')

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

In [31]:
df.cols().cast([("num", "string"),("num 2", "integer")])

DataFrame[words: string, num: string, animals: string, thing: string, two strings: string, filter: string, num 2: int]

## Keep columns
### Spark
* You can not remove multiple columns

### Pandas
* Handle in pandas with drop


In [23]:
from pyspark.sql.functions import *
df.withColumn("num", col("num").cast(StringType()))


DataFrame[words: string, num: string, animals: string, thing: string, two strings: string, filter: string, num 2: string]

In [16]:
df.cols().keep("num").show()

+---+
|num|
+---+
|  1|
|  2|
|  2|
|  3|
+---+



## Move columns
### Spark
Do not exist in spark

### Pandas
Do not exist in pandas

In [17]:
df.cols().move("words", "thing", "after").show()

+---+-------+-----+-------------------+-------------+------+-----+---------+
|num|animals|thing|              words|  two strings|filter|num 2|new_col_1|
+---+-------+-----+-------------------+-------------+------+-----+---------+
|  1|    dog|housé|  I like     fish  |      cat-car|     a|    1|        1|
|  2|    cat|   tv|            zombies|       dog-tv|     b|    2|        1|
|  2|   frog|table|simpsons   cat lady|eagle-tv-plus|     1|    3|        1|
|  3|  eagle|glass|               null|      lion-pc|     c|    4|        1|
+---+-------+-----+-------------------+-------------+------+-----+---------+



## Sorting Columns
### Spark
You can not sort columns using Spark Vanilla API 

### Pandas
Similar to pandas
http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values

In [18]:
df.cols().sort().show()

+-------+------+---------+---+-----+-----+-------------+-------------------+
|animals|filter|new_col_1|num|num 2|thing|  two strings|              words|
+-------+------+---------+---+-----+-----+-------------+-------------------+
|    dog|     a|        1|  1|    1|housé|      cat-car|  I like     fish  |
|    cat|     b|        1|  2|    2|   tv|       dog-tv|            zombies|
|   frog|     1|        1|  2|    3|table|eagle-tv-plus|simpsons   cat lady|
|  eagle|     c|        1|  3|    4|glass|      lion-pc|               null|
+-------+------+---------+---+-----+-----+-------------+-------------------+



In [19]:
df.cols().sort(reverse = True).show()

+-------------------+-------------+-----+-----+---+---------+------+-------+
|              words|  two strings|thing|num 2|num|new_col_1|filter|animals|
+-------------------+-------------+-----+-----+---+---------+------+-------+
|  I like     fish  |      cat-car|housé|    1|  1|        1|     a|    dog|
|            zombies|       dog-tv|   tv|    2|  2|        1|     b|    cat|
|simpsons   cat lady|eagle-tv-plus|table|    3|  2|        1|     1|   frog|
|               null|      lion-pc|glass|    4|  3|        1|     c|  eagle|
+-------------------+-------------+-----+-----+---+---------+------+-------+



## Drop columns
### Spark 
* You can not delete multiple colums

### Pandas
* Almost the same as pandas
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [20]:
df2 = df.cols().drop("num")
df2 = df.cols().drop(["num","words"])
df2.show()

+-------+-----+-------------+------+-----+---------+
|animals|thing|  two strings|filter|num 2|new_col_1|
+-------+-----+-------------+------+-----+---------+
|    dog|housé|      cat-car|     a|    1|        1|
|    cat|   tv|       dog-tv|     b|    2|        1|
|   frog|table|eagle-tv-plus|     1|    3|        1|
|  eagle|glass|      lion-pc|     c|    4|        1|
+-------+-----+-------------+------+-----+---------+



## Chaining

cols y rows functions are used to organize and encapsulate optimus' functionality apart of Apache Spark Dataframe API. This have a disadvantage at chaining time because we need to user invoke cols or rows in every step.

At the same time it can be helpfull when you look at the code because every line is self explained.

In [21]:
df\
    .cols().rename([('num','number')])\
    .cols().drop(["number","words"])\
    .withColumn("new_col_2", lit("spongebob"))\
    .cols().create("new_col_1", 1)\
    .cols().sort(reverse= True)\
    .show()

+-------------+-----+-----+---------+---------+------+-------+
|  two strings|thing|num 2|new_col_2|new_col_1|filter|animals|
+-------------+-----+-----+---------+---------+------+-------+
|      cat-car|housé|    1|spongebob|        1|     a|    dog|
|       dog-tv|   tv|    2|spongebob|        1|     b|    cat|
|eagle-tv-plus|table|    3|spongebob|        1|     1|   frog|
|      lion-pc|glass|    4|spongebob|        1|     c|  eagle|
+-------------+-----+-----+---------+---------+------+-------+



## Split Columns
### Spark

### Pandas

In [22]:
df.cols().split("two strings","-", n=3).show()

+-------------------+---+-------+-----+-------------+------+-----+---------+-----+-----+-----+
|              words|num|animals|thing|  two strings|filter|num 2|new_col_1|COL_0|COL_1|COL_2|
+-------------------+---+-------+-----+-------------+------+-----+---------+-----+-----+-----+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|  cat|  car| null|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|  dog|   tv| null|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|eagle|   tv| plus|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1| lion|   pc| null|
+-------------------+---+-------+-----+-------------+------+-----+---------+-----+-----+-----+



In [23]:
df.cols().split("two strings","-", get = 1).show()

+-------------------+---+-------+-----+-------------+------+-----+---------+-----+
|              words|num|animals|thing|  two strings|filter|num 2|new_col_1|COL_1|
+-------------------+---+-------+-----+-------------+------+-----+---------+-----+
|  I like     fish  |  1|    dog|housé|      cat-car|     a|    1|        1|  car|
|            zombies|  2|    cat|   tv|       dog-tv|     b|    2|        1|   tv|
|simpsons   cat lady|  2|   frog|table|eagle-tv-plus|     1|    3|        1|   tv|
|               null|  3|  eagle|glass|      lion-pc|     c|    4|        1|   pc|
+-------------------+---+-------+-----+-------------+------+-----+---------+-----+



## Pandas comparision
Pandas vs Spark
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/