## A beautifull way to work with data

This is an API proposal to access data.

Dataframes would have rows and columns. 

* To access columns just use df.cols()
* To access rows just use df.rows()
* I/O operations to load and save data are in Optimus. op.load.csv(). op.save.csv()

Easy and simple

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from optimus import *

from pyspark.sql.session import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType

sc = SparkSession.builder.getOrCreate()

In [4]:
# Create optimus
op = Optimus(sc)

Using a created Spark Session...
Done.


## Create dataframe
### Spark

This is ugly:

```
val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)```

In [33]:
# Thanks Mr Powers
df = op.create.df([
                ("  I like     fish  ", 1, "dog dog", "housé", 5 , "a"),
                ("    zombies", 2, "cat", "tv", 6, "b"),
                ("simpsons   cat lady", 2, "frog", "table", 7, "1"),
                (None, 3, "eagle", "glass", 8, "c")
            ],
            [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True),
                ("second", "int", True),
                ("filter", StringType(), True)
            ])

df.show()

+-------------------+---+-------+-----+------+------+
|              words|num|animals|thing|second|filter|
+-------------------+---+-------+-----+------+------+
|  I like     fish  |  1|dog dog|housé|     5|     a|
|            zombies|  2|    cat|   tv|     6|     b|
|simpsons   cat lady|  2|   frog|table|     7|     1|
|               null|  3|  eagle|glass|     8|     c|
+-------------------+---+-------+-----+------+------+



## Filter by type

In [87]:
df.rows().filter_by_type("filter", type = "integer").show()

+-------------------+---+-------+-----+------+------+
|              words|num|animals|thing|second|filter|
+-------------------+---+-------+-----+------+------+
|  I like     fish  |  1|dog dog|housé|     5|     a|
|            zombies|  2|    cat|   tv|     6|     b|
|               null|  3|  eagle|glass|     8|     c|
+-------------------+---+-------+-----+------+------+



## Lookup

In [88]:
df.rows().lookup("animals", ["dog", "cat", "eagle"], "just animals").show()

+-------------------+---+------------+-----+------+------+
|              words|num|     animals|thing|second|filter|
+-------------------+---+------------+-----+------+------+
|  I like     fish  |  1|     dog dog|housé|     5|     a|
|            zombies|  2|just animals|   tv|     6|     b|
|simpsons   cat lady|  2|        frog|table|     7|     1|
|               null|  3|just animals|glass|     8|     c|
+-------------------+---+------------+-----+------+------+



In [102]:
df

DataFrame[words: string, num: int, animals: string, thing: string, second: int, filter: string]

## Apply by type

In [120]:
def func(value): 
    return str(int(value) + 1 )

df.rows().apply_by_type([('num', 'integer', func)]).show()

+-------------------+---+-------+-----+------+------+
|              words|num|animals|thing|second|filter|
+-------------------+---+-------+-----+------+------+
|  I like     fish  |  2|dog dog|housé|     5|     a|
|            zombies|  3|    cat|   tv|     6|     b|
|simpsons   cat lady|  3|   frog|table|     7|     1|
|               null|  4|  eagle|glass|     8|     c|
+-------------------+---+-------+-----+------+------+



In [111]:
df.select("num").show()

== Physical Plan ==
*(1) Project [words#380, pythonUDF0#1973 AS num#1960, animals#382, thing#383, second#384, filter#385]
+- BatchEvalPython [<lambda>(num#381)], [words#380, num#381, animals#382, thing#383, second#384, filter#385, pythonUDF0#1973]
   +- Scan ExistingRDD[words#380,num#381,animals#382,thing#383,second#384,filter#385]

+---+
|num|
+---+
|  1|
|  2|
|  2|
|  3|
+---+



In [74]:
parameters = [('product', 'intener', 'None')]
print (type(parameters))
for column,data_type,func in parameters:
    print(column)

<class 'list'>
product
