# Optimus

## Pandas and Optimus side by side

From the web:

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. 

Here is list of the 90th percentil functions used in pandas. Thanks to Devin Petersohn. https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/

|Pandas|Optimus
|---|---|---|
|pd.read_csv()|op.read.csv()|
|pd.Dataframe|op.create.df()|
|df.append|df.row().append()|
|df.mean|df.cols().mean()|
|df.head()|df.head()|
|df.drop()|df.cols().drop()
|df.sum()|df.cols().sum()||
|df.to_csv()|df.save().csv()|
|df.get()|NI|
|df.mode()|df.cols().mode()|
|df.astype()|df.cols().cast(),  astype() as alias|
|df.sub()|NI|
|pd.concat()|optimus.concat()|
|df.apply|df.cols().apply()|
|df.groupby()|df.groupby()|
|df.join()|df.join()|
|df.fillna()|df.fillna()|
|df.max()|df.cols().max()|
|reset_index|NA|

NI= Not implemented



# Optimus

## Pandas, Optimus and Spark side by side

|Pandas|Optimus|Spark|
|---|---|---|
|pd.read_csv()|op.read.csv()|spark.read.csv()|
|pd.Dataframe|op.create.df()|df.createdataframe()|
|pd.append|df.row().append()|df.union()|
|pd.mean|df.cols().mean()|done via agg|
|df.head()|df.head()|df.show()|
|df.drop()|df.cols().drop()|df.drop()|
|df.sum()|df.cols().sum()|done via agg function|
|df.to_csv()|df.save().csv()|df.read.csv()|
|df.get()|NI|NI|
|df.mode()|df.cols().mode()|done via agg function|
|df.astype()|df.cols().cast(),  astype() as alias|df.cast()|
|df.sub()|NI|NI|
|pd.concat()|optimus.concat()|NI|
|df.apply|df.cols().apply()|NI|			
|df.groupby()|via Spark DataFrame|df.groupby()|
|df.join()|via Spark DataFrame|df.join()|
|df.fillna()|via Spark DataFrame|df.fillna()|
|df.max()|df.cols().max()|done via agg function|
|reset_index|NA|NA|

NI= Not implemented


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from optimus import *

In [3]:
# Create optimus
op = Optimus(master="local", app_name= "optimus")


             ____        __  _                     
            / __ \____  / /_(_)___ ___  __  _______
           / / / / __ \/ __/ / __ `__ \/ / / / ___/
          / /_/ / /_/ / /_/ / / / / / / /_/ (__  ) 
          \____/ .___/\__/_/_/ /_/ /_/\__,_/____/  
              /_/                                  
              
Just checking that all necessary environments vars are present...
-----
PYSPARK_PYTHON=python
SPARK_HOME=C:\opt\spark\spark-2.3.1-bin-hadoop2.7
JAVA_HOME=C:\java8
-----
Starting or getting SparkSession and SparkContext...
Setting checkpoint folder ( local ). If you are in a cluster initialize optimus with master='your_ip' as param
Deleting previous folder if exists...
Creating the checkpoint directory...
Optimus successfully imported. Have fun :).


In [4]:
op.get_ss()

In [5]:
op.get_sc()

## Create dataframe
### Spark

This is ugly:

```
val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)```

In [6]:
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType

df = op.create.df(
    [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True),
                ("two strings", StringType(), True),
                ("filter", StringType(), True),
                ("num 2", "string", True),
                ("date", "string", True),
                ("num 3", "string", True)
                
            ],[
                ("  I like     fish  ", 1, "dog", "&^%$#housé", "cat-car", "a","1", "20150510", "3"),
                ("    zombies", 2, "cat", "tv", "dog-tv", "b","2", "20160510", "3"),
                ("simpsons   cat lady", 2, "frog", "table","eagle-tv-plus","1","3", "20170510", "4"),
                (None, 3, "eagle", "glass", "lion-pc", "c","4", "20180510", "5"),
    
            ]
            )

df.show()

+-------------------+---+-------+----------+-------------+------+-----+--------+-----+
|              words|num|animals|     thing|  two strings|filter|num 2|    date|num 3|
+-------------------+---+-------+----------+-------------+------+-----+--------+-----+
|  I like     fish  |  1|    dog|&^%$#housé|      cat-car|     a|    1|20150510|    3|
|            zombies|  2|    cat|        tv|       dog-tv|     b|    2|20160510|    3|
|simpsons   cat lady|  2|   frog|     table|eagle-tv-plus|     1|    3|20170510|    4|
|               null|  3|  eagle|     glass|      lion-pc|     c|    4|20180510|    5|
+-------------------+---+-------+----------+-------------+------+-----+--------+-----+



## concat
### Spark
No available in Spark Vanilla

### Pandas
Almost the same functionlity

In [7]:
op.concat(df,df).show()

+-------------------+---+-------+----------+-------------+------+-----+--------+-----+
|              words|num|animals|     thing|  two strings|filter|num 2|    date|num 3|
+-------------------+---+-------+----------+-------------+------+-----+--------+-----+
|  I like     fish  |  1|    dog|&^%$#housé|      cat-car|     a|    1|20150510|    3|
|            zombies|  2|    cat|        tv|       dog-tv|     b|    2|20160510|    3|
|simpsons   cat lady|  2|   frog|     table|eagle-tv-plus|     1|    3|20170510|    4|
|               null|  3|  eagle|     glass|      lion-pc|     c|    4|20180510|    5|
|  I like     fish  |  1|    dog|&^%$#housé|      cat-car|     a|    1|20150510|    3|
|            zombies|  2|    cat|        tv|       dog-tv|     b|    2|20160510|    3|
|simpsons   cat lady|  2|   frog|     table|eagle-tv-plus|     1|    3|20170510|    4|
|               null|  3|  eagle|     glass|      lion-pc|     c|    4|20180510|    5|
+-------------------+---+-------+----------