## A better way to manipulate data

This is an API proposal to access data

Dataframes would have to ojects, rows and columns. This will have the same access methods:

* create
* read
* move
* update
* delete

For example

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from optimus import *

from pyspark.sql.session import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType

sc = SparkSession.builder.getOrCreate()

In [3]:
# Create optimus
op = Optimus(sc)

Using a created Spark Session...
Done.


## Create dataframe
### Spark

This is ugly:

```
val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)```

In [38]:
df = op.create.df([
                ("  I like     fish  ", 1, "dog", "housé" ),
                ("    zombies", 2, "cat", "tv"),
                ("simpsons   cat lady", 2, "frog", "table"),
                (None, 3, "eagle", "glass")
            ],
            [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True)
            ])

df.show()

+-------------------+---+-------+-----+
|              words|num|animals|thing|
+-------------------+---+-------+-----+
|  I like     fish  |  1|    dog|housé|
|            zombies|  2|    cat|   tv|
|simpsons   cat lady|  2|   frog|table|
|               null|  3|  eagle|glass|
+-------------------+---+-------+-----+



In [37]:
df

DataFrame[words: string, num: int, animals: string, thing: string]

In [5]:
df.cols()

<optimus.dataframe.columns.Columns at 0x1dec0b73e48>

## Create Columns
### Spark
* You can not create multiple columns at the same time
* You need to use the lit function. lit???

### Pandas



In [6]:
dfr = df.cols().create("new_col_1", 1)
dfr.show()
df

+-------------------+---+-------+-----+---------+
|              words|num|animals|thing|new_col_1|
+-------------------+---+-------+-----+---------+
|  I like     fish  |  1|    dog|housé|        1|
|            zombies|  2|    cat|   tv|        1|
|simpsons   cat lady|  2|   frog|table|        1|
|               null|  3|  eagle|glass|        1|
+-------------------+---+-------+-----+---------+



DataFrame[words: string, num: int, animals: string, thing: string]

In [7]:
df.cols()._df

DataFrame[words: string, num: int, animals: string, thing: string]

In [8]:
df

DataFrame[words: string, num: int, animals: string, thing: string]

In [9]:
df.cols()._df

DataFrame[words: string, num: int, animals: string, thing: string]

In [12]:
dfr = df.cols().create([
    ("new_col_2", 2.22),
    ("new_col_3", lit(3)),
    ("new_col_4", "test"),
    ("new_col_5", df['num']*2)
    ])
dfr.show()
df.show()

+-------------------+---+-------+-----+---------+---------+---------+---------+
|              words|num|animals|thing|new_col_2|new_col_3|new_col_4|new_col_5|
+-------------------+---+-------+-----+---------+---------+---------+---------+
|  I like     fish  |  1|    dog|housé|     2.22|        3|     test|        2|
|            zombies|  2|    cat|   tv|     2.22|        3|     test|        4|
|simpsons   cat lady|  2|   frog|table|     2.22|        3|     test|        4|
|               null|  3|  eagle|glass|     2.22|        3|     test|        6|
+-------------------+---+-------+-----+---------+---------+---------+---------+

+-------------------+---+-------+-----+
|              words|num|animals|thing|
+-------------------+---+-------+-----+
|  I like     fish  |  1|    dog|housé|
|            zombies|  2|    cat|   tv|
|simpsons   cat lady|  2|   frog|table|
|               null|  3|  eagle|glass|
+-------------------+---+-------+-----+



## Select columns
### Spark
You can not select columns by string and index at the same time

### Pandas
You can not select columns by string and index at the same time

In [None]:
columns = ["words", 1, "animals", 3]
dfr.cols().select(columns).show()

In [13]:
dfr.cols().select(regex = "n.*").show()

+---+---------+---------+---------+------------------+
|num|new_col_2|new_col_3|new_col_4|         new_col_5|
+---+---------+---------+---------+------------------+
|  1|     2.22|        3|     test|               2.2|
|  2|     2.22|        3|     test|               4.4|
|  2|     2.22|        3|     test|               4.4|
|  3|     2.22|        3|     test|6.6000000000000005|
+---+---------+---------+---------+------------------+



## Rename
### Spark

In [33]:
df.cols()
#dfr = df.cols().rename([('num','number')])


<optimus.dataframe.columns.Columns at 0x202e92d5b70>

In [43]:
df

DataFrame[words: string, num: int, animals: string, thing: string]

In [42]:
#dfr = df.cols().rename([('num','number')])
df.cols()._df

DataFrame[words: string, num: int, animals: string, thing: string, new_col_2: double, new_col_3: int, new_col_4: string, new_col_5: double]

In [48]:
print("*** Load files operations")
op.load.csv()

print ("*** Rows CRUD operations")

# Rows CRUD Operations
op.rows.create()
op.rows.select()
op.rows.update()
op.rows.delete()


op.rows.apply()

print ("*** Columns CRUD operations")

# Column Operation
op.columns.create()
op.columns.select(by_index = 1)

*** Load files operations


AttributeError: 'Optimus' object has no attribute 'load'

## Pandas comparision
Pandas vs Spark
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

In [83]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])

# Drop col
df.drop(['B', 'C'], axis=1)
# or df.drop(columns=['B', 'C'])

Unnamed: 0,A,D
0,0,3
1,4,7
2,8,11


In [80]:
# Drop by index
df.drop([0, 1])

Unnamed: 0,A,B,C,D
2,8,9,10,11


In [100]:
# Spark
df.drop('B')
df.drop('C')#

# Multiple olumns
columns_to_drop = ['A', 'B']
df.drop(*columns_to_drop)

# Optimus

DataFrame[words: string, num: int, animals: string, thing: string]

In [109]:
class A:
    def __init__(self):    
        self.hola = "como estas?"
        pass
    
    def cols(self):
        print("columns")
        
        def create(name):
            print(name)
            print(self.hola)
        cols.create = create
        
        return cols

In [110]:
a = A()

In [111]:
a.cols().create("hola")

columns
hola
como estas?


In [112]:
def FakeObject():
    def test():
        print ("foo")
    FakeObject.test = test
    return FakeObject
x = FakeObject()
x.test()

foo
