## A better way to manipulate data

This is an API proposal to access data

Dataframes would have rows and columns. 

To access columns just use df.cols()
To access rows just use df.rows()
I/O operations to load and save data are in Optimus.

In [1]:
%load_ext autoreload
%autoreload 2

In [20]:
from optimus import *

from pyspark.sql.session import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType

sc = SparkSession.builder.getOrCreate()

In [21]:
# Create optimus
op = Optimus(sc)

Using a created Spark Session...
Done.


## Create dataframe
### Spark

This is ugly:

```
val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)```

In [22]:
df = op.create.df([
                ("  I like     fish  ", 1, "dog", "housé" ),
                ("    zombies", 2, "cat", "tv"),
                ("simpsons   cat lady", 2, "frog", "table"),
                (None, 3, "eagle", "glass")
            ],
            [
                ("words", "str", True),
                ("num", "int", True),
                ("animals", "str", True),
                ("thing", StringType(), True)
            ])

df.show()

+-------------------+---+-------+-----+
|              words|num|animals|thing|
+-------------------+---+-------+-----+
|  I like     fish  |  1|    dog|housé|
|            zombies|  2|    cat|   tv|
|simpsons   cat lady|  2|   frog|table|
|               null|  3|  eagle|glass|
+-------------------+---+-------+-----+



In [23]:
df.cols().rename([('num','number')])

DataFrame[words: string, number: int, animals: string, thing: string]

## Create Columns
### Spark
* You can not create multiple columns at the same time
* You need to use the lit function. lit???

### Pandas



In [24]:
df = df.cols().create("new_col_1", 1)
df.show()

NameError: name 'Column' is not defined

In [None]:
sf = df.cols().create([
    ("new_col_2", 2.22),
    ("new_col_3", lit(3)),
    ("new_col_4", "test"),
    ("new_col_5", df['num']*2)
    ])

df.show()

## Select columns
### Spark
You can not select columns by string and index at the same time

### Pandas
You can not select columns by string and index at the same time

In [None]:
columns = ["words", 1, "animals", 3]
df.cols().select(columns).show()

In [None]:
df.cols().select(regex = "n.*").show()

## Rename
### Spark

In [None]:
df.cols().rename([('num','number')]).show()

In [None]:
df.cols().cast([("num", "string")])

## Keep columns
### Spark

In [None]:
df.cols().keep("num").show()

In [None]:
df.show()

## Move columns
### Spark


In [None]:
df2 = df.cols().move("words", "thing", "after")
df2.show()
df2

## Drop columns
### Spark 

You can not delete multiple colums

In [None]:
df2 = df.cols().drop("num")
df2 = df.cols().drop(["num","words"])
df2.show()

## Load file

In [25]:
print("*** Load files operations")
op.load.csv()

print ("*** Rows CRUD operations")

# Rows CRUD Operations
op.rows.create()
op.rows.select()
op.rows.update()
op.rows.delete()


op.rows.apply()

print ("*** Columns CRUD operations")

# Column Operation
op.columns.create()
op.columns.select(by_index = 1)

*** Load files operations


AttributeError: 'Optimus' object has no attribute 'load'

## Pandas comparision
Pandas vs Spark
https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

In [83]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(3,4),columns=['A', 'B', 'C', 'D'])

# Drop col
df.drop(['B', 'C'], axis=1)
# or df.drop(columns=['B', 'C'])

Unnamed: 0,A,D
0,0,3
1,4,7
2,8,11


In [80]:
# Drop by index
df.drop([0, 1])

Unnamed: 0,A,B,C,D
2,8,9,10,11


In [100]:
# Spark
df.drop('B')
df.drop('C')#

# Multiple olumns
columns_to_drop = ['A', 'B']
df.drop(*columns_to_drop)

# Optimus

DataFrame[words: string, num: int, animals: string, thing: string]