# Reference

https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html

https://pandas.pydata.org/pandas-docs/stable/10min.html

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import sys
sys.path.append("..")

## Install Optimus with 

from command line:

`pip install optimuspyspark`

from a notebook you can use:

`!pip install optimuspyspark`

## Import optimus and start it

In [5]:
from optimus import Optimus
op= Optimus(master="local")

## Dataframe creation

Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names.


In [6]:
df = op.create.df(
    [
        "names",
        "height",
        "function",
        "rank",
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz",13.0, "First Lieutenant", 8),
        ("Megatron",None, "None", None),
        
    ]).h_repartition(1)
df.table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⸱Lieutenant,8.0
Megatron,,,


Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/types/package-summary.html

Also you can use some Optimus predefined types:
* "str" = StringType() 
* "int" = IntegerType() 
* "float" = FloatType()
* "bool" = BoleanType()

In [76]:
df = op.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz",13.0, "First Lieutenant", 8),
        ("Megatron",None, "None", None),
        
    ])
df.table()

names  1 (string)  nullable,height  2 (float)  nullable,function  3 (string)  nullable,rank  4 (int)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⸱Lieutenant,8.0
Megatron,,,


Creating a dataframe and specify if the column accepts null values

In [78]:
df = op.create.df(
    [
        ("names", "str", True),
        ("height", "float", True),
        ("function", "str", True),
        ("rank", "int", True),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz",13.0, "First Lieutenant", 8),
        ("Megatron",None, "None", None),
        
    ])
df.table()

names  1 (string)  nullable,height  2 (float)  nullable,function  3 (string)  nullable,rank  4 (int)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⸱Lieutenant,8.0
Megatron,,,


Creating a Daframe using a pandas dataframe

In [35]:
import pandas as pd
import numpy as np

data = [("bumbl#ebéé  ", 17.5, "Espionage", 7),
         ("Optim'us", 28.0, "Leader", 10),
         ("ironhide&", 26.0, "Security", 7)]
labels = ["names", "height", "function", "rank"]

# Create pandas dataframe
pdf = pd.DataFrame.from_records(data, columns=labels)

df = op.create.df(pdf = pdf)
df.table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


## Viewing data
Here is how to View the first 10 elements in a dataframe.

In [36]:
df.table(10)

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


## Partitions
Partition are the way Spark divide the data in your local computer or cluster to better optimize how it will be processed.It can greatly impact the Spark performance.

Take 5 minutes to read this article:
https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297

## Lazy operations
Lorem ipsum 

https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage

## Inmutability
Lorem ipsum

## Spark Architecture
Lorem ipsum

## Columns and Rows

Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are aroud the pandas class. We think this approach can better help you to access and transform data. For a deep dive about the designing decision please read:

https://towardsdatascience.com/announcing-optimus-v2-agile-data-science-workflows-made-easy-c127a12d9e13

Sort by cols names

In [37]:
df.cols.sort().table()

function  1 (string)  nullable,height  2 (double)  nullable,names  3 (string)  nullable,rank  4 (bigint)  nullable
Espionage,17.5,bumbl#ebéé⸱⸱,7
Leader,28.0,Optim'us,10
Security,26.0,ironhide&,7


Sort by rows rank value

In [38]:
df.rows.sort("rank").table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
Optim'us,28.0,Leader,10
bumbl#ebéé⸱⸱,17.5,Espionage,7
ironhide&,26.0,Security,7


In [39]:
df.describe().table()

summary  1 (string)  nullable,names  2 (string)  nullable,height  3 (string)  nullable,function  4 (string)  nullable,rank  5 (string)  nullable
count,3,3.0,3,3.0
mean,,23.83333333333333,,8.0
stddev,,5.575242894559244,,1.7320508075688772
min,Optim'us,17.5,Espionage,7.0
max,ironhide&,28.0,Security,10.0


## Selection

Unlike Pandas, Spark DataFrames don't support random row access. So methods like `loc` in pandas are not available.

Also Pandas don't handle indexes. So methods like `iloc` are not available.

Select an show an specific column

In [40]:
df.cols.select("names").table()

names  1 (string)  nullable
bumbl#ebéé⸱⸱
Optim'us
ironhide&


Select rows from a Dataframe where a the condition is meet

In [41]:
df.rows.select(df["rank"]>7).table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
Optim'us,28.0,Leader,10


Select rows by specific values on it

In [42]:
df.rows.is_in("rank",[7, 10]).table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


Create and unique id for every row.

In [43]:
df.create_id().table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable,id  5 (bigint)
bumbl#ebéé⸱⸱,17.5,Espionage,7,0
Optim'us,28.0,Leader,10,1
ironhide&,26.0,Security,7,2


Create wew columns

In [44]:
df.cols.append("Affiliation","Autobot").table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable,Affiliation  5 (string)
bumbl#ebéé⸱⸱,17.5,Espionage,7,Autobot
Optim'us,28.0,Leader,10,Autobot
ironhide&,26.0,Security,7,Autobot


## Missing Data

In [45]:
df.rows.drop_na("*",how='any').table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


Filling missing data.

In [46]:
df.cols.fill_na("*","N//A").table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


To get the boolean mask where values are nan.

In [47]:
df.cols.is_na("*").table()

names  1 (boolean),height  2 (boolean),function  3 (boolean),rank  4 (boolean)
False,False,False,False
False,False,False,False
False,False,False,False


# Operations

## Stats

In [51]:
df.cols.mean("height")

23.833333333333332

In [52]:
df.cols.mean("*")

{'names': {'mean': None},
 'height': {'mean': 23.833333333333332},
 'function': {'mean': None},
 'rank': {'mean': 8.0}}

### Apply

In [61]:
def func(value, args):
    return value + 1

df.cols.apply("height",func,"float").table()

names  1 (string)  nullable,height  2 (float)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,18.5,Espionage,7
Optim'us,29.0,Leader,10
ironhide&,27.0,Security,7


### Histogramming

In [63]:
df.cols.count_uniques("*")

{'names': {'approx_count_distinct': 3},
 'height': {'approx_count_distinct': 3},
 'function': {'approx_count_distinct': 3},
 'rank': {'approx_count_distinct': 2}}

### String Methods

In [70]:
df\
    .cols.lower("names")\
    .cols.upper("function").table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,ESPIONAGE,7
optim'us,28.0,LEADER,10
ironhide&,26.0,SECURITY,7


## Merge

### Concat

Optimus provides and intuitive way to concat Dataframes by columns or rows.

In [85]:
df_new = op.create.df(
    [
        "class"
    ],
    [
        ("Autobot"),
        ("Autobot"),
        ("Autobot"),
        ("Autobot"),
        ("Decepticons"),
        
        
    ]).h_repartition(1)

op.concat([df,df_new], "columns").table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable,class  5 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0,Autobot
Optim'us,28.0,Leader,10.0,Autobot
Jazz,13.0,First⸱Lieutenant,8.0,Autobot
ironhide&,26.0,Security,7.0,Autobot
Megatron,,,,Decepticons


In [86]:
df_new = op.create.df(
    [
        "names",
        "height",
        "function",
        "rank",
    ],
    [
        ("Grimlock", 22.9, "Dinobot Commander", 9),               
    ]).h_repartition(1)

op.concat([df,df_new], "rows").table()


names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⸱Lieutenant,8.0
Megatron,,,
Grimlock,22.9,Dinobot⸱Commander,9.0


In [None]:
Operations like `join` and `group` are handle using Spark directly

In [91]:
import pandas as pd

pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})

sdf = op.create.df(pdf=pdf)
sdf.table()
sdf.melt(id_vars=['A'], value_vars=['B', 'C']).table()

A  1 (string)  nullable,B  2 (bigint)  nullable,C  3 (bigint)  nullable
a,1,2
b,3,4
c,5,6


A  1 (string)  nullable,variable  2 (string),value  3 (string)  nullable
a,B,1
a,C,2
b,B,3
b,C,4
c,B,5
c,C,6


In [None]:
pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                            'B': {0: 1, 1: 3, 2: 5},
                            'C': {0: 2, 1: 4, 2: 6}})

In [11]:
df = op.create.df(
   [("A","str"), ("B","int"), ("C","int")],
[
    ("a",1,2),
    ("b",3,4),
    ("c",5,6),        
])
df.melt(id_vars=['A'], value_vars=['B', 'C']).table()

[[('A', StringType, True), ('B', StringType, True), ('C', StringType, True)],
 [('a', '1', '2'), ('b', '3', '4'), ('c', '5', '6')]]

A  1 (string)  nullable,variable  2 (string),value  3 (string)  nullable
a,B,1
a,C,2
b,B,3
b,C,4
c,B,5
c,C,6


In [44]:
a = [("A","str"), ("B","int"), ("C","int")],
[
    ("a",1,2),
    ("b",3,4),
    ("c",5,6),        
]

[('a', 1, 2), ('b', 3, 4), ('c', 5, 6)]

In [142]:
#df.to_json()
df.table()

A  1 (string)  nullable,B  2 (string)  nullable,C  3 (string)  nullable
a,1,2
b,3,4
c,5,6


In [140]:
value = df.collect()
[tuple(v.asDict().values()) for v in value]


[('a', '1', '2'), ('b', '3', '4'), ('c', '5', '6')]

In [7]:
df.cols.names()

['names', 'height', 'function', 'rank']

In [40]:
df.export()

"[(A, StringType(), True)(B, StringType(), True)(C, StringType(), True)], [('a', '1', '2'), ('b', '3', '4'), ('c', '5', '6')]"

In [22]:
source_df = op.create.df(
    ["A", "B", "C"],
    [
        ("a", 1, 2),
        ("b", 3, 4),
        ("c", 5, 6),
    ])


actual_df = source_df.melt(id_vars=['A'], value_vars=['B', 'C']).table()

expected_df = op.create.df([('A', StringType, True), ('B', StringType, True), ('C', StringType, True)],
                           [('a', '1', '2'), ('b', '3', '4'), ('c', '5', '6')])

assert (expected_df.collect() == actual_df.collect())

A  1 (string)  nullable,variable  2 (string),value  3 (string)  nullable
a,B,1
a,C,2
b,B,3
b,C,4
c,B,5
c,C,6


NameError: name 'StringType' is not defined

In [20]:
df.schema

StructType(List(StructField(A,StringType,true),StructField(B,StringType,true),StructField(C,StringType,true)))

In [24]:
eval("StringType()")

NameError: name 'StringType' is not defined