# Reference

https://databricks.com/blog/2015/08/12/from-pandas-to-apache-sparks-dataframe.html

https://pandas.pydata.org/pandas-docs/stable/10min.html

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("..")

## Install Optimus with 
pip install optimuspyspark

from a notebook you can use

!pip install optimuspyspark

## Import optimus and start it

In [3]:
from optimus import Optimus
op= Optimus()

## Dataframe creation

Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names

In [49]:
df = op.create.df(
    [
        "names",
        "height",
        "function",
        "rank",
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz",13.0, "First Lieutenant", 8),
        ("Megatron",None, "None", None),
        
    ])
df.table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⸱Lieutenant,8.0
Megatron,,,


Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatype. https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/types/package-summary.html

In [7]:
df = op.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
    ])
df.table()

names  1 (string)  nullable,height  2 (float)  nullable,function  3 (string)  nullable,rank  4 (int)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


Creating a dataframe and specify if you the column accepts null values

In [23]:
df = op.create.df(
    [
        ("names", "str", True),
        ("height", "float", False),
        ("function", "str", True),
        ("rank", "int", True),
    ],
    [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
    ])
df.table()

names  1 (string)  nullable,height  2 (string),function  3 (string)  nullable,rank  4 (string)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


Creating a Daframe using a pandas dataframe

In [31]:
import pandas as pd
import numpy as np

data = [("bumbl#ebéé  ", 17.5, "Espionage", 7),
         ("Optim'us", 28.0, "Leader", 10),
         ("ironhide&", 26.0, "Security", 7)]
labels = ["names", "height", "function", "rank"]

# Create pandas dataframe
pdf = pd.DataFrame.from_records(data, columns=labels)

df = op.create.df(pdf)
df.table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


## Viewing data

Here is how to View the first 10 elements in a dataframe

In [33]:
df.table(10)

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
bumbl#ebéé⸱⸱,17.5,Espionage,7
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7


Sort by cols names

In [37]:
df.cols.sort().table()

function  1 (string)  nullable,height  2 (double)  nullable,names  3 (string)  nullable,rank  4 (bigint)  nullable
Espionage,17.5,bumbl#ebéé⸱⸱,7
Leader,28.0,Optim'us,10
Security,26.0,ironhide&,7


Sort by rows rank value

In [36]:
df.rows.sort("rank").table()

names  1 (string)  nullable,height  2 (double)  nullable,function  3 (string)  nullable,rank  4 (bigint)  nullable
Optim'us,28.0,Leader,10
ironhide&,26.0,Security,7
bumbl#ebéé⸱⸱,17.5,Espionage,7


In [50]:
df.describe().table()

summary  1 (string)  nullable,names  2 (string)  nullable,height  3 (string)  nullable,function  4 (string)  nullable,rank  5 (string)  nullable
count,3,3.0,3,3.0
mean,,23.83333333333333,,8.0
stddev,,5.575242894559244,,1.7320508075688772
min,Optim'us,17.5,Espionage,7.0
max,ironhide&,28.0,Security,10.0


## Selection

Unlike Pandas, Spark DataFrames don't support random row access. So methods like iloc in pandas are not available.
Also Pandas don't handle indexes. So methods like loc in pandas are not available.

Select an show an specific column

In [34]:
df.cols.select("names").table()

names  1 (string)  nullable
bumbl#ebéé⸱⸱
Optim'us
ironhide&
Jazz


Select rows from a Dataframe where a the condition is meet

In [25]:
df.rows.select(df["rank"]>7).table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable,id  5 (bigint)
Optim'us,28.0,Leader,10,25769803776
Jazz,13.0,First⸱Lieutenant,8,60129542144


Select rows by specific values on it

In [26]:
df.rows.isin("rank",[7, 10]).table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable,id  5 (bigint)
bumbl#ebéé⸱⸱,17.5,Espionage,7,8589934592
Optim'us,28.0,Leader,10,25769803776
ironhide&,26.0,Security,7,42949672960


In [31]:
df.create_id().table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable,id  5 (bigint)
bumbl#ebéé⸱⸱,17.5,Espionage,7,8589934592
Optim'us,28.0,Leader,10,25769803776
ironhide&,26.0,Security,7,42949672960
Jazz,13.0,First⸱Lieutenant,8,60129542144


New columns

In [40]:
df.cols.append("Affiliation","Autobot").table()

names  1 (string)  nullable,height  2 (string)  nullable,function  3 (string)  nullable,rank  4 (string)  nullable,id  5 (bigint),Affiliation  6 (string)
bumbl#ebéé⸱⸱,17.5,Espionage,7,8589934592,Autobot
Optim'us,28.0,Leader,10,25769803776,Autobot
ironhide&,26.0,Security,7,42949672960,Autobot
Jazz,13.0,First⸱Lieutenant,8,60129542144,Autobot


Missing Data

In [None]:
dropna fill na is na

In [52]:
df.cols.fill_na("*","N//A").table()

names  1 (string),height  2 (string),function  3 (string),rank  4 (string)
names,height,function,rank
names,height,function,rank
names,height,function,rank
names,height,function,rank
names,N//A,function,N//A


In [50]:
df.cols.is_na("*").table()

names  1 (boolean),height  2 (boolean),function  3 (boolean),rank  4 (boolean)
False,False,False,False
False,False,False,False
False,False,False,False
False,False,False,False
False,True,False,True
