- Title: Using Optimus for Data Profiling in PySpark
- Slug: pyspark-optimus-data-profiling
- Date: 2019-12-19 09:44:16
- Category: Computer Science
- Tags: programming, Python, HPC, high performance computing, PySpark, Optimus, data profiling, data profile
- Author: Ben Du
- Modified: 2019-12-19 09:44:16


## Tips & Traps

1. Optimus requires Python 3.6+.

In [1]:
import pandas as pd
import findspark
# A symbolic link of the Spark Home is made to /opt/spark for convenience
findspark.init('/opt/spark')

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName('PySpark Example'
                                    ).enableHiveSupport().getOrCreate()

In [5]:
from optimus import Optimus
ops = Optimus(master="local")

In [6]:
df = ops.create.df(
    [
        ("names", "str"),
        ("height", "float"),
        ("function", "str"),
        ("rank", "int"),
    ], [
        ("bumbl#ebéé  ", 17.5, "Espionage", 7),
        ("Optim'us", 28.0, "Leader", 10),
        ("ironhide&", 26.0, "Security", 7),
        ("Jazz", 13.0, "First Lieutenant", 8),
        ("Megatron", None, "None", None),
    ]
)
df.table()

names  1 (string)  nullable,height  2 (float)  nullable,function  3 (string)  nullable,rank  4 (int)  nullable
bumbl#ebéé⋅⋅,17.5,Espionage,7.0
Optim'us,28.0,Leader,10.0
ironhide&,26.0,Security,7.0
Jazz,13.0,First⋅Lieutenant,8.0
Megatron,,,


In [7]:
ops.profiler.run(df)

0,1
Number of columns,4
Number of rows,5
Total Missing (%),2
Total size in memory,-1 Bytes

0,1
Categorical,0
Numeric,0
Date,0
Array,0
Not available,0

0,1
Unique,4.0
Unique (%),
Missing,0.0
Missing (%),

0,1
String,5.0
Integer,
Decimal,
Bool,
Date,
Missing,0.0
Null,0.0

Value,Count,Frequency (%)
Jazz,1,20.0%
Megatron,1,20.0%
bumbl#ebéé,1,20.0%
Optim'us,1,20.0%
ironhide&,1,20.0%
"""Missing""",0,%

0,1
Unique,4.0
Unique (%),
Missing,1.0
Missing (%),

0,1
String,
Integer,
Decimal,4.0
Bool,
Date,
Missing,0.0
Null,1.0

0,1
Mean,21.125
Minimum,13.0
Maximum,28.0
Zeros(%),0.0

0,1
Minimum,13.0
5-th percentile,13.0
Q1,13.0
Median,17.5
Q3,26.0
95-th percentile,28.0
Maximum,28.0
Range,
Interquartile range,

0,1
Standard deviation,7.07549
Coef of variation,
Kurtosis,-1.70021
Mean,21.125
MAD,
Skewness,-0.15561
Sum,84.5
Variance,50.0625

0,1
Unique,5.0
Unique (%),
Missing,0.0
Missing (%),

0,1
String,5.0
Integer,
Decimal,
Bool,
Date,
Missing,0.0
Null,0.0

Value,Count,Frequency (%)
First Lieutenant,1,20.0%
Leader,1,20.0%
Security,1,20.0%
Espionage,1,20.0%
,1,20.0%
"""Missing""",0,%

0,1
Unique,3.0
Unique (%),
Missing,1.0
Missing (%),

0,1
String,
Integer,4.0
Decimal,
Bool,
Date,
Missing,0.0
Null,1.0

0,1
Mean,8.0
Minimum,7.0
Maximum,10.0
Zeros(%),0.0

0,1
Minimum,7.0
5-th percentile,7.0
Q1,7.0
Median,7.0
Q3,8.0
95-th percentile,10.0
Maximum,10.0
Range,
Interquartile range,

0,1
Standard deviation,1.41421
Coef of variation,
Kurtosis,-1.0
Mean,8.0
MAD,
Skewness,0.8165
Sum,32.0
Variance,2.0


<optimus.profiler.profiler.Profiler at 0x7fa299b875f8>

## References

https://github.com/ironmussa/Optimus

https://github.com/ironmussa/Optimus/tree/master/examples

https://htmlpreview.github.io/?https://github.com/ironmussa/Optimus/blob/master/docs/cheatsheet/optimus_cheat_sheet.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions