spark-meta

Meta data utilities for the Spark DataFrame

Profile

Data profiling works similar to df.describe(), but acts on non-numeric columns.

To use profile execute the implicit method profile on a DataFrame. An example follows.

import com.jasonsatran.spark.meta.profile.DataFrameUtils._

val df = Seq(

      ("Mets","1986", "New York", "27896702"),
      ("Yankees", "2009", "New York", "3063405"),
      ("Cubs", "2016", "Chicago", "3232420"),
      ("White Sox","2005","Chicago", "1746293"),
      ("Nationals","","Washington", "2481938"),
      ("Senators","1924","Washington",null)
    ).toDF("Team", "Last Championship", "City", "2016 Attendance")

df.profile.show

will produce results like this:

+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+
|      Column Name|Record Count|Unique Values|Empty Strings|Null Values|Percent Fill|Percent Numeric|Max Length|
+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+
|             Team|           6|            6|            0|          0|       100.0|            0.0|         9|
|Last Championship|           6|            6|            1|          0|        83.3|           83.3|         4|
|             City|           6|            3|            0|          0|       100.0|            0.0|        10|
|  2016 Attendance|           6|            6|            0|          1|        83.3|           83.3|         8|
+-----------------+------------+-------------+-------------+-----------+------------+---------------+----------+

Results Explained

The profile command is similar to the Describe except that it gives us statistics on non-numeric columns. This allows us to get a quick overview of the contents of a DataFrame before getting into in depth analysis.

We provide the following metrics on each column of the input DataFrame

Record Count: The total number of records in the column
Unique Values: The distinct count of unique values in a column
Empty Strings: The count of empty strings in a column
Null Values: The count of null values in a column
Percent Fill: The percentage of records that are not empty string or null
Percent Numeric: The percentage of records that are numeric
Max Len: The max length of data

A Note about Performance

Limit the columns to only those that are required as these calculations are processor intensive. Several queries are executed for each column.

In other words before running profile on a DataFrame it is advisable to select only the required colums:

someDataFrame.select("column1","column2").profile

will perform better than

someDataFrame.profile

Another Note about Performance

For larger data frames run it on a sample

someDataFrame.sample(true, .01).select("col1").profile

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

spark-meta

Profile

Results Explained

A Note about Performance

Another Note about Performance

About

Releases

Packages

Contributors 2

Languages

License

jasonsatran/spark-meta

Folders and files

Latest commit

History

Repository files navigation

spark-meta

Profile

Results Explained

A Note about Performance

Another Note about Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Languages