Datasurveyor

Author:

Nick Buker

Introduction:

Datasurveyor is a small collection of tools for exploratory data analysis. It leverages Pandas, but the tools are able to ingest either DataFrames or Series. The output is a tidy DataFrame for easy viewing of results. Currently, datasurveyor focuses on rapidly identifying data quality issues, but the scope will likely expand as the package becomes "battle tested".

Installing datasurveyor:

Datasurveyor can be install via pip. As always, use of a project-level virtual environment is recommended. Note: Datasurveyor requires Python >= 3.6.

$ pip install datasurveyor

Using Datasurveyor

To demonstrate the tools available in datasurveyor, let's use a Pandas DataFrame named df.

	id	name	state	platform	app_inst	lylty	spend
0	1	Nick	WA	ios	True	0	0
1	2	Gina	OR	android	True	1	nan
2	3	Rob	WA	ios	False	0	10
3	4	Adam	ID	web	True	1	150
4	5	Hanna	WA	ios	True	1	12
5	6	Susan	Null	android	False	0	0
6	7	Quentin	WA	ios	True	1	nan
7	8	Caitlyn	unknown	web	True	0	8
8	9	Matt	WA	web	True	1	50
9	10	Nick	WA	ios	True	0	-10

A data dictionary for df is below.

column	dtype	description
id	int64	unique customer identifier
name	object	customer name
state	object	state of residence
platform	object	system platform
app_inst	bool	app installation flag
lylty	int64	loyalty program flag
spend	float64	total customer spend

Binary features

Description

The methods within BinaryFeatures are intended for use with binary data (data with two possible values). Datasurveyor expects binary features to be stored as bools or integers (with values of 0 or 1). In the example data, app_inst and lylty are binary features.

Importing BinaryFeatures

The binary feature tools can be imported with the command below.

from datasurveyor import BinaryFeatures as BF

Checking if all values the same

The check_all_same method can be used to check if binary features contain exclusively the same value. This method can be applied to a single binary feature or a collection of binary features.

BF.check_all_same(df['app_inst'])

	all_same
0	False

BF.check_all_same(df[['app_inst', 'lylty']])

	column	all_same
0	app_inst	False
1	lylty	False

Checking if values are mostly the same

The check_mostly_same method can be used to check if binary features contain mostly the same value (default threshold 95%). This method can be applied to a single binary feature or a collection of binary features.

BF.check_mostly_same(df['app_inst'])

	mostly_same	thresh	mean
0	False	0.95	0.8

BF.check_mostly_same(df[['app_inst', 'lylty']])

	column	mostly_same	thresh	mean
0	app_inst	False	0.95	0.8
1	lylty	False	0.95	0.5

The user can specify whatever threshold is appropriate for their usecase. If thresh=0.7 is applied, the method will flag features with at least 70% the same value.

BF.check_mostly_same(df['app_inst'], thresh=0.7)

	mostly_same	thresh	mean
0	True	0.7	0.8

BF.check_mostly_same(df[['app_inst', 'lylty']], thresh=0.7)

	column	mostly_same	thresh	mean
0	app_inst	True	0.7	0.8
1	lylty	False	0.7	0.5

Checking the range

The check_outside_range method can be used to detect features with data outside the expected range of 0 and 1. Note that the outside of range condition is only possible for binary features encoded as integer data type.

BF.check_outside_range(df['app_inst'])

	outside_range
0	False

BF.check_outside_range(df[['app_inst', 'lylty']])

	column	outside_range
0	app_inst	False
1	lylty	False

Categorical features

Description

The methods within CategoricalFeatures are intended for use with categorical data (data denoting categories). Datasurveyor expects categorical features to be stored as object (string) or integer type. In the example data, state and platform are categorical features.

Importing CategoricalFeatures

The categorical feature tools can be imported with the command below.

from datasurveyor import CategoricalFeatures as CF

Checking if values are mostly the same

The check_mostly_same method can be used to check if categorical features contain mostly the same value (default threshold 95%). This method can be applied to a single categorical feature or a collection of categorical features.

CF.check_mostly_same(df['state'])

	mostly_same	thresh	most_common	count	prop
0	False	0.95	WA	6	0.6

CF.check_mostly_same(df[['state', 'platform']])

	column	mostly_same	thresh	most_common	count	prop
0	state	False	0.95	WA	6	0.6
1	platform	False	0.95	ios	5	0.5

The user can specify whatever threshold is appropriate for their usecase. If thresh=0.6 is applied, the method will flag features with at least 60% the same value.

CF.check_mostly_same(df['state'], thresh=0.6)

	mostly_same	thresh	most_common	count	prop
0	True	0.6	WA	6	0.6

CF.check_mostly_same(df[['state', 'platform']], thresh=0.6)

	column	mostly_same	thresh	most_common	count	prop
0	state	True	0.6	WA	6	0.6
1	platform	False	0.6	ios	5	0.5

Checking number of categories

The n_categories method can be used to count the number of categories. This method can be applied to a single categorical feature or a collection of categorical features.

CF.check_n_categories(df['state'])

	n_categories
0	4

CF.check_n_categories(df[['state', 'platform']])

	column	n_categories
0	state	4
1	platform	3

General features

Description

The methods within GeneralFeatures are intended for use with any data. Datasurveyor expects inputs to be of type Pandas Series or DataFrame, but has no type expectations for the data within those structures.

Importing GeneralFeatures

The general feature tools can be imported with the command below.

from datasurveyor import GeneralFeatures as GF

Checking for nulls

The check_nulls method can be used to check for nulls. This method can be applied to a single feature or a collection of features.

GF.check_nulls(df['spend'])

	nulls_present	null_count	prop_null
0	True	2	0.2

GF.check_nulls(df)

	column	nulls_present	null_count	prop_null
0	id	False	0	0
1	name	False	0	0
2	state	False	0	0
3	platform	False	0	0
4	app_inst	False	0	0
5	lylty	False	0	0
6	spend	True	2	0.2

Checking for nulls

The check_fuzzy_nulls method can be used to check for values that commonly denote nulls. This method can be applied to a single feature or a collection of features.

GF.check_fuzzy_nulls(df['state'])

	fuzzy_nulls_present	fuzzy_null_count	prop_fuzzy_null
0	True	1	0.1

GF.check_fuzzy_nulls(df)

	column	fuzzy_nulls_present	fuzzy_null_count	prop_fuzzy_null
0	id	False	0	0
1	name	False	0	0
2	state	True	1	0.1
3	platform	False	0	0
4	app_inst	False	0	0
5	lylty	False	0	0
6	spend	False	0	0

The defaults items checked for are: 'null', 'Null', 'NULL', '' (empty string), and ' ' (single space). The user can specify additional items to check for using the add_fuzzy_nulls argument.

GF.check_fuzzy_nulls(df['state'], add_fuzzy_nulls=['unknown'])

	fuzzy_nulls_present	fuzzy_null_count	prop_fuzzy_null
0	True	2	0.2

GF.check_fuzzy_nulls(df, add_fuzzy_nulls=['unknown'])

	column	fuzzy_nulls_present	fuzzy_null_count	prop_fuzzy_null
0	id	False	0	0
1	name	False	0	0
2	state	True	2	0.2
3	platform	False	0	0
4	app_inst	False	0	0
5	lylty	False	0	0
6	spend	False	0	0

Unique features

Description

The methods within UniqueFeatures are intended for use with data where each observation has a unique value. Datasurveyor expects unique features to be stored as datetime, object (string), or integer type. In the example data, id is a unique feature.

Importing UniqueFeatures

The unique feature tools can be imported with the command below.

from datasurveyor import UniqueFeatures as UF

Checking uniqueness

The check_uniqueness method can be used to check if potentially unique features contain unique values. This method can be applied to a single unique feature or a collection of unique features.

UF.check_uniqueness(sample_df['id'])

	dupes_present	dupe_count	prop_dupe
0	False	0	0

UF.check_uniqueness(df[['id', 'name']])

	column	dupes_present	dupe_count	prop_dupe
0	id	False	0	0
1	name	True	1	0.1

Contributing to datasurveyor

If you are interested in contributing to this project:

Fork the datasurveyor repo.
Clone the forked repository to your machine.
Create a git branch.
Make changes and push them to GitHub.
Submit your changes for review by creating a pull request. In order to be approved changes should include:
- Appropriate updates to the README.md
- Google style docstrings
- Tests providing proper coverage of new code

Testing

For those interested in contributing to datasurveyor forking and editing the project, pytest is the testing framework used. To run the tests, create a virtual environment, install the contents of dev_requirements.txt, and run the following command from the root directory of the project. The testing scripts can be found in the tests/ directory.

$ pytest

To run tests and view coverage, use the below command:

$ pytest --cov=datasurveyor

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
datasurveyor		datasurveyor
dist		dist
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LISCENCE		LISCENCE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datasurveyor

Author:

Introduction:

Table of contents:

Installing datasurveyor:

Using datasurveyor:

Contributing and Testing:

Installing datasurveyor:

Using Datasurveyor

Binary features

Description

Importing BinaryFeatures

Checking if all values the same

Checking if values are mostly the same

Checking the range

Categorical features

Description

Importing CategoricalFeatures

Checking if values are mostly the same

Checking number of categories

General features

Description

Importing GeneralFeatures

Checking for nulls

Checking for nulls

Unique features

Description

Importing UniqueFeatures

Checking uniqueness

Contributing to datasurveyor

Testing

About

Releases

Packages

Languages

nickbuker/datasurveyor

Folders and files

Latest commit

History

Repository files navigation

Datasurveyor

Author:

Introduction:

Table of contents:

Installing datasurveyor:

Using datasurveyor:

Contributing and Testing:

Installing datasurveyor:

Using Datasurveyor

Binary features

Description

Importing BinaryFeatures

Checking if all values the same

Checking if values are mostly the same

Checking the range

Categorical features

Description

Importing CategoricalFeatures

Checking if values are mostly the same

Checking number of categories

General features

Description

Importing GeneralFeatures

Checking for nulls

Checking for nulls

Unique features

Description

Importing UniqueFeatures

Checking uniqueness

Contributing to datasurveyor

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages