Data-preppy...

...is a collection of scripts to work with CSV files and do data preparation, for example for scikit-learn prediction methods. The full files are never held in memory so you can work with very large files if you want to.

I need to write better documentation for this, but for now you sometimes need to read the source.

For now it is Python 2.7 only. I gratefully accept pull requests to fix this and other issues.

Depends on numpy, scikit-learn, a little bit on pandas and optionally sknn (a neural network for scikit-learn).

Quick usage

./summarize_csv.py table.csv  # This will calculate and write stats (including unicode sparklines) to stdout. Paste the stats to stat.txt or something
# Then add "+" in front of the fields that you want to keep
./filter.py stat.txt table.csv filtered_table.csv # filter columns
./csv_fix.py stat.txt filtered_table.csv fixed_table.csv  # impute standard values in empty cells, split category fields to several boolean fields

Other files

add_columns.py - Take two or more CSV files with equal amount of rows and add the columns
add_rows.py - Add rows of two CSV files with same headers.
cluster_csv.py - Cluster rows in CSV and add a cluster group column to the rows. Can be useful to get a "holistic" view of each row.
convert_coords.py - Convert sweref coords to GPS. Not really similar to the other tools.
dict_merge.py - Run through a CSV and replace cells via translation dict.
filter_column_names.py - Remove fields from CSV by name
kill_outliers.py - Run after summarize_csv if you want to remove (presumably broken) rows with different field types than the rest.
merge_csv.py - Merge two CSV file according to an ID column. Like a join.
metrics.py - Calculate and graph performance metrics for classification
predict.py - Some prediction methods
smart_csv.py - The class used to read CSVs in the other files. Guesses encoding, delimiter, etc.
split_csv.py - Split CSV into training and test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-preppy...

Quick usage

Other files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.markdown		README.markdown
add_all.py		add_all.py
add_columns.py		add_columns.py
add_rows.py		add_rows.py
cluster_csv.py		cluster_csv.py
convert_coords.py		convert_coords.py
csv_fix.py		csv_fix.py
dict_merge.py		dict_merge.py
filter.py		filter.py
filter_column_names.py		filter_column_names.py
kill_outliers.py		kill_outliers.py
merge_csv.py		merge_csv.py
metrics.py		metrics.py
predict.py		predict.py
smart_csv.py		smart_csv.py
split_csv.py		split_csv.py
summarize_csv.py		summarize_csv.py

License

gurgeh/data-preppy

Folders and files

Latest commit

History

Repository files navigation

Data-preppy...

Quick usage

Other files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages