# Open-FF data set customizer
Use this notebook to convert Open-FF's full dataframe (that is in "parquet" format) into a different format and, if desired, filter the data to a smaller subset.

In [None]:
# fetch the Open-FF code repository
# For use in COLAB, the following lines should be uncommented; 
#   comment all lines if running locally

# !git clone https://github.com/gwallison/openFF.git &>/dev/null;
# %run openFF/notebooks/Data_set_customizer_support.py

In [None]:
# Local - The lines below should stay commented unless running locally, they would replace the cell above
import sys
sys.path.insert(0,'c:/MyDocs/integrated/') # adjust to your setup
%run Data_set_customizer_support.py

# Optional: Filter by state

In [None]:
states = prep_states()
states

Select state(s) that you want in the output file.  The selection box contains only those states in the Open-FF data.  Use `shift` or `cntl` click to select more than one state. 

Then run the next cell.

In [None]:
df = filter_by_statelist(df,states)

# Optional: Include chemical data?
If you are interested in chemical records at all, select the "include chemical data" option.  However, if you are only interested in variables like location, operator name, total base water volume, and/or date, deselecting this option will greatly reduce the output file size.

If you choose to include chemical data, you will be given the option to filter which chemicals are included. The options are:
- all (which will include non-chemical categories like "proprietary" and "ambiguousID"),
- specific sets, or
- a custom list that you enter by hand.

In [None]:
inc_chk_box = show_inc_chem_checkbox()
inc_chk_box

In [None]:
chem_set = show_chem_set(inc_chk_box)
chem_set

In [None]:
cus_chem = check_for_custom_list(chem_set,inc_chk_box)
cus_chem

In [None]:
df = filter_by_chem_set(df,chem_set,cus_chem)

# Optional: select columns to include
There are over 100 different columns in the Open-FF full data set, which includes both the original FracFocus columns and columns that Open-FF generates.  In most cases you will not need most of those and selecting a subset will keep your final custom data set smaller.

Choosing the Standard set will reduce the columns to a smaller but typical set and it will also remove disclosures and records that have been flagged as duplicates.  The Full set keeps all records (duplicates can be filtered later using the `in_std_filtered` flag).

Choose from selection:

In [None]:
col_set = show_col_set()
col_set

Process the selection:

In [None]:
df = filter_by_col_set(df,col_set,inc_chk_box)

# Select the output format

The current formats available are:
- **"parquet"** - an compressed structured format for large files. (recommended if you have the ability to use it)
- **"CSV"** - a traditional, text based format, standard input for spreadsheets. Can require 10x the storage and processing time as "parquet," This format does not keep Open-FF formatting so you may need to specify, for example, which columns are text and which are numeric (e.g. APINumber should be text because of leading zeros).
- **"Excel"** - similar to CSV but it will keep some formatting.  However, the maximum sheet size is (1048576, 16384) which will not hold all Open-FF records. Filtered files may fit.

In [None]:
format_type = show_formats()
format_type

Finally, make the output file:

In [None]:
# save the output file
make_output_file(df,format_type)