# Loading and Saving DataFrames

Loading and saving data in pandas has a few gotchas and annoyances that we can
smooth over with two simplifying functions, `pd.load()` and `df.save()`.  This
notebook covers both.

First, we import pandas and Clear Data.

In [11]:
# Because this is the development repo, we import in this ugly way, but
# if you've done pip install clear-data, these two steps are not needed.
import sys
sys.path.append( os.getcwd()+"/../src" )

# In your own code, do just this:
import pandas as pd
import clear_data

Let's imagine we have an example DataFrame containing employee information.
(To see more about generating example data, see the notebook
[Generating Example Data](generating-example-data.ipynb).)

In [12]:
df = pd.example()
df.head()

Unnamed: 0,LastName,FirstName,ID,Department,Salary,YearsAtCompany
0,Sanchez,Penelope,960718,Research,66761.13,10
1,Young,Sebastian,520924,Management,113989.47,3
2,Carter,Jack,160330,Human Resources,72433.12,3
3,Moore,Mason,457142,Human Resources,62649.34,7
4,Baker,Daniel,429878,Management,148397.5,5


## Let's say we wanted to save the file

Old way:

In [13]:
# Choose your favorite method, such as:
df.to_csv( 'output.csv' )
# or: df.to_html( 'output.html' )
# or: df.to_excel( 'output.xlsx' )
# or any of many other formats, including particulars you need to pay attention
# to when writing to some of them, such as JSON orientation and TSV separator.

Clear Data way:

In [14]:
df.save( 'output.json' ) # file extension determines how to save

But that is not the exciting part.  Saving data is much easier than loading it.
All of the following annoyances occur when loading data:

 * Specifying the correct separator for CSV vs. TSV
 * Specifying the sheet name when loading from Excel files
 * Lifting the result DataFrame out of a list when loading from Excel files
 * Downloading an HDF file before reading, because `read_hdf()` does not support URLs.
 * Figuring out the correct orientation when loading from JSON.
 * Determining whether you need to apply normalization when loading from JSON.
 * Switching the XML parser if loading from HTML fails.
 * Knowing the correct argument types and meanings unique to each load function
   (e.g., `pd.read_excel()` takes different arguments than `pd.read_orc()`).

But Clear Data handles all of that for you.  Just use `pd.load()`:

In [15]:
reloaded_df = pd.load( 'output.json' )
reloaded_df.head()

Unnamed: 0,LastName,FirstName,ID,Department,Salary,YearsAtCompany
0,Sanchez,Penelope,960718,Research,66761.13,10
1,Young,Sebastian,520924,Management,113989.47,3
2,Carter,Jack,160330,Human Resources,72433.12,3
3,Moore,Mason,457142,Human Resources,62649.34,7
4,Baker,Daniel,429878,Management,148397.5,5


Not all formats preserve data types perfectly.  If the user cares about that
issue, choose a format such as Parquet, Pickle, ORC, or HDF.