# Saving and Serialising a dataframe


In [1]:
import numpy as np
import pandas as pd

In [4]:
# Lets make a new dataframe and save it out using various formats
df = pd.DataFrame(np.random.random(size=(100000, 4)), columns=["A", "B", "C", "D"])
df.head()

Unnamed: 0,A,B,C,D
0,0.069474,0.016839,0.607693,0.960414
1,0.755562,0.792302,0.638826,0.257696
2,0.766277,0.049024,0.264378,0.898995
3,0.263386,0.18859,0.977028,0.101986
4,0.052184,0.381186,0.655244,0.827316


In [6]:
df.to_csv("save.csv", index=False, float_format="%0.4f")

In [7]:
df.to_pickle("save.pkl")

In [8]:
# pip install tables
df.to_hdf("save.hdf", key="data", format="table")

In [9]:
# pip install feather-format
df.to_feather("save.fth")

In [11]:
# If you want to get the timings you can see in the video, you'll need this extension:
# https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html

Now this is a very easy test - its only numeric data. If we add strings and categorical data things can slow down a lot! Let's try this on mixed Astronaut data from Kaggle: https://www.kaggle.com/nasa/astronaut-yearbook

In [14]:
df = pd.read_csv("astronauts.csv")
df.head()

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
0,Joseph M. Acaba,2004.0,19.0,Active,5/17/1967,"Inglewood, CA",Male,University of California-Santa Barbara; Univer...,Geology,Geology,,,2,3307,2,13.0,"STS-119 (Discovery), ISS-31/32 (Soyuz)",,
1,Loren W. Acton,,,Retired,3/7/1936,"Lewiston, MT",Male,Montana State University; University of Colorado,Engineering Physics,Solar Physics,,,1,190,0,0.0,STS 51-F (Challenger),,
2,James C. Adamson,1984.0,10.0,Retired,3/3/1946,"Warsaw, NY",Male,US Military Academy; Princeton University,Engineering,Aerospace Engineering,Colonel,US Army (Retired),2,334,0,0.0,"STS-28 (Columbia), STS-43 (Atlantis)",,
3,Thomas D. Akers,1987.0,12.0,Retired,5/20/1951,"St. Louis, MO",Male,University of Missouri-Rolla,Applied Mathematics,Applied Mathematics,Colonel,US Air Force (Retired),4,814,4,29.0,"STS-41 (Discovery), STS-49 (Endeavor), STS-61 ...",,
4,Buzz Aldrin,1963.0,3.0,Retired,1/20/1930,"Montclair, NJ",Male,US Military Academy; MIT,Mechanical Engineering,Astronautics,Colonel,US Air Force (Retired),2,289,2,8.0,"Gemini 12, Apollo 11",,


In [15]:
df.to_csv("save.csv", index=False, float_format="%0.4f")

In [16]:
pd.read_csv("save.csv");

In [17]:
df.to_pickle("save.pkl")

In [18]:
pd.read_pickle("save.pkl");

In [32]:
df.to_hdf("save.hdf", key="data", format="table")

In [22]:
pd.read_hdf("save.hdf");

In [23]:
df.to_feather("save.fth")

In [24]:
pd.read_feather("save.fth");

In [34]:
%ls

 Volume in drive C is System
 Volume Serial Number is 48F0-A822

 Directory of C:\Users\shint1\Google Drive\SDS\DataManip\2. Notebooks and Datasets\2_Data\Lectures

16/02/2020  01:18 PM    <DIR>          .
16/02/2020  01:18 PM    <DIR>          ..
16/02/2020  12:55 PM    <DIR>          .ipynb_checkpoints
14/02/2020  10:50 PM            38,725 1_Loading.ipynb
14/02/2020  11:32 PM            32,118 2_NumpyVPandas.ipynb
16/02/2020  12:54 PM             9,216 3_CreatingDataFrames.ipynb
16/02/2020  01:18 PM            18,019 4_SavingAndSerialising.ipynb
20/09/2019  10:04 AM            81,593 astronauts.csv
01/10/2019  08:15 PM            11,328 heart.csv
18/01/2020  01:19 PM            35,216 heart.pkl
16/02/2020  01:14 PM            87,030 save.csv
16/02/2020  01:15 PM           107,240 save.fth
16/02/2020  01:19 PM         4,108,481 save.hdf
16/02/2020  01:15 PM            90,693 save.pkl
              11 File(s)      4,619,659 bytes
               3 Dir(s)  244,606,853,120 bytes free


### Recap

In terms of file size, HDF5 is the largest for this example. Everything else is approximately equal. For small data sizes, often csv is the easiest as its human readable. HDF5 is great for *loading* in huge amounts of data quickly. Pickle is faster than CSV, but not human readable.

Lots of options, don't get hung up on any of them. csv and pickle are easy and for most cases work fine.