# Saving and Serialising a dataframe


In [1]:
import numpy as np
import pandas as pd

In [2]:
# Lets make a new dataframe and save it out using various formats
df = pd.DataFrame(np.random.random(size=(100000, 4)), columns=["A", "B", "C", "D"])
df.head()

Unnamed: 0,A,B,C,D
0,0.863149,0.314732,0.669747,0.702656
1,0.546542,0.563607,0.780532,0.312281
2,0.024058,0.473108,0.44798,0.811878
3,0.888702,0.392524,0.830159,0.452014
4,0.266793,0.44978,0.589546,0.882689


In [3]:
df.to_csv("save.csv", index=False, float_format="%0.3f")

In [4]:
df.to_pickle("save.pkl")

In [8]:
# pip install tables
df.to_hdf("save.hdf", key="data", format="table")

In [9]:
# pip install feather-format
df.to_feather("save.fth")

In [11]:
# If you want to get the timings you can see in the video, you'll need this extension:
# https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/execute_time/readme.html

Now this is a very easy test - its only numeric data. If we add strings and categorical data things can slow down a lot! Let's try this on mixed Astronaut data from Kaggle: https://www.kaggle.com/nasa/astronaut-yearbook

In [7]:
df = pd.read_csv("astronauts.csv")
df.head()

Unnamed: 0,Name,Year,Group,Status,Birth Date,Birth Place,Gender,Alma Mater,Undergraduate Major,Graduate Major,Military Rank,Military Branch,Space Flights,Space Flight (hr),Space Walks,Space Walks (hr),Missions,Death Date,Death Mission
0,Joseph M. Acaba,2004.0,19.0,Active,5/17/1967,"Inglewood, CA",Male,University of California-Santa Barbara; Univer...,Geology,Geology,,,2,3307,2,13.0,"STS-119 (Discovery), ISS-31/32 (Soyuz)",,
1,Loren W. Acton,,,Retired,3/7/1936,"Lewiston, MT",Male,Montana State University; University of Colorado,Engineering Physics,Solar Physics,,,1,190,0,0.0,STS 51-F (Challenger),,
2,James C. Adamson,1984.0,10.0,Retired,3/3/1946,"Warsaw, NY",Male,US Military Academy; Princeton University,Engineering,Aerospace Engineering,Colonel,US Army (Retired),2,334,0,0.0,"STS-28 (Columbia), STS-43 (Atlantis)",,
3,Thomas D. Akers,1987.0,12.0,Retired,5/20/1951,"St. Louis, MO",Male,University of Missouri-Rolla,Applied Mathematics,Applied Mathematics,Colonel,US Air Force (Retired),4,814,4,29.0,"STS-41 (Discovery), STS-49 (Endeavor), STS-61 ...",,
4,Buzz Aldrin,1963.0,3.0,Retired,1/20/1930,"Montclair, NJ",Male,US Military Academy; MIT,Mechanical Engineering,Astronautics,Colonel,US Air Force (Retired),2,289,2,8.0,"Gemini 12, Apollo 11",,


In [8]:
df.to_csv("save.csv", index=False, float_format="%0.4f")

In [9]:
pd.read_csv("save.csv");

In [10]:
df.to_pickle("save.pkl")

In [11]:
pd.read_pickle("save.pkl");

In [12]:
df.to_hdf("save.hdf", key="data", format="table")

In [13]:
pd.read_hdf("save.hdf");

In [14]:
df.to_feather("save.fth")

ImportError: Missing optional dependency 'pyarrow'.  Use pip or conda to install pyarrow.

In [None]:
pd.read_feather("save.fth");

In [16]:
%ls

 Volume in drive G is HDD Storge 2
 Volume Serial Number is D2CA-B02B

 Directory of G:\Qaurter2Notebooks\piaic_q2_class_reseouces\practiceResource

01/10/2021  12:15 PM    <DIR>          .
01/10/2021  12:15 PM    <DIR>          ..
01/10/2021  12:12 PM    <DIR>          .ipynb_checkpoints
01/09/2021  10:33 AM            15,377 Answers.ipynb
01/10/2021  12:03 PM            81,593 astronauts.csv
01/10/2021  12:10 PM            33,930 dataInspecting.ipynb
01/10/2021  12:04 PM            19,860 dataloading.ipynb
01/10/2021  12:14 PM            18,591 dataSavingAndSerialising.ipynb
01/10/2021  11:57 AM            11,328 heart.csv
01/09/2021  10:31 AM            35,216 heart.pkl
01/09/2021  10:53 AM            32,414 NumpyVPandas.ipynb
01/09/2021  10:33 AM             2,812 Questions.ipynb
01/10/2021  12:14 PM            87,030 save.csv
01/10/2021  12:15 PM           801,617 save.hdf
01/10/2021  12:15 PM            90,693 save.pkl
01/09/2021  10:30 AM            18,594 SavingAndSerialising.i

### Recap

In terms of file size, HDF5 is the largest for this example. Everything else is approximately equal. For small data sizes, often csv is the easiest as its human readable. HDF5 is great for *loading* in huge amounts of data quickly. Pickle is faster than CSV, but not human readable.

Lots of options, don't get hung up on any of them. csv and pickle are easy and for most cases work fine.