- Date: 2020-08-07 09:32:09
- Author: Ben Du
- Title: Pandas IO
- Slug: pandas-io
- Category: Computer Science
- Tags: Computer Science, pandas, IO, DataFrame, Parquet, BytesIO, read, write, schema, version
- Modified: 2020-08-07 09:32:09


In [6]:
import io
import pandas as pd
import sys

## Tips and Traps

1. Use the Parquet format as much as possible 
    instead of other binary format or text format.

2. Both Python libraries `pyarrow` and `fastparquet` can handle Paruqet files. 
    `pyarrow` is preferred to `fastparquet`.
    
3. When writing a pandas DataFrame to a Parquet file,
    you can specify arguments `schema` and `version`
    to control the output schema of the DataFrame precisely.
    Notice that the `version` argument is required 
    to avoid the 
    [issue](https://github.com/pandas-dev/pandas/issues/37327)
    .

### DataFrame.read_parquet

### DataFrame.to_parquet

When writing a pandas DataFrame to a Parquet file,
you can specify arguments `schema` and `version`
to control the output schema of the DataFrame precisely.
Notice that the `version` argument is required 
to avoid the 
[issue](https://github.com/pandas-dev/pandas/issues/37327)
.
Of course,
you can also manually cast the data type of each column 
before outputing the DataFrame,
but it is not as efficient as the first approach. 

In [1]:
import numpy as np
import pyarrow as pa
import pandas as pd

In [None]:
df_j0 = pd.read_csv("rank_j0.csv")
schema = pa.schema(
    [
        ("id", pa.uint64()),
        ("mod", pa.uint32()),
        ("dups", pa.uint8()),
        ("rank", pa.uint32()),
    ]
)
df_j0.to_parquet("rank_j0.parquet", version="2.6", schema=schema)

[mixed type](https://github.com/pandas-dev/pandas/issues/21228)

#### Output Types of Columns

null object -> null int when read into PySpark!!

https://stackoverflow.com/questions/49172428/how-to-specify-logical-types-when-writing-parquet-files-from-pyarrow

https://stackoverflow.com/questions/50110044/how-to-force-parquet-dtypes-when-saving-pd-dataframe

## DataFrame.to_csv

## Read/Write A Pandas DataFrame From/To A Binary Stream

In [4]:
bio = io.BytesIO()

In [5]:
df = pd.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": [1, 1, 1, 1, 1]})

df.head()

Unnamed: 0,x,y,z
0,1,5,1
1,2,4,1
2,3,3,1
3,4,2,1
4,5,1,1


In [7]:
sys.getsizeof(bio)

96

In [8]:
df.to_parquet(bio)

In [9]:
sys.getsizeof(bio)

3360

In [10]:
pd.read_parquet(bio)

Unnamed: 0,x,y,z
0,1,5,1
1,2,4,1
2,3,3,1
3,4,2,1
4,5,1,1


## References

https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery

http://www.legendu.net/misc/blog/python-pandas-read_csv/
    
http://www.legendu.net/misc/blog/read-and-write-parquet-files-in-python/

https://docs.python.org/3/library/io.html

https://www.devdungeon.com/content/working-binary-data-python

https://webkul.com/blog/using-io-for-creating-file-object/