# Python Parquet

Python's integration with Parquet offers a good combination for data analysis. Parquet's columnar storage and compression dramatically speed up reading large datasets, while Python libraries like Pandas seamlessly load and manipulate the data. This translates to faster processing, efficient memory usage, and a smoother workflow for tackling complex data tasks.

## CSV to Parquet with Pandas

Pandas is a cornerstone Python library for data science. It provides high-performance data structures like DataFrames (think powerful spreadsheets) and Series (labeled arrays) for efficient data manipulation and analysis. Pandas empowers data scientists and analysts to seamlessly load, clean, transform, and ultimately unlock insights from complex datasets.

In [1]:
import pandas as pd

In [2]:
# read CSV as a DataFrame
df = pd.read_csv('data/us_presidents.csv')

df

Unnamed: 0,full_name,birth_year
0,teddy roosevelt,1901
1,abe lincoln,1809


In [3]:
# save DataFrame as parquet
df.to_parquet('data/us_presidents.parquet')

In [4]:
# read parquet as DataFrame
df = pd.read_parquet('data/us_presidents.parquet')

df

Unnamed: 0,full_name,birth_year
0,teddy roosevelt,1901
1,abe lincoln,1809


## CSV to Parquet with PySpark

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .master("local") \
  .appName("parquet_example") \
  .getOrCreate()

In [6]:
data = spark.read \
    .csv('data/us_presidents.csv', header=True)

data.show()

+---------------+----------+
|      full_name|birth_year|
+---------------+----------+
|teddy roosevelt|      1901|
|    abe lincoln|      1809|
+---------------+----------+



In [7]:
data.repartition(1) \
    .write.mode('overwrite') \
    .parquet('data/pyspark_us_presidents')

In [8]:
data = spark.read.parquet('data/pyspark_us_presidents')

data.show()

+---------------+----------+
|      full_name|birth_year|
+---------------+----------+
|teddy roosevelt|      1901|
|    abe lincoln|      1809|
+---------------+----------+

