# Apache Parquet

>[Apache Parquet](https://parquet.apache.org/) is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

### Install pyarrow

In [None]:
%pip install --upgrade --quiet  pyarrow

### Import dependencies

In [8]:
import os
from langchain_community.document_loaders import ParquetLoader

### Define the parquet file to load

In [12]:
FILE_PATH = f"{os.path.abspath('')}/example_data/mlb_teams_2012.parquet"

### Load file and display top 10 Documents

In [22]:
loader = ParquetLoader(file_path=FILE_PATH, content_columns='Team')
docs = loader.load()
for doc in docs[:10]:
    print(f"{doc.__class__.__name__} -> {doc}")

Document -> page_content='Nationals' metadata={'Payroll': 81.34, 'Wins': 98}
Document -> page_content='Reds' metadata={'Payroll': 82.2, 'Wins': 97}
Document -> page_content='Yankees' metadata={'Payroll': 197.96, 'Wins': 95}
Document -> page_content='Giants' metadata={'Payroll': 117.62, 'Wins': 94}
Document -> page_content='Braves' metadata={'Payroll': 83.31, 'Wins': 94}
Document -> page_content='Athletics' metadata={'Payroll': 55.37, 'Wins': 94}
Document -> page_content='Rangers' metadata={'Payroll': 120.51, 'Wins': 93}
Document -> page_content='Orioles' metadata={'Payroll': 81.43, 'Wins': 93}
Document -> page_content='Rays' metadata={'Payroll': 64.17, 'Wins': 90}
Document -> page_content='Angels' metadata={'Payroll': 154.49, 'Wins': 89}


### Alternatively specify multiple columns to join into the `page_contents` of the Document

In [24]:
loader2 = ParquetLoader(file_path=FILE_PATH, content_columns=['Team', 'Payroll'])
docs2 = loader2.load()
for doc in docs2[:10]:
    print(f"{doc.__class__.__name__} -> {doc}")

Document -> page_content='Nationals 81.34' metadata={'Wins': 98}
Document -> page_content='Reds 82.2' metadata={'Wins': 97}
Document -> page_content='Yankees 197.96' metadata={'Wins': 95}
Document -> page_content='Giants 117.62' metadata={'Wins': 94}
Document -> page_content='Braves 83.31' metadata={'Wins': 94}
Document -> page_content='Athletics 55.37' metadata={'Wins': 94}
Document -> page_content='Rangers 120.51' metadata={'Wins': 93}
Document -> page_content='Orioles 81.43' metadata={'Wins': 93}
Document -> page_content='Rays 64.17' metadata={'Wins': 90}
Document -> page_content='Angels 154.49' metadata={'Wins': 89}


### Alternatively specify specific fields to present as `metadata` in the Document

In [26]:
loader3 = ParquetLoader(file_path=FILE_PATH, content_columns=['Team'], metadata_columns='Payroll')
docs3 = loader3.load()
for doc in docs3[:10]:
    print(f"{doc.__class__.__name__} -> {doc}")

Document -> page_content='Nationals' metadata={'Payroll': 81.34}
Document -> page_content='Reds' metadata={'Payroll': 82.2}
Document -> page_content='Yankees' metadata={'Payroll': 197.96}
Document -> page_content='Giants' metadata={'Payroll': 117.62}
Document -> page_content='Braves' metadata={'Payroll': 83.31}
Document -> page_content='Athletics' metadata={'Payroll': 55.37}
Document -> page_content='Rangers' metadata={'Payroll': 120.51}
Document -> page_content='Orioles' metadata={'Payroll': 81.43}
Document -> page_content='Rays' metadata={'Payroll': 64.17}
Document -> page_content='Angels' metadata={'Payroll': 154.49}
