Add option to specify metadata columns in CSV loader #11576

benchello · 2023-10-09T20:49:30Z

Description

This PR adds the option to specify additional metadata columns in the CSVLoader beyond just Source.

The current CSV loader includes all columns in page_content and if we want to have columns specified for page_content and metadata we have to do something like the below.:

csv = pd.read_csv(
        "path_to_csv"
    ).to_dict("records")

documents = [
        Document(
            page_content=doc["content"],
            metadata={
                "last_modified_by": doc["last_modified_by"],
                "point_of_contact": doc["point_of_contact"],
            }
        ) for doc in csv
    ]

Usage

Example Usage:

csv_test  =  CSVLoader(
      file_path="path_to_csv", 
      metadata_columns=["last_modified_by", "point_of_contact"]
 )

Example CSV:

content, last_modified_by, point_of_contact
"hello world", "Person A", "Person B"

Example Result:

Document {
 page_content: "hello world"
 metadata: {
 row: '0',
 source: 'path_to_csv',
 last_modified_by: 'Person A',
 point_of_contact: 'Person B',
 }

vercel · 2023-10-09T20:49:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 9, 2023 9:49pm

baskaryan · 2023-10-09T21:49:32Z

thanks @benchello!

) #### Description This PR adds the option to specify additional metadata columns in the CSVLoader beyond just `Source`. The current CSV loader includes all columns in `page_content` and if we want to have columns specified for `page_content` and `metadata` we have to do something like the below.: ``` csv = pd.read_csv( "path_to_csv" ).to_dict("records") documents = [ Document( page_content=doc["content"], metadata={ "last_modified_by": doc["last_modified_by"], "point_of_contact": doc["point_of_contact"], } ) for doc in csv ] ``` #### Usage Example Usage: ``` csv_test = CSVLoader( file_path="path_to_csv", metadata_columns=["last_modified_by", "point_of_contact"] ) ``` Example CSV: ``` content, last_modified_by, point_of_contact "hello world", "Person A", "Person B" ``` Example Result: ``` Document { page_content: "hello world" metadata: { row: '0', source: 'path_to_csv', last_modified_by: 'Person A', point_of_contact: 'Person B', } ``` --------- Co-authored-by: Ben Chello <bchello@dropbox.com> Co-authored-by: Bagatur <baskaryan@gmail.com>

Ben Chello added 2 commits October 9, 2023 16:35

Add option to specify document metadata in python CSVLoader

d75aa92

Rebase and update

fb6c1d5

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 9, 2023

baskaryan added 3 commits October 9, 2023 14:46

cr

8c5a624

undo

e677b95

cr

e36d264

baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 9, 2023

baskaryan merged commit 5de64e6 into langchain-ai:master Oct 9, 2023
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to specify metadata columns in CSV loader #11576

Add option to specify metadata columns in CSV loader #11576

benchello commented Oct 9, 2023

vercel bot commented Oct 9, 2023 •

edited

baskaryan commented Oct 9, 2023

Add option to specify metadata columns in CSV loader #11576

Add option to specify metadata columns in CSV loader #11576

Conversation

benchello commented Oct 9, 2023

Description

Usage

vercel bot commented Oct 9, 2023 • edited

baskaryan commented Oct 9, 2023

vercel bot commented Oct 9, 2023 •

edited