Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify metadata columns in CSV loader #11576

Merged
merged 5 commits into from Oct 9, 2023

Conversation

benchello
Copy link
Contributor

Description

This PR adds the option to specify additional metadata columns in the CSVLoader beyond just Source.

The current CSV loader includes all columns in page_content and if we want to have columns specified for page_content and metadata we have to do something like the below.:

csv = pd.read_csv(
        "path_to_csv"
    ).to_dict("records")

documents = [
        Document(
            page_content=doc["content"],
            metadata={
                "last_modified_by": doc["last_modified_by"],
                "point_of_contact": doc["point_of_contact"],
            }
        ) for doc in csv
    ]

Usage

Example Usage:

csv_test  =  CSVLoader(
      file_path="path_to_csv", 
      metadata_columns=["last_modified_by", "point_of_contact"]
 )

Example CSV:

content, last_modified_by, point_of_contact
"hello world", "Person A", "Person B"

Example Result:

Document {
 page_content: "hello world"
 metadata: {
 row: '0',
 source: 'path_to_csv',
 last_modified_by: 'Person A',
 point_of_contact: 'Person B',
 }

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 9, 2023
@vercel
Copy link

vercel bot commented Oct 9, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 9, 2023 9:49pm

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Oct 9, 2023
@baskaryan
Copy link
Collaborator

thanks @benchello!

@baskaryan baskaryan merged commit 5de64e6 into langchain-ai:master Oct 9, 2023
32 checks passed
hoanq1811 pushed a commit to hoanq1811/langchain that referenced this pull request Feb 2, 2024
)

#### Description
This PR adds the option to specify additional metadata columns in the
CSVLoader beyond just `Source`.

The current CSV loader includes all columns in `page_content` and if we
want to have columns specified for `page_content` and `metadata` we have
to do something like the below.:
```
csv = pd.read_csv(
        "path_to_csv"
    ).to_dict("records")

documents = [
        Document(
            page_content=doc["content"],
            metadata={
                "last_modified_by": doc["last_modified_by"],
                "point_of_contact": doc["point_of_contact"],
            }
        ) for doc in csv
    ]
```
#### Usage
Example Usage:
```
csv_test  =  CSVLoader(
      file_path="path_to_csv", 
      metadata_columns=["last_modified_by", "point_of_contact"]
 )
```
Example CSV:
```
content, last_modified_by, point_of_contact
"hello world", "Person A", "Person B"
```

Example Result:
```
Document {
 page_content: "hello world"
 metadata: {
 row: '0',
 source: 'path_to_csv',
 last_modified_by: 'Person A',
 point_of_contact: 'Person B',
 }
```

---------

Co-authored-by: Ben Chello <bchello@dropbox.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants