Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Randomised row selection with read_csv() #58760

Closed
1 of 3 tasks
Arjun-G-Ravi opened this issue May 18, 2024 · 4 comments
Closed
1 of 3 tasks

ENH: Randomised row selection with read_csv() #58760

Arjun-G-Ravi opened this issue May 18, 2024 · 4 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement IO CSV read_csv, to_csv

Comments

@Arjun-G-Ravi
Copy link

Arjun-G-Ravi commented May 18, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM using read_csv() (or any of the other read()) and then use another random sampling function to sample from the whole dataset. If I cannot load all of the rows into my RAM(due to hardware limitations), I will not be able to do this at all.

Feature Description

Add a parameter random into the read_csv() (and all of the other read()) in pandas to randomly load a part of the dataset into RAM.

Example

df = pd.read_csv('data.csv', nrows=1000, random=True) This should load 1000 rows from the dataset sampled at random, without ever loading the whole dataset into the RAM.

Alternative Solutions

The current way to do this is using

df = pd.read_csv('your_file.txt')
random_sample = df.sample(n=100)

This requires us to load everything to RAM.

So we have to find a way to sample the rows randomly, and then only load them into the RAM.

Additional Context

I believe this is an outstanding improvement, and will be widely used by the community.

@Arjun-G-Ravi Arjun-G-Ravi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2024
@twoertwein
Copy link
Member

I think you can use skiprows for this (I did not test it)

def selectFirstNRandom(size: int):
    selected_count = 0
    def skip(index: int):
        if selected_count >= size:
            return True
        skip_row = random.random() > 0.5
        if not skip_row:
            selected_count += 1
         return skip_row
    return skip

df = pd.read_csv('your_file.txt', skiprows=selectFirstNRandom(100))

@Aloqeely
Copy link
Member

I'm not sure if there will be much interest in this feature.

@twoertwein's skiprows solution probably works, but I think using chunksize and iterator is cleaner, it's less random though.

it = pd.read_csv("test.csv", chunksize=1000, iterator=True)

chunk1 = it.get_chunk()  # returns DataFrame with first 1000 lines
chunk2 = it.get_chunk()  # returns DataFrame with the next thousand lines

For more information on the chunksize and iterator parameters, you can see the IO Tools user guide

@rhshadrach
Copy link
Member

I'm not sure if there will be much interest in this feature.

Agreed @Aloqeely - CSV reading in pandas is already very complex and I think we would like to simplify it where possible. This feature seems too niche to me to have it supported directly in pandas.

If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM

How does one determine which 1000 rows to choose without knowing the total number of rows? Wouldn't that require reading the entire file anyways?

@rhshadrach rhshadrach added Closing Candidate May be closeable, needs more eyeballs IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2024
@mroeschke
Copy link
Member

Thanks for the request but it seems like there's not much interest in this feature so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants