ENH: Randomised row selection with read_csv() #58760

Arjun-G-Ravi · 2024-05-18T03:42:11Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM using read_csv() (or any of the other read()) and then use another random sampling function to sample from the whole dataset. If I cannot load all of the rows into my RAM(due to hardware limitations), I will not be able to do this at all.

Feature Description

Add a parameter random into the read_csv() (and all of the other read()) in pandas to randomly load a part of the dataset into RAM.

Example

df = pd.read_csv('data.csv', nrows=1000, random=True) This should load 1000 rows from the dataset sampled at random, without ever loading the whole dataset into the RAM.

Alternative Solutions

The current way to do this is using

df = pd.read_csv('your_file.txt')
random_sample = df.sample(n=100)

This requires us to load everything to RAM.

So we have to find a way to sample the rows randomly, and then only load them into the RAM.

Additional Context

I believe this is an outstanding improvement, and will be widely used by the community.

The text was updated successfully, but these errors were encountered:

twoertwein · 2024-05-18T11:04:45Z

I think you can use skiprows for this (I did not test it)

def selectFirstNRandom(size: int):
    selected_count = 0
    def skip(index: int):
        if selected_count >= size:
            return True
        skip_row = random.random() > 0.5
        if not skip_row:
            selected_count += 1
         return skip_row
    return skip

df = pd.read_csv('your_file.txt', skiprows=selectFirstNRandom(100))

Aloqeely · 2024-05-18T11:49:16Z

I'm not sure if there will be much interest in this feature.

@twoertwein's skiprows solution probably works, but I think using chunksize and iterator is cleaner, it's less random though.

it = pd.read_csv("test.csv", chunksize=1000, iterator=True)

chunk1 = it.get_chunk()  # returns DataFrame with first 1000 lines
chunk2 = it.get_chunk()  # returns DataFrame with the next thousand lines

For more information on the chunksize and iterator parameters, you can see the IO Tools user guide

rhshadrach · 2024-05-18T11:55:13Z

I'm not sure if there will be much interest in this feature.

Agreed @Aloqeely - CSV reading in pandas is already very complex and I think we would like to simplify it where possible. This feature seems too niche to me to have it supported directly in pandas.

If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM

How does one determine which 1000 rows to choose without knowing the total number of rows? Wouldn't that require reading the entire file anyways?

mroeschke · 2024-05-19T18:12:18Z

Thanks for the request but it seems like there's not much interest in this feature so closing

Arjun-G-Ravi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2024

rhshadrach added Closing Candidate May be closeable, needs more eyeballs IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2024

mroeschke closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Randomised row selection with read_csv() #58760

ENH: Randomised row selection with read_csv() #58760

Arjun-G-Ravi commented May 18, 2024 •

edited

Loading

twoertwein commented May 18, 2024

Aloqeely commented May 18, 2024

rhshadrach commented May 18, 2024

mroeschke commented May 19, 2024

ENH: Randomised row selection with read_csv() #58760

ENH: Randomised row selection with read_csv() #58760

Comments

Arjun-G-Ravi commented May 18, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Example

Alternative Solutions

Additional Context

twoertwein commented May 18, 2024

Aloqeely commented May 18, 2024

rhshadrach commented May 18, 2024

mroeschke commented May 19, 2024

Arjun-G-Ravi commented May 18, 2024 •

edited

Loading