-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Randomised row selection with read_csv() #58760
Comments
I think you can use def selectFirstNRandom(size: int):
selected_count = 0
def skip(index: int):
if selected_count >= size:
return True
skip_row = random.random() > 0.5
if not skip_row:
selected_count += 1
return skip_row
return skip
df = pd.read_csv('your_file.txt', skiprows=selectFirstNRandom(100)) |
I'm not sure if there will be much interest in this feature. @twoertwein's it = pd.read_csv("test.csv", chunksize=1000, iterator=True)
chunk1 = it.get_chunk() # returns DataFrame with first 1000 lines
chunk2 = it.get_chunk() # returns DataFrame with the next thousand lines For more information on the |
Agreed @Aloqeely - CSV reading in pandas is already very complex and I think we would like to simplify it where possible. This feature seems too niche to me to have it supported directly in pandas.
How does one determine which 1000 rows to choose without knowing the total number of rows? Wouldn't that require reading the entire file anyways? |
Thanks for the request but it seems like there's not much interest in this feature so closing |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
If I have a large dataset with 100K rows, and I want to select 1000 rows at random, I first need to load the whole dataset into my RAM using
read_csv()
(or any of the other read()) and then use another random sampling function to sample from the whole dataset. If I cannot load all of the rows into my RAM(due to hardware limitations), I will not be able to do this at all.Feature Description
Add a parameter random into the read_csv() (and all of the other read()) in pandas to randomly load a part of the dataset into RAM.
Example
df = pd.read_csv('data.csv', nrows=1000, random=True)
This should load 1000 rows from the dataset sampled at random, without ever loading the whole dataset into the RAM.Alternative Solutions
The current way to do this is using
This requires us to load everything to RAM.
So we have to find a way to sample the rows randomly, and then only load them into the RAM.
Additional Context
I believe this is an outstanding improvement, and will be widely used by the community.
The text was updated successfully, but these errors were encountered: