-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
There are many instances where users want to know how many records are in their large CSV files before loading them.
Good if pandas have some functionality to scan CSV files (without reading the entire file to memory at once), and provide some useful information to the user.
-Number of records in the file by counting new lines
-What are the column names
-What is the delimiter
-Are there non-standard characters within the data, Show them to decide suitable encoding method
-What is the optimal record chunk size when reading the file using pandas
-Memory requirement to hold the entire file in memory and memory requirement to hold data in each column
Feature Description
Sample code to show number of rows by using Numpy library,
#copied from https://stackoverflow.com/a/64744699
import numpy as np
chunk = 1024 * 1024 * 512 # Process 500 MB at a time.
f = np.memmap(r"C:\path\to\csvfile\somefile.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
for i in range(0, len(f), chunk))
del f
print(num_newlines)
Alternative Solutions
It is hard to find alternative solutions especially when the file size is 10+ gigabytes
Additional Context
No response