Skip to content

ENH: Add a functionality to scan csv files and show some useful info (eg-the record count) of a CSV file without loading entire file.  #57181

@KelumPerera

Description

@KelumPerera

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

There are many instances where users want to know how many records are in their large CSV files before loading them.

Good if pandas have some functionality to scan CSV files (without reading the entire file to memory at once), and provide some useful information to the user.
-Number of records in the file by counting new lines
-What are the column names
-What is the delimiter
-Are there non-standard characters within the data, Show them to decide suitable encoding method
-What is the optimal record chunk size when reading the file using pandas
-Memory requirement to hold the entire file in memory and memory requirement to hold data in each column

Feature Description

Sample code to show number of rows by using Numpy library,

#copied from https://stackoverflow.com/a/64744699
import numpy as np

chunk = 1024 * 1024 * 512 # Process 500 MB at a time.
f = np.memmap(r"C:\path\to\csvfile\somefile.csv")
num_newlines = sum(np.sum(f[i:i+chunk] == ord('\n'))
for i in range(0, len(f), chunk))
del f
print(num_newlines)

Alternative Solutions

It is hard to find alternative solutions especially when the file size is 10+ gigabytes

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO CSVread_csv, to_csvNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions