Reading CSV files is very import while dealing with some real time dataset, here we will learn everything about reading data from files, CSV stands for common seperrated values and we will also learn about StringIO



it is not strictly necessary for a CSV file to exist on your local machine to read it ‚Äî it depends on where you're reading it from and what tools you're using.

In [2]:
import pandas as pd
import numpy as np


# Local file path ‚Äî assuming iris.csv is in the current directory
df_local = pd.read_csv('iris.csv')

# Online file URL ‚Äî Iris dataset from a real GitHub repo
df_online = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

# Print both
print("üìÇ Locally present:\n", df_local.head())
print("\nüåê Globally (online) present:\n", df_online.head())

üìÇ Locally present:
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

üåê Globally (online) present:
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


| Feature        | Unit | Description                                                                |
| -------------- | ---- | -------------------------------------------------------------------------- |
| `sepal_length` | cm   | Length of the **sepal** (outer petal) of the flower.                       |
| `sepal_width`  | cm   | Width of the **sepal**.                                                    |
| `petal_length` | cm   | Length of the **petal** (inner petal, often more colorful).                |
| `petal_width`  | cm   | Width of the **petal**.                                                    |
| `species`      | ‚Äî    | The type of iris flower (e.g., **setosa**, **versicolor**, **virginica**). |


üß† What is StringIO?
StringIO stands for "String Input/Output".

It lets you treat a string as if it were a file ‚Äî meaning you can read from or write to a string just like you would with an actual file.


üîß What does it do?
You can write to a string as if it's a file.

You can read from a string as if you're reading from a file.

It is especially useful when working with libraries that expect file-like objects, like pandas.read_csv().



üß† What is an in-memory file format?

An in-memory file is a temporary file-like object that exists only in RAM (memory) ‚Äî not on your hard drive. It behaves just like a real file, but is faster and more efficient for temporary tasks.



In [None]:
from io import StringIO

# Below is a string that mimics a CSV (Comma-Separated Values) file
# It has 3 columns: col1, col2, col3
# And 3 rows of data:
# Row 1: x, y, 1
# Row 2: a, b, 2
# Row 3: c, d, 3
data = ('col1,col2,col3\n'
        'x,y,1\n'
        'a,b,2\n'
        'c,d,3')

# This checks the type of 'data', which is a string representing CSV content
type(data)  # Output: <class 'str'>

StringIO(data)  # This creates a file-like object from the string

pd.read_csv(StringIO(data))  # This reads the CSV data into a DataFrame
# Output the DataFrame created from the string CSV


#suppose you want to check only a certain column 
pd.read_csv(StringIO(data), usecols=['col1'])  # This reads only 'col1' from the CSV data

# suppose we want to check any two column 
df3 = pd.read_csv('iris.csv', usecols=['sepal_length', 'sepal_width'])  # This reads only 'sepal_length' and 'sepal_width' from the local csv file


#now if we want to make df3 as a new csv file
df3.to_csv('df3test.csv')  # This saves the DataFrame df3 to a new CSV file named 'df3test.csv' in local machine/directory




In [16]:
data1 = ('col1,col2,col3,col4\n'
            'x,y,1,10\n'
            'a,b,2,20\n'
            'c,d,3'
         )
#below it will create a new DataFrame from the string data1
#as i did not give 4 values in the last row, it will fill NaN for that value
df4 = pd.read_csv(StringIO(data1))

df4.info()  # This will show the information about the DataFrame df4, including the number of entries, columns, and data types


df4.isnull().sum()  # This will count the number of null (NaN) values in each column of df4


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    3 non-null      object 
 1   col2    3 non-null      object 
 2   col3    3 non-null      int64  
 3   col4    2 non-null      float64
dtypes: float64(1), int64(1), object(2)
memory usage: 228.0+ bytes


col1    0
col2    0
col3    0
col4    1
dtype: int64

similarly you can read csv file, convert it's data type and also check certain column or rows etc