## Reading data with Pandas

Pandas is a powerful Python library widely used for data analysis and manipulation, and it provides simple yet highly efficient methods to read various file types into DataFrames. The most commonly used function is pd.read_csv() for reading comma-separated value (CSV) files, but pandas can handle many other formats as well. It supports reading .txt files using read_csv() or read_table(), Excel files with read_excel(), JSON files with read_json(), HTML tables using read_html(), SQL databases using read_sql(), and even more complex formats like Parquet, HDF5, and Feather. These functions allow users to directly import data into a structured, tabular form where they can clean, filter, and analyze the data efficiently. Pandas also offers many optional parameters for each function to handle missing values, set headers, parse dates, specify delimiters, and more — making it flexible for working with a wide variety of real-world data sources.

## Writing data with Pandas

Pandas provides simple functions to write DataFrames to external files. The most common method is DataFrame.to_csv() which writes data to a .csv file. Similarly, you can use to_excel() for .xlsx files, to_json() for JSON, to_html() for HTML tables, and even to_sql() for databases. These functions support many customization options such as choosing a delimiter, including/excluding the index column, specifying headers, formatting dates, and controlling the output file path. This functionality is especially useful for exporting processed data, storing model-ready datasets, or creating reports. Writing with pandas ensures that your data retains its structure, making it easy to reuse or share.

We will be working on reading different types of data including : 

1. CSV files (.csv)
2. Raw text files (.txt)
3. JSON data from a file or from an API
4. Data from SQL query over a database

# The read_csv method

The read_csv() method is one of the most commonly used functions in the pandas library. It allows you to read comma-separated values (CSV) files and convert them directly into a pandas DataFrame — a structured table where rows and columns can be easily manipulated for analysis.

This function is crucial for data analysts and data scientists because most real-world data comes in .csv format. Whether you’re working with survey data, financial reports, exported SQL results, or scraped datasets, read_csv() is often the very first step of your analysis pipeline.

Syntax : 
import pandas as pd
df = pd.read_csv("filename.csv")


filepath_or_buffer

    The file path (local or URL) to the CSV file.

sep

    Delimiter used (default is ',', but can be '\t', ';', etc.).

header  

    Row number(s) to use as the column names. Use None if there’s no header.

names

    List of column names to use (when header=None).

index_col

    Column(s) to set as the index (e.g. "ID").

usecols

    A list of column names or indices to load.

dtype

    Specify data types for columns.

parse_dates

    Convert specified columns to datetime format.

na_values

    Additional strings to recognize as NA/NaN.

skiprows

    Number of lines to skip at the start of the file.

nrows

    Number of rows of the file to read.

encoding

    Specify file encoding (e.g. "utf-8", "ISO-8859-1").

error_bad_lines (deprecated)

    Skip lines with too many fields.

engine

    Backend parser to use ("c", "python").


We will be trying to read the csv file that we have within our folder for now. I have downlaoded the "btc-market-price.csv" file for now.

In [340]:
import pandas as pd

In [341]:
pd.read_csv?
# You can do this to read the documentation when using the read_cav method.

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m:[0m [0;34m'str | None | lib.NoDefault'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m:[0m [0;34m"int | Sequence[int] | None | Literal['infer']"[0m [0;34m=[0m [0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m:[0m [0;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m:[0m [0;34m'IndexLabel | Literal[False] | None'[0m [0

Everytime we use read_csv we will have to provide the valid filepath for that csv file. A valid stringpath is always acceptable. A string could be a URL.The pandas.read_csv() function supports several valid URL schemes, such as http://, https://, ftp://, s3://, gs://, file://, and even data: URIs, allowing it to read CSV files directly from web links, cloud storage services, local file paths, and embedded data sources.

In [342]:
# csv_url = "https://github.com/datasets/gdp/blob/main/data/gdp.csv"
# In the above code instead of passing the location of that file i tried passing the link for that page and this caused me a parse error
csv_url = "https://raw.githubusercontent.com/datasets/gdp/refs/heads/main/data/gdp.csv"
# I got the above url by clicking the raw on the data set and the I used the url of that page

pd.read_csv(csv_url).head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Afghanistan,AFG,2000,3521418000.0
1,Afghanistan,AFG,2001,2813572000.0
2,Afghanistan,AFG,2002,3825701000.0
3,Afghanistan,AFG,2003,4520947000.0
4,Afghanistan,AFG,2004,5224897000.0


# What is a ParseError in pandas.read_csv()?

A ParserError means that Pandas couldn’t properly read and split the file into rows and columns. It’s usually triggered when:

	• The number of columns differs across rows

	• The file is not actually a CSV, even though it ends in .csv

	• You’re trying to read HTML, JSON, or webpage text as if it were raw CSV

In [343]:
df = pd.read_csv("dataset/btc-market-price.csv")
df.head()

# reading locally available csv files

Unnamed: 0,2017-04-02 00:00:00,1099.169125
0,2017-04-03 00:00:00,1141.813
1,2017-04-04 00:00:00,1141.600363
2,2017-04-05 00:00:00,1133.079314
3,2017-04-06 00:00:00,1196.307937
4,2017-04-07 00:00:00,1190.45425


In the above code we let pandas infer everything related to our data, but in most cases we will need to explicitly tell pandas how we want our data to be loaded. To do such things we provide paramaters which we will be discussing below.

# First row behaviour with header parameters 
The csv files we are reading ahve two columns Timestamp and Price. It doesnot have a header. Pandas automatically assigns the first row as the header, which is not correct because some of the csv files will lack those headers like the one we are using currently. So, to overwrite this behavior we will be using the header parameter.

In [344]:
df = pd.read_csv("dataset/btc-market-price.csv", header = None)

In [345]:
df

Unnamed: 0,0,1
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813000
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937
...,...,...
360,2018-03-28 00:00:00,7960.380000
361,2018-03-29 00:00:00,7172.280000
362,2018-03-30 00:00:00,6882.531667
363,2018-03-31 00:00:00,6935.480000


# Missing values with na_values parameter
We can define na_values parameter with the values we want to be recognized as NA/NaN. In this case empty strings such as "",?,- will be recognized as null values.



In [346]:
df = pd.read_csv("dataset/btc-market-price.csv", header = None, na_values = ['',"",'?','-'])
df.sample(n=10)


Unnamed: 0,0,1
225,2017-11-13 00:00:00,6550.227533
26,2017-04-28 00:00:00,1331.294429
263,2017-12-21 00:00:00,16047.51
199,2017-10-18 00:00:00,5546.1761
115,2017-07-26 00:00:00,2495.028586
314,2018-02-10 00:00:00,8319.876566
72,2017-06-13 00:00:00,2748.185086
6,2017-04-08 00:00:00,1181.149838
224,2017-11-12 00:00:00,5716.301583
116,2017-07-27 00:00:00,2647.625


In [347]:
df = pd.read_csv("dataset/btc-market-price.csv",
                 header = None, 
                 na_values=['',"",'?','-'],
                 names = ["Timestamp", "Price"])

In [348]:
df 

Unnamed: 0,Timestamp,Price
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813000
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937
...,...,...
360,2018-03-28 00:00:00,7960.380000
361,2018-03-29 00:00:00,7172.280000
362,2018-03-30 00:00:00,6882.531667
363,2018-03-31 00:00:00,6935.480000


In [349]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937


In [350]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Timestamp  365 non-null    object 
 1   Price      365 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB


# Column Types using dtype parameter

Without the dtype parameter the pandas will try to figure out the data types automatically. We can use the dtype parameter to force the pandas to use the specific dtype. In this example we will try to force the Price to be float

In [351]:
df = pd.read_csv("dataset/btc-market-price.csv",
                 header = None,
                 na_values = ['',"",'?','-'],
                 names = ["Timestamp", "Price"],
                 dtype = {"Price": "float"})


In [352]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-04-02 00:00:00,1099.169125
1,2017-04-03 00:00:00,1141.813
2,2017-04-04 00:00:00,1141.600363
3,2017-04-05 00:00:00,1133.079314
4,2017-04-06 00:00:00,1196.307937


In [353]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Timestamp  365 non-null    object 
 1   Price      365 non-null    float64
dtypes: float64(1), object(1)
memory usage: 5.8+ KB


In [354]:
pd.to_datetime(df["Timestamp"].head())
# This is for the check 

0   2017-04-02
1   2017-04-03
2   2017-04-04
3   2017-04-05
4   2017-04-06
Name: Timestamp, dtype: datetime64[ns]

In [355]:
df["Timestamp"] = pd.to_datetime(df["Timestamp"])

In [356]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-04-02,1099.169125
1,2017-04-03,1141.813
2,2017-04-04,1141.600363
3,2017-04-05,1133.079314
4,2017-04-06,1196.307937


In [357]:
df.dtypes

Timestamp    datetime64[ns]
Price               float64
dtype: object

# Date parser using parse_dates parameter

Another way of dealing the Datetime objects is using parse_dates parameter with the position of column of the dates


In [358]:
df = pd.read_csv("dataset/btc-market-price.csv",
                 header = None,
                 na_values = ['',"",'?','-'],
                 names = ["Timestamp", "Price"],
                 dtype = {"Price": "float"},
                 parse_dates = [0])

# parse_dates = [0] simply says to parse the index 0 column to datetime


In [359]:
df.head()

Unnamed: 0,Timestamp,Price
0,2017-04-02,1099.169125
1,2017-04-03,1141.813
2,2017-04-04,1141.600363
3,2017-04-05,1133.079314
4,2017-04-06,1196.307937


In [360]:
df.dtypes

Timestamp    datetime64[ns]
Price               float64
dtype: object

Adding index to our data using the index column

In [361]:
df = pd.read_csv("dataset/btc-market-price.csv",
                 header = None,
                 na_values = ['',"",'?','-'],
                 names = ["Timestamp", "Price"],
                 dtype = {"Price": "float"},
                 parse_dates= [0],
                 index_col=[0])


In [362]:
df.head()

Unnamed: 0_level_0,Price
Timestamp,Unnamed: 1_level_1
2017-04-02,1099.169125
2017-04-03,1141.813
2017-04-04,1141.600363
2017-04-05,1133.079314
2017-04-06,1196.307937


# Additional challenging parsing

Custom data delimeters using sep parameter

Here, I have made a sample data for the demonstration. 

In [363]:
exam_df = pd.read_csv("dataset/sample_data_gt.csv")

In [364]:
exam_df

Unnamed: 0,Name>Age>Score
0,Alice>24>88
1,Bob>27>76
2,Charlie>22>93
3,David>30>85
4,Eva>25>90


We can define which delimeter to use using the sep parameter. If we dont use the sep parameter, pandas will automatically detect the seperator.
In most of the CSV file seperator will be comma(,), and will be automatically detected. But that is not the case everytime. We will find seperators like (;),tabs(\t especially on the TSV files), whitespaces or any other characters.
In this case the seperator is a > character

In [365]:
exam_df = pd.read_csv("dataset/sample_data_gt.csv", sep=">")

In [366]:
exam_df

# this is how the sep parametre has helped us in seperating the data properly

Unnamed: 0,Name,Age,Score
0,Alice,24,88
1,Bob,27,76
2,Charlie,22,93
3,David,30,85
4,Eva,25,90


# Custom data encoding

Files are stored using different encodings. we have probably heard about the ASCII, UTF-8, latin1, etc.

While reading data custom encoding can be defined as encoding parameter.

    encoding="UTF-8" : will be used if the data is UTF-8 encoded
    encoding="iso-8859-1": will be used if data is ISO/IEC 8859-I("extended ASCII) encoded  .

In our case we dont need encoding as the data is loaded without any problems.

# Get rid of blank lines

The skip_blank_lines parameter when set to True skips the blank lines when we read the files.
If we set the parameter to False, then every blank line will be be loaded with NaN values into the dataframe.

In [367]:
#syntax
# df = pd.read_csv("sample.csv", skip_blank_lines=False)
# print(df)

# By default this is always False

# Excluding specific rows

We can use the skiprows to :
1. Exclude reading specified number of rows from the beginning of a file by passing an integer argument. **This removes the header too.**
2. Skip reading specific row indices from a file, by passing a list containing row indices to skip.

In [None]:
exam_df = pd.read_csv("dataset/output.csv", 
                      skiprows = 2)
# As we can see that we have skipped the row.


In [369]:
exam_df

Unnamed: 0,Alice,24,88
0,Bob,27,76
1,Charlie,22,93
2,Eva,25,90
3,Frank,29,82
4,Grace,31,87
5,Helen,26,91
6,Ian,28,79
7,Julia,23,95


As the header is considered the first row we can skip reading data rows 1 and 3, we can use the skiprows = [1,3]

In [None]:
exam_df = pd.read_csv("dataset/output.csv", skiprows = [1,3])

# This simply means don't read row 1 and 3 at all.

# You generally don’t skip the last row because it might contain actual, important data.

In [371]:
exam_df

Unnamed: 0,Name,Age,Score
0,Alice,24,88
1,Charlie,22,93
2,Eva,25,90
3,Frank,29,82
4,Grace,31,87
5,Helen,26,91
6,Ian,28,79
7,Julia,23,95


# Loading specific columns 

When we want to load the specific columns instead of loading everything, we can pass the required columns as parameter to the **usecols**

This is better perfomance wise because instead of loading an entire dataframe and deleting the not required columns. We can select just the required columns and work with it in the dataframe that we have created.

In [None]:
pd.read_csv("dataset/output.csv", usecols = ["Age"])
# This is using the Column name 

Unnamed: 0,Age
0,24
1,22
2,25
3,29
4,31
5,26
6,28
7,23


In [None]:
pd.read_csv("dataset/output.csv", usecols = [1])

# This is using the column index

Unnamed: 0,Age
0,24
1,22
2,25
3,29
4,31
5,26
6,28
7,23


# Using a Series instead of a DataFrame

In [378]:
exam_proto = pd.read_csv("dataset/output.csv", usecols = ["Age"])

In [383]:
type(exam_proto)

pandas.core.frame.DataFrame

In [None]:
exam_proto = exam_proto.squeeze()
# The squeeze function changes it to the series

# Squeezing makes sense when you’re working with a single column or row and want a simpler structure (Series) to perform analysis, math, or plotting.

In [386]:
type(exam_proto)

pandas.core.series.Series

# Save to CSV file

In [372]:
exam_df.to_csv("dataset/output.csv")

In [373]:
pd.read_csv("dataset/output.csv")

Unnamed: 0.1,Unnamed: 0,Name,Age,Score
0,0,Alice,24,88
1,1,Charlie,22,93
2,2,Eva,25,90
3,3,Frank,29,82
4,4,Grace,31,87
5,5,Helen,26,91
6,6,Ian,28,79
7,7,Julia,23,95


In [374]:
exam_df.to_csv("dataset/output.csv", index = None)

In [375]:
pd.read_csv("dataset/output.csv")
# With this we can simply remove the index

Unnamed: 0,Name,Age,Score
0,Alice,24,88
1,Charlie,22,93
2,Eva,25,90
3,Frank,29,82
4,Grace,31,87
5,Helen,26,91
6,Ian,28,79
7,Julia,23,95
