### Notes on Ch 6: Data Loading, Storage, and File Formats

Pandas provides various functions to read tabular data into a DataFrame. The most commonly used function is `pandas.read_csv`, which reads delimited data from a file, URL, or file-like object. Other functions are available for different formats, such as Excel, JSON, HDF5, and more.

#### Reading and Writing Data in Text Format

The `read_csv` function is extensively used to convert text data into a DataFrame. It has various optional arguments to handle different scenarios:

* <b>Indexing</b>

You can treat one or more columns as the returned DataFrame and specify whether to get column names from the file or from arguments you provide.

* <b>Type Inference and Data Conversion</b>

This involves user-defined value conversions and specifying a custom list of missing value markers.

* <b>Date and Time Parsing</b>

The `parse_dates` argument helps combine date and time information spread over multiple columns into a single column.

* <b>Date and Time Parsing</b>

Support for iterating over chunks of very large files is provided.

* <b>Unclean Data Issues</b>

Handling unclean data issues includes skipping rows or a footer, dealing with comments, and addressing numeric data with thousands separated by commas.



In [3]:
import pandas as pd

# Reading a CSV file with a header row
df = pd.read_csv("../data/ex1.csv")
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
# File without a Header

df_no_header = pd.read_csv("../data/ex2.csv", header=None)
print(df_no_header)
print("")
# Adding a header
df_custom_header = pd.read_csv("../data/ex2.csv", names=['a', 'b', 'c', 'd', 'message'])
print(df_custom_header)

   0   1   2   3      4
0  1   2   3   4  hello
1  5   6   7   8  world
2  9  10  11  12    foo

   a   b   c   d message
0  1   2   3   4   hello
1  5   6   7   8   world
2  9  10  11  12     foo


In [6]:
# Using a Column as an Index
df_indexed = pd.read_csv("../data/ex2.csv", names=['a', 'b', 'c', 'd', 'message'], index_col='message')
df_indexed

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [8]:
# Hierarchical Indexing
parsed = pd.read_csv("../data/csv_mindex.csv", index_col=['key1', 'key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [10]:
# Handling Whitespace-delimited File
result = pd.read_csv("../data/ex3.txt", sep='\s+')
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [11]:
# Skipping Rows
skipped_rows = pd.read_csv("../data/ex4.csv", skiprows=[0, 2, 3])
skipped_rows

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [13]:
# Handling Missing Values
result_missing = pd.read_csv("../data/ex5.csv")
print(result_missing)
print("")
result_custom_missing = pd.read_csv("../data/ex5.csv", na_values=['NULL'])
print(result_custom_missing)

  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo

  something  a   b     c   d message
0       one  1   2   3.0   4     NaN
1       two  5   6   NaN   8   world
2     three  9  10  11.0  12     foo
