In [1]:
import pandas as pd

This notebooks shows basic techniques for reading data in multiple formats into a Pandas DataFrame. First we will explain basic techniques using data stored in CSV.

In [2]:
%%writefile data.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Overwriting data.csv


The simplest thing we can do we a CSV file is reading it as is. In this case:

* The first row will be considered the column names
* The following rows are directly read as rows in the DataFrame

In [3]:
filename = "data.csv"
df = pd.read_csv(filename)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Pandas can also read CSV files that don't have a header. In this case, columns will be a numeric Index.

In [4]:
%%writefile data_no_header.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Overwriting data_no_header.csv


In [5]:
filename = "data_no_header.csv"
df = pd.read_csv(filename, header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We can specify the column names as well. Specifying column names implies that no header row is present

In [6]:
filename = "data_no_header.csv"
df = pd.read_csv(filename, names=['a', 'b', 'c', 'd', 'message'])
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We can also use one of the columns in the data as the index. In this case the column used as index is not includes as data as well.

In [7]:
filename = "data_no_header.csv"
names=['a', 'b', 'c', 'd', 'message']
df = pd.read_csv(filename, names=names, index_col='message')
df

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


 Unsing this technique it is also possible to create hierachical indexes.

In [8]:
%%writefile data_hierarchy.csv
1,1,3,4
1,2,3,4
2,4,2,5
2,4,5,6
2,1,1,1

Overwriting data_hierarchy.csv


In [9]:
filename = "data_hierarchy.csv"
names=['a', 'b', 'c', 'd']
df = pd.read_csv(filename, names=names, index_col=['a','b'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,3,4
1,2,3,4
2,4,2,5
2,4,5,6
2,1,1,1
