#  Data Loading, Storage, and File Formats

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels as sm

## Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object.

![alt text](images/parsingdata.png "Parsing functions in pandas")

The optional arguments for these functions may fall into
a few categories:
Indexing
Can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.
Type inference and data conversion
This includes the user-defined value conversions and custom list of missing value
markers.
Datetime parsing
Includes combining capability, including combining date and time information
spread over multiple columns into a single column in the result.
Iterating
Support for iterating over chunks of very large files.
Unclean data issues
Skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.

Some of these functions, like pandas.read_csv, perform *type inference*, because the column data types are not part of the data format. That means you don’t necessarily have to specify which columns are numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data types stored in the format.

In [4]:
!type examples\ex1.csv

a, b, c, d
2, 5, 7, 8
1, 6, 8, 4
2, 5, 7, 1 


In [5]:
df = pd.read_csv("examples\\ex1.csv")

In [6]:
df

Unnamed: 0,a,b,c,d
0,2,5,7,8
1,1,6,8,4
2,2,5,7,1


We could also have used **read_table** and specified the delimiter:

In [7]:
pd.read_table("examples/ex1.csv", sep=",")

Unnamed: 0,a,b,c,d
0,2,5,7,8
1,1,6,8,4
2,2,5,7,1


A file will not always have a header row. 

In [9]:
!type examples\ex2.csv

2, 5, 7, 8, alo
1, 6, 8, 4, big
2, 5, 7, 1, nice 


In [12]:
pd.read_csv("examples\\ex2.csv", header=None)

Unnamed: 0,0,1,2,3,4
0,2,5,7,8,alo
1,1,6,8,4,big
2,2,5,7,1,nice


You can allow pandas to assign default column names, or you can specify names yourself:

In [15]:
pd.read_csv("examples/ex2.csv", names=["a", "b", "c", "d", "message"])

Unnamed: 0,a,b,c,d,message
0,2,5,7,8,alo
1,1,6,8,4,big
2,2,5,7,1,nice


Suppose you wanted the message column to be the index of the returned DataFrame.
You can either indicate you want the column at index 4 or named 'message' using the **index_col** argument:

In [16]:
names = ["a", "b", "c", "d", "message"]

In [17]:
pd.read_csv("examples\\ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
alo,2,5,7,8
big,1,6,8,4
nice,2,5,7,1


In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names:

In [18]:
parsed = pd.read_csv("examples\\csv_mindex.csv", index_col=["key1", "key2"])

In [19]:
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:

In [21]:
list(open("examples\\ex3.txt"))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491']

While you could do some munging by hand, the fields here are separated by a vari‐
able amount of whitespace. In these cases, you can pass a regular expression as a
delimiter for **read_table**. This can be expressed by the regular expression **\\s+**, so we
have then:

In [31]:
result = pd.read_table("examples\\ex3.txt", sep="\s+")

In [32]:
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


Because there was one fewer column name than the number of data rows, read_table infers that the first column should be the DataFrame’s index in this special case.

![alt text](images/readcsvargs.png "Some read_csv/read_table function arguments")

You can skip the first, third, and fourth rows of a file with **skiprows**:

In [35]:
!type examples\\ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [36]:
pd.read_csv("examples/ex4.csv", skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some **sentinel** value. By default, pandas uses a set of commonly occurring sentinels, such as **NA** and **NULL**:

In [37]:
!type examples\ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [38]:
result = pd.read_csv("examples/ex5.csv")

In [39]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [40]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The **na_values** option can take either a list or set of strings to consider missing
values:

In [46]:
result = pd.read_csv("examples/ex5.csv", na_values=["NULL"])

In [47]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [48]:
result = pd.read_csv("examples/ex5.csv", na_values=["6"])

In [49]:
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2.0,3.0,4,
1,two,5,,,8,world
2,three,9,10.0,11.0,12,foo


Different NA sentinels can be specified for each column in a dict:

In [50]:
sentinels = {"message": ["foo", "NA"], "something": ["two"]}

In [52]:
df = pd.read_csv("examples/ex5.csv", na_values=sentinels)

In [53]:
df

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


In [56]:
df.columns[df.isna().any()].tolist()

['something', 'c', 'message']

### Reading Text Files in Pieces

Before we look at a large file, we make the pandas display settings more compact:

In [57]:
pd.options.display.max_rows = 10

In [58]:
result = pd.read_csv("examples/ex6.csv")

In [59]:
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
4195,0.467976,-0.038649,-0.295344,-1.824726,L
4196,-0.358893,1.404453,0.704965,-0.200638,B
4197,-0.501840,0.659254,-0.421691,-0.057688,G
4198,0.204886,1.074134,1.388361,-0.982404,R


If you want to only read a small number of rows (avoiding reading the entire file), specify that with **nrows**:

In [60]:
df = pd.read_csv("examples/ex6.csv", nrows=5)

In [61]:
df.columns

Index(['one', 'two', 'three', 'four', 'key'], dtype='object')

To read a file in pieces, specify a **chunksize** as a number of rows:

In [62]:
chunker = pd.read_csv("examples/ex6.csv", chunksize=500)

In [63]:
chunker

<pandas.io.parsers.TextFileReader at 0x1a9b5a2ac48>

The **TextParser** object returned by read_csv allows you to iterate over the parts of the file according to the *chunksize*. 

 For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [70]:
chunker = pd.read_csv("examples/ex6.csv", chunksize=1000)

In [66]:
for piece in chunker:
    print(piece.shape)

(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)
(200, 5)


In [71]:
chunker

<pandas.io.parsers.TextFileReader at 0x1a9b5a66348>

In [72]:
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece["key"].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

In [73]:
tot[:10]

 R    840.0
 Q    840.0
 L    840.0
 G    840.0
 B    840.0
dtype: float64

**TextParser** is also equipped with a *get_chunk* method that enables you to read
pieces of an arbitrary size.

## Writing Data to Text Format

Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:

In [74]:
data = pd.read_csv("examples/ex5.csv")

In [75]:
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [76]:
data.to_csv("examples/out.csv")

In [77]:
!type examples\out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Other delimiters can be used, of course (writing to *sys.stdout* so it prints the text result to the console):

In [78]:
import sys

In [79]:
data.to_csv(sys.stdout, sep="|")

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:

In [80]:
data.to_csv(sys.stdout, na_rep="NULL")

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


With no other options specified, both the row and column labels are written. Both of
these can be disabled:

In [81]:
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You can also write only a subset of the columns, and in an order of your choosing:

In [82]:
data.to_csv(sys.stdout, index=False, columns=["a", "b", "c"])

a,b,c
1,2,3.0
5,6,
9,10,11.0


Series also has a **to_csv** method:

In [83]:
dates = pd.date_range("1/1/2000", periods=7)

In [84]:
ts = pd.Series(np.arange(7), index=dates)

In [85]:
ts.to_csv("examples/tseries.csv")

  """Entry point for launching an IPython kernel.


In [86]:
!type examples\tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6


### Working with Delimited Formats

## JSON Data

In [87]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

To convert a JSON string to Python form, use **json.loads**:

In [88]:
import json

In [89]:
result = json.loads(obj)

In [90]:
type(result)

dict

In [91]:
result["siblings"]

[{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
 {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]

**json.dumps**, on the other hand, converts a Python object back to JSON:

In [92]:
asjson = json.dumps(result)

In [93]:
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

In [94]:
siblings = pd.DataFrame(result["siblings"], columns=["name", "age"])

In [95]:
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


The **pandas.read_json** can automatically convert JSON datasets in specific arrangements into a Series or DataFrame. For example:

The default options for pandas.read_json assume that each object in the JSON array is a row in the table:

In [96]:
data = pd.read_json("examples/example.json")

In [97]:
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


If you need to export data from pandas to JSON, one way is to use the **to_json** methods on Series and DataFrame:

In [98]:
print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


In [99]:
print(data.to_json(orient="records"))

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]


### XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. Examples include **lxml, Beautiful Soup, and html5lib**. While lxml is
comparatively much faster in general, the other libraries can better handle malformed
HTML or XML files.

pandas has a built-in function, **read_html**, which uses libraries like lxml and Beautiful Soup to automatically parse tables out of HTML files as DataFrame objects.

In [100]:
pip install lxml beautifulsoup4 html5lib

Note: you may need to restart the kernel to use updated packages.


The pandas.read_html function has a number of options, but by default it searches for and attempts to parse all tabular data contained within \< table \> tags. The result is a list of DataFrame objects:

In [104]:
tables = pd.read_html("examples/tables.html")

In [105]:
tables

[        0   1   2   3
 0    amir   2   3   6
 1  masoud  12  53  66
 2   saeed  26  32  61]

In [106]:
len(tables)

1

In [107]:
failures = tables[0]

In [108]:
failures

Unnamed: 0,0,1,2,3
0,amir,2,3,6
1,masoud,12,53,66
2,saeed,26,32,61


In [109]:
failures.sum()

0    amirmasoudsaeed
1                 40
2                 88
3                133
dtype: object

### Parsing XML with lxml.objectify

In [110]:
from lxml import objectify

In [111]:
path = "examples\\lstxml.xml"
parsed = objectify.parse(open(path))
root = parsed.getroot()

In [112]:
parsed

<lxml.etree._ElementTree at 0x1a9b5a47188>

In [113]:
root

<Element ROOT at 0x1a9b5b15448>

In [114]:
root.INDICATOR

<Element INDICATOR at 0x1a9b5fddc08>

In [115]:
data = []
skip_fields = ["PARENT_SEQ", "INDICATOR_SEQ", "DESIRED_CHANGE", "DECIMAL_PLACES"]

In [116]:
for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

In [117]:
data

[{'AGENCY_NAME': 'Metro-North Railroad',
  'INDICATOR_NAME': 'Escalator Availability',
  'DESCRIPTION': 'Percent of the time that escalators are operational\nsystemwide. The availability rate is based on physical observations performed\nthe morning of regular business days only. This is a new indicator the agency\nbegan reporting in 2009.',
  'PERIOD_YEAR': 2011,
  'PERIOD_MONTH': 12,
  'CATEGORY': 'Service Indicators',
  'FREQUENCY': 'M',
  'INDICATOR_UNIT': '%',
  'YTD_TARGET': 97.0,
  'YTD_ACTUAL': '',
  'MONTHLY_TARGET': 97.0,
  'MONTHLY_ACTUAL': ''}]

In [118]:
pd.DataFrame(data)

Unnamed: 0,AGENCY_NAME,INDICATOR_NAME,DESCRIPTION,PERIOD_YEAR,PERIOD_MONTH,CATEGORY,FREQUENCY,INDICATOR_UNIT,YTD_TARGET,YTD_ACTUAL,MONTHLY_TARGET,MONTHLY_ACTUAL
0,Metro-North Railroad,Escalator Availability,Percent of the time that escalators are operat...,2011,12,Service Indicators,M,%,97.0,,97.0,


XML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag, which is also valid XML:

In [119]:
from io import StringIO

tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

In [120]:
root

<Element a at 0x1a9b5f6b1c8>

In [121]:
root.get("href")

'http://www.google.com'

In [122]:
root.text

'Google'

### Binary Data Formats

One of the easiest ways to store data (also known as serialization) efficiently in binary
format is using Python’s built-in **pickle** serialization. pandas objects all have a **to_pickle** method that writes the data to disk in pickle format:

In [125]:
frame = pd.read_csv("examples/ex2.csv")

In [126]:
frame

Unnamed: 0,2,5,7,8,alo
0,1,6,8,4,big
1,2,5,7,1,nice


In [128]:
frame.to_pickle("examples/frame_pickle")

You can read any “pickled” object stored in a file by using the built-in **pickle** directly, or even more conveniently using **pandas.read_pickle**:

In [129]:
pd.read_pickle("examples/frame_pickle")

Unnamed: 0,2,5,7,8,alo
0,1,6,8,4,big
1,2,5,7,1,nice


**pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library. **

pandas has built-in support for two more binary data formats: _HDF5_ and _Message‐Pack_. Some other storage formats for pandas or NumPy data include: _bcolz_ and _Feather_

### Using HDF5 Format

HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. 

The “HDF” in HDF5 stands for hierarchical data format. 

Each HDF5 file can store multiple datasets and supporting metadata. 

Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. 

HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.

While it’s possible to directly access HDF5 files using either the PyTables or h5py
libraries, pandas provides a high-level interface that simplifies storing Series and
DataFrame object. The HDFStore class works like a dict and handles the low-level
details:

In [152]:
frame = pd.DataFrame({"a": np.random.randn(100)})

In [153]:
frame

Unnamed: 0,a
0,-1.456977
1,-0.174095
2,-1.458757
3,1.295580
4,-0.564461
...,...
95,-1.437044
96,-1.143115
97,0.213776
98,-0.244823


In [154]:
store = pd.HDFStore("mydata.h5")

In [155]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

In [156]:
store["obj1"] = frame

In [157]:
store["obj1_col"] = frame["a"]

In [158]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

Objects contained in the HDF5 file can then be retrieved with the same dict-like API:

In [159]:
store["obj1"]

Unnamed: 0,a
0,-1.456977
1,-0.174095
2,-1.458757
3,1.295580
4,-0.564461
...,...
95,-1.437044
96,-1.143115
97,0.213776
98,-0.244823


In [160]:
store["obj1_col"]

0    -1.456977
1    -0.174095
2    -1.458757
3     1.295580
4    -0.564461
        ...   
95   -1.437044
96   -1.143115
97    0.213776
98   -0.244823
99    1.352503
Name: a, Length: 100, dtype: float64

HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally slower, but it supports query operations using a special syntax:

In [161]:
store.put("obj2", frame, format="table")

In [162]:
store.select("obj2", where=["index >= 10 and index <= 15"])

Unnamed: 0,a
10,0.932567
11,-0.121031
12,-0.104757
13,0.335039
14,0.614553
15,-0.33913


In [163]:
store.close()

The put is an explicit version of the store['obj2'] = frame method but allows us to
set other options like the storage format.

The pandas.read_hdf function gives you a shortcut to these tools:

In [164]:
frame.to_hdf("mydata.h5", "obj3", format="table")

In [165]:
pd.read_hdf("mydata.h5", "obj3", where=["index < 5"])

ValueError: The file 'mydata.h5' is already opened, but not in read-only mode (as requested).

If you work with large quantities of data locally,explore PyTables and h5py to see how they can suit your needs. Since many data analysis problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate your applications.

HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple
writers do so simultaneously, the file can become corrupted.

### Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the **ExcelFile** class or **pandas.read_excel** function. 
Internally these tools use the add-on packages **xlrd** and **openpyxl** to read XLS and XLSX files, respectively.
To use **ExcelFile**, create an instance by passing a path to an xls or xlsx file:

In [166]:
xlsx = pd.ExcelFile("examples/ex1.xlsx")

In [167]:
xlsx

<pandas.io.excel._base.ExcelFile at 0x1a9b635bb48>

Data stored in a sheet can then be read into DataFrame with **parse**:

In [168]:
pd.read_excel(xlsx, "Sheet1")

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also simply pass the filename to **pandas.read_excel**:

In [169]:
frame = pd.read_excel("examples/ex1.xlsx", "Sheet1")

In [170]:
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ **to_excel** method:

In [171]:
writer = pd.ExcelWriter("examples/ex2.xlsx")

In [172]:
frame.to_excel(writer, "Sheet1")

In [173]:
writer.save()

In [174]:
frame.to_excel("examples/ex3.xlsx")

## Interacting with Web APIs

There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the **requests** package.

In [175]:
import requests

url = "http://localhost:8888/notebooks/Python%20Tutorial/Data%20Loading%2C%20Storage%2C%20and%20File%20Formats/Data%20Loading%2C%20Storage%2C%20and%20File%20Formats.ipynb#"
resp = requests.get(url)

In [176]:
resp

<Response [200]>

In [177]:
# data = resp.json()
# data[0]['title']

In [178]:
# issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])

### Interacting with Databases

In [179]:
import sqlite3

query = """
CREATE TABLE test (a VARCHAR(20), b VARCHAR(20), c REAL, d INTEGER);
"""

In [180]:
con = sqlite3.connect("mydata.sqlite")

In [181]:
con.execute(query)
con.commit()

OperationalError: table test already exists

Then, insert a few rows of data:

In [182]:
data = [
    ("Atlanta", "Georgia", 1.25, 6),
    ("Tallahassee", "Florida", 2.6, 3),
    ("Sacramento", "California", 1.7, 5),
]

In [183]:
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [184]:
con.executemany(stmt, data)

<sqlite3.Cursor at 0x1a9a2961490>

In [185]:
con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:

In [186]:
cursor = con.execute("select * from test")
rows = cursor.fetchall()

You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute:

In [187]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [188]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5


The SQLAlchemy project is a popular Python SQL toolkit that abstracts away many of the common differences between SQL databases. pandas has a read_sql function that enables you to read data easily from a general SQLAlchemy connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and read data from the table created before:

In [189]:
import sqlalchemy as sqla

In [190]:
db = sqla.create_engine("sqlite:///mydata.sqlite")

In [191]:
pd.read_sql("select * from test", db)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5
3,Atlanta,Georgia,1.25,6
4,Tallahassee,Florida,2.6,3
5,Sacramento,California,1.7,5
