# Interacting with Databases

In many applications data rarely comes from text files, that being a fairly inefficient
way to store large amounts of data. SQL-based relational databases (such as SQL Server,
PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called
NoSQL) databases have become quite popular. The choice of database is usually dependent
on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has
some functions to simplify the process. As an example, I’ll use an in-memory SQLite
database using Python’s built-in sqlite3 driver:

In [1]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

In [2]:
import sqlite3

In [4]:
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""

In [5]:
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

In [6]:
# Then, insert a few rows of data:
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]

stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)

con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list
of tuples when selecting data from a table:

In [7]:
cursor = con.execute('Select * from test')

In [8]:
rows= cursor.fetchall()

In [9]:
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

You can pass the list of tuples to the DataFrame constructor, but you also need the
column names, contained in the cursor’s description attribute:

In [10]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [15]:
DataFrame(rows, columns=list(zip(*cursor.description))[0])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


This is quite a bit of munging that you’d rather not repeat each time you query the
database. pandas has a read_frame function in its pandas.io.sql module that simplifies
the process. Just pass the select statement and the connection object:

In [16]:
import pandas.io.sql as sql

In [18]:
sql.sql_read_frame('select * from test', con)

AttributeError: module 'pandas.io.sql' has no attribute 'sql_read_frame'

In [19]:
sql.read_sql_query('select * from test', con)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


## Storing and Loading Data in MongoDB

NoSQL databases take many different forms. Some are simple dict-like key-value stores
like BerkeleyDB or Tokyo Cabinet, while others are document-based, with a dict-like
object being the basic unit of storage. I've chosen MongoDB (http://mongodb.org) for
my example. I started a MongoDB instance locally on my machine, and connect to it
on the default port using pymongo, the official driver for MongoDB:

In [21]:
import pymongo
con = pymongo.Connection('localhost', port=27017)

ImportError: No module named 'pymongo'