<DIV ALIGN=CENTER>

# Introduction to Cassandra
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In the previous course, we discussed relational databases, SQL, and
using Python to work with relational databases. With the rapid growth
in large data sets, however, there has been an explosion in new database
technologies. In this IPython Notebook, we explore [MongoDB][mdb], one
of the more popular new database technologies.  [MongoDB][mdbw] is a
NoSQL document-oriented database, which means it is _not only SQL_ and
stores data as documents. The data are stored using dynamic schemas that
employ _BSON_ format, which is JSON-like format. For more information,
the [MongoDB documentation website][mdbd] provides a wealth of useful
information.

-----
[mdb]: https://www.mongodb.org
[mdbw]: https://en.wikipedia.org/wiki/MongoDB
[mdbd]: https://docs.mongodb.org/manual/

## Python with MongoDB

To use Python to interact with MongoDB, we need to use a suitable Python
library. The recommended Python library is [_pymongo_][pymdb], which
provides support for establishing a connection between a Python program
and a MongoDB server as well as support tools for working with MongoDB. 

We have already installed _pymongo_ in the course Docker container;
however, you can easily install is by using `pip`, for example to
install _pymongo_ for use with Python3 for the current user, we can
execute:

```console
pip3 install pymongo --user
```

Once this library is installed, we can import the MongoDB client to
establish a connection and retrieve data and MongoDB information.

```python
from pymongo import MongoClient
```

-----

[pymdb]: http://api.mongodb.org/python/current/

In [1]:
!pip install cassandra-driver --user

Collecting cassandra-driver
  Using cached cassandra-driver-3.1.1.tar.gz
Building wheels for collected packages: cassandra-driver
  Running setup.py bdist_wheel for cassandra-driver ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
[?25h  Stored in directory: /home/data_scientist/.cache/pip/wheels/8c/5f/2a/b3ea09402c02a0c9282f27e11a0ae7ebca7ab69048bd0c3448
Successfully built cassandra-driver
Installing collected packages: cassandra-driver
Successfully installed cassandra-driver
[33mYou are using pip version 8.0.3, however version 8.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
from cassandra.cluster import Cluster
from cassandra.policies import WhiteListRoundRobinPolicy

lbp = WhiteListRoundRobinPolicy(['40.124.12.119'])
cluster = Cluster(contact_points=['40.124.12.119'], load_balancing_policy=lbp)
session = cluster.connect()
session.set_keyspace('info490')
session.execute('USE info490')
rows = session.execute('SELECT * FROM test')

for row in rows:
    print(row.name)

anchal


## Local MongoDB Server

To use a local MongoDB server, for instance, a MongoDB server running
inside our course Docker container, we need to first start the server.
To do this, open a terminal window inside the Docker container, most
easily done using the _New_ menu on the JupyterHub Server homepage,
followed by _Terminal_.

![New Terminal](images/new-term.png)

Inside this new terminal window, start up the MongoDB server by issuing
the following command:

```console mongod --nojournal ``` This will start the mongo database
daemon with no journaling (since we are not worried about crash safety).
This will produce a list of messages such as the following in your
terminal window.

![New MongoDB local server](images/new-mongod.png)

At this point the local server is ready to start accepting connections.
To open a connection to the localhost using pymongo, we establish a new
MongoDB client:


```python
client = MongoClient()
```

which assumes a local server with default port. Alternatively, we can
explicitly list the hostname and port, which is preferred since it is
easier to recognize the server and port number, which can be easily
changed when we move to a remote MongoDB server.

```python
client = MongoClient("mongodb://localhost:27017")
```

which connects to the local MongoDB daemon using the default local host
name and port.

-----

## Remote MongoDB Server

To connect to a remote MongoDB server, for instance by using the course
cluster system, we simply need the IP address for the server and the
port number on which the MongoDB daemon is listening. For this course,
Notebooks running on the course JupyterHub Server can access a MongoDB
server on `10.0.3.126` and the default port number of `27017`:


```python
client = MongoClient("mongodb://10.0.3.126:27017")
```

-----

In [None]:
# Establish a connection to MongoDB (uncomment only one of these lines)

# For remote course server use
#client = MongoClient("mongodb://10.0.3.126:27017")

# For local Docker server use
client = MongoClient("mongodb://localhost:27017")

-----
## MongoDB Database

MongoDB provides storage for collections of documents. To manage a set
of related collections, MongoDB uses the concept of a database. Thus a
MongoDB database is similar to a standard relational database, which
contains a collection of tables.

In the next few sections, we explore the _pymongo_ library in a similar
manner as the official [_pymongo_ tutorial][pymt]. In addition, in this
Notebook we use dictionary style access to acquire a database,
collection, or document. There is also an attribute style method to
access these items, but dictionary style is preferred since it reinforces
that concept that MongoDB is a document style database and that Python
dictionaries are used to create document schema. In addition, the
dictionary style enables names to be used that might not be legal Python
names, such as `test-database`. 

-----
[pymt]: http://api.mongodb.org/python/current/tutorial.html

In [None]:
# We will delete our working directory if it exists before recreating.

dbname = 'test-database'
if  dbname in client.database_names():
    client.drop_database(dbname)
    
print('Existing databases:', client.database_names())

In [None]:
db = client['test-database']
print('Existing databases:', client.database_names())

----

MongoDB utilizes _lazy evaluation_ when creating databases or
collections, which simply means these objects are not created until
they are actually needed. This is shown previously for databases, where
we create a new `test-database` but the new database does not show up in
the list of active MongoDB databases. This database will not even be
created when we add a collection; instead it will be created when we
first add data to a collection, which is demonstrated in the next few
code cells.

We now create a new collection, entitled `test-collection` into which we
can insert new data.

-----

In [None]:
collection = db['test_collection']

print('Existing databases:', client.database_names())
print('Existing collections:', db.collection_names())

-----

## Adding Data

Given a collection, we can easily add new _documents_ to our MongoDB
collection by employing a Python dictionary to map the document schema
to the document data. In the following code cell, we first create a
`student` document, followed by a `students` collection to hold
`student` documents, and we insert the first student by using the
`insert_one` method on the `students` collection. We retrieve this new
students id, which we display as a validation of this process. After
this process, we display the newly created database and collection.

-----

In [None]:
student = {'fname': 'Jane',
           'lname': 'Doe',
           'company': 'bdg surf shop'}

students = db['students']

jane_id = students.insert_one(student).inserted_id
print("New Student ID: ", jane_id)

In [None]:
print('Existing databases:', client.database_names())
print('Existing collections:', db.collection_names())

-----

Unlike relational database tables, a MongoDB collection can store
documents that have different schema. We demonstrate this in the next
two code cells where we create two new students that each have different
schema from the original student. Atfer inserting these new students, we
count the number of documents in the `students` collection.

-----

In [None]:
student = {'fname': 'John',
           'lname': 'Doe',
           'company': 'bdg surf shop',
           'lucky_numbers': [2, 5, 9, 13, 27]}

john_id = students.insert_one(student).inserted_id
print("New Student ID: ", john_id)

In [None]:
import datetime

student = {'fname': 'Pat',
           'lname': 'Doe',
           'company': 'bdg surf shop',
           'hire_date': datetime.datetime.utcnow()}

pat_id = students.insert_one(student).inserted_id
print("New Student ID: ", pat_id)

In [None]:
print("Number of students = ", students.count())

-----

### Retrieving Data

MongoDB provides `find_one` and `find` methods that can be used to find
one or more documents in a collection. The first method, `find_one`,
simply returns one document (by default the first document in the
collection) unless an argument is supplied that specifically selects
documents. For example, the second code cell is used to find one
document with a specific id value. More generally, the `find` method can
be used to iterate over all (or given a suitable argument, a limited set
of) documents in the collection, as demonstrated in the third code cell.

-----

In [None]:
students.find_one()

In [None]:
students.find_one({"_id": pat_id})

In [None]:
for student in students.find():
    print(student)

-----

We can also insert multiple documents at once by collecting the new
documents in a Python `list` and using the `insert_many` method to
perform a bulk insert.

-----

In [None]:
new_students = [
    {'fname': 'Mike',
     'lname': 'Simone',
     'company': 'Del Ray Enterprises',
    'products': [{'id': 1, 'name': 'eyeware'}, {'id': 2, 'name': 'hat'},]},
    {'fname': 'Clair',
     'lname': 'Hwu',
     'company': 'Hoboken Surfware Incorporated',
     'comment': 'Great supplier, fast, fair, and courteous.'}]

result = students.insert_many(new_students)

print(result.inserted_ids)

In [None]:
print("Number of students = ", students.count())

-----

As previously mentioned, we can also use the `find` method to quickly
identify specific documents in a collection, over which we can iterate
to perform additional operations. In the following code cells, we first
search for documents with the _last name_ attribute equal to `Hwu`,
after which, we apply the `count` method to the set of documents
returned by searching for _last name_ equal to `Doe`.

-----

In [None]:
for student in students.find({"lname": "Hwu"}):
    print(student)

In [None]:
print("Number of students = ", students.find({"lname": "Doe"}).count())

-----

Given a document, we can also extract specific value by employing
dictionary style access, which should make sense since the document is
accessed in Python as a dictionary object. In the following example, we
extract the first and last names for all documents. Obviously this
requires that all documents contain these values, if not, an error is
generated. But handling these conditions is beyond the scope of this
Notebook.

-----

In [None]:
for student in students.find():
    print(student['fname'], student['lname'])

----

## Querying

MongoDB also supports a [rich query][mdbq] syntax, but it likely will
seem odd to anyone familiar with SQL. The full set includes comparison,
logical, element tests, evaluation methods, geospatial, array, and
projection operations. These operators begin with a `$` character, and
the rest of the name identifies the specific operator. For example,
`$gte` is _greater than or equal to_. 

The format for the query is to encode the target field as the key of a
dictionary, and the operator and any associated values as a second
dictionary that maps to the field's key. For example, to test if the
field `age` is less than 20, we write the following query 
`{age:{ $lt: 20}}`. 
This is demonstrated in the following code cell where we identify the
documents with last name equal to `Doe`, after which we sort the
documents by first name. When using pymongo, we enclose the attributes
and operators in quotes to ensure they are passed correctly to the
MongoDB server.

-----

[mdbq]: http://www.mongodb.org/display/DOCS/Advanced+Queries

In [None]:
for student in students.find({"lname": {'$eq': 'Doe'}}).sort('fname'):
    print(student)

-----
## Breakout Session

During this breakout, you should work with the previous MongoDB examples
in order to better learn how MongoDB works, and how it is different than
pure relational databases. Specific additional problems you can attempt
include the following:

1. Write a compound query that searches for students with last name `Doe`
and first name `Jane`.

2. Write a compound query that searches for students with last name `Doe`
and first name beings with the letter 'j' (you might use the MongoDB
regular expression query operator fir this.

3. Make a  student with a totally different schema and insert this new
student into the `students` collection.

Additional, more advanced problems:

1. Read in Airline data (100k rows) and store in a new collection.

2. Repat the previous exercise, but drop any column on insert that holds
an `NA` value.

-----

-----
### Additional References


2. The [MongoDB Manual][mdbm]
4. The Python Edition, of [Getting Started with MongoDB][pymdb]
6. Python MongoDB [Library Reference][pyml]
-----

[mdbm]: https://docs.mongodb.org/manual/
[pymdb]: https://docs.mongodb.org/getting-started/python/
[pyml]: http://api.mongodb.org/python/current/installation.html

### Return to the [Week Three](index.ipynb) index.

-----