# Getting Started with MongoDB for Spatial Analysis

## Launch MongoDB locally using Docker

### Create a persistent data volume

To get started, you'll need to create a persistent data volume that Docker can use to store your database's files.  Creating a Docker volume allows you to persist any data loaded into the database, even if you have to stop, start, or re-run the Docker container housing your database.  Run the following from the command line to create a Docker volume for you MongoDB database:

```bash
docker volume create mongodb_volume
```

To check that the volume was created successfully, run the following from your command line:
```bash
docker volume ls
```

You should see a volume called `mongodb_volume` listed in the results.

### Launch the Docker container

Now you're ready to launch MongoDB using Docker!  Run the following from your command line:

```bash
docker run -d \
--name mongodb \
-p 27017:27017 \
--mount source=mongodb_volume,target=/data/db \
mongo:4.0
```

What does this command do?  Breaking it down, here's what each argument means:

- **-d** : runs the container in "detached" mode, so that it keeps running in the background and will not shut off if you close out of your console window
- **--name** : the name Docker will give to your container; if you don't specify a name here, Docker will give your container a randomly-generated name
- **-p** : these are port mappings, indicating that Docker should forward information going in and out of port 27017 from the Neo4j container to port 27017 on your local machine
- **--mount** : takes the persistent volume named `mongodb_volume` and mounts it into the /data folder inside of the container; this is where the database's data and settings will get stored
- **-e**: environment variables that get passed to the database's configuration file on startup; these variables are required to make sure plugins will run correctly
- At the very end of the command, you'll notice we list **mongo:4.0** as the final argument.  This specifies the Docker image and version to run inside of the container, and will download and launch MongoDB version 4.0 (the most current version as of this writing).  If newer versions are available, you can specify `mongo:latest` to get the most recent version of MongoDB.

To check that the container is running, execute the following command in the command line:

```docker container ls```

You should see something like this, indicating that the container is successfully running:

```
CONTAINER ID     IMAGE          COMMAND                  CREATED          STATUS         PORTS               
blahblahblah     mongo:4.0      "docker-entrypoint.s…"   X seconds ago    X seconds ago  0.0.0.0:27017->27017/tcp
```

<hr>

## Connect to the database

There are two ways to connect to MongoDB database once it is up and running:

#### Method #1: Mongo Compass app

MongoDB offers a desktop client software you can use to perform basic queries and other administrative tasks.  To use the client, download the **Mongo Compass app** from the [MongoDB download center](https://www.mongodb.com/download-center#compass).  Be sure to choose the version of the app that matches the operating system on your local computer.  Once that's downloaded, enter the connection credentials as follows, then click "Connect":

![MongoDB Compass connect app](img_mongodb/mongodb_compass_connect.png)

#### Method #2: mongo shell

The other way to connect to the MongoDB database is via the command line, using the **mongo shell**.  To start off, run the following command that allows you to enter the Docker container and access a bash shell within the container:

```bash
docker exec -it mongodb /bin/bash
```

Once you're inside the container with the bash prompt ready, run the following command to enter the mongo shell:

```bash
mongo --host localhost:27017
```

You'll notice that neither of these approaches requires you to authenticate or set a username or password.  This is because MongoDB runs in a kind of "demo mode" by default, where authentication features are diabled.  You will need to enable an authentication method and take other security precautions to secure the database installation if you want to use MongoDB in a production environment.

<hr>

## Load data into the database

### Create database

#### Method 1: Mongo Compass desktop app

To create a new database, click the "Create Database" button in the Mongo Compass app:

![Mongo Compass create database button](img_mongodb/mongodb_create_db.png)

Then, give the database a name, and also add a name for the first collection in the database.  A **collection** is a set of documents--in this case, tweets--that are stored together within a database.  If you are familiar with relational databases, you can think of a collection as being similar to a "table" in a relational database system.  The primary difference between a "collection" and a "table" is that collections don't enforce any kind of tabular data structure when storing your data.  Unlike rows in a table, which must be structured according to a pre-defined set of attributes/columns, documents in a collection can contain irregularly-shaped values (ex: arrays, dictionaries), missing values, and are just generally more flexible in their structure than tables generally are.

<img src="img_mongodb/mongodb_create_db_modal.png" alt="Mongo Compass create database modal" style="width: 400px;"/>


#### Method 2: mongo shell

To view a list of existing databases from within the mongo shell, run the following command:

```bash
show dbs
```

If you've already created the `twitter_sample` database using the Mongo Compass app, you should see it listed here.  Otherwise, to switch to a database--either existing or new--run the `use` command, followed by the database name.  This will activate the chosen database and/or create the database if it doesn't yet exist:

```bash
use twitter_sample
```
To create a collection within a database, run the following command:

```bash
db.createCollection("tweets")
```

### Create indexes

At this point, you'll want to create at least one index on the tweet "id" field to help detect duplicated and guarantee uniqueness when loading tweets into the database.  Once this index is set up, you don't necessarily need to create any additional indexes prior to loading the data, but later on we'll look into adding additional indexes to optimize query performance when you're ready to query the data. 

#### Method 1: Mongo Compass desktop app

To set up an index using the Compass app, navigate to the "Indexes" tab within the collection called "tweets" that you have just created.  There, you may notice that it already looks like an index named `_id_` is populated in the database.  This is the default index, and maps to the internal [ObjectId](https://docs.mongodb.com/manual/reference/method/ObjectId/) that Mongo uses when it stores each document into the collection:  

![Mongo Compass index tab with default index](img_mongodb/mongodb_index_tab.png)

You can leave this index alone, and just note that this is _not_ the same as the "id" field coming from the Twitter data.  To create an index for the "id" field within the tweets you'll be loading, click the "Create Index" button and enter the following parameters.  Then click "Create" to add the index:

<img src="img_mongodb/mongodb_create_index_modal.png" alt="Mongo Compass create index modal" style="width: 400px;"/>


#### Method 2: mongo shell

You can perform exactly the same index creation operation in the mongo shell.  Just make sure you're `use`-ing the correct database (ex: `use twitter_sample`), and then execute the following from the mongo shell:

```bash
db.tweets.createIndex(
   { id: 1 },
   { name: "id_index", unique: true }
)
```

You can run `db.tweets.getIndexes()` to check that the index was created successfully.

### Execute load scripts

Insert one vs. bulk insert





In [2]:
import Clean_Load_Scripts as cleanNLoad

In [3]:
data_folder = '/Users/linkalis/Desktop/twitter_data/twitter_sample_5GB_split/'
logs_folder = '/Users/linkalis/Desktop/twitter_data/twitter_sample_5GB_split/logs/'

In [4]:
extractor = cleanNLoad.Extractor(data_folder, logs_folder, initialize=False)

In [None]:
while extractor.next_file_available():
    next_file_data, next_file_name = extractor.get_next_file() # read in the next file
    cleaner = cleanNLoad.Cleaner(next_file_data, next_file_name, logs_folder) # clean the data (fix bounding boxes, add centroids, etc.)
    cleaned_data = cleaner.clean_data() 
    loader = cleanNLoad.Loader(cleaned_data, next_file_name, logs_folder) # initialize the loader
    loader.get_connection("mongodb", "localhost", "27017", None, None, "twitter_sample", "tweets") # create a database connection
    loader.load_batch_data() # load the file's data as a batch

<hr>

## Query the data

### Basic queries

### Spatial queries

Add spatial indexes:
    
https://docs.mongodb.com/manual/core/geohaystack/

### Advanced queries

Text query: try with and without special indexes

Aggregate query