##  Bonus Lab: MongoDB Introduction

This week's lab is a mini-lab.

Reminder - save your work. Go to File > Save a Copy in Drive to ensure that you have your work saved.

## DataFrame Selection

This is here just for reference. Don't forget it!

**Selecting rows by numeric index**

Provide `x:y` notation in : `df[10:14]`

**Selecting rows by index name**

Provide the name to `.loc[]`: `df.loc['Sherlock Holmes']`

**Selecting rows by inclusion criteria**

Provide any collection (e.g. a list or Series) of True/False values:

```
df[[True, False, False, True, True]]
```

```
df[df.year > 1996]
```
    
**Selecting multiple columns**

Provide a collection of strings, referencing the column names:

```
df[['genres', 'year']]
```
    
**Selecting single column (as Series)**

```
df['year']
```

Or:

```
df.year
```

Consider the latter as the shortcut, not the main way.

The output is a Series. To select a single column as a DataFrame, use list with only one value.

## Opening JSON

This code shows how you load JSON into Python. If you're not reading data from the internet, you don't need the `from smart_open` line.

In [1]:
import json
from smart_open import open
with open('https://raw.githubusercontent.com/organisciak/Scripting-Course/master/data/cooking.json') as f:
    data = json.load(f)

print("There are ", len(data), "items.")

There are  39774 items.


What's done here? 

You load a text file, and 'parse' it for Python to understand that it is formatted as JSON.

`with open(...) as f` syntax opens a file, setting it to a variable named `f`. That can read the raw text from the file, but the `json.load` function parses the text (one big string!) to a Python object.

This particular data is a big list of foods: try printing the first few items:

In [None]:
data[:2]

[{'cuisine': 'greek',
  'id': 10259,
  'ingredients': ['romaine lettuce',
   'black olives',
   'grape tomatoes',
   'garlic',
   'pepper',
   'purple onion',
   'seasoning',
   'garbanzo beans',
   'feta cheese crumbles']},
 {'cuisine': 'southern_us',
  'id': 25693,
  'ingredients': ['plain flour',
   'ground pepper',
   'salt',
   'tomatoes',
   'ground black pepper',
   'thyme',
   'eggs',
   'green tomatoes',
   'yellow corn meal',
   'milk',
   'vegetable oil']}]

### Questions

Q1) What type of cuisine is the 300th item (index=299) in the dataset? (*5pts*)

In [None]:
q1_answer = "" #@param {type:"string"}

Q2) What food is the tenth ingredient of the 200th item in the data? Tip: you can count by hand for the answer, but see if you can grab it with code. (*5pts*)

In [None]:
q2_answer = "" #@param {type:"string"}

Q3) If you had run `data2 = f.read()` instead of `data = json.load(f)` in the loading code, what type of information would the `data2` variable hold? How or why is it different? (Tip: try running your code from Q1 on data2 to see what it looks like, or use the `type()` function to figure our the type of variable that `data2` is compared to `data`.) (*5pts*)

In [None]:
q3_answer = "" #@param {type:"string"}


Q4) What's the difference between running `print(data2[300:500])`, and having the notebook auto-print by running a cell with `data2[300:500]` on the last line? (*5pts*)

In [None]:
q4_answer = "" #@param {type:"string"}

## Setting up MongoDB

MongoDB is a separate database. If you work with Jupyter on your own computer, you have to install it. For this class, since we're using Colab online, we'll also use an online service that already has MongoDB installed.

For this class, **Dr. Organisciak has created a read-only database that you can work with**, so you **don't have to create your own database**.

*If* you would like to create your own online-hosted database *anyway*, you can do so with the following steps:
- 1. Create an account with [MongoDB Atlas](https://www.mongodb.com/cloud/atlas), by clicking the Try Free button.
- 2. Create a shared cluster in a free tier.
- 3. Click 'Connect', then create a user, a password, and add `0.0.0.0` to the whitelist. (This whitelist setting means people can connect from anywhere if they have your username and password, so it's okay for a learning environment but less secure for critical applications).
- 4. For 'Choose a Connection Method', select 'connect your application', then 'Python', then copy the connection string. This is the location to your server.

In [7]:
#@title Connect to a MongoDB database
#@markdown This cell connects to a remote MongoDB instance.
#@markdown When it asks your password, paste in the password posted to Canvas.
!pip install dnspython
from urllib.parse import quote_plus
from pymongo import MongoClient
import os
from getpass import getpass

if os.path.exists('credentials.txt'):
    # Allow loading credentials for user, pw, url, one per line
    with open('credentials.txt', mode='r') as f:
        user, mongopw, cluster_url = [l.strip() for l in f.readlines()]
else:
    user = "scriptingStudent" #@param {type:"string"}
    cluster_url = "cluster0-ga5s0.mongodb.net" #@param {type:"string"}
    mongopw = getpass('Enter your MongoDB password for "{}":\n'.format(user))
    with open('credentials.txt', mode='w') as f:
        f.write("{}\n{}\n{}".format(user, mongopw, cluster_url))

client = MongoClient("mongodb+srv://{}:{}@{}/test?retryWrites=true&w=majority".format(quote_plus(user), quote_plus(mongopw), cluster_url))
db = client.week9

Collecting dnspython
  Downloading dnspython-2.2.1-py3-none-any.whl (269 kB)
[?25l[K     |█▏                              | 10 kB 23.1 MB/s eta 0:00:01[K     |██▍                             | 20 kB 17.3 MB/s eta 0:00:01[K     |███▋                            | 30 kB 11.4 MB/s eta 0:00:01[K     |████▉                           | 40 kB 9.7 MB/s eta 0:00:01[K     |██████                          | 51 kB 4.6 MB/s eta 0:00:01[K     |███████▎                        | 61 kB 5.4 MB/s eta 0:00:01[K     |████████▌                       | 71 kB 5.8 MB/s eta 0:00:01[K     |█████████▊                      | 81 kB 4.4 MB/s eta 0:00:01[K     |███████████                     | 92 kB 4.9 MB/s eta 0:00:01[K     |████████████▏                   | 102 kB 5.4 MB/s eta 0:00:01[K     |█████████████▍                  | 112 kB 5.4 MB/s eta 0:00:01[K     |██████████████▋                 | 122 kB 5.4 MB/s eta 0:00:01[K     |███████████████▉                | 133 kB 5.4 MB/s eta 0:00:01

The database that we're using is called 'week9', and has been set to a variable called `db`:

In [8]:
db

Database(MongoClient(host=['cluster0-shard-00-02-ga5s0.mongodb.net:27017', 'cluster0-shard-00-00-ga5s0.mongodb.net:27017', 'cluster0-shard-00-01-ga5s0.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='Cluster0-shard-0', tls=True), 'week9')

### Connecting to a Collection

Here, I connect to the 'cooking' collection of `db`. If it doesn't exist, it will be created. (The account you're connecting is read-only, though, so you won't be able to create new collections in Dr. O's database!)

In [9]:
collection = db.cooking

### Inserting Data

Here's how you insert the `data` from before. It's a list - data[0] is one 'document', data[1] is another one, and so on - so we can use `insert_many`:

```
collection.insert_many(data)
```

**The data is already on our class database**, so you don't have to insert it (also, you don't have write acess, just so you can only read).

To see a count of how many records we have, run:

In [None]:
collection.estimated_document_count()

It's also possible to use `insert_one` for a single record:

```collection.insert_one(your_record)```

(Again, we're connected to our class database read-only, so you can't insert here unless you create your own DB).

# Questions

Q5) How many documents are in the cooking collection? (*easiest 3pts of the class*)
   - 7954
   - 39774
   - 79548

In [None]:
q5_answer = "" #@param ["", "7954", "39774", "79548"]

Q6) Match the cuisine to the number of documents that are that type of cuisine. Tip: You can run `collection.count_documents()` with the same input you would give to `collection.find()` (*9pts*).
   - Cuisines: a) 'japanese', b) 'mexican', c) 'italian'
   - Counts: 6438, 7838, 1423

In [None]:
q6a_answer = "" #@param ["", "6438", "7838", "1423"]

In [None]:
q6b_answer = "" #@param ["", "6438", "7838", "1423"]

In [None]:
q6c_answer = "" #@param ["", "6438", "7838", "1423"]

Q7) How many results have liver as an ingredient? Tip: If you want to inspect the results to doublecheck but worry about printing a BFD (a 'big dataset'), you can try `find_one`. Remember that for the answer itself, you want to `count_documents`! (*6pts*)

In [None]:
q7_answer = "" #@param {type:"string"}

`$in` and `$all` questions:

Q8) How many results have both 'duck' and 'chinese five-spice powder'? (*6pts*)

In [None]:
q8_answer = "" #@param {type:"string"}


Q9) How many results have either 'duck' or 'chinese five-spice powder'? (*6pts*)

In [None]:
q9_answer = "" #@param {type:"string"}

# Submission Instructions

In [None]:
#@markdown ### First, Enter your name for grading
my_name = "" #@param { type:'string' }

#@markdown _Have you saved your work for yourself? Don't forget to Save a Copy in Drive so that you have your progress._

In [20]:
#@markdown ### Second, check your work:

#@markdown - have you answered all the questions?
#@markdown     - Some answers can be checked automatically - just run this cell.
#@markdown - Does this notebook run from top to bottom?
#@markdown     - Go to "Runtime > Restart and run all..." to check. Do all the cells run, to the very bottom, or is there a cell in the middle with an error?
#@markdown - Have you completed all the answers where you entered code, keeping the `# Answer-Qx` line at the start of those cells?

#@markdown *A lab that the professor has to fix manually will lose 10pts - run the checks!*

#@markdown ### Finally, submit it

#@markdown - Download the file with "File > Download .ipynb" and submit it to the Canvas assignment page.