## CORD-19-Research-Data-Set-Atlas

This workbook covers the process of querying data from a db hosted on Atlas.

For this code to run, you'll need to create a MongoDB on Atlas named "covid-19" with a collection named "noncomm-subset". 

### Download data

You can download the data from:

https://pages.semanticscholar.org/coronavirus-research

I'm using the non-commercial use subset (76mb download). To keep the example size manageable, I moved the first 70 records into the data directory of this repository. 

### Install Mongo

https://docs.mongodb.com/manual/installation/

### Mongo Client

You may want to get familiar with the MongoDB client and CRUD operations before working with python. 

### Build an Atlas MongoDB from the CORD-19 Subset

Build a database name "covid-19", and a collection named "noncomm-subset".

Add records through (substitute in your username, password, and 

mongoimport --uri "mongodb+srv://your_username:your_password@your_atlas_uri/covid-19" --collection noncomm-subset --drop --file filename.json

for example, to upload one file, mine would be (still redacting password and username)

mongoimport --uri "mongodb+srv://username:password@python-mongodb-workshop.unmjr.gcp.mongodb.net/covid-19" --collection noncomm-subset --drop --file 1b58422e266ab9339c919119923229d080f27360.json 


In [None]:
import json
from pymongo import MongoClient

In [None]:
MDB_URL = "mongodb+srv://pymongo_workshop_user:pwd@python-mongo-workshop.jkemn.gcp.mongodb.net/"
client = MongoClient(MDB_URL)

In [None]:
client.list_database_names()

In [None]:
db = client.get_database("covid-19")

In [None]:
db.list_collection_names()

In [None]:
pmc_content = db['noncomm-subset']

In [None]:
for c in db['noncomm-subset'].find().limit(10):
    print(c)

In [None]:
# to print the titles only and suppress the id

for c in db['noncomm-subset'].find({},{ 'metadata.title': 1, '_id': 0 }):
    print(c)

In [None]:
# find one paper by paper_id

for c in db['noncomm-subset'].find({'paper_id': '00a00d0edc750db4a0c299dd1ec0c6871f5a4f24'}):
    print(c)

In [None]:
# query on nested documents
# see: https://docs.mongodb.com/manual/tutorial/query-embedded-documents/

for c in db['noncomm-subset'].find({'metadata.title': 'ACE/ACE2 Ratio and MMP-9 Activity as Potential Biomarkers in Tuberculous Pleural Effusions'}):
    print(c)

In [None]:
# query on an array of embedded documents
# see: https://docs.mongodb.com/manual/tutorial/query-array-of-documents/

for c in db['noncomm-subset'].find({'metadata.authors.first': 'Wen-Yeh'}):
    print(c)

### Query on a text index

To query on a search phrase or word, you'll need to build a text index on the fields you want to search. For this tutorial, we'll do this with the MongoDB shell. 

Note - if the field is nested you'll need to put it in qoutation marks when you build the text index.

To build the index:
```
db.pmc_content.createIndex( { "body_text.text": "text" } )
```

To list all the indexes you have on a collection
```
db.pmc_content.getIndexes()
```

To remove the index
```
db.pmc_content.dropIndex("body_text.text_text")
```

In [None]:
import re
regx = re.compile("pleural", re.IGNORECASE)
for r in db['noncomm-subset'].find({'body_text.text': { '$regex': regx } }):
    print(r)

In [None]:
# query the text index field created on the body_text field
# in the browser use { "body_text.text": "text" } 

In [None]:
for c in db['noncomm-subset'].find({'$text':{'$search':'Pleural'}}):
    print(c)

In [None]:
for a in db['noncomm-subset'].aggregate([{'$group':{'_id':{'$arrayElemAt': ['$metadata.authors.affiliation.location.country', 0]},
                                                    'count':{'$sum': 1}}}]):
    print(a)