# Tutorial pymongo

Agenda:

* [Intro](#intro)
  * Connection
  * Create
  * Read
  * Update
  * Delete
* [Scrape a Table](#scrapeatable)
  * Get HTML
  * Parse HTML
  * Store in MongoDB

See also the Tutorial provided by [MongoDB](https://api.mongodb.com/python/current/tutorial.html).

# Intro

The recomended way to install [pymongo](https://docs.mongodb.com/drivers/pymongo) is:
```python
python -m pip install pymongo
```

## Connection

In [None]:
# client for a MongoDB instance
from pymongo import MongoClient

How to connect to the localhost:

* client = MongoClient()
* client = MongoClient('localhost', 27017)
* client = MongoClient('mongodb://localhost:27017/')

How to connect to your Atlas cluster:

> You need to install some dependencies:
`pip install 'pymongo[tls,srv]'`


```python
client = MongoClient("mongodb+srv://<username>:<password>@cluster0-tnsbt.mongodb.net/<dbname>?retryWrites=true&w=majority")
```

In [None]:
# let's connect to the localhost
client = MongoClient()

# let's create a database 
db = client.smm695

# and a collection
wb_1 = db.wb_1

# print connection
print("""
Database
==========
{}

Collection
==========
{}
""".format(db, wb_1), flush=True
)

**Note**:

_Databases and collections are created "lazily" in MongoDB_. At this point, nothing has happened on the server, both database and collection will be created with the first insert. 

## Create

In [None]:
# document to insert
planet = {"name": "Mercury", "mass": 0.06, "moons": 0}

# insert one
insert_1 = wb_1.insert_one(planet)

# getting _id
insert_1.inserted_id

# print info
print("""
Collection
==========
{}

_id
==========
{}
""".format(db.list_collection_names(), insert_1.inserted_id), flush=True)

In [None]:
# array of documents
planets = [{
    "name": "Venus",
    "mass": 0.82,
    "moons": 0
}, {
    "name": "Earth",
    "mass": 1.00,
    "moons": 1
},
{
    "name": "Mars",
    "mass": 0.11,
    "moons": 2
}]

# using insertmany
insert_2 = wb_1.insert_many(planets)

# getting _id
insert_2.inserted_ids

## Read

In [None]:
from pprint import pprint as pp

In [None]:
# let's inspect a document
pp(wb_1.find_one())

In [None]:
# let's find all planets without moons
for obs in wb_1.find({'moons': 0}):
    pp(obs)

## Update

In [None]:
# add a new planet
new_planet = {"name": "Jupiter", "mass": 317.8, "moons": 75, "rings": True}
wb_1.insert_one(new_planet).inserted_id

In [None]:
# update the number of moons
name = "Jupiter"
update_1 = wb_1.update_one({"name": name}, {"$inc": {"moons": 4}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_1.matched_count, update_1.modified_count), flush=True)

In [None]:
# set a new field
update_2 = wb_1.update_one({"name": name}, {"$set": {"type": "gas giants"}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_2.matched_count, update_2.modified_count), flush=True)

In [None]:
# unset rings
update_3 = wb_1.update_one({"rings": True}, {"$unset": {"rings": ""}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_3.matched_count, update_3.modified_count), flush=True)

In [None]:
# set "type" field for the other planets
update_4 = wb_1.update_many({"moons": {"$lte" : 2}}, {"$set": {"type": "terrestrial planets"}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_4.matched_count, update_4.modified_count), flush=True)

In [None]:
# let's see what we have
for planet in wb_1.find():
    pp(planet)

## Delete

In [None]:
# delete one
delete_1 = wb_1.delete_one({"name": "Mercury"})

# print info
print('Deleted:', delete_1.deleted_count )

In [None]:
# delete many
delete_2 = wb_1.delete_many({"type": "terrestrial planets"})

# print info
print('Deleted:', delete_2.deleted_count )

In [None]:
# drop collection
wb_1.drop()

# list collections
db.list_collection_names()

# Scrape a Table

Here, we are going to scrape a table from a Wikipedia [page](https://en.wikipedia.org/wiki/Planet), storing results on MongoDB.

[![Jupiter System Montage](https://images-assets.nasa.gov/image/PIA18469/PIA18469~orig.jpg)](https://images-assets.nasa.gov/image/PIA18469/PIA18469~orig.jpg)
[Nasa: Building Planets Through Collisions](https://images.nasa.gov/details-PIA18469)

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
# let's connect to the localhost
client = MongoClient()

# let's create a database 
db = client.smm695

# collection
wiki_planets = db.wiki_planets

# print connection
print("""
Database
==========
{}

Collection
==========
{}
""".format(db, wiki_planets), flush=True
)

In [None]:
# set a user agent
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.53 Safari/534.3'
header = {'user-agent': user_agent}

# set a proxy
proxy = {'https': 'http://***.***.***.***:***'}

# wikipedia page
url = "https://en.wikipedia.org/wiki/Planet"

In [None]:
# get the wiki page 
wb = requests.get(url, 
                  headers=header, 
                  #proxies=proxy, 
                  timeout=15)

# parsing the webpage text
sp = BeautifulSoup(wb.text, 'html.parser')

print("""
User-Agent:
===========
{}

IP:
===========
{}

""".format(wb.request.headers['user-agent'],
           #wb.headers['X-Client-IP']
           '****'
          ), flush=True)

In [None]:
# get target table
target = sp.find('table', {'class' : "wikitable sortable"})

print(target.prettify())

In [None]:
# 'th' gives us the columns' titles
header = [th.text.strip('\n') for th in target.find_all('th')]

# 'tr' defines rows and 'td' is the body to extract
L = []
d = {}
x = {}


for obj in target.findAll('tr'):
    corpus = obj.findAll('td')
    if len(corpus) > 0:
        d = {'Name': corpus[1].a.get('title')}
        for i in range(2, 13, 1):
            column_name = header[i].encode('ascii',
                                           'ignore').decode("utf-8").replace(
                                               '[i]', '').replace('[j]', '')
            try:
                x = {column_name: float(corpus[i].text.strip('\n'))}
            except:
                try:
                    x = {column_name: float(corpus[i].text.strip('\n').replace('\u2212', '-'))}
                except:
                    x = {column_name: corpus[i].text.strip('\n')}
            d.update(x)
        insert = wiki_planets.insert_one(d)
        print(insert.inserted_id)

In [None]:
# let's see what we imported:
for doc in wiki_planets.find():
    pp(doc)