# Tutorial pymongo

Agenda:

* [Intro](#intro)
  * Connection
  * Create
  * Read
  * Update
  * Delete
* [Scrape a Table](#scrapeatable)
  * Get HTML
  * Parse HTML
  * Store in MongoDB

See also the Tutorial provided by [MongoDB](https://api.mongodb.com/python/current/tutorial.html).

# Intro

The recomended way to install [pymongo](https://docs.mongodb.com/drivers/pymongo) is:
```python
python -m pip install pymongo
```

## Connection

In [1]:
# client for a MongoDB instance
from pymongo import MongoClient

How to connect to the localhost:

* client = MongoClient()
* client = MongoClient('localhost', 27017)
* client = MongoClient('mongodb://localhost:27017/')

How to connect to your Atlas cluster:

> You need to install some dependencies:
`pip install 'pymongo[tls,srv]'`


```python
client = MongoClient("mongodb+srv://<username>:<password>@cluster0-tnsbt.mongodb.net/<dbname>?retryWrites=true&w=majority")
```

In [2]:
# let's connect to the localhost
client = MongoClient()

# let's create a database 
db = client.smm695

# and a collection
wb_1 = db.wb_1

# print connection
print("""
Database
==========
{}

Collection
==========
{}
""".format(db, wb_1), flush=True
)


Database
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'smm695')

Collection
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'smm695'), 'wb_1')



**Note**:

_Databases and collections are created "lazily" in MongoDB_. At this point, nothing has happened on the server, both database and collection will be created with the first insert. 

## Create

In [3]:
# document to insert
planet = {"name": "Mercury", "mass": 0.06, "moons": 0}

# insert one
insert_1 = wb_1.insert_one(planet)

# getting _id
insert_1.inserted_id

# print info
print("""
Collection
==========
{}

_id
==========
{}
""".format(db.list_collection_names(), insert_1.inserted_id), flush=True)


Collection
['wb_1']

_id
5ee1e08dc6d6b98fa8eb6d7f



In [4]:
# array of documents
planets = [{
    "name": "Venus",
    "mass": 0.82,
    "moons": 0
}, {
    "name": "Earth",
    "mass": 1.00,
    "moons": 1
},
{
    "name": "Mars",
    "mass": 0.11,
    "moons": 2
}]

# using insertmany
insert_2 = wb_1.insert_many(planets)

# getting _id
insert_2.inserted_ids

[ObjectId('5ee1e08ec6d6b98fa8eb6d80'),
 ObjectId('5ee1e08ec6d6b98fa8eb6d81'),
 ObjectId('5ee1e08ec6d6b98fa8eb6d82')]

## Read

In [5]:
from pprint import pprint as pp

In [6]:
# let's inspect a document
pp(wb_1.find_one())

{'_id': ObjectId('5ee1e08dc6d6b98fa8eb6d7f'),
 'mass': 0.06,
 'moons': 0,
 'name': 'Mercury'}


In [7]:
# let's find all planets without moons
for obs in wb_1.find({'moons': 0}):
    pp(obs)

{'_id': ObjectId('5ee1e08dc6d6b98fa8eb6d7f'),
 'mass': 0.06,
 'moons': 0,
 'name': 'Mercury'}
{'_id': ObjectId('5ee1e08ec6d6b98fa8eb6d80'),
 'mass': 0.82,
 'moons': 0,
 'name': 'Venus'}


## Update

In [8]:
# add a new planet
new_planet = {"name": "Jupiter", "mass": 317.8, "moons": 75, "rings": True}
wb_1.insert_one(new_planet).inserted_id

ObjectId('5ee1e092c6d6b98fa8eb6d83')

In [9]:
# update the number of moons
name = "Jupiter"
update_1 = wb_1.update_one({"name": name}, {"$inc": {"moons": 4}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_1.matched_count, update_1.modified_count), flush=True)


Matched: 1

Modified: 1



In [10]:
# set a new field
update_2 = wb_1.update_one({"name": name}, {"$set": {"type": "gas giants"}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_2.matched_count, update_2.modified_count), flush=True)


Matched: 1

Modified: 1



In [11]:
# unset rings
update_3 = wb_1.update_one({"rings": True}, {"$unset": {"rings": ""}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_3.matched_count, update_3.modified_count), flush=True)


Matched: 1

Modified: 1



In [12]:
# set "type" field for the other planets
update_4 = wb_1.update_many({"moons": {"$lte" : 2}}, {"$set": {"type": "terrestrial planets"}})

# print info
print("""
Matched: {}

Modified: {}
""".format(update_4.matched_count, update_4.modified_count), flush=True)


Matched: 4

Modified: 4



In [13]:
# let's see what we have
for planet in wb_1.find():
    pp(planet)

{'_id': ObjectId('5ee1e08dc6d6b98fa8eb6d7f'),
 'mass': 0.06,
 'moons': 0,
 'name': 'Mercury',
 'type': 'terrestrial planets'}
{'_id': ObjectId('5ee1e08ec6d6b98fa8eb6d80'),
 'mass': 0.82,
 'moons': 0,
 'name': 'Venus',
 'type': 'terrestrial planets'}
{'_id': ObjectId('5ee1e08ec6d6b98fa8eb6d81'),
 'mass': 1.0,
 'moons': 1,
 'name': 'Earth',
 'type': 'terrestrial planets'}
{'_id': ObjectId('5ee1e08ec6d6b98fa8eb6d82'),
 'mass': 0.11,
 'moons': 2,
 'name': 'Mars',
 'type': 'terrestrial planets'}
{'_id': ObjectId('5ee1e092c6d6b98fa8eb6d83'),
 'mass': 317.8,
 'moons': 79,
 'name': 'Jupiter',
 'type': 'gas giants'}


## Delete

In [14]:
# delete one
delete_1 = wb_1.delete_one({"name": "Mercury"})

# print info
print('Deleted:', delete_1.deleted_count )

Deleted: 1


In [15]:
# delete many
delete_2 = wb_1.delete_many({"type": "terrestrial planets"})

# print info
print('Deleted:', delete_2.deleted_count )

Deleted: 3


In [16]:
# drop collection
wb_1.drop()

# list collections
db.list_collection_names()

[]

# Scrape a Table

Here, we are going to scrape a table from a Wikipedia [page](https://en.wikipedia.org/wiki/Planet), storing results on MongoDB.

[![Jupiter System Montage](https://images-assets.nasa.gov/image/PIA18469/PIA18469~orig.jpg)](https://images-assets.nasa.gov/image/PIA18469/PIA18469~orig.jpg)
[Nasa: Building Planets Through Collisions](https://images.nasa.gov/details-PIA18469)

In [17]:
import requests
from bs4 import BeautifulSoup 

In [18]:
# let's connect to the localhost
client = MongoClient()

# let's create a database 
db = client.smm695

# collection
wiki_planets = db.wiki_planets

# print connection
print("""
Database
==========
{}

Collection
==========
{}
""".format(db, wiki_planets), flush=True
)


Database
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'smm695')

Collection
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'smm695'), 'wiki_planets')



In [19]:
# set a user agent
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.53 Safari/534.3'
header = {'user-agent': user_agent}

# set a proxy
proxy = {'https': 'http://***.***.***.***:***'}

# wikipedia page
url = "https://en.wikipedia.org/wiki/Planet"

In [20]:
# get the wiki page 
wb = requests.get(url, 
                  headers=header, 
                  #proxies=proxy, 
                  timeout=15)

# parsing the webpage text
sp = BeautifulSoup(wb.text, 'html.parser')

print("""
User-Agent:
===========
{}

IP:
===========
{}

""".format(wb.request.headers['user-agent'],
           #wb.headers['X-Client-IP']
           '****'
          ), flush=True)


User-Agent:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.53 Safari/534.3

IP:
****




In [21]:
# get target table
target = sp.find('table', {'class' : "wikitable sortable"})

print(target.prettify())

<table class="wikitable sortable" style="margin: 1em auto; text-align: center;">
 <tbody>
  <tr>
   <th>
   </th>
   <th class="unsortable">
    Name
   </th>
   <th>
    Equatorial
    <br/>
    diameter
    <sup class="reference" id="cite_ref-relativeearth_113-0">
     <a href="#cite_note-relativeearth-113">
      [i]
     </a>
    </sup>
   </th>
   <th>
    <a href="/wiki/Planetary_mass" title="Planetary mass">
     Mass
    </a>
    <sup class="reference" id="cite_ref-relativeearth_113-1">
     <a href="#cite_note-relativeearth-113">
      [i]
     </a>
    </sup>
   </th>
   <th>
    <a class="mw-redirect" href="/wiki/Semi-major_axis" title="Semi-major axis">
     Semi-major axis
    </a>
    (
    <a href="/wiki/Astronomical_unit" title="Astronomical unit">
     AU
    </a>
    )
   </th>
   <th>
    <a href="/wiki/Orbital_period" title="Orbital period">
     Orbital period
    </a>
    <br/>
    (years)
    <sup class="reference" id="cite_ref-relativeearth_113-2">
     <a href=

In [22]:
# 'th' gives us the columns' titles
header = [th.text.strip('\n') for th in target.find_all('th')]

# 'tr' defines rows and 'td' is the body to extract
L = []
d = {}
x = {}


for obj in target.findAll('tr'):
    corpus = obj.findAll('td')
    if len(corpus) > 0:
        d = {'Name': corpus[1].a.get('title')}
        for i in range(2, 13, 1):
            column_name = header[i].encode('ascii',
                                           'ignore').decode("utf-8").replace(
                                               '[i]', '').replace('[j]', '')
            try:
                x = {column_name: float(corpus[i].text.strip('\n'))}
            except:
                try:
                    x = {column_name: float(corpus[i].text.strip('\n').replace('\u2212', '-'))}
                except:
                    x = {column_name: corpus[i].text.strip('\n')}
            d.update(x)
        insert = wiki_planets.insert_one(d)
        print(insert.inserted_id)

5ee1e0d2c6d6b98fa8eb6d85
5ee1e0d2c6d6b98fa8eb6d86
5ee1e0d2c6d6b98fa8eb6d87
5ee1e0d2c6d6b98fa8eb6d88
5ee1e0d2c6d6b98fa8eb6d89
5ee1e0d2c6d6b98fa8eb6d8a
5ee1e0d2c6d6b98fa8eb6d8b
5ee1e0d2c6d6b98fa8eb6d8c


In [23]:
# let's see what we imported:
for doc in wiki_planets.find():
    pp(doc)

{'Atmosphere': 'minimal',
 'Axial tilt ()': 0.04,
 'Confirmedmoons': 0.0,
 'Equatorialdiameter': 0.382,
 "Inclinationto Sun's equator ()": 3.38,
 'Mass': 0.06,
 'Name': 'Mercury (planet)',
 'Orbital period(years)': 0.24,
 'Orbitaleccentricity': 0.206,
 'Rings': 'no',
 'Rotation period(days)': 58.64,
 'Semi-major axis (AU)': 0.39,
 '_id': ObjectId('5ee1e0d2c6d6b98fa8eb6d85')}
{'Atmosphere': 'CO2, N2',
 'Axial tilt ()': 177.36,
 'Confirmedmoons': 0.0,
 'Equatorialdiameter': 0.949,
 "Inclinationto Sun's equator ()": 3.86,
 'Mass': 0.82,
 'Name': 'Venus',
 'Orbital period(years)': 0.62,
 'Orbitaleccentricity': 0.007,
 'Rings': 'no',
 'Rotation period(days)': -243.02,
 'Semi-major axis (AU)': 0.72,
 '_id': ObjectId('5ee1e0d2c6d6b98fa8eb6d86')}
{'Atmosphere': 'N2, O2, Ar',
 'Axial tilt ()': 23.44,
 'Confirmedmoons': 1.0,
 'Equatorialdiameter': 1.0,
 "Inclinationto Sun's equator ()": 7.25,
 'Mass': 1.0,
 'Name': 'Earth',
 'Orbital period(years)': 1.0,
 'Orbitaleccentricity': 0.017,
 'Rings': 