# Question 1

#### Imports

In [24]:
import pandas as pd
from pymongo import MongoClient
import numpy as np
import warnings

This line is used to suppress the deprication warning thrown by cursor.count()

In [25]:
warnings.simplefilter('ignore')

#### DB connection

In [2]:
client = MongoClient('localhost', 27017)

In [3]:
db = client['book']

#### Phones and country data from book database on cs1

I originally found and exported a collection called phones from a database called book on cs1.  This collection contained 100000 phone numbers. I imported this collection into my own book database as phoneOrig and then I imported the countries.json file from cs1. 

In [10]:
db.list_collection_names()

['phoneOrig', 'countries']

#### Exploration of phones and countries

In [11]:
db.phoneOrig.find_one()

{'_id': 78005550007.0,
 'components': {'country': 7.0,
  'area': 800.0,
  'prefix': 555.0,
  'number': 5550007.0},
 'display': '+7 800-5550007'}

This will display the maximum value of of the country code.

In [22]:
cur = db.phoneOrig.aggregate([{'$sort':{'components.country':-1}}])
cur.next()

{'_id': 88005550030.0,
 'components': {'country': 8.0,
  'area': 800.0,
  'prefix': 555.0,
  'number': 5550030.0},
 'display': '+8 800-5550030'}

This will display the minimum value of the country code.

In [23]:
cur = db.phoneOrig.aggregate([{'$sort':{'components.country':1}}])
cur.next()

{'_id': 18005550018.0,
 'components': {'country': 1.0,
  'area': 800.0,
  'prefix': 555.0,
  'number': 5550018.0},
 'display': '+1 800-5550018'}

### Country Document Example
Each country document contains an attribute 'callingCode' which I plan to use for matching to the country code for each phone number. I will have to deal with the fact that there are sometimes multiple codes and that they are in string format.  

In [37]:
db.countries.find_one()

{'_id': ObjectId('5fc70731d65a4788957605b4'),
 'name': {'common': 'Anguilla',
  'official': 'Anguilla',
  'native': {'eng': {'official': 'Anguilla', 'common': 'Anguilla'}}},
 'tld': ['.ai'],
 'cca2': 'AI',
 'ccn3': '660',
 'cca3': 'AIA',
 'cioc': '',
 'currency': ['XCD'],
 'callingCode': ['1264'],
 'capital': 'The Valley',
 'altSpellings': ['AI'],
 'region': 'Americas',
 'subregion': 'Caribbean',
 'languages': {'eng': 'English'},
 'translations': {'deu': {'official': 'Anguilla', 'common': 'Anguilla'},
  'fra': {'official': 'Anguilla', 'common': 'Anguilla'},
  'hrv': {'official': 'Anguilla', 'common': 'Angvila'},
  'ita': {'official': 'Anguilla', 'common': 'Anguilla'},
  'jpn': {'official': 'アングィラ', 'common': 'アンギラ'},
  'nld': {'official': 'Anguilla', 'common': 'Anguilla'},
  'por': {'official': 'Anguilla', 'common': 'Anguilla'},
  'rus': {'official': 'Ангилья', 'common': 'Ангилья'},
  'spa': {'official': 'Anguila', 'common': 'Anguilla'},
  'svk': {'official': 'Anguilla', 'common': 'Ang

Since the country codes are between 1 and 8 I decided to see how many countries I could match these phone numbers to. 

In [32]:
cur = db.countries.find({'callingCode': 
                         {'$elemMatch': 
                          {'$in': ['1','2','3','4','5','6','7','8']}}},
                        {'name.common':1, 'callingCode':1, 'ccn3':1})
cur.count()

3

The countries with calling codes in this range are Canada, Russia, and the United States. Furthermore, the calling code for Canada and the US are both 1. I would like to be able to use some other information such as the field 'ccn3' to differentiate these.

In [33]:
cur.next()

{'_id': ObjectId('5fc70731d65a4788957605d3'),
 'name': {'common': 'Canada'},
 'ccn3': '124',
 'callingCode': ['1']}

In [34]:
cur.next()

{'_id': ObjectId('5fc70731d65a478895760689'),
 'name': {'common': 'Russia'},
 'ccn3': '643',
 'callingCode': ['7']}

In [35]:
cur.next()

{'_id': ObjectId('5fc70731d65a4788957606aa'),
 'name': {'common': 'United States'},
 'ccn3': '840',
 'callingCode': ['1']}

I queried the data to see if the area code could be used to differentiate the numbers, but every area code is 800.

In [62]:
db.phoneOrig.find({'components.area': {'$ne':800}}).count()

0

I decided that instead of only matching a selection of these numbers I would create a dataset that could be matched fully.

# Populate phone info

This function takes in some country data and makes many consecutive phone number documents.  I decided to use the 'ccn3' field to populate the area code. I had to add a default condition (800) because there was a country that was missing this code.

In [63]:
def populatePhones(nxt, start, stop):
    for code in nxt['callingCode']:
        country = int(code)
        try:
            area = int(nxt['ccn3'])
        except:
            area = 800
            
        for i in range(start, stop + 1):
            num = (country * 1e10) + (area * 1e7) + i;

            db.phones.insert_one({
              'components': {
                'country': country,
                'area': area,
                'prefix': int((i * 1e-3)),
                'number': i
              },
              'display': "+" + str(country) + " " + str(area) + "-" + str(i)
            })

The only fields required for the function are 'callingCode' and 'ccn3'.

In [64]:
cur = db.countries.find({},{'callingCode':1, 'ccn3':1})

In [65]:
for i in range(cur.count()):
    nxt = cur.next()
    populatePhones(nxt,500000,500500)

### Example of new phones data

In [66]:
db.phones.find_one()

{'_id': ObjectId('5fc904886a0ddd82a7728ad9'),
 'components': {'country': 1264, 'area': 660, 'prefix': 500, 'number': 500000},
 'display': '+1264 660-500000'}

I will expect the number of documents in the new collection with country info to match the number in the phones collection.

In [67]:
db.phones.count_documents({})

124248

In [68]:
db.list_collection_names()

['phones', 'phoneOrig', 'countries']

### Matching countries and populating a new collection

This will run through the countries collection and match the phone numbers. I chose to combine the data in python before inserting it into the news phoneInfo collection.  I did this because everything could be done in one step but I could have also used \\$update and \\$set to perform this action in several steps.

In [69]:
cur1 = db.countries.find({},{'_id':0})

In [70]:
for i in range(cur1.count()):
    country = cur1.next()
    codes = country['callingCode']
    try:
        area = int(country['ccn3'])
    except:
        area = 800
    
    for code in codes:
        code = int(code)
        cur2 = db.phones.find({'components.country':code,
                               'components.area':area},
                              {'_id':0})
        for j in range(cur2.count()):
            phone = cur2.next()
            phone['countryInfo'] = country
            db.phoneInfo.insert_one(phone)

In [71]:
db.list_collection_names()

['phones', 'phoneOrig', 'phoneInfo', 'countries']

The new collection contains a phone number and all of the country information.

In [72]:
db.phoneInfo.find_one()

{'_id': ObjectId('5fc907966a0ddd82a7747031'),
 'components': {'country': 1264, 'area': 660, 'prefix': 500, 'number': 500000},
 'display': '+1264 660-500000',
 'countryInfo': {'name': {'common': 'Anguilla',
   'official': 'Anguilla',
   'native': {'eng': {'official': 'Anguilla', 'common': 'Anguilla'}}},
  'tld': ['.ai'],
  'cca2': 'AI',
  'ccn3': '660',
  'cca3': 'AIA',
  'cioc': '',
  'currency': ['XCD'],
  'callingCode': ['1264'],
  'capital': 'The Valley',
  'altSpellings': ['AI'],
  'region': 'Americas',
  'subregion': 'Caribbean',
  'languages': {'eng': 'English'},
  'translations': {'deu': {'official': 'Anguilla', 'common': 'Anguilla'},
   'fra': {'official': 'Anguilla', 'common': 'Anguilla'},
   'hrv': {'official': 'Anguilla', 'common': 'Angvila'},
   'ita': {'official': 'Anguilla', 'common': 'Anguilla'},
   'jpn': {'official': 'アングィラ', 'common': 'アンギラ'},
   'nld': {'official': 'Anguilla', 'common': 'Anguilla'},
   'por': {'official': 'Anguilla', 'common': 'Anguilla'},
   'rus': 

The collection also has the same number of entries as the phone collection.

In [73]:
db.phoneInfo.count_documents({})

124248

In [74]:
client.close()