# 400_mongo_datasets

## Purpose
The focus of this notebook is to investigate whether setting up a MongoDB cluster would be more beneficial for our project compared to using pickle files to save and load our dataframes. We were very interested in using a MongoDB cluster for our project after the talk Joe Drumgoole gave us at the start of the project. The datasets we have prepared previously will have to be converted to JSON format in order for them to be uploaded to the cluster.

We have set up a cluster called 'DSnP' for Free and used the code below to convert the datasets into JSON format for upload.

Note: pymongo package for python is needed to access the MongoDB cluster. Also IP address needs to be whitelisted.

## Datasets
* _Input_: 300_dataset1.pkl, 350_dataset2.pkl
* _Output_: dataset1.json, dataset2.json

In [10]:
from pymongo import MongoClient
import os
import sys
import pandas as pd
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
%matplotlib inline

# Saving the prepared first dataset as JSON file for MongoDB

In [11]:
ds1_df = pd.read_pickle('../../data/processed/300_dataset1.pkl')
ds1_df.head(5)

Unnamed: 0,company_name,roles,country_code,state_code,region,city,status,category_list,category_group_list,funding_rounds,...,Sales and Marketing,Science and Engineering,Sports,Sustainability,Transportation,Travel and Tourism,Video,Technology,Finance,Communication
0,Intel,"company,investor",USA,CA,SF Bay Area,Santa Clara,ipo,"Hardware,Manufacturing,Product Design,Semicond...","Design,Hardware,Manufacturing,Science and Engi...",1,...,0,1,0,0,0,0,0,0,0,0
1,Intercomp,company,USA,OH,Cleveland,Medina,operating,"Hardware,Software","Hardware,Software",1,...,0,0,0,0,0,0,0,1,0,0
2,Microsoft,"company,investor",USA,WA,Seattle,Redmond,ipo,"Cloud Computing,Collaboration,Consumer Electro...","Consumer Electronics,Hardware,Internet Service...",1,...,0,0,0,0,0,0,0,1,0,0
3,Compaq,"company,investor",USA,CA,SF Bay Area,Palo Alto,acquired,"Hardware,Information Technology,Software","Hardware,Information Technology,Software",1,...,0,0,0,0,0,0,0,1,0,0
4,Toyota Motor Corporation,"company,investor",JPN,Unknown,Unknown,Unknown,ipo,"Automotive,Mobile,Transportation","Mobile,Transportation",1,...,0,0,0,0,1,0,0,0,0,1


In [12]:
ds1_df.shape

(78357, 53)

This is the dataframe we will be using for our analysis and results for the the following RQ:
- "Correlation between the Industry and Location of a startup and the amount of funding received"

In [5]:
ds1_df.to_json("../../data/mongo_json/dataset1.json", orient="records")

# Saving the prepared second dataset as JSON file for MongoDB

In [13]:
ds2_df = pd.read_pickle('../../data/processed/350_dataset2.pkl')
ds2_df.head(5)

Unnamed: 0,first_name,last_name,gender,company_name,funding_rounds,funding_total_usd,primary_role,country_code,state_code,city,...,Sales and Marketing,Science and Engineering,Sports,Sustainability,Transportation,Travel and Tourism,Video,Technology,Finance,Communication
0,Steve,Wozniak,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,...,0,0,0,0,0,0,0,1,0,0
1,Kevin,Harvey,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,...,0,0,0,0,0,0,0,1,0,0
2,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,...,0,0,0,0,0,0,0,1,0,0
3,Armas,Markkula,male,Apple,4,6150250000.0,company,USA,CA,Cupertino,...,0,0,0,0,0,0,0,1,0,0
4,Kristee,Rosendahl,female,Apple,4,6150250000.0,company,USA,CA,Cupertino,...,0,0,0,0,0,0,0,1,0,0


'ds2_df' will be the resulting dataframe that we will use for our analysis and results. Thus, we will save it as a JSON file with orient='records' to ensure each row is in appropriate format.

In [9]:
ds2_df.to_json("../../data/mongo_json/dataset2.json", orient="records")

## Testing some queries of from MongoDB using pymongo package

In [8]:
client = MongoClient("mongodb+srv://user_dsnp:dsnp2018@dsnpcluster-humda.mongodb.net/admin")
db = client.DSnP

In [13]:
ds1_mongo_df = pd.DataFrame(list(db.Dataset_1.find()))
ds1_mongo_df.head(5)

Unnamed: 0,Administrative Services,Advertising,Agriculture and Farming,Biotechnology,Clothing and Apparel,Commerce and Shopping,Communication,Community and Lifestyle,Consumer Goods,Content and Publishing,...,funding_rounds,funding_total_usd,last_funding_on,org_uuid,primary_role,region,roles,state_code,status,type
0,0,0,0,0,0,0,0,0,0,0,...,1,549000.0,1970-12-31,6681b1b0-0cea-6a4a-820d-60b15793fa66,company,Cleveland,company,OH,operating,organization
1,0,0,0,0,0,0,0,0,0,0,...,1,1500000.0,1982-02-14,10a3b2fd-b142-046b-7d8f-3b1aa4877aca,company,SF Bay Area,"company,investor",CA,acquired,organization
2,0,0,0,0,0,0,0,0,0,0,...,1,2510000.0,1968-07-31,1e4f199c-363b-451b-a164-f94571075ee5,company,SF Bay Area,"company,investor",CA,ipo,organization
3,0,0,0,0,0,0,0,0,0,0,...,1,1000000.0,1981-09-01,fd80725f-53fc-7009-9878-aeecf1e9ffbb,company,Seattle,"company,investor",WA,ipo,organization
4,0,0,0,0,0,0,1,0,0,0,...,1,42000000.0,1982-04-14,12b90373-ab49-a56a-4b4e-c7b3e9236faf,company,Unknown,"company,investor",Unknown,ipo,organization


In [14]:
ds1_mongo_df.shape

(78357, 54)

In [None]:
ds2_mongo_df = pd.DataFrame(list(db.Dataset_2.find()))
ds2_mongo_df.head(5)

In [None]:
ds2_mongo_df.shape

**It is to be noted that there is an extra column in the dataframe called _"_id"_ that details each rows id in MongoDB.**

In [5]:
pickle_save_time = %timeit -o ds1_df.to_pickle("../../data/processed/300_dataset1.pkl")
pickle_save_time

153 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<TimeitResult : 153 ms ± 5.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)>

In [6]:
pickle_read_time = %timeit -o pd.read_pickle("../../data/processed/300_dataset1.pkl")
pickle_read_time

118 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


<TimeitResult : 118 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)>

#### The following code will not run due to your IP address not being whitelisted.

In [None]:
mongo_ds1_time = %timeit -o pd.DataFrame(list(db.Dataset_1.find()))
mongo_ds1_time

### MongoDB: Why it didn't work?!

We spent a lot of time hoping that we could implement the use of Mongo for our project. However, as shown by the time differences above, it just was not ideal due to time constraints. There are multiple reasons why we decicded to stick to using pickle files to save/load our dataframes:
- Having to save the dataframe to a JSON and then having to manually 'mongoimport' it to the cluster was/is very time consuming.
- The MongoDB cluster could not be accessed on our laptops when connected to UCD Wireless, it had to be when we used eduroam. However, eduroam did not function on one of our laptops.

Due to the fact that we were both going to simulataneously collaborating on this project it made sense not to pursue the use of Mongo any longer. However, it was interesting to learn about the functions Mongo has and we became very familiar with the tools available while investigating whether we could use it for our project.
