## Import Packages

In [1]:
import pandas as pd
import json

## Task 1

Insert 4 vehicles from the State College Auto Database (can be the same ones you used in Redis) into the MongoDB.  One easy way to do this is to create a .json file from the data and then import the .json file into a new data collection using the 'mongoimport' package as you did in HW #3 to import the restaurant dataset. 

In [14]:
cars = pd.read_csv('cars.csv',sep=',',index_col=0)
with open('cars.json', 'w') as outfile:
    cars_dict = cars.to_dict('index')
    for i in range(len(cars_dict)):
        json.dump(cars_dict[i], outfile)

In [15]:
! mongoimport --db local --collection cars --file cars.json

2018-11-06T01:22:33.494-0500	connected to: localhost
2018-11-06T01:22:34.035-0500	imported 10 documents


![](1.png)

## Task 2

Query your database and demonstrate that you can search on at least 2 attributes of the vehicles. Demonstrate a query on a single attribute and a query on more than one attribute. 

- query on a single attribute
![](2.png)

- query on more than one attribute
![](3.png)




## Task 3

Demonstrate a query for a vehicle not in the database. 

![](4.png)

## Task 4

What are differences you found between using MongoDB and Redis for loading and storing this data?  What would be a role for a database like Redis in storing this vehicle information? Can you imagine challenges in using the method you used for data import into MongoDB when dealing with Big Data?  How might you address any challenges in a big data environment? 

With mongoDB we can import the whole dataset in the json format directly while in Redis, we have to import one key-value pair at a time. In this case, Redis acts like a dictionary in python used to store the nested dictionaries (key-value pairs). 

- challenges:
    - the dataset to insert is too large
    - take up too much memory
    - it takes too long to import the whole dataset
    
    
- possible solutions to address the challenges:
    - import the data in batches
        - mongoimport --db local --collection cars --file cars.json --batchSize 1
    - disable journaling to avoid further memory consumption when tracking the changes made
        - mongod --nojournal
    - import the dataset in parallel with `--numInsertionWorkers`
        - mongoimport --db local --collection cars --file cars.json --numInsertionWorkers 8
    - convert text data into bson format
    - use GridFS when the file exceeds the size limit of 16MB
        - 
    - if there are complex analyses on the data involved, such as apply a machine learning algorithm to do classification, a possible way is to use mongodb-spark-connector to analyze with spark mllib, then import dataset from hdfs back to mongodb

In [None]:
hdfs dfs -put cars.csv /user/data
pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.1
        --packages org.mongodb:mongo-java-driver:3.8.0
        --packages org.apache.spark:spark-sql_2.11:2.3.0
        --packages com.stratio.datasource:spark-mongodb_2.11:0.12.0
        --packages org.mongodb:casbah_2.11:3.0.0
        --packages org.apache.spark:spark-catalyst_2.11:2.2.1

In [1]:
from pyspark.sql import SparkSession, Row, functions

def parseInput(line):
    fields = line.split(',')
    return Row(Title = fields[1], 
               Year = fields[2],
               Mileage = fields[3],
               Make = fields[4],
               Model = fields[5],
               Trim = fields[6],
               Style = fields[7],
               Engine = fields[8],
               Exterior_Color = fields[9],
               Interior_Color = fields[10],
               VIN = fields[11],
               Stock = fields[12],
               description = fields[13])

if __name__ == '__main__':
    spark = SparkSession.builder.appName('MongoDBIntegration').getOrCreate()
    lines = spark.sparkContext.textFile('hdfs://localhost:9002/user/data/cars.json')
    users = lines.map(parseInput)
    usersDataset = spark.createDataFrame(users)
    
    # write into MongoDB
    usersDataset.write \
                .format('com.stratio.datasource.mongodb') \
                .options(host="localhost:27017", database="local", collection="cars2") \
                .mode('append') \
                .save()

![](6.png)

![](7.png)