# 使用MongoDB

## 数据准备
### 问题描述
在此习题集中，你将处理另一种类型的 infobox 数据，审核、清理数据，并得出一种数据模型，将数据插入 MongoDB，然后对数据库运行一些查询。数据集中包含关于蛛形纲动物的数据。

对于这道练习，你的任务是解析文件，仅处理 FIELDS 字典中作为键的字段，并返回清理后的值字典列表。

你应该完成以下几个步骤：

- 根据 FIELDS 字典中的映射更改字典的键
- 删掉“rdf-schema#label”中的小括号里的多余说明，例如“(spider)”
- 如果“name”为“NULL”，或包含非字母数字字符，将其设为和“label”相同的值。
- 如果字段的值为“NULL”，将其转换为“None”
- 如果“synonym”中存在值，应将其转换为数组（列表），方法是删掉“{}”字符，并根据“|” 拆分字符串。剩下的清理方式将由你自行决定，例如删除前缀“*”等。如果存在单数同义词，值应该依然是列表格式。
- 删掉所有字段前后的空格（如果有的话）

输出结构应该如下所示：
```
[ { 'label': 'Argiope',
    'uri': 'http://dbpedia.org/resource/Argiope_(spider)',
    'description': 'The genus Argiope includes rather large and spectacular spiders that often ...',
    'name': 'Argiope',
    'synonym': ["One", "Two"],
    'classification': {
                      'family': 'Orb-weaver spider',
                      'class': 'Arachnid',
                      'phylum': 'Arthropod',
                      'order': 'Spider',
                      'kingdom': 'Animal',
                      'genus': None
                      }
  },
  { 'label': ... , }, ...
]
```

In [1]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
In this problem set you work with another type of infobox data, audit it,
clean it, come up with a data model, insert it into MongoDB and then run some
queries against your database. The set contains data about Arachnid class
animals.

Your task in this exercise is to parse the file, process only the fields that
are listed in the FIELDS dictionary as keys, and return a list of dictionaries
of cleaned values. 

The following things should be done:
- keys of the dictionary changed according to the mapping in FIELDS dictionary
- trim out redundant description in parenthesis from the 'rdf-schema#label'
  field, like "(spider)"
- if 'name' is "NULL" or contains non-alphanumeric characters, set it to the
  same value as 'label'.
- if a value of a field is "NULL", convert it to None
- if there is a value in 'synonym', it should be converted to an array (list)
  by stripping the "{}" characters and splitting the string on "|". Rest of the
  cleanup is up to you, e.g. removing "*" prefixes etc. If there is a singular
  synonym, the value should still be formatted in a list.
- strip leading and ending whitespace from all fields, if there is any
- the output structure should be as follows:

[ { 'label': 'Argiope',
    'uri': 'http://dbpedia.org/resource/Argiope_(spider)',
    'description': 'The genus Argiope includes rather large and spectacular spiders that often ...',
    'name': 'Argiope',
    'synonym': ["One", "Two"],
    'classification': {
                      'family': 'Orb-weaver spider',
                      'class': 'Arachnid',
                      'phylum': 'Arthropod',
                      'order': 'Spider',
                      'kingdom': 'Animal',
                      'genus': None
                      }
  },
  { 'label': ... , }, ...
]

  * Note that the value associated with the classification key is a dictionary
    with taxonomic labels.
"""
import codecs
import csv
import json
import pprint
import re

DATAFILE = 'arachnid.csv'
FIELDS ={'rdf-schema#label': 'label',
         'URI': 'uri',
         'rdf-schema#comment': 'description',
         'synonym': 'synonym',
         'name': 'name',
         'family_label': 'family',
         'class_label': 'class',
         'phylum_label': 'phylum',
         'order_label': 'order',
         'kingdom_label': 'kingdom',
         'genus_label': 'genus'}


def process_file(filename, fields):

    process_fields = fields.keys()
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        for i in range(3):
            l = next(reader)

        for line in reader:
            # YOUR CODE HERE
            line["rdf-schema#label"] = re.sub('\(.+\)', '', line['rdf-schema#label']).strip()
            if line['rdf-schema#label'] == 'NULL':
                line['rdf-schema#label'] = None

            if line['name'] == 'NULL' or re.search(r'\W', line['name']):
                line['name'] = line['rdf-schema#label']

            if line['synonym'] == 'NULL':
                line['synonym'] = None
            else:
                line['synonym'] = parse_array(line['synonym'])
                for syn in line['synonym']:
                    syn.replace('*', "")

            item = {}
            item['classification'] = {}

            for key in fields:
                if line[key] == 'NULL':
                    line[key] = None

                if re.search(r'_label', key):
                    item['classification'][fields[key]] = line[key]
                else:
                    item[fields[key]] = line[key]

            data.append(item)
    return data


def parse_array(v):
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return v_array
    return [v]


def test():
    data = process_file(DATAFILE, FIELDS)
    print ("Your first entry:")
    pprint.pprint(data[0])
    first_entry = {
        "synonym": None, 
        "name": "Argiope", 
        "classification": {
            "kingdom": "Animal", 
            "family": "Orb-weaver spider", 
            "order": "Spider", 
            "phylum": "Arthropod", 
            "genus": None, 
            "class": "Arachnid"
        }, 
        "uri": "http://dbpedia.org/resource/Argiope_(spider)", 
        "label": "Argiope", 
        "description": "The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced."
    }

    assert len(data) == 76
    assert data[0] == first_entry
    assert data[17]["name"] == "Ogdenia"
    assert data[48]["label"] == "Hydrachnidiae"
    assert data[14]["synonym"] == ["Cyrene Peckham & Peckham"]

if __name__ == "__main__":
    test()

Your first entry:
{'classification': {'class': 'Arachnid',
                    'family': 'Orb-weaver spider',
                    'genus': None,
                    'kingdom': 'Animal',
                    'order': 'Spider',
                    'phylum': 'Arthropod'},
 'description': 'The genus Argiope includes rather large and spectacular '
                'spiders that often have a strikingly coloured abdomen. These '
                'spiders are distributed throughout the world. Most countries '
                'in tropical or temperate climates host one or more species '
                'that are similar in appearance. The etymology of the name is '
                'from a Greek name meaning silver-faced.',
 'label': 'Argiope',
 'name': 'Argiope',
 'synonym': None,
 'uri': 'http://dbpedia.org/resource/Argiope_(spider)'}


### bug
AttributeError: 'DictReader' object has no attribute 'next'  
`reader.next()`改为`next(reader)

## 向 MongoDB 插入数据


In [2]:
"""
Complete the insert_data function to insert the data into MongoDB.
"""

import json

def insert_data(data, db):

    # Your code here. Insert the data into a collection 'arachnid'
    db.arachnid.insert_many(data)
    pass


if __name__ == "__main__":
    
    from pymongo import MongoClient
    client = MongoClient("mongodb://123.207.27.93:27017")
    db = client.examples

    with open('arachnid.json') as f:
        data = json.loads(f.read())
        insert_data(data, db)
        print (db.arachnid.find_one())

{'_id': ObjectId('5ceb90046c04431d277f4255'), 'synonym': None, 'name': 'Argiope', 'classification': {'kingdom': 'Animal', 'family': 'Orb-weaver spider', 'order': 'Spider', 'phylum': 'Arthropod', 'genus': None, 'class': 'Arachnid'}, 'uri': 'http://dbpedia.org/resource/Argiope_(spider)', 'label': 'Argiope', 'description': 'The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced.'}
