# Management and Pre-Processing Assessment

In this assessment you will go through the process of obtaining data, cleaning it, and then querying it from a database.  We are using data about food hygiene from UK open data.  The data stored is a copy of the official data.

To provide a solution for each task, you might like to do the practice exercises: "HTML and Page Scraping", and "Using MongoDB to Retrieve Information" first.

You may validate your answers by clicking "Validate" on the "Assignments" tab for this exercise.  These will be done automatically, using the tests in this notebook.  The final submission will be both machine checked and human marked.

## Question 0: Setup [1 mark]

Run the following cell to import the core dependencies required for this exercise

In [22]:
# You don't need to write anything here
import requests
import json
from bs4 import BeautifulSoup
from urllib.robotparser import RobotFileParser
from nose.tools import assert_equal, assert_raises
from pymongo import MongoClient

In [23]:
# Check that the required libraries and functions have been imported
# You don't need to write anything here

try:
    imports = [requests, BeautifulSoup, RobotFileParser, assert_equal, assert_raises, json, MongoClient]
except NameError as e:
    print(e)
    raise AssertionError('You appear to be missing one of the required libraries or functions')
assert True
print('Successfully imported libraries and functions')

Successfully imported libraries and functions


# Question 1: Web APIs and Page Scraping



### Question 1(a) [2 marks]

Write a function `get_establishment_by_id` which accepts a parameter `id`, and returns the name of that business as a string.  It should obtain the data from the [food hygeine ratings API](http://ratings.food.gov.uk/open-data/en-GB), and use version 2 of the API.
- You may **assume that the ID exists**
- You should use the **`Establishments`** endpoint  

To complete this question you may wish to look at the information found [here](http://docs.python-requests.org/en/master/user/quickstart/).   

N.B. The version of requests installed on the server is relatively recent.  In a previous update, there was a breaking change which meant that only strings or byte-like objects could be passed as headers.  As such, if you wish to pass an integer, you will have to do it as e.g., `{'header_name': '4321'}`.  

*Hint: Week 3, Guided Exercise 2, Scraping With Requests and Beautiful Soup*

In [24]:
def get_establishment_by_id(id):
    # YOUR CODE HERE
    # url request - json으로 반환설정
    heads={
        'x-api-version':'2',
        'content-type':'application/json'
    }
    # 해당 사업id 데이터 조회
    url = 'http://api.ratings.food.gov.uk/establishments/' + str(id)
    r = requests.get(url, headers=heads)
    
    # 해당 사업의 회사명반환
    return r.json()['BusinessName']

In [25]:
assert_equal(get_establishment_by_id(990000), '1N1 Fashion N Pizza')
assert_equal(get_establishment_by_id(511819), 'Star Karahi')
assert_equal(get_establishment_by_id(692630), 'Baldiesburn Bed & Breakfast')
print('All tests successfully passed')

All tests successfully passed


### Question 1(b) [2 marks]

Data stored at http://138.68.148.20/, in HTML format will be used for this question.  Use the Python `requests` library for any requests to the server:

**Write a function** `check_robots`, which accepts a **parameter** `url` which tells you whether the server at http://138.68.148.20/ will permit you to scrape that page.  

*Hint: Week 3, Guided Exercise 2, Robots.txt*

In [26]:
def check_robots(url):
    # YOUR CODE HERE
    """
    Use the RobotFileParser to check if a page on the server can be visited
    """
    # url의 root 추출 및 robots.txt 확인.
    root = 'http://' + url.replace('http://', '').split('/')[0]
    try:
        rp = RobotFileParser()
        # 로봇유효성 확인용 robots.txt 설정.
        rp.set_url(root + '/robots.txt')
        rp.read()
        
        # 유효성 여부 반환
        result = rp.can_fetch("*", url)
#         print("fetch result::" + str(result)  )
        return result
    except NotImplementedError as e:
        # print(e)
        raise NotImplementedError() 
    

In [27]:

# Testing whether your code works correctly.
# You don't need to write anything here

# Confirm an allowed page returns True
assert check_robots('http://138.68.148.20/index.html')
# Confirm a disallowed page returns False
assert not check_robots('http://138.68.148.20/data/scotland/glasgow_city')
print('Passed all the tests')

Passed all the tests


# Question 1(c) [3 marks]

Write a function which takes a URL as a **parameter**, and reads the **XML** on the page it goes to.  The function should **return** a `dict` with the amount of records in `EstablishmentCollection`, and the name of the first business.  
HINT: You can use `BeautifulSoup` for parsing XML as well as HTML.  The function should behave as follows:
- The function should use the Python **`requests`** library.
- **If** the page is banned by robots.txt, then it should not be visited, and should return **`None`**
- **If** the page does not return a **200 status code** in response, then it should not attempt to parse the result, and return **`None`**
- If the page is an **XML** file, it should return a dict in the following format: `{'first_business': 'business name', 'amount_of_records': 1234}`

N.B. The order of a Python `dict` is not guaranteed, so we will not take into account which key appears first.  

*Hint: Week 3, Guided Exercise 2, Parsing HTML - Scraping with Requests and Beautiful Soup*

In [28]:

def parse_xml(url):
    """
    This function should parse the XML file, for example http://138.68.148.20/west_midlands/cannock_chase
    NOTE: Unlike for HTML, you need to use 'xml' as the second parameter for BeautifulSoup
    You may use any of Python's core libraries, or other libraries installed if you wish rather than BeautifulSoup
    """
    # YOUR CODE HERE 
    # 로봇유효성 확인- robots.txt // True / False(None)
    if(check_robots(url)):
        
        # url request 유효성 확인 
        urlHeads={
        'x-api-version':'2',
        'content-type':'application/json'
        }
        requestStatus = requests.get(url, headers=urlHeads).status_code # 200 / 404(None)
        if(requestStatus == 200 ):
            # 답안 dictionary 생성.
            retDict = {}
            xmlHeads={
                'x-api-version':'2',
                'content-type': 'application/xml'
            }
            # xml 형태로 반환 받아 크롤링. 시간 다소소요.(평균 36s)
            soup = BeautifulSoup(requests.get(url, headers=xmlHeads).text , 'xml') 
            retDict["first_business"] = str(soup.findAll('EstablishmentDetail')[0].find('BusinessName').string)
            retDict["amount_of_records"] = int(soup.find('ItemCount').string)
        else:
            # url request 가 404 인경우 None 반환
            retDict = None 
    else:
        # 로봇유효성이 False 인경우 None 반환
        retDict = None 
    return retDict

In [29]:

# You don't need to write anything here
# Confirm that the function calls the check_robots function
tmp_check_robots = check_robots
del check_robots

try:
    parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase')
except NameError:
    pass
else:
    raise AssertionError("get_urls does not call check_robots")
finally:
    check_robots = tmp_check_robots

# TEST NOT VISITING PAGES PROHIBITED BY ROBOTS
# THIS SHOULD NOT CALL requests.get

tmp_requests = requests
del requests

try:
    parse_xml('http://138.68.148.20/data/scotland/glasgow_city')
    parse_xml('http://138.68.148.20/data/scotland/clackmannanshire')
except NameError:
    raise AssertionError("The function should not be using requests on this URL")
finally:
    requests = tmp_requests
    # TEST OUTPUT RESPONSE
    assert_equal(parse_xml('http://138.68.148.20/data/west_midlands/cannock_chase'),
    {'amount_of_records': 731, 'first_business': '1st Choice Pizza/Fish & Chips'})
    assert_equal(parse_xml('http://138.68.148.20/data/wales/swansea'),
    {'amount_of_records': 1700, 'first_business': '360 Beach and Watersports Centre'})
    # TEST HANDLING 404
    assert_equal(parse_xml('http://138.68.148.20/data/calderdale'), None)

    print('All test successfully passed')



All test successfully passed


## Question 2: Retrieving Data from MongoDB

We will assume that you have successfully cleaned the data, and have stored it in the MongoDB database.  Using the following PyMongo configuration, answer the following questions about the data:

In [30]:
# These are the credentials to connect to the database
# You don't need to write anything here, but you need to run this cell

client = MongoClient('mongodb://cpduser:M13pV5woDW@mongodb/health_data')
db = client.health_data

### Question 2(a) [1 mark]

Write a **function** `get_count`, which takes a PyMongo collection object as a parameter and **returns** the amount of businesses in the collection.  

*Hint: Week 3, Guided Exercise 4, Using MongoDB to Retrieve Information*

In [31]:
def get_count(collection):
    # YOUR CODE HERE
    """
    Return an integer which gives the amount of unique businesses in the given collection
    """
    try: 
        # 조회하는 컬렉션의 전체 카운트 확인.
        size = collection.count()
#         print("total size : " + str(size))

        # assert의 비교확인을 위한 int 박싱 반환.
        return int(size)
    except NotImplementedError:
        raise NotImplementedError()


In [32]:
# You don't need to write anything here
assert_equal(get_count(db.uk), 511819)
assert_equal(get_count(db.swansea), 1700)
assert_equal(get_count(db.westminster), 4315)
assert_equal(get_count(db.newcastle_upon_tyne), 2308)
print('Passed all the tests')

Passed all the tests


## Question 2(b) [3 marks]

Write a **function** `get_rating_value_percentage` which **returns** the **percentage** of businesses which were awarded an overall `RatingValue` of 5.  The function should accept a parameter `collection` of type `Collection`, for which it should return the percentage as a **float** between 0 and 1.  

*Hint: Week 3, Guided Exercise 4, Cursors*

In [33]:
def get_rating_value_percentage(collection):
    """
    Return a float between 0 and 1 of the amount with a RatingValue of 5
    """
    # YOUR CODE HERE
    try:
        # 상단의 function 활용 - 콜렉션 전체 갯수
        totalSize = get_count(collection)
        # 콜렉션의 조건에 맞는 갯수 반환.
        targetSize = collection.find({'RatingValue': {'$eq': 5}}).count()

#         print("percentage result :: " + str(float(targetSize/totalSize)))
        
        # assert의 비교확인을 위한 계산 및 float 박싱 반환.
        return float(targetSize/totalSize)
    
    except NotImplementedError:
        raise NotImplementedError()

In [34]:
# You don't need to write anything here
assert_equal(get_rating_value_percentage(db.uk), 0.5287240215779406)
assert_equal(get_rating_value_percentage(db.swansea), 0.6688235294117647)
assert_equal(get_rating_value_percentage(db.westminster), 0.4600231749710313)
assert_equal(get_rating_value_percentage(db.newcastle_upon_tyne), 0.5966204506065858)
print('Passed all the tests')

Passed all the tests


### Question 2(c) [3 marks]

Write a **function** `get_no_geocode` which will find establishments with region Scotland which do not have a `Geocode` recorded.  The parameter `establishment_type` is a string, which will indicate the type of establishment to search for.  All queries should be run on the `uk` collection.

The function should **return** a PyMongo **`Cursor`** object, with only the following fields:
- `BusinessName`, `BusinessType`, and `LocalAuthorityName`.  
- `_id` should not be included  

*Hint: Week 3, Guided Exercise 4, Returning Part of a Document*

In [35]:
def get_no_geocode(establishment_type):
    # YOUR CODE HERE
    # 해당 조건에 맞는 컬렉션 조회 및 결과 Cursor 반환
    retResult = db.uk.find({'Region': 'scotland', 'BusinessType' : establishment_type, 'Geocode': None}, 
                      {'BusinessName': 1, 'BusinessType':1, 'LocalAuthorityName':1, '_id':0 })
    return retResult


In [36]:

# You don't need to write anything here

cursor = get_no_geocode('Restaurant/Cafe/Canteen' )
for cur in cursor:

    assert '_id' not in cur
    assert 'BusinessType' in cur
    assert_equal(cur['BusinessType'], 'Restaurant/Cafe/Canteen')
    assert 'BusinessName' in cur
    assert 'LocalAuthorityName' in cur

    assert_equal(len(list(get_no_geocode('Takeaway/sandwich shop'))), 405)
    assert_equal(len(list(get_no_geocode('Retailers - other'))), 1079)
print('Passed all the tests')



Passed all the tests


## Question 2(d) [5 marks]

What was the earliest and latest dates that an inspection was carried out? Write a **function** which returns a dict in the form `{'earliest_date': 'YYYY-MM-DD', 'latest_date': 'YYYY-MM-DD'}`.  

*Hint: Week 3, Guided Exercise 4, MongoDB Aggregation Framework*

In [37]:
def get_earliest_and_latest_dates(collection):
    # YOUR CODE HERE
    try: 
        # 컬렉션의 그루핑(Aggregation) 작업.
        coll = collection.aggregate(
            [
                # 그룹대상 필드와 대상값 action 설정.
                {"$group": { "_id": "$Region", "count": {"$sum": 1} } }
            ]
        ) 
        
        # 최소와 최대값을 찾기위한 list 생성.
        lst = []
        for dot in coll:
            print(dot) 
            if(dot['_id'] != None):
                lst.append(dot['_id'].strftime('%Y-%m-%d')) 

        retDict = {'earliest_date': min(lst), 'latest_date': max(lst)}
#         print("retDict :: " + str(retDict))
        
        return retDict
    except NotImplementedError:
        raise NotImplementedError()
get_earliest_and_latest_dates(db.uk)

{'_id': 'north_east', 'count': 21081}


AttributeError: 'str' object has no attribute 'strftime'

In [None]:
# You don't need to write anything here
assert_equal(get_earliest_and_latest_dates(db.uk),{'earliest_date': '1989-01-01', 'latest_date': '2016-09-15'})
assert_equal(get_earliest_and_latest_dates(db.swansea),{'earliest_date': '2010-10-06', 'latest_date': '2016-08-16'})
assert_equal(get_earliest_and_latest_dates(db.westminster), 
                {'earliest_date': '1999-01-27', 'latest_date': '2016-09-13'})
assert_equal(get_earliest_and_latest_dates(db.newcastle_upon_tyne), 
                {'earliest_date': '2005-07-08', 'latest_date': '2016-09-06'})
print('Passed all the tests')

## Question 3 Exploring and fixing data [5 marks]

During this week Huw has talked about issues which may arise when integrating data. For this task, consider the data described in this notebook, and any other source you wish.
- Provide two concise examples of possible issues, and their mitigation in relation to these data
- Each example should be approximately one paragraph


다수의 데이터 소스를 데이터 사이언스의 분석에 활용한다면 단일 데이터 소스에서 얻을수 없는 추가적인 트렌드 추세와 통찰력을 얻을수 있습니다.
하지만, 2개 이상의 데이터 소스를 통합하는데는 아래와 같이 데이터 값과 데이터 속성에 이슈가 있을수 있습니다.

첫번째로 데이터값의 불일치 현상입니다. 두개이상의 데이터 소스에서 개체명이 다르지만 같은 데이터값을 지닐수 있습니다. 예를 들어, 현재 프로젝트 진행중인 사업에서도 '진행상태'라는 속성명이 두 RDBMS의 데이터 소스를 관리하는 업체가 달라 어떤곳은 STATUS, 어떤곳은 PROCESS라고 표기되었습니다. 이런 경우의 해결방법으로 RDBMS의 JOIN을 활용해 각 속성명을 같게 만들어 조회하여 이슈를 해결할수 있었습니다.

두번째로 데이터의 속성이 다른경우 입니다. 숫자 데이터가 같은 형태를 보이더라도 다른 Unit system을 사용한다면 데이터 분석에 큰 차이가 있을것 입니다. 예를들어 Metric System을 사용하는 아시아(한국)데이터와 영국 표기법인 Imperial Unit system을 사용하는 영국데이터를 통합하는 경우, 데이터 소스의 양과 질, 그리고 신뢰성을 바탕으로 Metric 또는 Imperial system으로 통일화해 데이터를 분석해야 올바른 분석결과가 나타날 것입니다. lb(pounds)와 Kg, Mile과 Km, Fahrenheit(°F)와 Celcius(°C) 등을 연산으로 일치화 해야 보다 정확한 데이터 활용이 가능합니다. 
