In [12]:
import pandas as pd
import pathlib
import numpy as np
from pprint import pprint
from functools import partial

from src.data import datasets, utils, Dataset
from src.data.datasets import build_dataset_dict, load_dataset, fetch_and_unpack
from src.data.fetch import fetch_text_file, hash_file
from src.data.utils import list_dir, head_file, normalize_labels, read_space_delimited
from src.paths import interim_data_path, raw_data_path, processed_data_path

In [13]:
%load_ext autoreload
%autoreload 2

# Adding and Processing the Yelp Dataset

The yelp dataset is free to use under a reasonable license agreement, but you need to register to download a copy.
https://www.yelp.com/dataset/download



In [14]:
dataset_name='yelp'

First, we should download the raw data tarfile and check the hashes on it. If you have an independent way of verifying the hashes, do so. Otherwise, you can compute them the first time around automatically, as shown below

This hash will be used for comparison on subsequent downloads. We can use `build_dataset_dict` to download from a URL or generate hashes from an existing file

In [19]:
help(build_dataset_dict)

Help on function build_dataset_dict in module src.data.datasets:

build_dataset_dict(hash_type='sha1', hash_value=None, url=None, name=None, file_name=None, from_txt=None)
    Build a raw dataset dictionary entry for a file.
    
    This will fetch the file if `url` is specified.
    If `hash_value` is specified, the file hash will be checked.
    Otherwise, the hash will be computed.
    
    hash_type: {'sha1', 'md5', 'sha256'}
    hash_value: string or None
        if None, hash will be computed from downloaded file
    file_name: string or None
        Name of downloaded file. If None, will be the last component of the URL
    url: string
        URL to fetch
    from_txt: string
        contents of file to create.
        One of `url` or `from_txt` must be specified
    name: string
        this field can be used to indicate the type of datafile being downloaded.
        Usually, this is just informational. However, if you specify names `DESCR` or `LICENSE`,
        the contents 

In [21]:
rawdata = build_dataset_dict(file_name=f'yelp_dataset.tar.gz'); rawdata

{'name': None,
 'file_name': 'yelp_dataset.tar.gz',
 'hash_type': 'sha1',
 'hash_value': '096ac5ced8a9229ecc5116e77b6be8d8f90fdacb'}

The **name** field can be used to indicate the type of datafile being downloaded. Usually, this is just informational. However, if you specify names `DESCR` or `LICENSE`, the downloaded (text) file will be used as the dataset description and license text, respectively.

Usually you will want to give these unique names, so they don't clash with other downloaded files. (e.g. "LICENSE.txt" is a terrible name to use). We use the **file_name** option for this:

In [22]:
# notice the files have been downloaded to the RAW directory
list_dir(raw_data_path)

['.gitkeep', 'yelp_dataset.tar.gz']

Every dataset should have an associated **license** and **description**. These are stored as `LICENSE` and `DESCR` attributes respectively.

There are 3 typical ways to add license and description text:
* Download textfiles from somewhere on the Internet, and using the `name` parameter in `build_dataset_dict` to tag them as either `LICENSE` or `DESCR`
* Create a `{dataset_name}.license`, or `{dataset_name}.readme` in src/data. When a dataset matching this
name is created, this information will automatically be used, or
* specify the text directly when using `build_dataset_dict`

To set the LICENSE and DESCR for this dataset, we can simply create textfiles in the appropriate location.
If the filename base matches the dataset name (`yelp` in this case), they will be used automatically.

In this case, the license is available from https://s3-media1.fl.yelpcdn.com/assets/srv0/engineering_pages/06cb5ad91db8/assets/vendor/yelp-dataset-agreement.pdf

In [36]:
%%writefile ../src/data/yelp.license
YELP DATASET TERMS OF USE

Last Updated: July 26, 2018
This document governs the terms under which you may access and use the data that Yelp
makes available for download through this website (or made available by other means) for
academic purposes (the “Data”). This document incorporates the terms of the following
additional document, including all future amendments or modifications thereto (collectively, and
together with this document, the “Data Agreement” ):

Yelp Terms of Service:

By accessing or using the Data, you agree to be bound by the Data Agreement and represent
that the contact information you provide to Yelp is correct. If you access or use the Data on
behalf of a university, school, or other entity, you represent that you have authority to bind such
entity and its affiliates to the Data Agreement and that it is fully binding upon them. In such
case, the term “you” and “your” will refer to such entity and its affiliates. If you do not have
authority, or if you do not agree with the terms of the Data Agreement, you may not access or
use the Data. You should read and keep a copy of each component of the Data Agreement for
your records. In the event of a conflict among them, the terms of this document will control.

1. Purpose

The Data is made available by Yelp Inc. (“Yelp”) to enable you to access valuable
local information to develop an academic project as part of an ongoing course of study. With this
in mind, Yelp reserves the right to continually review and evaluate all uses of the Data provided
under the Data Agreement.

2. Changes

Yelp reserves the right to modify or revise the Data Agreement at any time. If the
change is deemed to be material and it is foreseeable that such change could be adverse to
your interests, Yelp will provide you notice of the change to this Data Agreement by sending you
an email to the email you provided to Yelp. Your continued use of the Data after the notice of
material change will constitute your acceptance of and agreement to such changes. 

If YOU DO
NOT WISH TO BE BOUND TO ANY NEW TERMS, YOU MUST TERMINATE THE DATA
AGREEMENT BY IMMEDIATELY CEASING USE OF THE DATA AND DELETING IT FROM
ANY SYSTEMS OR MEDIA.

3. License

Subject to the terms set forth in the Data Agreement (specifically the restrictions set
forth in Section 4 below), Yelp grants you a royalty-free, non-exclusive, revocable,
non-sublicensable, non-transferable, fully paid-up right and license during the Term to use,
access, and create derivative works of the Data in electronic form for academic purposes only.
You may not use the Data for any other purpose without Yelp’s prior written consent. You
acknowledge and agree that Yelp may request information about, review, audit, and/or monitor
your use of the Data at any time in order to confirm compliance with the Data Agreement.
Nothing herein shall be construed as a license to use Yelp’s registered trademarks or service
marks, or any other Yelp branding.

4. Restrictions

You agree that you will not, and will not encourage, assist, or enable others to:
A. display, perform, or distribute any of the Data, or use the Data to update or create
your own business listing information (i.e. you may not publicly display any of the Data to any
third party, especially reviews and other user generated content, as this is a private data set
challenge and not a license to compete with or disparage with Yelp);
B. use the Data in connection with any commercial purpose;
C. use the Data in any manner or for any purpose that may violate any law or regulation,
or any right of any person including, but not limited to, intellectual property rights, rights of
privacy and/or rights of personality, or which otherwise may be harmful (in Yelp's sole
discretion) to Yelp, its providers, its suppliers, end users of this website, or your end users;
D. use the Data on behalf of any third party without Yelp’s consent;
E. create, redistribute or disclose any summary of, or metrics related to, the Data (e.g.,
the number of reviewed business included in the Data and other statistical analysis) to any third
party or on any website or other electronic media not expressly covered by this Agreement, this
provision however, excludes any disclosures necessary for academic purposes, including
without limitation the publication of academic articles concerning your use of the Data;
F. use the Data in a manner that is competitive in nature with Yelp;
G. display Data in a manner that could reasonably imply an endorsement, relationship or
affiliation with or sponsorship between you or a third party and Yelp, other than your permitted
use of the Data under the terms of the Data Agreement;
H. rent, lease, sell, transfer, assign, or sublicense, any part of the Data;
I. modify, rate, rank, review, vote or comment on, or otherwise respond to the content
contained in the Data;
J. display the Data or publicly communicate in any way, or on any site, in a manner that
disparages Yelp or its products or services, or infringes any Yelp intellectual property or other
rights;
K. use the Data in a manner that could reasonably be interpreted to suggest that Yelp is
the author or entity that is responsible, in whole or in part, for the creation or development of any
Data or that such Data represents the views of Yelp; or
L. use the Data for any purpose prohibited by law.

5. Ownership

As between you and Yelp, the Data and any derivative works you create from the
Data, and all intellectual property rights contained in the foregoing, are and will at all times
remain the sole and exclusive property of Yelp and are protected by applicable intellectual
property laws and treaties (whether those rights happen to be registered or not, and wherever in
the world those rights may exist), or as otherwise set forth in the contest rules where the various
submitted solutions must be made available under a specified open source license, such as the
MIT License.

6. Indemnity

You agree that your use of the Data is at your own risk and you agree to hold
harmless, defend (subject to Yelp's right to participate with counsel it selects) and indemnify
Yelp and its subsidiaries, affiliates, officers, agents, employees and suppliers from and against
any and all claims, damages, liabilities, costs and fees (including reasonable attorneys’ fee)
arising from, or in any way related to your or your end users’ use or implementation of the Data.
You will not agree to any settlement that imposes any obligation on Yelp without Yelp's prior
consent.
                  
7. No Warranties by Yelp; No Entitlement to Support from Yelp

THE DATA IS PROVIDED
“AS IS”, “WITH ALL FAULTS” AND “AS AVAILABLE” WITHOUT WARRANTY, OF ANY KIND
AND AT YOUR SOLE RISK. EXCEPT TO THE MAXIMUM EXTENT REQUIRED BY
APPLICABLE LAW, YELP DISCLAIMS ALL WARRANTIES, REPRESENTATIONS,
CONDITIONS, AND DUTIES, WHETHER EXPRESS, IMPLIED OR STATUTORY,
REGARDING THE DATA, INCLUDING, WITHOUT LIMITATION, ANY AND ALL IMPLIED
WARRANTIES OF MERCHANTABILITY, ACCURACY, RESULTS OF USE, RELIABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE, INTERFERENCE WITH QUIET
ENJOYMENT AND NON-INFRINGEMENT OF THIRD-PARTY RIGHTS. FURTHER, YELP
DISCLAIMS ANY WARRANTY THAT YOUR USE OF THE DATA WILL BE UNINTERRUPTED,
SECURE, TIMELY OR ERROR FREE. FOR THE AVOIDANCE OF DOUBT, YOU
ACKNOWLEDGE AND AGREE THAT THE DATA AGREEMENT DOES NOT ENTITLE YOU
TO ANY SUPPORT FOR THE DATA. NO ADVICE OR INFORMATION, WHETHER ORAL OR
IN WRITING, OBTAINED BY YOU FROM YELP WILL CREATE ANY WARRANTY NOT
EXPRESSLY STATED IN THE DATA AGREEMENT.

8. Limitation of Liability

THE DATA IS BEING PROVIDED FREE OF CHARGE.
ACCORDINGLY, YOU AGREE THAT YELP SHALL HAVE NO LIABILITY ARISING FROM OR
BASED ON YOUR USE OF THE DATA. REGARDLESS OF WHETHER ANY REMEDY SET
FORTH HEREIN FAILS OF ITS ESSENTIAL PURPOSE OR OTHERWISE, AND EXCEPT FOR
BODILY INJURY, IN NO EVENT SHALL YELP OR ITS SUBSIDIARIES, AFFILIATES,
OFFICERS, AGENTS, EMPLOYEES AND SUPPLIERS BE LIABLE TO YOU OR TO ANY
THIRD PARTY UNDER ANY TORT, CONTRACT, NEGLIGENCE, STRICT LIABILITY OR
OTHER LEGAL OR EQUITABLE THEORY FOR ANY LOST PROFITS, LOST OR
CORRUPTED DATA, COMPUTER FAILURE OR MALFUNCTION, INTERRUPTION OF
BUSINESS, OR OTHER SPECIAL, INDIRECT, INCIDENTAL OR CONSEQUENTIAL
DAMAGES OF ANY KIND ARISING OUT OF THE USE OR INABILITY TO USE THE DATA,
EVEN IF YELP HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSS OR DAMAGES
AND WHETHER OR NOT SUCH LOSS OR DAMAGES ARE FORESEEABLE. ANY CLAIM
ARISING OUT OF OR RELATING TO THE DATA AGREEMENT MUST BE BROUGHT WITHIN
(1) YEAR AFTER THE OCCURRENCE OF THE EVENT GIVING RISE TO SUCH CLAIM. IF
SUCH CLAIM IS NOT FILED, THEN THAT CLAIM IS PERMANENTLY BARRED. THIS
APPLIES TO YOU AND YOUR SUCCESSORS, AND TO YELP AND ITS SUCCESSORS.
NOTWITHSTANDING THE FOREGOING, SINCE THIS LICENSE IS PROVIDED TO YOU AT
NO CHARGE, YELP’S MAXIMUM LIABILITY UNDER THIS DATA AGREEMENT SHALL NOT,
IN ANY EVENT, EXCEED US$50.00.

9. Limited Relationship
                  
Yelp and You are, and will remain, independent contractors, and
nothing in the Data Agreement will be construed as creating an employer-employee
relationship, partnership or joint venture. Although you are permitted to publicize your use of the
Data, you agree not to make any other statements, without the prior written consent of Yelp,
implying a different kind of relationship between you and Yelp, including any implied
endorsement by Yelp. You do not have any authority of any kind to bind Yelp in any respect
whatsoever.

10. Term and Termination
                  
This Data Agreement is effective as of the date you download or
otherwise access the Data (“Effective Date” ) and shall continue in full force and effect for a term
of twelve (12) months from the Effective Date, unless earlier terminated by the parties or expires
in accordance with this Section 11 (the “Term”). Either party may immediately terminate this
Data Agreement, for any reason or for no reason, by providing written notice to the other party.
Yelp will provide notice of termination to the email account you provided to Yelp during
registration and termination will be effective upon delivery of the email notice. Yelp reserves the
right, in its sole discretion (for any reason or for no reason) and at any time without notice to
you, to change, suspend or discontinue the Data and/or suspend or terminate your further
access to the Data. Any termination of the Data Agreement will also immediately terminate the
licenses granted to you hereunder. Upon any termination of the Data Agreement, you will
promptly: (i) delete and remove all Data from any location, including any web pages, scripts,
widgets, applications and any other software in your possession or under your control; (ii)
destroy and remove from all computers, hard drives, networks and other storage media in your
possession or under your control all copies of any Data; and (iii) upon Yelp’s request, certify in
writing to Yelp that such actions have been taken.

11. Miscellaneous
                  
The Data Agreement encompasses the entire agreement between you and
Yelp regarding the subject matter discussed therein. The Data Agreement, and any disputes
arising from or relating to the interpretation thereof, will be governed by and construed under the
laws of the State of California without regard to its conflict of law provisions. You agree to
personal jurisdiction by and venue in the state and federal courts of the State of California, City
of San Francisco. The failure of Yelp to exercise or enforce any right or provision of the Data
Agreement will not constitute a waiver of such right or provision. The failure of either party to
exercise in any respect any right provided for herein will not be deemed a waiver of any further
rights hereunder. If any provision of the Data Agreement is found to be unenforceable or invalid,
that provision will be replaced with terms that most closely match the intent of the provision that
is not enforceable to the minimum extent necessary so that the remaining Data Agreement will
otherwise remain in full force and effect and enforceable. The Data Agreement is not
assignable, transferable or sublicensable, in whole or in part, by you except with Yelp's prior
written consent. Any attempt to do so is void. Yelp may assign the Data Agreement, in whole or
in part, at any time with or without notice to you. The section titles in the Data Agreement are for
convenience only and have no legal or contractual effect.

12. Survival 

Sections 4 through 13 will survive any expiration or termination of this Data
Agreement for any reason.

13.  Contact and Violations
                  
Please contact Yelp with any questions regarding the Data
Agreement. Please report any violations of the Data Agreement couvidat@yelp.com.


Writing ../src/data/yelp.license


The other thing we can do is just save the text to a string, and add it using `from_txt`:

In [52]:
readme_txt = '''
Yelp Dataset JSON

Each file is composed of a single object type, one JSON-object per-line.

Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.

Note: the follow examples contain inline comments, which are technically not valid JSON. This is done here to simplify the documentation and explaining the structure, the JSON files you download will not contain any comments and will be fully valid JSON.
business.json

Contains business data including location data, attributes, and categories.

{
    // string, 22 character unique string business id
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the business's name
    "name": "Garaje",

    // string, the neighborhood's name
    "neighborhood": "SoMa",

    // string, the full address of the business
    "address": "475 3rd St",

    // string, the city
    "city": "San Francisco",

    // string, 2 character state code, if applicable
    "state": "CA",

    // string, the postal code
    "postal code": "94107",

    // float, latitude
    "latitude": 37.7817529521,

    // float, longitude
    "longitude": -122.39612197,

    // float, star rating, rounded to half-stars
    "stars": 4.5,

    // interger, number of reviews
    "review_count": 1198,

    // integer, 0 or 1 for closed or open, respectively
    "is_open": 1,

    // object, business attributes to values. note: some attribute values might be objects
    "attributes": {
        "RestaurantsTakeOut": true,
        "BusinessParking": {
            "garage": false,
            "street": true,
            "validated": false,
            "lot": false,
            "valet": false
        },
    },

    // an array of strings of business categories
    "categories": [
        "Mexican",
        "Burgers",
        "Gastropubs"
    ],

    // an object of key day to value hours, hours are using a 24hr clock
    "hours": {
        "Monday": "10:00-21:00",
        "Tuesday": "10:00-21:00",
        "Friday": "10:00-21:00",
        "Wednesday": "10:00-21:00",
        "Thursday": "10:00-21:00",
        "Sunday": "11:00-18:00",
        "Saturday": "10:00-21:00"
    }
}

review.json

Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

{
    // string, 22 character unique review id
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // integer, star rating
    "stars": 4,

    // string, date formatted YYYY-MM-DD
    "date": "2016-03-09",

    // string, the review itself
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",

    // integer, number of useful votes received
    "useful": 0,

    // integer, number of funny votes received
    "funny": 0,

    // integer, number of cool votes received
    "cool": 0
}

user.json

User data including the user's friend mapping and all the metadata associated with the user.

{
    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",

    // string, the user's first name
    "name": "Sebastien",

    // integer, the number of reviews they've written
    "review_count": 56,

    // string, when the user joined Yelp, formatted like YYYY-MM-DD
    "yelping_since": "2011-01-01",

    // array of strings, an array of the user's friend as user_ids
    "friends": [
        "wqoXYLWmpkEH0YvTmHBsJQ",
        "KUXLLiJGrjtSsapmxmpvTA",
        "6e9rJKQC3n0RSKyHLViL-Q"
    ],

    // integer, number of useful votes sent by the user
    "useful": 21,

    // integer, number of funny votes sent by the user
    "funny": 88,

    // integer, number of cool votes sent by the user
    "cool": 15,

    // integer, number of fans the user has
    "fans": 1032,

    // array of integers, the years the user was elite
    "elite": [
        2012,
        2013
    ],

    // float, average rating of all reviews
    "average_stars": 4.31,

    // integer, number of hot compliments received by the user
    "compliment_hot": 339,

    // integer, number of more compliments received by the user
    "compliment_more": 668,

    // integer, number of profile compliments received by the user
    "compliment_profile": 42,

    // integer, number of cute compliments received by the user
    "compliment_cute": 62,

    // integer, number of list compliments received by the user
    "compliment_list": 37,

    // integer, number of note compliments received by the user
    "compliment_note": 356,

    // integer, number of plain compliments received by the user
    "compliment_plain": 68,

    // integer, number of cool compliments received by the user
    "compliment_cool": 91,

    // integer, number of funny compliments received by the user
    "compliment_funny": 99,

    // integer, number of writer compliments received by the user
    "compliment_writer": 95,

    // integer, number of photo compliments received by the user
    "compliment_photos": 50
}

checkin.json

Checkins on a business.

{
    // nested object of the day of the week with key of
    // the hour (using a 24hr clock) with the count of checkins
    // for that hour (e.g. 14:00 - 14:59).
    "time": {
        "Wednesday": {
            "14:00": 2,
            "16:00": 1,
            "2:00": 1,
            "0:00": 1
        },
        "Sunday": {
            "16:00": 8,
            "14:00": 3,
            "15:00": 3,
            "13:00": 1,
            "18:00": 2,
            "23:00": 1,
            "21:00": 1,
            "17:00": 2
        },
        "Friday": {
            "16:00": 1,
            "13:00": 1,
            "11:00": 2,
            "23:00": 2
        },
    },

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg"
}

tip.json

Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.

{
    // string, text of the tip
    "text": "Secret menu - fried chicken sando is da bombbbbbb Their zapatos are good too.",

    // string, when the tip was written, formatted like YYYY-MM-DD
    "date": "2013-09-20",

    // integer, how many likes it has
    "likes": 172,

    // string, 22 character business id, maps to business in business.json
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, 22 character unique user id, maps to the user in user.json
    "user_id": "49JhAJh8vSQ-vM4Aourl0g"
}

photo.json

Contains photo data including the caption and classification (one of "food", "drink", "menu", "inside" or "outside").

{
    // string, 22 character unique photo id
    "photo_id": "_nN_DhLXkfwEkwPNxne9hw",


    // string, 22 character business id, maps to business in business.json
    "business_id" : "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the photo caption, if any
    "caption" : "carne asada fries",

    // string, the category the photo belongs to, if any
    "label" : "food"
}
'''


Next, we combine the complete set of files into a URL list and use this to build our json file entry. In our case, just the 1 file

In [46]:
url_list = [rawdata]; url_list


[{'name': None,
  'file_name': 'yelp_dataset.tar.gz',
  'hash_type': 'sha1',
  'hash_value': '096ac5ced8a9229ecc5116e77b6be8d8f90fdacb'}]

In [53]:
url_list += [build_dataset_dict(from_txt=readme_txt, 
                                file_name=f'{dataset_name}.readme', name='DESCR')]

In [54]:
newds_dict = datasets.add_dataset_by_urllist(dataset_name, url_list)
pprint(newds_dict)

{'action': 'fetch_and_process',
 'load_function': functools.partial(<function new_dataset at 0x7fb27c02dae8>, dataset_name='yelp'),
 'load_function_args': [],
 'load_function_kwargs': {'dataset_name': 'yelp'},
 'load_function_module': 'src.data.datasets',
 'load_function_name': 'new_dataset',
 'url_list': [{'file_name': 'yelp_dataset.tar.gz',
               'hash_type': 'sha1',
               'hash_value': '096ac5ced8a9229ecc5116e77b6be8d8f90fdacb',
               'name': None},
              {'contents': '\n'
                           'Yelp Dataset JSON\n'
                           '\n'
                           'Each file is composed of a single object type, one '
                           'JSON-object per-line.\n'
                           '\n'
                           'Take a look at some examples to get you started: '
                           'https://github.com/Yelp/dataset-examples.\n'
                           '\n'
                           'Note: the follow examples

See that a generic `load_function` (`new_dataset`) has been used to process the data. This does nothing more than populates the DESCR and LICENSE fields (if possible), creating an otherwise empty `Dataset` object

In [55]:
# Now, call the (generic) load function and notice that the LICENSE and DESCR have been set
dset = newds_dict['load_function']()
type(dset)

src.data.dset.Dataset

In [56]:
print(dset.DESCR)


Yelp Dataset JSON

Each file is composed of a single object type, one JSON-object per-line.

Take a look at some examples to get you started: https://github.com/Yelp/dataset-examples.

Note: the follow examples contain inline comments, which are technically not valid JSON. This is done here to simplify the documentation and explaining the structure, the JSON files you download will not contain any comments and will be fully valid JSON.
business.json

Contains business data including location data, attributes, and categories.

{
    // string, 22 character unique string business id
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",

    // string, the business's name
    "name": "Garaje",

    // string, the neighborhood's name
    "neighborhood": "SoMa",

    // string, the full address of the business
    "address": "475 3rd St",

    // string, the city
    "city": "San Francisco",

    // string, 2 character state code, if applicable
    "state": "CA",

    // string, the postal code
   

In [57]:
license = getattr(dset, 'LICENSE', None)
print(license)

YELP DATASET TERMS OF USE

Last Updated: July 26, 2018
This document governs the terms under which you may access and use the data that Yelp
makes available for download through this website (or made available by other means) for
academic purposes (the “Data”). This document incorporates the terms of the following
additional document, including all future amendments or modifications thereto (collectively, and
together with this document, the “Data Agreement” ):

Yelp Terms of Service:

By accessing or using the Data, you agree to be bound by the Data Agreement and represent
that the contact information you provide to Yelp is correct. If you access or use the Data on
behalf of a university, school, or other entity, you represent that you have authority to bind such
entity and its affiliates to the Data Agreement and that it is fully binding upon them. In such
case, the term “you” and “your” will refer to such entity and its affiliates. If you do not have
authority, or if you do not agre

Now, reload the dataset from scratch and check that the license is there

In [58]:
newds_dict = datasets.add_dataset_by_urllist(dataset_name, url_list)
dset = datasets.load_dataset(dataset_name)
print(dset.LICENSE)

2018-08-31 16:23:08,467 - fetch - INFO - No compression detected. Copying...
2018-08-31 16:23:08,471 - fetch - INFO - Decompresing yelp.readme


YELP DATASET TERMS OF USE

Last Updated: July 26, 2018
This document governs the terms under which you may access and use the data that Yelp
makes available for download through this website (or made available by other means) for
academic purposes (the “Data”). This document incorporates the terms of the following
additional document, including all future amendments or modifications thereto (collectively, and
together with this document, the “Data Agreement” ):

Yelp Terms of Service:

By accessing or using the Data, you agree to be bound by the Data Agreement and represent
that the contact information you provide to Yelp is correct. If you access or use the Data on
behalf of a university, school, or other entity, you represent that you have authority to bind such
entity and its affiliates to the Data Agreement and that it is fully binding upon them. In such
case, the term “you” and “your” will refer to such entity and its affiliates. If you do not have
authority, or if you do not agre

# Here be dragons
below here we haven't edited things yet

## Processing the data
The next step is to write the importer that actually processes the data we will be using for this dataset.

The important things to generate are `data` and `target` entries. A `metadata` is optional, but recommended if you want to save additional information about the dataset.

Usually, this functionality gets bundled up into a function and added to `datasets.py`


In [None]:
# Unpack the file
untar_dir = fetch_and_unpack(dataset_name)
unpack_dir = untar_dir / 'lvq_pak-3.1'
list_dir(unpack_dir)

In this dataset, the training and test datsets are stored in files named `ex1.dat` and `ex2.dat` respectively

In [None]:
datafile_train = unpack_dir / 'ex1.dat'
datafile_test = unpack_dir / 'ex2.dat'

datafile_train.exists() and datafile_test.exists()

According to the documentation, the data format is space-delimited, with the class label included as the last column. Let's have a look

In [None]:
print(head_file(datafile_train))

Indeed, the datafile consists of 1 line containing the dimension of the data, a comment, and then 21 space-delimited columns, the final column being the target class label. 

**Note:** We have to be a little careful importing the data, because '#' is used both as the comment delimiter, and as a class label.

Fortunately, we have a helper function for this. We will get a little cheeky and skip the first 2 lines (hoping there are no other comments). The documentation also says ther are 1962 entries in each of the training and test datasets.

In [None]:
data, target = read_space_delimited(datafile_train, skiprows=[0,1])
data2, target2 = read_space_delimited(datafile_test, skiprows=[0])

data.shape, target.shape, data2.shape, target2.shape

In [None]:
target

This seems to work, so let's wrap this functionality up into a processing function.
By convention, the function takes a `dataset_name`, and any other options that may be useful for reading the data, and returns a dictionary that matches the `Dataset` constructor signature.

We will place this function in `localdata.py`, (and add it to `__all__`) to make it visible to our dataset code.

In [None]:
%%file ../src/data/localdata.py
"""
Custom dataset processing/generation functions should be added to this file
"""

from src.data.utils import read_space_delimited, normalize_labels
from src.paths import interim_data_path
import numpy as np

__all__ = ['process_lvq_pak']

def process_lvq_pak(dataset_name='lvq-pak', kind='all', numeric_labels=True, metadata=None):
    """
    kind: {'test', 'train', 'all'}, default 'all'
    numeric_labels: boolean (default: True)
        if set, target is a vector of integers, and label_map is created in the metadata
        to reflect the mapping to the string targets
    """
    
    untar_dir = interim_data_path / dataset_name
    unpack_dir = untar_dir / 'lvq_pak-3.1'

    if kind == 'train':
        data, target = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
    elif kind == 'test':
        data, target = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
    elif kind == 'all':
        data1, target1 = read_space_delimited(unpack_dir / 'ex1.dat', skiprows=[0,1])
        data2, target2 = read_space_delimited(unpack_dir / 'ex2.dat', skiprows=[0])
        data = np.vstack((data1, data2))
        target = np.append(target1, target2)
    else:
        raise Exception(f'Unknown kind: {kind}')

    if numeric_labels:
        if metadata is None:
            metadata = {}
        mapped_target, label_map = normalize_labels(target)
        metadata['label_map'] = label_map
        target = mapped_target

    dset_opts = {
        'dataset_name': dataset_name,
        'data': data,
        'target': target,
        'metadata': metadata
    }
    return dset_opts


Let's make sure this works as expected

In [None]:
from src.data.localdata import process_lvq_pak

for kind in ['train', 'test', 'all']:
    dset_opts = process_lvq_pak(kind=kind)
    dset = Dataset(**dset_opts)
    print(f'{kind}: data={dset.data.shape} target={dset.target.shape}')

This all looks good


In [None]:
datasets.add_dataset_from_function(dataset_name, process_lvq_pak);

Finally, re-load the dataset and save a copy of it

In [None]:
lvq = load_dataset(dataset_name)
print(str(lvq))
lvq.dump()

The saved data is stored in the `processed_data_path`. An copy of just the metadata is also stored, so that it may be quickly checked without loading the entire dataset.

In [None]:
list_dir(processed_data_path)