# Data Download
This file will show the process of downloading the data needed for this project.

# Table of Contents

- [Places](#1)
- [Reviews](#2)

The data is downloaded from [Google Local](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) data published by Professor Julian McAuley from University of California, San Diego. The data contains reviews about businesses along with their geographical location. The raw data has 3.116.785 businesses and 11.453.845 reviews, which spans across locations all over the world.

The data is stored as multiple non comma seperated dictionaries and must be converted into readable json format. This process in done below. The code takes approximate 6 hours to run and outputs the two data files "reviews.csv" and "places.csv". In this data downloading process, we have decided to only use data from London, UK and Manhattan, US.

**If you are interested in the raw data and the issues contained in it, you are welcome to read the paragraph below. Otherwise, you'll find our code further below.**

The Google Local data is comprised of dictionaries separated by new line operators. A sample of four lines from the `places.clean.json` file is shown below:
```json
{'name': u'Diamond Valley Lake Marina', 'price': None, 'address': [u'2615 Angler Ave', u'Hemet, CA 92545'], 'hours': [[u'Monday', [[u'6:30 am--4:15 pm']]], [u'Tuesday', [[u'6:30 am--4:15 pm']]], [u'Wednesday', [[u'6:30 am--4:15 pm']], 1], [u'Thursday', [[u'6:30 am--4:15 pm']]], [u'Friday', [[u'6:30 am--4:15 pm']]], [u'Saturday', [[u'6:30 am--4:15 pm']]], [u'Sunday', [[u'6:30 am--4:15 pm']]]], 'phone': u'(951) 926-7201', 'closed': False, 'gPlusPlaceId': '104699454385822125632', 'gps': [33.703804, -117.003209]}
{'name': u'Blue Ribbon Cleaners', 'price': None, 'address': [u'Parole', u'Annapolis, MD'], 'hours': None, 'phone': u'(410) 266-6123', 'closed': False, 'gPlusPlaceId': '103054478949000078829', 'gps': [38.979759, -76.547538]}
{'name': u'Portofino', 'price': None, 'address': [u'\u0443\u043b. \u0422\u0443\u0442\u0430\u0435\u0432\u0430, 1', u'Nazran, Ingushetia, Russia', u'366720'], 'hours': [[u'Monday', [[u'9:30 am--9:00 pm']]], [u'Tuesday', [[u'9:30 am--9:00 pm']]], [u'Wednesday', [[u'9:30 am--9:00 pm']], 1], [u'Thursday', [[u'9:30 am--9:00 pm']]], [u'Friday', [[u'9:30 am--9:00 pm']]], [u'Saturday', [[u'9:30 am--9:00 pm']]], [u'Sunday', [[u'9:30 am--9:00 pm']]]], 'phone': u'8 (963) 173-38-38', 'closed': False, 'gPlusPlaceId': '109810290098030327104', 'gps': [43.22776, 44.762726]}
{'name': u"Dicola's Pizzeria", 'price': None, 'address': [u'626 Can Do Expy # 1 , Hazle, PA 18202'], 'hours': None, 'phone': u'(570) 384-0520', 'closed': False, 'gPlusPlaceId': '104869934485244376571', 'gps': [40.9908, -76.0117]}
```
At first glance, this looks like dictionaries that should be simple to load into Python. However, the issue is that the data is not consistent on whether apostrophes or quotations marks are used to encapsulate strings in the dictionaries. For example, the first three lines are consistent with using apostrophes, which are not recognized as `json` format, but the fourth line uses quotation marks for the name since the name has an apostrophe in it, i.e. `Dicola's Pizzeria`.

Therefore, the solution we found was to iterate over all the rows in the data files, convert the row to json format and write the row to a new file with comma separation. The solution is time-consuming and not very elegant, but since we are only running this once we went with it. The same treatment applies for the `reviews.clean.json` file.

In [3]:
import pandas as pd
import requests
import gzip
import shutil
import json
import re
import ast
import os
from tqdm import tqdm
data_path = "data/"

In [2]:
# Function for unzipping gzip and saving as json
def unzip_gzip(file):
    with gzip.open(file, "rb") as f_in:
        with open(file.rsplit(".", 1)[0], "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

# Places <a class="anchor" id="1"></a>

In [3]:
# Specify file path
url = "http://deepyeti.ucsd.edu/jmcauley/datasets/googlelocal/places.clean.json.gz"
filename = url.split("/")[-1]

In [4]:
# Download zip file from website
with open(data_path + filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

In [6]:
# Unzip and save locally
unzip_gzip(data_path + filename)

In [8]:
# Inspiration from https://gist.github.com/mbrzusto/23fe728966247f25f3ec
fr=open(data_path + "places.clean.json")
fw=open(data_path + "places.json", "w")
written_firstline = 0
for line in tqdm(fr):
    json_dat = ast.literal_eval(line)
    full_address = ", ".join(json_dat['address'])
    in_ny = re.findall(r"NY\s\d{5}", json_dat['address'][-1]) # addresses in New York end with NY XXXXX
    in_london = re.findall(r'London.*?United Kingdom', full_address) # addresses in London contains the word London followed by zip code of varying length and United Kingdom
    if in_ny or in_london:
        if written_firstline == 0: # If file is empty
            fw.write("[")
            json.dump(json_dat, fw)
            written_firstline += 1
        else:
            fw.write(",\n")
            json.dump(json_dat, fw)
fw.write("]")

fw.close()
fr.close()

3114353it [04:42, 11020.28it/s]


In [9]:
# Convert json file to csv
with open(data_path + "places.json", "r") as f:
    content = json.loads(f.read())

df = pd.DataFrame(content)
df.to_csv(data_path + "places.csv", index=False, sep=";")

`gPlusPlaceId` is a unique ID for each business.

In [13]:
print(df.shape)
print(df.gPlusPlaceId.nunique())

(102851, 8)
102851


# Reviews <a class="anchor" id="2"></a>

In [4]:
places = pd.read_csv(data_path + "places.csv", sep=";")

In [11]:
# Download gzip data from website
url = "http://deepyeti.ucsd.edu/jmcauley/datasets/googlelocal/reviews.clean.json.gz"
filename = url.split("/")[-1]
with open(data_path + filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

In [12]:
# Unzip and save reviews locally
unzip_gzip(data_path + filename)

In [5]:
places_ids = places.gPlusPlaceId.values
fr=open(data_path + "reviews.clean.json")
fw=open(data_path + "reviews.json", "w")
written_firstline = 0
for line in tqdm(fr):
    json_dat = ast.literal_eval(line)
    if json_dat['gPlusPlaceId'] in places_ids: # only get reviews of businesses in the places file
        if written_firstline == 0:
            fw.write("[")
            json.dump(json_dat, fw)
            written_firstline += 1
        else:
            fw.write(",\n")
            json.dump(json_dat, fw)
                 
fw.write("]")

fw.close()
fr.close()

11453845it [7:45:05, 410.44it/s]


In [6]:
# Convert file from json to csv
with open(data_path + "reviews.json", "r") as f:
    content = json.loads(f.read())

df = pd.DataFrame(content)
df.to_csv(data_path + "reviews.csv", index=False, sep=";")


## Final remarks
Now that the data is downloaded it is ready to be prepared for analysis. This is done in the [data processing notebook](./DataProcessing.ipynb) which is the next suggested step to learn more about the project.