# Data Download

The data is downloaded from [Google Local](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local) data published by Professor Julian McAuley from University of California, San Diego. The data contains reviews about businesses along with their geographical location. The raw data has 3.116.785 businesses and 11.453.845 reviews, which spans across locations all over the world.

The data is stored as multiple non comma seperated dictionaries and must be converted into readable json format. This process in done below. The code takes approximate 6 hours to run and outputs the two data files "reviews.csv" and "places.csv". In this data downloading process, we have decided to only use data from United Kingdom and New York, USA.

In [None]:
import pandas as pd
import requests
import gzip
import shutil
import json
import re
import ast
import os
from tqdm import tqdm

: 

In [18]:
# Unzipping and saving as json
def unzip_gzip(file):
    with gzip.open(file, "rb") as f_in:
        with open(file.rsplit(".", 1)[0], "wb") as f_out:
            shutil.copyfileobj(f_in, f_out)

# Places

In [3]:
# Specify file path
url = "http://deepyeti.ucsd.edu/jmcauley/datasets/googlelocal/places.clean.json.gz"
filename = url.split("/")[-1]

In [None]:
# Download file
with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

In [7]:
unzip_gzip(filename)

In [5]:
# Inspiration from https://gist.github.com/mbrzusto/23fe728966247f25f3ec
fr=open("places.clean.json")
fw=open("places.json", "w")
written_firstline = 0
for line in tqdm(fr):
    json_dat = ast.literal_eval(line)
    full_address = ", ".join(json_dat['address'])
    in_ny = re.findall(r"NY\s\d{5}", json_dat['address'][-1]) # addresses in New York end with NY XXXXX
    in_london = re.findall(r'London.*?United Kingdom', full_address) # addresses in London contains the word London followed by zip code of varying length and United Kingdom
    if in_ny or in_london:
        if written_firstline == 0: # If file is empty
            fw.write("[")
            json.dump(json_dat, fw)
            written_firstline += 1
        else:
            fw.write(",\n")
            json.dump(json_dat, fw)
fw.write("]")

fw.close()
fr.close()

3114353it [13:29, 3845.68it/s]


In [6]:
with open("places.json", "r") as f:
    content = json.loads(f.read())

In [7]:
df = pd.DataFrame(content)
df.to_csv("places.csv", index=False, sep=";")

`gPlusPlaceId` is a unique ID for each business.

In [13]:
print(df.shape)
print(df.gPlusPlaceId.nunique())

(102851, 8)
102851


# Reviews

In [14]:
places = pd.read_csv("places.csv", sep=";")

In [4]:
url = "http://deepyeti.ucsd.edu/jmcauley/datasets/googlelocal/reviews.clean.json.gz"
filename = url.split("/")[-1]
with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

In [7]:
unzip_gzip(filename)

In [15]:
places_ids = places.gPlusPlaceId.values
fr=open("reviews.clean.json")
fw=open("reviews.json", "w")
written_firstline = 0
for line in tqdm(fr):
    json_dat = ast.literal_eval(line)
    if json_dat['gPlusPlaceId'] in places_ids:
        if written_firstline == 0:
            fw.write("[")
            json.dump(json_dat, fw)
            written_firstline += 1
        else:
            fw.write(",\n")
            json.dump(json_dat, fw)
                 
fw.write("]")

fw.close()
fr.close()

62135it [06:16, 165.11it/s]


KeyboardInterrupt: 

In [None]:
with open("reviews.json", "r") as f:
    content = json.loads(f.read())

df = pd.DataFrame(content)
df.to_csv("reviews.csv", index=False, sep=";")