# Notebook For Geo cleaning, Pulling Yelp Data, and Pulling secondary Yelp Information
## Guide
This notebook is broken down into three sections and is meant to be run interactively. The process for each city is done once and the flow can be run from cell to cell.
The steps are as follows:
1. Modify the data coming out of QGIS so that we have the correct longitude and latitude information
    
    1a. Take that data and output for usage if necessary later in the GeoData/Cityname/covertedlocs_cityname.csv format
2. Take the locations as input into the fuction to query Yelp. API Key Redacted.
    
    2a. It should be noted that we over enginered the querying function to account for failures which should not occur.
3. The data is then exported to./GeoInfo/raw_cityname.csv
    
    3a. There is a function written to read this back in
4. The data is then de-duplicated ( which is expected), to account for the fact that we likely overlapped the radius of any one point at least 4 times.
    
    4a. Yelp has a second API query function built in that allows the user to query some secondary information so we query this
5. The data is then passed out for cleaning

# GeoData Cleaning out of QGIS
Using pandas as pyproj we take in the 1km spaced locations transform the outputs projection from epsg 3857 to espg 4326 ( 4326 is the coordinate grid that google maps and most GPS services use).
This is then exported to the appropiate file in the ./Geodata/cityname/convertedlocs_cityname.csv

## Frisco

In [5]:
import pandas as pd
from pyproj import Proj, transform

data = pd.read_csv('./GeoData/Frisco/Frisco_1km.csv')
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def f(x):
    return transform(inProj, outProj, x['X'], x['Y'])

coordinates = data.apply(lambda x: f(x), result_type='expand', axis=1)
data = pd.concat([data, coordinates], axis=1)

data.to_csv('./GeoData/Frisco/convertedlocs_Frisco.csv',index=None)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  return transform(inProj, outProj, x['X'], x['Y'])


## Dallas

In [1]:
import pandas as pd
from pyproj import Proj, transform

data = pd.read_csv('./GeoData/Dallas/Dallas_1km_grid.csv')
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def f(x):
    return transform(inProj, outProj, x['X'], x['Y'])

coordinates = data.apply(lambda x: f(x), result_type='expand', axis=1)
data = pd.concat([data, coordinates], axis=1)

data.to_csv('./GeoData/Dallas/convertedlocs_dallas.csv',index=None)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  return transform(inProj, outProj, x['X'], x['Y'])


## Enterprise

In [4]:
import pandas as pd
from pyproj import Proj, transform

data = pd.read_csv('./GeoData/Enterprise/Enterprise_1km.csv')
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def f(x):
    return transform(inProj, outProj, x['X'], x['Y'])

coordinates = data.apply(lambda x: f(x), result_type='expand', axis=1)
data = pd.concat([data, coordinates], axis=1)

data.to_csv('./GeoData/Enterprise/convertedlocs_Enterprise.csv',index=None)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  return transform(inProj, outProj, x['X'], x['Y'])


## Conroe

In [39]:
import pandas as pd
from pyproj import Proj, transform

data = pd.read_csv('./GeoData/Conroe/conroe_1km_dots.csv')
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def f(x):
    return transform(inProj, outProj, x['X'], x['Y'])

coordinates = data.apply(lambda x: f(x), result_type='expand', axis=1)
data = pd.concat([data, coordinates], axis=1)

data.to_csv('./GeoInfo/convertedlocs_Conroe.csv',index=None)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  return transform(inProj, outProj, x['X'], x['Y'])


## Meridian

In [25]:
import pandas as pd
from pyproj import Proj, transform

data = pd.read_csv('./GeoData/Meridian/MerdianGrid.csv')
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def f(x):
    return transform(inProj, outProj, x['X'], x['Y'])

coordinates = data.apply(lambda x: f(x), result_type='expand', axis=1)
data = pd.concat([data, coordinates], axis=1)
data.to_csv('./GeoInfo/convertedlocs_Merdian.csv',index=None)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  return transform(inProj, outProj, x['X'], x['Y'])


# Yelp Pull by Grid
Here we have the functions, for loop, and methods we use to save and export the data.
It should be noted that the functions here are meant to be run manually, as Error handling was necessacry as the loop ran.

In [6]:
import requests
import json
import time
import pickle
import pandas as pd

+ Replace API_KEY
+ This endpoint is for business search

In [7]:
API_KEY = 'REMOVED'
ENDPOINT = 'https://api.yelp.com/v3/businesses/search'
HEADERS = {'Authorization': 'bearer %s' % API_KEY}

## Functions

In [8]:
def build_PARAMS(lat,long,radius,offset):
    """
    Purpose:
    Build out Param dictionary for the search.
    Params:
    * lat -> Float of the latitude
    * long -> Float of longitude
    * radius -> float of distance to check around geographic point
    * offset -> an int that is basically the page number
    Output:
    The params we want to pass to search in dictonary form.
    """
    params = {'term': 'restaurant',
                  'offset': offset, # this offset is to prevent duplicates
                  'limit': 50, 
                  'latitude':float(lat),
                  'longitude':float(long),
                  'radius':radius}
    return(params)

def handle_good_return(response,output):
    """
    Purpose:
    Handle a response that we know is good and contains data
    Params:
    * response -> The response from the yelp api in json format
    * output -> the list of dictonares we're bulding out
    Output:
    * output -> a list of dictonaries
    """
    business_data = response.json()
    output['businesses'] += business_data['businesses']
    return(output)

def handle_bad_response(lat,long,offset_val,output):
    """
    Purpose:
    Try to search again if there was a bad response
    Params:
    * lat -> Float of the latitude
    * long -> Float of longitude
    * offset -> an int that is basically the page number
    * output -> the output list of buisness which is a list of dicts
    Output:
    A couple of things can happen we get a sucessfule requery and the output is passed back out.
    Else the for loop using this might break with an informative message.
    """
    retry_flag=True
    for i in range(0,3):
        if retry_flag:
            time.sleep(10)
            PARAMETERS=build_PARAMS(lat,long,1000,offset_val)
            retry_request=requests.get(url = ENDPOINT, params= PARAMETERS, 
                headers = HEADERS)
            code=retry_request.status_code
            if code==200:
                retry_flag=False
                output=handle_good_return(retry_request,output)
                break
            else:
                print(f'Attempt number {i} failed will retry')
                remaining=int(retry_request.headers['ratelimit-remaining'])
                if remaining <= 10:
                    print(f'Maybe hitting daily limit will sleep until then and retry')
                    handle_sleep(retry_request)
                pass
    if retry_flag:
        raise Exception (f'Something is broken here is lat\
            {lat}, long {long}, offset_val {offset_val}',)
    else:
        return(output)

def handle_sleep(retry_request):
    """
    Purpose:
    Handle Sleeping the queries due to a need to sleep the query
    Input:
    retry_request - > dictonary with the retry information
    Output:
    Nothing - the function is just used a time spacer
    """
    reset_time=pd.to_datetime(retry_request.headers['ratelimit-resettime'])
    now=pd.Timestamp.utcnow()
    remaining_time=reset_time-now
    print(f'Should reset in {remaining_time}')
    sleeptime=remaining_time.seconds
    time.sleep(sleeptime)


def handle_multiple_queries(lat,long,li_of_offsets,output,snagflag):
    """
    Purpose:
    Try to search again if there was a bad response
    Params:
    * lat -> Float of the latitude
    * long -> Float of longitude
    * li_of_offsets -> a lsit of ints that is basically the page number
    * output -> the output list of buisness which is a list of dicts
    * snagflag - > a Bool meant to handle problems
    Output:
    Output the list of dicts, and the snagflag to tell the loop if there was an issue that needs to be retried
    """
    for offset_val in li_of_offsets:
        PARAMETERS=build_PARAMS(lat,long,1000,offset_val)
        return_request=requests.get(url = ENDPOINT, params= PARAMETERS, 
        headers = HEADERS)
        code=return_request.status_code
        if code==200:
            output=handle_good_return(return_request,output)
        else:
            print(f'Request Failed, {code}, going to retry in 10 seconds')
            try:
                output=handle_bad_response(lat,long,offset_val,output,snagflag)
            except:
                snagflag=True
                print('Hit a Snag')
                break
    return(output,snagflag)



## Pull For Loop
This loop is the main brain for pulling the information and getting the data out. It takes the functions above and handles all the queries.
**Note** This function is meant to be run manually to troubleshoot. We did not encounter any errors while running so its *assumed* that the functions are robust enough

In [9]:
for _,row in data.iterrows():
    #backwards so 0 last
    lat=row[1]
    long=row[0]
    offset=0
    break

In [14]:
PARAMETERS=build_PARAMS(lat,long,40_000,offset)

In [15]:
starting_response = requests.get(url = ENDPOINT, params= PARAMETERS, headers = HEADERS)

In [16]:
starting_response.content

b'{"error": {"code": "ACCESS_LIMIT_REACHED", "description": "You\'ve reached the access limit for this client. See instructions for requesting a higher access limit at https://www.yelp.com/developers/documentation/v3/rate_limiting"}}'

In [5]:
output = {}
output['businesses'] = []
last_remains=5000
try_num=0
snagflag=False
# Okay we're going to iterate over the geographic points
for _,row in data.iterrows():
    #backwards so 0 last
    lat=row[1]
    long=row[0]
    offset=0
    print(f'Starting query on ({lat},{long})')
    print(f'Output Size is {len(output["businesses"])}')
    PARAMETERS=build_PARAMS(lat,long,50_000,offset)
    starting_response = requests.get(url = ENDPOINT, params= PARAMETERS, headers = HEADERS)
    #Congrats it worked
    if starting_response.status_code==200:
        #write it out before doing anything else
        output=handle_good_return(starting_response,output)
        #Now what happens if we have more than 50 results
        length=starting_response.json()['total']
        #1000 is the limit so if its over we need to handle it differently
        # Easy Case
        if length>50 and length<=1000:
            # build out offset list
            li_of_offsets=list(range(50,length,50))
            #build output in seperate func
            output,snagflag=handle_multiple_queries(lat,long,li_of_offsets,output,snagflag)
            if snagflag:
                raise Exception
        elif length>1000:
            #need to reset radius maybe?
            raise Exception
    else:
        output=handle_bad_response(lat,long,offset,output)
    

Starting query on (33.219369763091024,-96.92274796186854)
Output Size is 0
Attempt number 0 failed will retry
Maybe hitting daily limit will sleep until then and retry
Should reset in 0 days 06:33:51.408616


### Ouput Data
Here we store the data in text format in caase we need it again.

In [None]:
pd.DataFrame(output['businesses']).to_csv('./GeoInfo/raw_Frisco.csv')

### Function for reading Data Back in

In [14]:
def read_in_old_raw_data(filepath):
    """Simple Script to read old data 
    back in filepath is expected to be a string"""
    return(pd.read_csv(filepath,index_col=0))

Most of these are showing out the city limits its no wonder

## Pull Yelp secondary information

Here we assume you have the output information still read into memory. If note the function above should be used and the re-read flag set to True.

In [None]:
def dedup_output(output, cityname,re_read=False):
    if type(output)==str and re_read==True:
        frame=read_in_old_raw_data(output)
    else:
        frame=pd.DataFrame(output['businesses'])
    frame=frame[frame['id'].duplicated(keep='first')]
    fn='./GeoInfo/dedup_'+cityname+'.csv'
    frame.to_csv(fn)
    return frame

### Example of how Dallas' data was run

In [None]:
frame=dedup_output(output,'dallas')

### Continue search
Here we simply iterate through the buisness ids to get the information here

In [12]:

business_id = frame['id'].to_list()
business_details = []

for i in business_id:
    b_id = i
    ENDPOINT = 'https://api.yelp.com/v3/businesses/%s' % b_id
    response = requests.get(url = ENDPOINT, headers = HEADERS)
    business_data = response.json()
    business_details.append(business_data)

Write file to txt

In [13]:
pd.DataFrame(business_details).to_csv('./All_Available_Dallas_Data_MoreINFO.csv')