# Google API Data - Goody & Sigel's

Mark Labinski
<br>
Data Scientist
<br>
Goody Goody Liquor

## File Information

### Description 

A python script to predict retail store performance based on data pulled from the Google Places API. This script was used to count the number of different local landmarks (supermarkets, liquor stores, gyms, schools, churches, bus stops, etc) around each store, to explore if the success of a retail store could be predicted based on these local places. This script also accepts keyword inputs, so that you can count the number of __specific__ places (ie, search for "Walgreens" instead of just "pharmacies").

The default script takes a list of geocoordinates, one set for each sample you wish to classify. Simply enter the names of the Google Places you wish to count, and a list of radii to search within, and the script will pull and count a list of these places for each store. Because this is a supervised learning script, target classes will need to be given in order to classify the sample stores. 

<div class='alert alert-warning'>
This script can be used in conjunction with my Census and Yelp API scripts (www.github.com/marklabinski) in order to use  Google, Yelp and Census data as feature inputs in a machine learning experiment. Combine these with my neural network scripts to create one amazing research project!
</div>

Examples of potential uses:
 
> - Retail : Classify McDonalds locations by success (top 25%, top 50%, bottom 50%, bottom 25%) based on the number of Burger King's, Taco Bell's, schools, bus stops, malls, etc. within a 1, 3, 5, and 10 mile radius. New potential locations can then be given to the algorithm to predict best locations to open a new store.
> - Real estate : Predict "on the rise" neighborhoods based on the number of Starbucks opened within the last year, the number of transplants within the past 3 years, median age, the income price over the past 2 years, etc.
> - Social sciences : Find correlations between lower-performing schools and their immediate surroundings.
> - The possibilities are limitless and this script generalizes to __ANYTHING__ - get creative!


### Functions
>- __GoogleRequest__ : Grabs and parses JSON data from Google API
>- __GoogleSearch__ : Performs Google API search using GoogleRequest

 <br>

__Dependencies__:
 urllib, json, requests, pandas, seaborn, matplotlib, numpy, scikit-learn
<br>

__Note__: The default script here takes in a dataset with confidential information, but can be changed by simply creating a new pandas dataframe to read all the data into (this will be noted below in the script). In order to bypass the Google API limit, a premium API key must be purchased. 

<br>

For API documentation, see:
https://developers.google.com/places/web-service/search


In [9]:
import urllib, json, requests
import pandas as pd
import numpy as np
import time
%matplotlib notebook

# Data Collection

#### Read in locations (Latitude, Longitude) around which to base Google search

In [3]:
root = 'C:/Users/markl/OneDrive/Documents/GG/'
fn = 'df_data.xlsx'
data = pd.read_excel(root+fn)
data.head()

Unnamed: 0,store,X,Y,supermarket1,liquorstore1,walmart1,convenience1,gas1,bank1,supermarket2,...,stdLiqD3,stdLiqD5,stdLiqD10,stdTotalBevDH,stdTotalBevD1,stdTotalBevD3,stdTotalBevD5,stdTotalBevD10,total,class
0,P679671,32.68313,-97.396815,1,2,0,3,8,9,5,...,14566.755885,20170.684427,27748.503819,25999.351244,26399.073228,30058.206796,37098.217513,51138.175439,3481503,0
1,P255015,32.822277,-96.862946,0,4,0,10,14,11,1,...,35910.510537,52131.606772,39592.521387,76787.750512,60881.346205,99762.648386,100874.007159,77314.162573,1200000,0
2,P255017,32.9382,-96.72049,0,2,0,9,10,4,11,...,15948.627376,15781.6983,24158.169235,192.890079,21452.991344,30764.585853,32293.487742,45410.073489,8300000,1
3,P251689,32.95099,-96.830994,1,10,1,9,10,20,4,...,32036.307401,26315.734435,30285.027666,50484.418333,65082.348322,59016.37438,48769.391239,58013.563303,8365586,1
4,P255034,33.071754,-96.75158,2,2,1,4,6,4,4,...,20771.251724,27004.854424,25518.624371,15244.702581,15434.634532,31506.233057,48094.847427,47713.674039,1700000,0


Above is our original, complete dataset. Let's pull out just the store and coordinates (X,Y) and work with those

In [10]:
df = data[['store','X','Y']]
df.head()

Unnamed: 0,store,X,Y
0,P679671,32.68313,-97.396815
1,P255015,32.822277,-96.862946
2,P255017,32.9382,-96.72049
3,P251689,32.95099,-96.830994
4,P255034,33.071754,-96.75158


### Define function to pull JSON data from Google API

In [11]:
def GoogleRequest(lat,lng,radius,types,key):
    """
    Function for grabbing and parsing JSON data from Google API.
    Input
        : latitude
        : longitude
        : radius (miles)
        : google types
        : API key    
    Output
        : Grabs JSON object from Google, turns it into nested array
    
    For API documentation, see: https://developers.google.com/places/web-service/search
    
    """
    # Making the URL
    
    AUTH_KEY = key
    LOCATION = str(lat) + "," + str(lng)
    RADIUS = radius * 1609.344 # convert to meters (Google requirement)
    TYPES = types
    googUrl = ('https://maps.googleapis.com/maps/api/place/nearbysearch/json'
           '?location=%s'
           '&radius=%s'
           '&types=%s'
           '&sensor=false&key=%s') % (LOCATION, RADIUS, TYPES, AUTH_KEY)
    
    # Grabbing the JSON result
    
    response = requests.get(googUrl)
    jsonData = response.json()
    return jsonData

### Pull data from Google Places API


For API documentation, see:
https://developers.google.com/places/web-service/search


For supported types, visit:
https://developers.google.com/places/web-service/supported_types

For keyword search, use syntax:
>'type&keyword=XXX' (ex: 'gas_station&keyword=7/11')


In [8]:
  #####################
 ## OPTIMIZING TEST ##
#####################
def GoogleSearch2(df,google_types,new_columns,radii,api_key,save_as='df_google.csv'):
    """
    Function to make Google API search request for locations in df
        Input
             : df           - pandas dataframe with latitude/longitude for each location around which to do Google search
             : google_types - list of google types to search for
                                 Format: 'type' or 'type&keyword=xxx' (ex: 'gas_station&keyword=7/11')
             : new_columns  - list of column names to use when storing google data in df
             : radii        - list of radii to use when searching
             : api_key      - your API key
             : save_as      - save df to csv as ___
                                 Format: 'google_date.csv'
        Output
             : df, now with added new_columns containing Google search results
                   
    For supported Google search types, visit: https://developers.google.com/places/web-service/supported_types
    """
    # Initialize search 
    search = {}
    
    # Loop through all Google types, adding iteratively to df
    for row in df.iterrows():    # Loop through test stores
        index,data = row
        for idx in range(len(google_types)):
                for r in range(len(radii)):
                    df[new_columns[idx]+str(radii[r])]=0  
                    search[index] = GoogleRequest(lat=data['X'],lng=data['Y'],radius=radii[r],types=google_types[idx],key=MyKey)
                    df.loc[index,[new_columns[idx]+str(radii[r])]] = len(search[index]['results'])
            
    df.to_csv(save_as)

In [7]:
def GoogleSearch(df,google_types,new_columns,radii,api_key,save_as='df_google.csv'):
    """
    Function to make Google API search request for locations in df
        Input
             : df           - pandas dataframe with latitude/longitude for each location around which to do Google search
             : google_types - list of google types to search for
                                 Format: 'type' or 'type&keyword=xxx' (ex: 'gas_station&keyword=7/11')
             : new_columns  - list of column names to use when storing google data in df
             : radii        - list of radii to use when searching
             : api_key      - your API key
             : save_as      - save df to csv as ___
                                 Format: 'google_date.csv'
        Output
             : df, now with added new_columns containing Google search results
                   
    For supported Google search types, visit: https://developers.google.com/places/web-service/supported_types
    """
    # Initialize search 
    search = {}
    
    # Loop through all Google types, adding iteratively to df
    for r in range(len(radii)):
        for idx in range(len(google_types)):
            df[new_columns[idx]+str(radii[r])]=0  
            for row in df.iterrows():    # Loop through test stores
                index,data = row
                search[index] = GoogleRequest(lat=data['X'],lng=data['Y'],radius=radii[r],types=google_types[idx],key=MyKey)
                df.loc[index,[new_columns[idx]+str(radii[r])]] = len(search[index]['results'])
            
    df.to_csv(save_as)

In [19]:
# API key
MyKey = 'AIzaSyAX78WDNi9hCVHADL813CjWQY7i5tFuD-k'

# List of Google types to use
google_types = ['department_store&keyword=Walmart','department_store&keyword=Target','department_store','bus_stop','bar','pharmacy']

# List of column names for new dfframe
new_columns = ['walmart','target','dept_store','bus','bar','pharmacy',]

# List of search radii
radii = [0.5,1,3,5]

save_as = 'C:/Users/markl/OneDrive/Documents/GG/df_google_v2'

GoogleSearch(df,google_types,new_columns,radii,MyKey,save_as)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [20]:
df

Unnamed: 0,store,X,Y,walmart0.5,walmart1,walmart3,walmart5,target0.5,target1,target3,...,bus3,bus5,bar0.5,bar1,bar3,bar5,pharmacy0.5,pharmacy1,pharmacy3,pharmacy5
0,P679671,32.68313,-97.396815,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,P255015,32.822277,-96.862946,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,P255017,32.9382,-96.72049,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,P251689,32.95099,-96.830994,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,P255034,33.071754,-96.75158,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,P455948,33.071592,-97.046911,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,P679678,32.725684,-97.422516,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,P692933,32.49201,-94.728782,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,P692935,32.504187,-94.769017,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,P157674,32.875958,-96.760978,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Data Exploration / Pre-Processing

First, we need to divide the stores into classes based on total sales. 

### Three Classes

In [6]:
# If we need to reload...
root = 'C:/Users/markl/OneDrive/Documents/GG/'
df = pd.read_csv(root+'df_google.csv')

In [4]:
totalmax = df['total'].max()
totalmin = df['total'].min()
third = (totalmax - totalmin) / 3
top3 = totalmax - third
low3 = totalmax - (third * 2)

for row in df.iterrows():
    idx, data = row
    if df.loc[idx,'total'] > top3 :
        df.loc[idx,'class'] = 0
    elif low3 < df.loc[idx,'total'] < top3 : 
        df.loc[idx,'class'] = 1
    elif df.loc[idx,'total'] < low3 :
        df.loc[idx,'class'] = 2   

In [5]:
df.groupby('class').count()

Unnamed: 0_level_0,Unnamed: 0,store,X,Y,supermarket1,liquorstore1,walmart1,convenience1,gas1,bank1,...,walmart3,convenience3,gas3,bank3,supermarket0.5,liquorstore0.5,walmart0.5,convenience0.5,gas0.5,total
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,3,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
1,7,7,7,7,7,7,7,7,7,7,...,7,7,7,7,7,7,7,7,7,7
2,20,20,20,20,20,20,20,20,20,20,...,20,20,20,20,20,20,20,20,20,20


In [217]:
X = df.iloc[:,3:-2]
y = df.iloc[:,-1]

In [315]:
from sklearn.svm import SVC
from sklearn.utils import class_weight

nsims = 10000
accuracies=pd.DataFrame()
for n in range(1,nsims):
    # Split into train/test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

    class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)

    clf = SVC(kernel='linear',class_weight=dict(enumerate(class_weights)))      

    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    #print(clf)
    #print 'Accuracy Train: ',clf.score(X_train,list(y_train.values))#
    #print 'Accuracy Test: ',clf.score(X_test,list(y_test.values))
    #print 'Confusion Matrix: \n',confusion_matrix(list(y_test.values), y_pred)
    #print 'F1 score: ',clf.
    #print ' ' 
    accuracies.loc[n,'score'] = clf.score(X_test,list(y_test.values))

In [316]:
accuracies.mean()

score    0.673084
dtype: float64

## Two Classes

Let's use pandas' df.describe() to pull 25%, 50%, and 75% values for the 'total' column of df

In [73]:
describe = df['total'].describe()

# Define classes
for row in df.iterrows():
    idx, data = row
    if df.loc[idx,'total'] > describe.loc['75%'] :
        df.loc[idx,'class'] = 1
    elif describe.loc['50%'] < df.loc[idx,'total'] < describe.loc['75%'] : 
        df.loc[idx,'class'] = 1
    elif describe.loc['25%'] < df.loc[idx,'total'] < describe.loc['50%'] :
        df.loc[idx,'class'] = 0
    elif df.loc[idx,'total'] < describe.loc['25%'] :
        df.loc[idx,'class'] = 0   

In [74]:
X = df.iloc[:,4:-2]
y = df.iloc[:,-1]

In [78]:
df.groupby('class').mean()

Unnamed: 0_level_0,Unnamed: 0,X,Y,supermarket1,liquorstore1,walmart1,convenience1,gas1,bank1,supermarket2,...,walmart3,convenience3,gas3,bank3,supermarket0.5,liquorstore0.5,walmart0.5,convenience0.5,gas0.5,total
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,17.333333,32.198402,-96.036489,1.6,4.6,0.533333,8.733333,10.733333,9.266667,5.0,...,2.4,17.933333,18.733333,17.2,0.466667,2.0,0.266667,2.533333,3.8,2278484.0
1,11.666667,32.892335,-96.714263,1.733333,5.533333,0.666667,9.6,10.733333,11.466667,5.8,...,2.466667,19.466667,20.0,18.2,0.6,2.733333,0.266667,3.133333,4.266667,7766664.0


In [40]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 10)
rfe = rfe.fit(X, y )
print(rfe.support_)
print(rfe.ranking_)

[ True False False False False False  True False  True False  True False
  True False  True  True  True False  True  True False False False]
[ 1 11 12  6 10  7  1  4  1  8  1 14  1  3  1  1  1 13  1  1  2  5  9]


In [42]:
cols = ['supermarket1','supermarket2','gas1','walmart2','gas2','supermarket3',
        'walmart3', 'convenience3','gas3','supermarket0.5','liquorstore0.5']
Xnew = X[cols]

#### Try different classification techniques

In [43]:
# Try different classification techniques 
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression,Perceptron
from sklearn.svm import LinearSVC
from sklearn.tree import export_graphviz

clf_A = LogisticRegression()
clf_B = LinearSVC()
clf_C = SGDClassifier(shuffle=True)
clf_D = Perceptron()
#clf_E = Perceptron()

#clf_list = [clf_A, clf_B, clf_C, clf_D, clf_E]
clf_list = [clf_A, clf_B, clf_D]

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(Xnew, y,
                                                    test_size = 0.2,
                                                    random_state = 10,
                                                    stratify = y)

for clf in clf_list:
    clf.fit(X_train,list(y_train.values))
    y_pred = clf.predict(X_test)
    print(clf)
    print 'Accuracy Train: ',clf.score(X_train,list(y_train.values))
    print 'Accuracy Test: ',clf.score(X_test,list(y_test.values))
    print 'Precision: ',precision_score(list(y_test.values),y_pred) 
    print 'Recall: ',recall_score(list(y_test.values),y_pred) 
    print 'F1 Score: ',f1_score(list(y_test.values),y_pred) 
    print 'Confusion Matrix: \n',confusion_matrix(list(y_test.values), y_pred)
    print ' ' 

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Accuracy Train:  0.875
Accuracy Test:  0.6666666666666666
Precision:  0.6
Recall:  1.0
F1 Score:  0.7499999999999999
Confusion Matrix: 
[[1 2]
 [0 3]]
 
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Accuracy Train:  0.9166666666666666
Accuracy Test:  0.6666666666666666
Precision:  0.6
Recall:  1.0
F1 Score:  0.7499999999999999
Confusion Matrix: 
[[1 2]
 [0 3]]
 
Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=F