# 100 ZIP Profiles

## Purpose:
<p>This notebook creates a dataset of profiles for various ZIP Codes present in the input file. Each profile contains various business related metrics for the area.</p>

## Input:
'business_2_interactions.pkl'
## Output:
'ZIPprofiles.pkl' & 'ZIPprofiles.csv' , csv needed for use in python 2

In [1]:
import sys
import pandas as pd
import numpy as np
import os
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
%matplotlib inline

## Set-up for creating our ZIP Profiles
First, let's load our dataset of businesses into a dataframe.

In [2]:
businessPrepped = pd.read_pickle('../../data/analysis/business_2_interactions.pkl')

In [3]:
businessPrepped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174567 entries, 0 to 174566
Data columns (total 11 columns):
business_id             174567 non-null object
name                    174567 non-null object
state                   174566 non-null object
stars                   174567 non-null float64
review_count            174567 non-null int64
is_open                 174567 non-null int64
postal_code             174567 non-null object
categories              174567 non-null object
checkins                174567 non-null float64
tipcount                174567 non-null float64
interactionsWeighted    174567 non-null float64
dtypes: float64(4), int64(2), object(5)
memory usage: 16.0+ MB


Let's remind ourselves of what's in the business dataset, so we know what we'd like to use for our ZIP code area metrics.

In [4]:
businessPrepped.head()

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",AZ,4.0,22,1,85044,Dentists;General Dentistry;Health & Medical;Or...,39.0,5.0,109.685216
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",PA,3.0,11,1,15317,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,15.0,1.0,46.415652
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",AZ,1.5,18,1,85017,Departments of Motor Vehicles;Public Services ...,6.0,0.0,53.123477
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",AZ,3.0,9,0,85282,Sporting Goods;Shopping,120.0,3.0,151.415652
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",OH,3.5,116,1,44221,American (New);Nightlife;Bars;Sandwiches;Ameri...,263.0,17.0,611.190138


We would like to know the mean rating of open businesses in an area, as well as the mean rating of closed businesses.
<p> Let's define functions to create series of open and closed businesses ratings, which can later be grouped and aggregated by area.

In [5]:

#Returns NaN for closed businesses, and the current star rating for open businesses.
def openStars(row):
    if row['is_open'] == 0: #If business is not open, return NaN. Otherwise, return the rating of the open business.
       return np.nan
    return row['stars']

#Returns NaN for closed businesses, and the current star rating for open businesses.
def closedStars(row):
    if row['is_open'] == 1: # If business is open, return Nan. Otherwise return the rating of the closed business.
       return np.nan
    return row['stars']


Now let's apply these functions, and observe the resulting dataframe.

In [6]:
# axis = 1, indicates that we will iterate over rows. Each row will have openStars performed on it.
#The resulting series will be added to the original dataframe as 'open_stars'.
businessPrepped['open_stars'] = businessPrepped.apply(lambda row : openStars(row), axis=1)

#Similarly, closedStars is performed on the dataframe below.
businessPrepped['closed_stars'] = businessPrepped.apply(lambda row : closedStars(row), axis=1)

businessPrepped.head()

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted,open_stars,closed_stars
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",AZ,4.0,22,1,85044,Dentists;General Dentistry;Health & Medical;Or...,39.0,5.0,109.685216,4.0,
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",PA,3.0,11,1,15317,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,15.0,1.0,46.415652,3.0,
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",AZ,1.5,18,1,85017,Departments of Motor Vehicles;Public Services ...,6.0,0.0,53.123477,1.5,
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",AZ,3.0,9,0,85282,Sporting Goods;Shopping,120.0,3.0,151.415652,,3.0
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",OH,3.5,116,1,44221,American (New);Nightlife;Bars;Sandwiches;Ameri...,263.0,17.0,611.190138,3.5,


Similar to the open and closed business ratings above, let's perform a similar task and split the category column into 'openCategories' and 'closedCategories. Each will record the category string of either open or closed businesses. 

In [7]:

#Returns Empty String "" for closed businesses, and the current current category for open businesses.
def openCategories(row):
    if row['is_open'] == 0:
       return ""
    return row['categories']

#Returns Empty String "" for closed businesses, and the current current category for open businesses.
def closedCategories(row):
    if row['is_open'] == 1:
       return ""
    return row['categories']



We'll apply these new functions to the dataframe, save the resulting series as new columns. And then observe the results.

In [1]:
# Axis = 1 indicates that we are iterating over rows and not columns
businessPrepped['closed_categories'] = businessPrepped.apply(lambda row : closedCategories(row), axis=1)
businessPrepped['open_categories'] = businessPrepped.apply(lambda row : openCategories(row), axis=1)
businessPrepped.head()

NameError: name 'businessPrepped' is not defined

##  Creating our new ZIP profile dataframe.

Let's create ZIP profiles containing the following information;
- ZIP Code
- State
- Number of businesses
- Number of Open Businesses
- % Businesses Closed
- Avg. Review Count of the businesses in the area.
- Avg. Tip Counts of the businesses
- Avg. Checkins
- Avg. Interactions (A weighted combination of Reviews,Tips and Checkins.)
- Mean Star Rating of Open Businesses
- Mean Star Rating of Closed Businesses
- Standard Deviation of all Star Ratings in each area.
- Count of each open category in each area.
- Count of each closed category in each area.

Let's create a function that will aggregate category columns in different groups.

Categories of a restaurant are stored as strings, delimited by a semi-colon.

We will try add these to a dictionary for the area, which will record the occurences of every local category.

In [9]:
def categoryAgg(categoryColumn):
    areaCategories= {} # Empty Dictionary to fill.
    
    # For each business categories string...
    for value in categoryColumn:
        #Split this string on the semi-colons, and store them in a list.
        categoryList = value.lower().split(";")
        #For each category in this list...
        for category in categoryList:
            #If the category is already in our dictionary, increment the number associated by 1.
            if category in areaCategories:
                areaCategories[category] += 1
            #Else if the category list isn't empty, add this category to our dictionary and instantiate it with a value of 1.
            elif category != "":
                areaCategories[category] = 1
    return areaCategories

<p>Now we'll create a function that will record a dictionary of the number of various chains in each area. It will do this by aggregating the business 'name' column in a group.</p>
<p>A business qualifies as a 'chain' if it is one of the top 10 occurring names which we uncovered in our dataset overview.</p>

In [10]:
#Operates on 'name' column and returns a dictionary counting the number of different chains.
def chainAgg(nameColumn):
    #Self-determined list of chains based on their appearances in dataset.
    chainList = ['"Starbucks"', '"McDonald"\'s','"Subway"', '"Pizza Hut"','"Taco Bell"','"Burger King"','"Walgreens"','"Wendy\'s"',\
                '"The UPS Store"', '"Tim Hortons"']
    
    #Empty Dictionary to record chain occurences. 
    areaChains= {}
    
    #For each name in the column, if it's in our list of chains...
    for name in nameColumn:
        if name in chainList:
            #if the chain is already in the 'areaChains' dictionary, increment the corresponding count.
            if name in areaChains:
                areaChains[name] += 1
            #otherwise add the chain to the dictionary with a value of 1.
            else :
                areaChains[name] = 1
    return areaChains

Now, let's group the businesses by postal/ZIP code, but also by state to retain this info for each area code.

<p>We'll also aggregate them to get our desired metrics (e.g. 'is_open' will be summed together to find the number of open businesses in each area.)</p>

In [11]:

zipProfileGroups = businessPrepped.groupby(['postal_code', 'state']).agg({'business_id':'count', 'is_open':sum,\
                                    'review_count':'mean','checkins':'mean', 'tipcount':'mean','interactionsWeighted':'mean',\
                                    'open_stars':'mean', 'closed_stars':'mean','stars':"std" ,\
                                                'open_categories':categoryAgg,'closed_categories':categoryAgg, 'name':chainAgg})

Now we will convert this group object into a dataframe, we will also reset the 'state' index, making it a column in our dataframe. ZIP code alone will be our index.

In [12]:
zipProfileDf = pd.DataFrame(zipProfileGroups)
zipProfileDf.reset_index(level = 1, inplace = True)
zipProfileDf.head(3)

Unnamed: 0_level_0,state,business_id,is_open,review_count,checkins,tipcount,interactionsWeighted,open_stars,closed_stars,stars,open_categories,closed_categories,name
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2224,OH,1,1,7.0,13.0,0.0,31.325797,2.5,,,"{'restaurants': 1, 'fast food': 1, 'burgers': 1}",{},{}
5440,VT,2,2,4.0,2.5,1.0,15.589855,4.25,,1.06066,"{'bed & breakfast': 1, 'hotels & travel': 1, '...",{},{}
5452,PA,1,1,49.0,1.0,7.0,147.606374,4.0,,,"{'restaurants': 1, 'cafes': 1, 'delis': 1}",{},{}


Now we'll calculate the percentage of businesses that are closed in each area. <p>This will be calculated by the number of open businesses (currently 'is_open'), divided by the total number of businesses in an area (currently 'business_id').

In [13]:
zipProfileDf['%closed'] = ((zipProfileDf['business_id'] - zipProfileDf['is_open']) / zipProfileDf['business_id'])*100

zipProfileDf[zipProfileDf['business_id'] > 20].sort_values('%closed', ascending = False).head(1)

Unnamed: 0_level_0,state,business_id,is_open,review_count,checkins,tipcount,interactionsWeighted,open_stars,closed_stars,stars,open_categories,closed_categories,name,%closed
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
15289,PA,23,7,36.173913,90.652174,5.956522,200.948429,3.642857,3.5,0.76742,"{'museums': 1, 'colleges & universities': 2, '...","{'sandwiches': 1, 'restaurants': 13, 'bakeries...",{},69.565217


We will now alter the column titles to accurately reflect their contents. We will also rearrange the columns as desired.

In [14]:
#renaming columns to accurately reflect their contents.
zipProfileDf.columns=['state','num_businesses','num_open','num_reviews','num_checkins','num_tips','num_interactions','open_rating','closed_rating','std.dev_rating',\
                      'open_categories','closed_categories','chains','%closed']

#Now Rearrange Columns
zipProfileDf = zipProfileDf[['state', 'num_businesses','num_open','%closed','num_reviews','num_checkins','num_tips','num_interactions','open_rating','closed_rating',\
                             'std.dev_rating','open_categories','closed_categories','chains']]


Let's count the number of chains in each area as well, we can also use this number to calculate the percentage of chains in an area.

<p>First we'll define a function to count the number of chains using the dictionary holding the counts of chains in an area.<p>

In [15]:
def countChains(chainDict):
    #declare an counter to record number of chains in dictionary
    count = 0
    
    #if chain dictionary is empty return 0
    if not chainDict:
        return count
    
    #add each value in the dictionary to the count, these values are the number of specific chain businesses in the area.
    count = sum(chainDict.values())
        
    return count

Now let's apply this function to the 'chains' column of our database, which holds these chain dictionaries.

In [16]:
zipProfileDf['num_chains'] = zipProfileDf['chains'].apply(countChains)

Let's calculate the percentage of businesses that are chains in an area as well.

In [17]:
zipProfileDf['%chains'] = (zipProfileDf['num_chains'] / zipProfileDf['num_businesses'])*100

Let's make sure there's no percentages outside the 0-100 range.

In [19]:
percentageMask = (zipProfileDf['%chains'] < 0) & (zipProfileDf['%chains'] > 100)
zipProfileDf[percentageMask].head()

Unnamed: 0_level_0,state,num_businesses,num_open,%closed,num_reviews,num_checkins,num_tips,num_interactions,open_rating,closed_rating,std.dev_rating,open_categories,closed_categories,chains,num_chains,%chains
postal_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1


## Review
Let's review the information of our dataframe.

In [20]:
zipProfileDf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1536 entries, 02224 to nan
Data columns (total 16 columns):
state                1536 non-null object
num_businesses       1536 non-null int64
num_open             1536 non-null int64
%closed              1536 non-null float64
num_reviews          1536 non-null float64
num_checkins         1536 non-null float64
num_tips             1536 non-null float64
num_interactions     1536 non-null float64
open_rating          1498 non-null float64
closed_rating        939 non-null float64
std.dev_rating       1198 non-null float64
open_categories      1536 non-null object
closed_categories    1536 non-null object
chains               1536 non-null object
num_chains           1536 non-null int64
%chains              1536 non-null float64
dtypes: float64(9), int64(3), object(4)
memory usage: 204.0+ KB


It appears that we have some business where the postal-code was 'nan', let's remove the ZIP profile corresponding to this.

In [21]:
zipProfileDf = zipProfileDf.drop('nan')

Now we'll review the info again, we no longer have any 'nan' entries.

In [22]:
zipProfileDf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1514 entries, 02224 to YO22
Data columns (total 16 columns):
state                1514 non-null object
num_businesses       1514 non-null int64
num_open             1514 non-null int64
%closed              1514 non-null float64
num_reviews          1514 non-null float64
num_checkins         1514 non-null float64
num_tips             1514 non-null float64
num_interactions     1514 non-null float64
open_rating          1476 non-null float64
closed_rating        931 non-null float64
std.dev_rating       1182 non-null float64
open_categories      1514 non-null object
closed_categories    1514 non-null object
chains               1514 non-null object
num_chains           1514 non-null int64
%chains              1514 non-null float64
dtypes: float64(9), int64(3), object(4)
memory usage: 201.1+ KB


Now we save the dataframes to a .pkl and .csv (for Python 2)

In [23]:
zipProfileDf.to_pickle('../../data/analysis/ZIPprofiles.pkl')
zipProfileDf.to_csv('../../data/analysis/ZIPprofiles.csv')