# 300 State Profiles

## Purpose:
<p>This notebook creates a dataset of profiles for various states present in the input file. Each profile contains various business related metrics for the area.</p>

## Input:
'businessFinal.pkl'
## Output:
'stateProfiles.pkl' & 'stateProfiles.csv' , csv needed for use in python 2

In [1]:
import os
import sys
import pandas as pd
import numpy as np
module_path = os.path.abspath(os.path.join('../../data/..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
%matplotlib inline

## Set-up for creating our State Profiles
First, let's load our dataset of businesses into a dataframe.

In [2]:
businessPrepped = pd.read_pickle('../../data/analysis/businessFinal.pkl')

In [3]:
businessPrepped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174567 entries, 0 to 174566
Data columns (total 11 columns):
business_id             174567 non-null object
name                    174567 non-null object
state                   174566 non-null object
stars                   174567 non-null float64
review_count            174567 non-null int64
is_open                 174567 non-null int64
postal_code             174567 non-null object
categories              174567 non-null object
checkins                174567 non-null float64
tipcount                174567 non-null float64
interactionsWeighted    174567 non-null float64
dtypes: float64(4), int64(2), object(5)
memory usage: 16.0+ MB



Let's remind ourselves of what's in the business dataset, so we know what we'd like to use for our state profile metrics.

In [4]:
businessPrepped.head()

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",AZ,4.0,22,1,85044,Dentists;General Dentistry;Health & Medical;Or...,39.0,5.0,109.685216
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",PA,3.0,11,1,15317,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,15.0,1.0,46.415652
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",AZ,1.5,18,1,85017,Departments of Motor Vehicles;Public Services ...,6.0,0.0,53.123477
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",AZ,3.0,9,0,85282,Sporting Goods;Shopping,120.0,3.0,151.415652
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",OH,3.5,116,1,44221,American (New);Nightlife;Bars;Sandwiches;Ameri...,263.0,17.0,611.190138


We would like to know the mean rating of open businesses in an area, as well as the mean rating of closed businesses.
<p> Let's define functions to create series of open and closed businesses ratings, which can later be grouped and aggregated by area.

In [5]:

#Returns NaN for closed businesses, and the current star rating for open businesses.
def openStars(row):
    if row['is_open'] == 0: #If business is not open, return NaN. Otherwise, return the rating of the open business.
       return np.nan
    return row['stars']

#Returns NaN for closed businesses, and the current star rating for open businesses.
def closedStars(row):
    if row['is_open'] == 1: # If business is open, return Nan. Otherwise return the rating of the closed business.
       return np.nan
    return row['stars']


Now let's apply these functions, and observe the resulting dataframe.

In [6]:
# axis = 1, indicates that we will iterate over rows. Each row will have openStars performed on it.
#The resulting series will be added to the original dataframe as 'open_stars'.
businessPrepped['open_stars'] = businessPrepped.apply(lambda row : openStars(row), axis=1)

#Similarly, closedStars is performed on the dataframe below.
businessPrepped['closed_stars'] = businessPrepped.apply(lambda row : closedStars(row), axis=1)

businessPrepped.head()

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted,open_stars,closed_stars
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",AZ,4.0,22,1,85044,Dentists;General Dentistry;Health & Medical;Or...,39.0,5.0,109.685216,4.0,
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",PA,3.0,11,1,15317,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,15.0,1.0,46.415652,3.0,
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",AZ,1.5,18,1,85017,Departments of Motor Vehicles;Public Services ...,6.0,0.0,53.123477,1.5,
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",AZ,3.0,9,0,85282,Sporting Goods;Shopping,120.0,3.0,151.415652,,3.0
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",OH,3.5,116,1,44221,American (New);Nightlife;Bars;Sandwiches;Ameri...,263.0,17.0,611.190138,3.5,


Similar to the open and closed business ratings above, let's perform a similar task and split the category column into 'openCategories' and 'closedCategories. Each will record the category string of either open or closed businesses. 

In [7]:

#Returns Empty String "" for closed businesses, and the current current category for open businesses.
def openCategories(row):
    if row['is_open'] == 0:
       return ""
    return row['categories']

#Returns Empty String "" for closed businesses, and the current current category for open businesses.
def closedCategories(row):
    if row['is_open'] == 1:
       return ""
    return row['categories']



We'll apply these new functions to the dataframe, save the resulting series as new columns. And then observe the results.

In [8]:
# Axis = 1 indicates that we are iterating over rows and not columns
businessPrepped['closed_categories'] = businessPrepped.apply(lambda row : closedCategories(row), axis=1)
businessPrepped['open_categories'] = businessPrepped.apply(lambda row : openCategories(row), axis=1)
businessPrepped.head()

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted,open_stars,closed_stars,closed_categories,open_categories
0,FYWN1wneV18bWNgQjJ2GNg,"""Dental by Design""",AZ,4.0,22,1,85044,Dentists;General Dentistry;Health & Medical;Or...,39.0,5.0,109.685216,4.0,,,Dentists;General Dentistry;Health & Medical;Or...
1,He-G7vWjzVUysIKrfNbPUQ,"""Stephen Szabo Salon""",PA,3.0,11,1,15317,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...,15.0,1.0,46.415652,3.0,,,Hair Stylists;Hair Salons;Men's Hair Salons;Bl...
2,KQPW8lFf1y5BT2MxiSZ3QA,"""Western Motor Vehicle""",AZ,1.5,18,1,85017,Departments of Motor Vehicles;Public Services ...,6.0,0.0,53.123477,1.5,,,Departments of Motor Vehicles;Public Services ...
3,8DShNS-LuFqpEWIp0HxijA,"""Sports Authority""",AZ,3.0,9,0,85282,Sporting Goods;Shopping,120.0,3.0,151.415652,,3.0,Sporting Goods;Shopping,
4,PfOCPjBrlQAnz__NXj9h_w,"""Brick House Tavern + Tap""",OH,3.5,116,1,44221,American (New);Nightlife;Bars;Sandwiches;Ameri...,263.0,17.0,611.190138,3.5,,,American (New);Nightlife;Bars;Sandwiches;Ameri...


##  Creating our new State profile dataframe.

Let's create State profiles containing the following information;
- State
- Number of businesses
- Number of Open Businesses
- % Businesses Closed
- Avg. Review Count of the businesses in the area.
- Avg. Tip Counts of the businesses
- Avg. Checkins
- Avg. Interactions (A weighted combination of Reviews,Tips and Checkins.)
- Mean Star Rating of Open Businesses
- Mean Star Rating of Closed Businesses
- Standard Deviation of all Star Ratings in each area.
- Count of each open category in each area.
- Count of each closed category in each area.

Let's create a function that will aggregate category columns in different groups.

Categories of a restaurant are stored as strings, delimited by a semi-colon.

We will try add these to a dictionary for the area, which will record the occurences of every local category.

In [9]:
def categoryAgg(categoryColumn):
    areaCategories= {} # Empty Dictionary to fill.
    
    # For each business categories string...
    for value in categoryColumn:
        #Split this string on the semi-colons, and store them in a list.
        categoryList = value.lower().split(";")
        #For each category in this list...
        for category in categoryList:
            #If the category is already in our dictionary, increment the number associated by 1.
            if category in areaCategories:
                areaCategories[category] += 1
            #Else if the category list isn't empty, add this category to our dictionary and instantiate it with a value of 1.
            elif category != "":
                areaCategories[category] = 1
    return areaCategories

<p>Now we'll create a function that will record a dictionary of the number of various chains in each area. It will do this by aggregating the business 'name' column in a group.</p>
<p>A business qualifies as a 'chain' if it is one of the top 10 occurring names which we uncovered in our dataset overview.</p>

In [10]:
#Operates on 'name' column and returns a dictionary counting the number of different chains.
def chainAgg(nameColumn):
    #Self-determined list of chains based on their appearances in dataset.
    chainList = ['"Starbucks"', '"McDonald"\'s','"Subway"', '"Pizza Hut"','"Taco Bell"','"Burger King"','"Walgreens"','"Wendy\'s"',\
                '"The UPS Store"', '"Tim Hortons"']
    
    #Empty Dictionary to record chain occurences. 
    areaChains= {}
    
    #For each name in the column, if it's in our list of chains...
    for name in nameColumn:
        if name in chainList:
            #if the chain is already in the 'areaChains' dictionary, increment the corresponding count.
            if name in areaChains:
                areaChains[name] += 1
            #otherwise add the chain to the dictionary with a value of 1.
            else :
                areaChains[name] = 1
    return areaChains

Now, let's group the businesses by State but also by state to retain this info for each area code.

<p>We'll also aggregate them to get our desired metrics (e.g. 'is_open' will be summed together to find the number of open businesses in each area.)</p>

In [11]:

stateProfileGroups = businessPrepped.groupby(['state']).agg({'business_id':'count', 'is_open':sum,\
                                    'review_count':'mean','checkins':'mean', 'tipcount':'mean','interactionsWeighted':'mean',\
                                    'open_stars':'mean', 'closed_stars':'mean','stars':"std" ,\
                                                'open_categories':categoryAgg,'closed_categories':categoryAgg, 'name':chainAgg})

Now we will convert this group object into a dataframe, we will also reset the 'state' index, making it a column in our dataframe. State code alone will be our index.

In [12]:
stateProfileDf = pd.DataFrame(stateProfileGroups)
stateProfileDf.reset_index(inplace = True)
stateProfileDf.head(20)

Unnamed: 0,state,business_id,is_open,review_count,checkins,tipcount,interactionsWeighted,open_stars,closed_stars,stars,open_categories,closed_categories,name
0,01,10,10,4.0,9.0,1.5,23.39884,3.7,,0.788811,"{'restaurants': 10, 'italian': 2, 'pizza': 3, ...",{},{}
1,3,1,1,4.0,5.0,0.0,15.471884,5.0,,,"{'austrian': 1, 'restaurants': 1, 'beer garden...",{},{}
2,30,1,1,3.0,0.0,0.0,7.853913,3.0,,,"{'restaurants': 1, 'french': 1}",{},{}
3,6,3,3,3.333333,4.666667,0.666667,15.138551,3.666667,,1.154701,"{'austrian': 1, 'restaurants': 1, 'nightlife':...",{},{}
4,AB,1,1,3.0,0.0,0.0,7.853913,5.0,,,"{'event planning & services': 1, 'private tuto...",{},{}
5,ABE,3,3,4.0,12.333333,0.666667,24.550531,4.0,,0.866025,"{'restaurants': 1, 'coffee & tea': 1, 'food': ...",{},{}
6,AK,1,1,22.0,14.0,2.0,76.831303,2.5,,,"{'restaurants': 1, 'fast food': 1, 'burgers': 1}",{},{}
7,AL,1,1,10.0,7.0,4.0,43.651593,5.0,,,"{'hair removal': 1, 'event planning & services...",{},{}
8,AR,2,2,15.0,4.0,1.0,45.887535,3.25,,2.474874,"{'festivals': 1, 'arts & entertainment': 1, 'h...",{},{}
9,AZ,52214,44045,31.173498,104.974087,6.551519,203.737084,3.762606,3.546028,1.060034,"{'dentists': 1362, 'general dentistry': 1041, ...","{'sporting goods': 125, 'shopping': 1409, 'boo...","{'""The UPS Store""': 99, '""Taco Bell""': 109, '""..."


We can see that there a lot of states with very few businesses, if we take a closer look we can see that some of them were not expected to be in our database. For example, 'NYK' is North Yorkshire in England.


In [13]:
businessPrepped[businessPrepped['state'] == 'NYK']

Unnamed: 0,business_id,name,state,stars,review_count,is_open,postal_code,categories,checkins,tipcount,interactionsWeighted,open_stars,closed_stars,closed_categories,open_categories
1009,qzA-93v2lwf9qnCOPyWiUQ,"""Scarborough Deaf Club""",NYK,4.0,4,1,YO11,Public Services & Government,0.0,0.0,10.471884,4.0,,,Public Services & Government
1792,PjzIBRm4pxV_mivCZtmZlw,"""Duke Of Wellington""",NYK,5.0,3,1,YO21,Hotels & Travel;Event Planning & Services;Hotels,0.0,0.0,7.853913,5.0,,,Hotels & Travel;Event Planning & Services;Hotels
1841,PX9ceQzCEX5AEn73_UlZ2g,"""the coffee bean""",NYK,4.0,5,1,YO12,Food;Coffee & Tea,0.0,0.0,13.089855,4.0,,,Food;Coffee & Tea
5755,Rv7TH2KWTUXpjFPwTTp4Xg,"""Bonnet & Sons""",NYK,3.5,3,1,YO11,Coffee & Tea;Food;Cafes;Restaurants,0.0,1.0,10.471884,3.5,,,Coffee & Tea;Food;Cafes;Restaurants
7447,zIYzpJ70IEgQI_kAZF0a4g,"""Nippy Taxis""",NYK,2.5,3,1,YO11,Hotels & Travel;Transportation;Taxis,0.0,0.0,7.853913,2.5,,,Hotels & Travel;Transportation;Taxis
8692,qFBEXTIBrAsTsAGaMXOTEw,"""Mainline Menswear""",NYK,3.5,3,1,YO11,Shopping;Men's Clothing;Fashion,0.0,0.0,7.853913,3.5,,,Shopping;Men's Clothing;Fashion
9450,Wijq97TO27mhi_OG-xaW-w,"""Golden Grid Restaurant""",NYK,3.5,3,1,YO11,Seafood;Restaurants,1.0,0.0,8.853913,3.5,,,Seafood;Restaurants
10126,zadkJh0PvDYhP8sw0E8hLQ,"""Scarborough Railway Station""",NYK,3.0,3,1,YO11,Train Stations;Hotels & Travel,9.0,0.0,16.853913,3.0,,,Train Stations;Hotels & Travel
10391,2LdSbmZnJJS5DiwvaAp4Yw,"""Saba Thai Restaurant""",NYK,3.5,6,1,,Shopping,0.0,0.0,15.707826,3.5,,,Shopping
11981,xcPmvEfUMlgUFfLmrF1tQA,"""Pattisons""",NYK,3.5,3,1,YO12,Shopping;Flowers & Gifts;Florists,0.0,0.0,7.853913,3.5,,,Shopping;Flowers & Gifts;Florists


On investigation of 'NYK' we can see that the postal codes cover only a small region of the NYK area. As a result of this coupled with the overall small number of businesses, we do not feel this is representative of a state.

<p>We'll assume a similar situation for most states with low numbers of businesses, to avoid these states we'll focus our state study on the US and Canada</p>

Now we'll calculate the percentage of businesses that are closed in each area. <p>This will be calculated by the number of open businesses (currently 'is_open'), divided by the total number of businesses in an area (currently 'business_id').

In [14]:
stateProfileDf['%closed'] = (stateProfileDf['business_id'] - stateProfileDf['is_open']) / stateProfileDf['business_id']

We will now alter the column titles to accurately reflect their contents. We will also rearrange the columns as desired.

In [15]:
stateProfileDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 14 columns):
state                   67 non-null object
business_id             67 non-null int64
is_open                 67 non-null int64
review_count            67 non-null float64
checkins                67 non-null float64
tipcount                67 non-null float64
interactionsWeighted    67 non-null float64
open_stars              65 non-null float64
closed_stars            27 non-null float64
stars                   40 non-null float64
open_categories         67 non-null object
closed_categories       67 non-null object
name                    67 non-null object
%closed                 67 non-null float64
dtypes: float64(8), int64(2), object(4)
memory usage: 7.4+ KB


In [16]:
stateProfileDf.columns=['state','num_businesses','num_open','num_reviews','num_checkins','num_tips','num_interactions'\
                        ,'open_rating','closed_rating',\
                        'std.dev_rating','open_categories','closed_categories','chains','%closed']

#Now Rearrange Columns
stateProfileDf = stateProfileDf[['state', 'num_businesses','num_open','%closed','num_interactions','num_reviews','num_checkins','num_tips',\
                                 'open_rating','closed_rating',\
                             'std.dev_rating','open_categories','closed_categories','chains']]

stateProfileDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 14 columns):
state                67 non-null object
num_businesses       67 non-null int64
num_open             67 non-null int64
%closed              67 non-null float64
num_interactions     67 non-null float64
num_reviews          67 non-null float64
num_checkins         67 non-null float64
num_tips             67 non-null float64
open_rating          65 non-null float64
closed_rating        27 non-null float64
std.dev_rating       40 non-null float64
open_categories      67 non-null object
closed_categories    67 non-null object
chains               67 non-null object
dtypes: float64(8), int64(2), object(4)
memory usage: 7.4+ KB


Let's count the number of chains in each area as well, we can also use this number to calculate the percentage of chains in an area.

<p>First we'll define a function to count the number of chains using the dictionary holding the counts of chains in an area.<p>

In [17]:
def countChains(chainDict):
    #declare an counter to record number of chains in dictionary
    count = 0
    
    #if chain dictionary is empty return 0
    if not chainDict:
        return count
    
    #add each value in the dictionary to the count, these values are the number of specific chain businesses in the area.
    count = sum(chainDict.values())
        
    return count

Now let's apply this function to the 'chains' column of our database, which holds these chain dictionaries.

In [18]:
stateProfileDf['num_chains'] = stateProfileDf['chains'].apply(countChains)

Let's calculate the percentage of businesses that are chains in an area as well.

In [19]:
stateProfileDf['%chains'] = (stateProfileDf['num_chains'] / stateProfileDf['num_businesses'])*100

Let's make sure there's no percentages outside the 0-100 range.

In [20]:
percentageMask = (stateProfileDf['%chains'] < 0) & (stateProfileDf['%chains'] > 100)
stateProfileDf[percentageMask].head()

Unnamed: 0,state,num_businesses,num_open,%closed,num_interactions,num_reviews,num_checkins,num_tips,open_rating,closed_rating,std.dev_rating,open_categories,closed_categories,chains,num_chains,%chains


## Review

We can see below that after the first 12 states there is a considerable drop off in the number of businesses in our dataset, we expect that these are unlikely to be intended inclusions.
<p>The original dataset was supposed to cover 11 metropolitan areas, so we had expected to find only 11 states. However South and North Carolina both appear in the top 12, so we assume that they are both intended to be included.</p>

In [21]:
stateProfileDf.sort_values('num_businesses', ascending = False).head(14)

Unnamed: 0,state,num_businesses,num_open,%closed,num_interactions,num_reviews,num_checkins,num_tips,open_rating,closed_rating,std.dev_rating,open_categories,closed_categories,chains,num_chains,%chains
9,AZ,52214,44045,0.156452,203.737084,31.173498,104.974087,6.551519,3.762606,3.546028,1.060034,"{'dentists': 1362, 'general dentistry': 1041, ...","{'sporting goods': 125, 'shopping': 1409, 'boo...","{'""The UPS Store""': 99, '""Taco Bell""': 109, '""...",1064,2.037768
43,NV,33086,27491,0.169105,391.281489,55.140754,211.120444,13.676298,3.733495,3.611707,1.030425,"{'real estate services': 453, 'real estate': 1...","{'italian': 193, 'restaurants': 2284, 'office ...","{'""Subway""': 151, '""Burger King""': 48, '""Pizza...",633,1.913196
47,ON,30208,24723,0.181574,106.094045,20.996524,41.997782,3.486659,3.421227,3.358979,0.933462,"{'bakeries': 809, 'bagels': 87, 'food': 4820, ...","{'italian': 295, 'french': 90, 'restaurants': ...","{'""Starbucks""': 249, '""Pizza Hut""': 47, '""Subw...",655,2.1683
38,NC,12956,11099,0.143331,147.725217,23.728466,73.450293,4.642714,3.590819,3.469844,1.013534,"{'restaurants': 2924, 'american (traditional)'...","{'home & garden': 35, 'furniture stores': 17, ...","{'""Wendy's""': 39, '""Pizza Hut""': 32, '""Taco Be...",293,2.2615
46,OH,12609,10920,0.133952,110.799835,19.32564,50.748037,3.612658,3.562546,3.408822,0.999437,"{'american (new)': 475, 'nightlife': 998, 'bar...","{'bakeries': 60, 'food': 334, 'restaurants': 1...","{'""Taco Bell""': 44, '""Pizza Hut""': 47, '""Wendy...",343,2.720279
48,PA,10109,8663,0.143041,127.65982,22.732615,57.86418,3.927589,3.621147,3.523513,0.97275,"{'hair stylists': 74, 'hair salons': 293, 'men...","{'breakfast & brunch': 35, 'gluten-free': 5, '...","{'""Pizza Hut""': 24, '""Subway""': 41, '""Taco Bel...",199,1.968543
50,QC,8169,6925,0.152283,87.107396,17.91884,32.985678,2.754315,3.680578,3.553859,0.846419,"{'italian': 335, 'restaurants': 4019, 'mexican...","{'arts & entertainment': 21, 'festivals': 1, '...","{'""Tim Hortons""': 36, '""Starbucks""': 41, '""Piz...",98,1.199657
63,WI,4754,3973,0.164283,121.931394,23.083088,52.731384,3.3496,3.670526,3.480154,0.973218,"{'tires': 71, 'oil change stations': 63, 'auto...","{'convenience stores': 4, 'desserts': 9, 'food...","{'""Walgreens""': 19, '""The UPS Store""': 7, '""St...",100,2.103492
20,EDH,3795,3078,0.188933,62.134891,12.618972,24.549934,1.737549,3.793372,3.743375,0.715964,"{'active life': 101, 'parks': 22, 'local flavo...","{'restaurants': 362, 'food': 182, 'coffee & te...","{'""Subway""': 4, '""Pizza Hut""': 4, '""Starbucks""...",28,0.737813
11,BW,3118,2746,0.119307,37.660966,11.35279,6.754971,0.452534,3.812454,3.759409,0.791569,"{'italian': 297, 'restaurants': 1514, 'cafes':...","{'food': 80, 'caterers': 8, 'restaurants': 269...","{'""Burger King""': 13, '""Subway""': 10, '""Starbu...",31,0.994227


We'll take a selection of the top 12 states in terms of number of businesses.

In [22]:
stateProfileSelection = stateProfileDf.sort_values('num_businesses', ascending = False).head(12)

Now we'll review our dataframe one last time before saving it.

In [23]:
stateProfileSelection.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12 entries, 9 to 52
Data columns (total 16 columns):
state                12 non-null object
num_businesses       12 non-null int64
num_open             12 non-null int64
%closed              12 non-null float64
num_interactions     12 non-null float64
num_reviews          12 non-null float64
num_checkins         12 non-null float64
num_tips             12 non-null float64
open_rating          12 non-null float64
closed_rating        12 non-null float64
std.dev_rating       12 non-null float64
open_categories      12 non-null object
closed_categories    12 non-null object
chains               12 non-null object
num_chains           12 non-null int64
%chains              12 non-null float64
dtypes: float64(9), int64(3), object(4)
memory usage: 1.6+ KB


In [24]:
stateProfileSelection.to_pickle('../../data/analysis/stateProfiles.pkl')
stateProfileSelection.to_csv('../../data/analysis/stateProfiles.csv')