# Step 1.1: Extracting Basic Features



Previously we have cleaned the data and exported a few clean datasets (in the **data/step1** folder).

- **biz.csv**: Businesses of 8 selected states, with neighborhood information derived from Zillow neighborhoods, and census tract information derived from geographical boundaries of census tracts. Some businesses may not be associated with any neighborhood name.
- **neighborhoods/neighborhoods.shp**: Shape file of all Zillow neighborhoods in the 8 states that have at least one business.
- **census-tracts/census-tracts.shp**: Shape file of all census tracts in the 8 states that have at least one business.

In this notebook, we aggregate a few neighborhood/census tract level measurements based on the Yelp data. Then we will apply clustering algorithms to these measurements later, to separate neighborhoods based on their business dynamics.

In [90]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas.io.json import json_normalize

sns.set(
    style="white",
    color_codes=True,
    rc={
        'axes.linewidth': 0.5,
        'lines.linewidth': 2,
        'axes.labelsize': 14,
        'axes.titlesize': 14
    }
)

pd.set_option('display.max_rows', 4)

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Basic features

In [77]:
# Load data
biz = pd.read_csv('../data/step1/biz.csv', dtype={
    'postal_code': str,
    'CT_ID': str
})

In [78]:
display(biz)

Unnamed: 0,address,business_id,city,hours,is_open,name,neighborhood,postal_code,review_count,stars,...,CT_LAND,CT_WATER,CT_BIZ_COUNT,State,County,City,Name,Nhood,Nhood_area,Nhood_biz_count
0,691 Richmond Rd,YDf95gJZaq05wvo7hTQbbQ,Richmond Heights,"{'Monday': '10:00-21:00', 'Tuesday': '10:00-21...",1,Richmond Town Square,,44143,17,2.0,...,4183953.0,0.0,29.0,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115521,540 Marks St,scMIE4jyGp7FkWrMKAgjxA,Henderson,{},0,Walmart,,89014,42,2.5,...,18327522.0,0.0,792.0,NV,Clark,Henderson,Whitney Ranch,"Whitney Ranch, Henderson, NV",7.521403e+06,577.0


In [79]:
display(biz.columns)

Index(['address', 'business_id', 'city', 'hours', 'is_open', 'name',
       'neighborhood', 'postal_code', 'review_count', 'stars', 'state',
       'Active Life', 'Arts & Entertainment', 'Automotive', 'Beauty & Spas',
       'Event Planning & Services', 'Food', 'Health & Medical',
       'Home Services', 'Hotels & Travel', 'Local Services', 'Nightlife',
       'Restaurants', 'Shopping', 'BusinessAcceptsCreditCards',
       'RestaurantsPriceRange2', 'BikeParking', 'GoodForKids',
       'RestaurantsTakeOut', 'OutdoorSeating', 'RestaurantsGoodForGroups',
       'RestaurantsAttire', 'GoodForMeal', 'Ambience', 'Alcohol', 'geometry',
       'CT_ID', 'CT_LAND', 'CT_WATER', 'CT_BIZ_COUNT', 'State', 'County',
       'City', 'Name', 'Nhood', 'Nhood_area', 'Nhood_biz_count'],
      dtype='object')

In [81]:
display(biz.postal_code)

0         44143
          ...  
115521    89014
Name: postal_code, Length: 115522, dtype: object

### Alcohol

Possible values for `Alcohol`

In [82]:
biz.Alcohol.unique()

array([nan, 'none', 'full_bar', 'beer_and_wine'], dtype=object)

In [83]:
biz.loc[biz['Alcohol'] == 'none', 'Alcohol'] = np.nan
biz.Alcohol.unique()

array([nan, 'full_bar', 'beer_and_wine'], dtype=object)

### Price range

In [84]:
biz.RestaurantsPriceRange2.values[:10]

array([  2.,   2.,  nan,   1.,   2.,  nan,  nan,   2.,  nan,  nan])

## Build the feature matrix

After examing the extreme/unique values of the features, now aggregate the measurementns to neighborhoods and census tracts.

In [95]:
notnull_ratio = lambda x: x.notnull().mean()
isval_ratio = lambda x: lambda y: (x == y).mean()

def basic_agg(nbh, by='Nhood'):
    """Calculate basic aggregates stats of the neighborhoods"""
    stats = nbh.groupby(by).agg({
        'business_id': ['count'],
        'is_open': ['mean'],
        'Nhood_area': ['first'],
        'review_count': ['mean'],
        'stars': ['mean'],
        'RestaurantsPriceRange2': ['mean'],  # by default will skip NA
        'Alcohol': [('ratio', notnull_ratio)]
    }).sort_values(('business_id', 'count'), ascending=False)
    # how many businesses per square kilometer
    stats['biz_density'] = (
        stats[('business_id','count')] /
        (stats[('Nhood_area', 'first')] / 10e6)
    )
    stats.columns = stats.columns.map(lambda x: '_'.join(x))
    stats = pd.DataFrame(stats.to_records())
    return stats
    
nbh_stats_min_10 = basic_agg(biz[biz['Nhood_biz_count'] >= 10])
ct_stats_min_20 = basic_agg(biz[biz['CT_BIZ_COUNT'] >= 20], by='CT_ID')
display(nbh_stats_min_10)
display(ct_stats_min_20)

Unnamed: 0,Nhood,business_id_count,is_open_mean,Nhood_area_first,review_count_mean,stars_mean,RestaurantsPriceRange2_mean,Alcohol_ratio,biz_density_
0,"Paradise, Las Vegas, NV",5454,0.819215,1.132623e+08,50.751008,3.674276,1.734807,0.126146,481.536897
1,"Spring Valley, Las Vegas, NV",4030,0.822333,9.008128e+07,48.276923,3.855707,1.705340,0.098015,447.373756
...,...,...,...,...,...,...,...,...,...
440,"Arrow Head Acres, North Las Vegas, NV",10,1.000000,8.902987e+05,10.600000,3.650000,1.500000,0.000000,112.321848
441,"McClellan Park, Madison, WI",10,0.800000,2.381347e+06,18.200000,3.700000,1.857143,0.300000,41.993045


Unnamed: 0,CT_ID,business_id_count,is_open_mean,Nhood_area_first,review_count_mean,stars_mean,RestaurantsPriceRange2_mean,Alcohol_ratio,biz_density_
0,32003006700,1767,0.802490,8.657817e+06,203.042445,3.594793,2.371550,0.353707,2040.930281
1,04013216816,1142,0.823117,2.707229e+08,31.990368,3.885727,2.009960,0.075306,42.183359
...,...,...,...,...,...,...,...,...,...
1452,04013522800,20,0.950000,,28.500000,3.675000,1.833333,0.050000,
1453,39153530104,20,0.850000,,13.900000,3.650000,1.538462,0.250000,


## Export the feature matrix as a CSV file

These exported feature matrices will later be used in clustering.

In [97]:
nbh_stats_min_10.to_csv('../data/step2/nbh_stats_min_5.csv', index=False)

In [98]:
ct_stats_min_20.to_csv('../data/step2/ct_stats_min_20.csv', index=False)