1. To create a Jupyter .ipynb notebook for data analysis
2. Import your data or data files and to save as dataframes
3. Examine your data, columns and rows and rename and adjust indexing and encoding as appropriate
4. Clean null and blank values, and consider to drop rows, as well as to manipulate data and adjust data types as appropriate, including dates and time, or setting appropriate indices. Adjusting specific values and replacing strings and characters for the data wrangling process.
5. Explore analysis with graphing and visualizations with matplotlib and seaborn and alternative visualization packages (Plotly, bokeh, altair, vincent)
6. Perform additional analysis by creating new columns for calculations, including aggregator functions, counts and groupbys.
7. Encode categorical variables with a variety of techniques through logical conditions, where clauses, or one hot encoding
8. Re-run calculations, including crosstabs or pivots, and new graphs to see results
9. Create correlation matrices, pairplots and heatmaps to determine which attributes should be features for your models and which attributes should not
10. Identify the response variables(s) that you would want to predict/classify/interpret with data science
11. Perform additional feature engineering as necessary, including Min/Max, Normalizaton, Scaling, and additional Pipeline changes that may be beneficial or helpful when you run machine learning
12. Merge or concatenate datasets if you have not already, based on common keys or unique items for more in-depth analysis
13. Add commenting and markdown throughout the jupyter notebook to explain the interpretation of your results or to comment on code that may not be human readable, and help you recall for you what you are referencing.  
14. To create a markdown .md milestone report that shows and explains the results of what you have accomplished to date in this part of your course project. Consider also creating a .pdf or .pptx to display initial results, aha moments, or findings that would be novel or fascinating for your final presentations.


# Part 3: Exploratory Data Analysis
Project: New Coffee Shop Location
<br/>

## 1. Import data files and save as dataframes.

In [2]:
import pandas as pd
from matplotlib import pyplot as plt

%matplotlib inline

### First I'll import zip_codes file (which contains all zip codes in dc-metro) into a python list.

In [6]:
import csv
with open('./data/zip_codes.csv', 'r') as f:
    reader = csv.reader(f)
    zip_codes = list(reader)
    
# convert list of lists attained from csv.reader to single flat list
zip_codes_flat = [item for sublist in zip_codes for item in sublist]
# zip_codes_flat[0:5]
# type(zip_codes_flat[1])

### Import ACS Population Dataset

### Import Yelp dataset.
    - After importing the dataset I found it to be missing all data from DC metro area, so I chose to use Yelp API.

In [4]:
# businesses = pd.read_json('./data/business.json', lines=True)
# businesses.shape
# businesses.dtypes
# businesses.head()
# businesses[businesses.city.isin(zip_codes_flat)]

In [5]:
# This takes a few minutes to run to run
# Query the yelp api for each zip code in the DC metro area

import requests

url = "https://api.yelp.com/v3/businesses/search"

dict_responses = {}

for i in zip_codes_flat:
    querystring = {'location': i}

    headers = {
        'authorization': "Bearer -Vxv2h6JQ4BvtAFJPnXnFESk2A0ZGIM-Uplb3lm3HbZ0fvyJqyzOaebRuahcJxtKTqdep7oi6ZAyOcHOLU9t0KMp9SK5NZS-TYgmQeC0mjYXRq2XWM0acx31HfMPW3Yx",
        'cache-control': "no-cache",
        'postman-token': "46f50419-2ee9-b8f7-52ec-4f53e3173476"
        }

    response = requests.request("GET", url, headers=headers, params=querystring)
    
    response_todict = response.json()
    response_todict[i] = response_todict.pop('businesses')
    
    dict_responses.update(response_todict)

In [165]:
# Clean up the response from the api

list_of_businesses = []

for key, item in dict_responses.items():
    list_of_businesses.append(item)

del(list_of_businesses[0:2])

In [201]:
# more cleanup; list is currently list of lists of dictionaries due to how api was called
# convert to just list of dictionaries

list_of_businesses_formatting = []

for i in list_of_businesses:
    for j in i:
        list_of_businesses_formatting.append(j)
        
list_of_businesses_formatting[0]

{'alias': 'zaytinya-washington',
 'categories': ['Greek', 'Turkish', 'Lebanese'],
 'coordinates': {'latitude': 38.89904, 'longitude': -77.02349},
 'display_phone': '(202) 638-0800',
 'distance': 853.4585424100367,
 'id': 'GBkFa8TJwkaUJsJXXGkTTg',
 'image_url': 'https://s3-media2.fl.yelpcdn.com/bphoto/qf8Mc1iHho1dN-CJFItEwQ/o.jpg',
 'is_closed': False,
 'location': {'address1': '701 9th St NW',
  'address2': '',
  'address3': '',
  'city': 'Washington, DC',
  'country': 'US',
  'display_address': ['701 9th St NW', 'Washington, DC 20001'],
  'state': 'DC',
  'zip_code': '20001'},
 'name': 'Zaytinya',
 'phone': '+12026380800',
 'price': '$$$',
 'rating': 4.0,
 'review_count': 3944,
 'transactions': [],
 'url': 'https://www.yelp.com/biz/zaytinya-washington?adjust_creative=mC5YxpuGROU8gtr7Nb8j_w&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=mC5YxpuGROU8gtr7Nb8j_w'}

#### Next, I need to normalize the json data so that each data point is in its own column. This command returns a dataframe.

In [203]:
# Change formatting of categories column to only include "title", not "alias" (makes normalization easier)
for idx, val in enumerate(list_of_businesses_formatting):
    list_categories = []
    for idx2, val2 in enumerate(val['categories']):
        list_categories.append(list_of_businesses_formatting[idx]['categories'][idx2]['title'])
    val['categories'] = list_categories

In [204]:
# This isn't technically normalized all the way since categories still has multiple values.
# I think I can still use it for filtering so I won't worry about it now.
from pandas.io.json import json_normalize
businesses_df = json_normalize(list_of_businesses_formatting)

In [205]:
businesses_df.head()

Unnamed: 0,alias,categories,coordinates.latitude,coordinates.longitude,display_phone,distance,id,image_url,is_closed,location.address1,location.address2,location.address3,location.city,location.country,location.display_address,location.state,location.zip_code,name,phone,price,rating,review_count,transactions,url
0,zaytinya-washington,"[Greek, Turkish, Lebanese]",38.89904,-77.02349,(202) 638-0800,853.458542,GBkFa8TJwkaUJsJXXGkTTg,https://s3-media2.fl.yelpcdn.com/bphoto/qf8Mc1...,False,701 9th St NW,,,"Washington, DC",US,"[701 9th St NW, Washington, DC 20001]",DC,20001,Zaytinya,12026380800,$$$,4.0,3944,[],https://www.yelp.com/biz/zaytinya-washington?a...
1,old-ebbitt-grill-washington,"[Bars, American (Traditional), Breakfast & Bru...",38.898005,-77.033362,(202) 347-4800,1428.775801,iyBbcXtQSBfiwFQZwVBNaQ,https://s3-media2.fl.yelpcdn.com/bphoto/KBCezp...,False,675 15th St NW,,,"Washington, DC",US,"[675 15th St NW, Washington, DC 20005]",DC,20005,Old Ebbitt Grill,12023474800,$$,4.0,6532,[],https://www.yelp.com/biz/old-ebbitt-grill-wash...
2,a-baked-joint-washington-9,"[Coffee & Tea, Breakfast & Brunch, Sandwiches]",38.902411,-77.017139,(202) 408-6985,547.273915,SpCeYPhky4gsWa9-IBtw2A,https://s3-media1.fl.yelpcdn.com/bphoto/iTBw1K...,False,440 K St NW,,,"Washington, DC",US,"[440 K St NW, Washington, DC 20001]",DC,20001,A Baked Joint,12024086985,$,4.5,1214,[],https://www.yelp.com/biz/a-baked-joint-washing...
3,le-diplomate-washington,"[Brasseries, French, Cafes]",38.911359,-77.031575,(202) 332-3333,1086.375403,j9qYRR8HCXm_GEnetijOGA,https://s3-media2.fl.yelpcdn.com/bphoto/2EljPz...,False,1601 14th St NW,,,"Washington, DC",US,"[1601 14th St NW, Washington, DC 20009]",DC,20009,Le Diplomate,12023323333,$$$,4.0,2447,[],https://www.yelp.com/biz/le-diplomate-washingt...
4,rasika-washington,[Indian],38.895008,-77.021286,(202) 637-1222,1267.48223,CwdlygqT4cWwOtQGsYdoBw,https://s3-media4.fl.yelpcdn.com/bphoto/rkzs8J...,False,633 D St NW,,,"Washington, DC",US,"[633 D St NW, Washington, DC 20004]",DC,20004,Rasika,12026371222,$$$,4.5,2631,[],https://www.yelp.com/biz/rasika-washington?adj...


## 2. Examine your data, columns and rows and rename and adjust indexing and encoding as appropriate

### I'll start with the ACS dataset.

### Next, the Yelp Dataset.

In [206]:
businesses_df.shape

(6946, 24)

In [207]:
businesses_df.head()

Unnamed: 0,alias,categories,coordinates.latitude,coordinates.longitude,display_phone,distance,id,image_url,is_closed,location.address1,location.address2,location.address3,location.city,location.country,location.display_address,location.state,location.zip_code,name,phone,price,rating,review_count,transactions,url
0,zaytinya-washington,"[Greek, Turkish, Lebanese]",38.89904,-77.02349,(202) 638-0800,853.458542,GBkFa8TJwkaUJsJXXGkTTg,https://s3-media2.fl.yelpcdn.com/bphoto/qf8Mc1...,False,701 9th St NW,,,"Washington, DC",US,"[701 9th St NW, Washington, DC 20001]",DC,20001,Zaytinya,12026380800,$$$,4.0,3944,[],https://www.yelp.com/biz/zaytinya-washington?a...
1,old-ebbitt-grill-washington,"[Bars, American (Traditional), Breakfast & Bru...",38.898005,-77.033362,(202) 347-4800,1428.775801,iyBbcXtQSBfiwFQZwVBNaQ,https://s3-media2.fl.yelpcdn.com/bphoto/KBCezp...,False,675 15th St NW,,,"Washington, DC",US,"[675 15th St NW, Washington, DC 20005]",DC,20005,Old Ebbitt Grill,12023474800,$$,4.0,6532,[],https://www.yelp.com/biz/old-ebbitt-grill-wash...
2,a-baked-joint-washington-9,"[Coffee & Tea, Breakfast & Brunch, Sandwiches]",38.902411,-77.017139,(202) 408-6985,547.273915,SpCeYPhky4gsWa9-IBtw2A,https://s3-media1.fl.yelpcdn.com/bphoto/iTBw1K...,False,440 K St NW,,,"Washington, DC",US,"[440 K St NW, Washington, DC 20001]",DC,20001,A Baked Joint,12024086985,$,4.5,1214,[],https://www.yelp.com/biz/a-baked-joint-washing...
3,le-diplomate-washington,"[Brasseries, French, Cafes]",38.911359,-77.031575,(202) 332-3333,1086.375403,j9qYRR8HCXm_GEnetijOGA,https://s3-media2.fl.yelpcdn.com/bphoto/2EljPz...,False,1601 14th St NW,,,"Washington, DC",US,"[1601 14th St NW, Washington, DC 20009]",DC,20009,Le Diplomate,12023323333,$$$,4.0,2447,[],https://www.yelp.com/biz/le-diplomate-washingt...
4,rasika-washington,[Indian],38.895008,-77.021286,(202) 637-1222,1267.48223,CwdlygqT4cWwOtQGsYdoBw,https://s3-media4.fl.yelpcdn.com/bphoto/rkzs8J...,False,633 D St NW,,,"Washington, DC",US,"[633 D St NW, Washington, DC 20004]",DC,20004,Rasika,12026371222,$$$,4.5,2631,[],https://www.yelp.com/biz/rasika-washington?adj...


#### My looping api call technique resulted in many duplicates, as we examine and fix below (id is unique).

In [208]:
businesses_df.id.value_counts().head()

VUfflugAZa3MxbtzDTGnEA    20
iyBbcXtQSBfiwFQZwVBNaQ    19
VA8aPObRynlwR1TGzbzraQ    19
GBkFa8TJwkaUJsJXXGkTTg    15
CNATPQgruDWBVRjXfzT8KQ    14
Name: id, dtype: int64

In [209]:
# test line to make sure drop_duplicates works properly; will execute inplace next
# businesses_df.drop_duplicates(subset='id').id.value_counts().head()

businesses_df.drop_duplicates(subset='id', inplace=True)

#### Now I will drop columns that I won't need during this project.

In [210]:
list(businesses_df)

['alias',
 'categories',
 'coordinates.latitude',
 'coordinates.longitude',
 'display_phone',
 'distance',
 'id',
 'image_url',
 'is_closed',
 'location.address1',
 'location.address2',
 'location.address3',
 'location.city',
 'location.country',
 'location.display_address',
 'location.state',
 'location.zip_code',
 'name',
 'phone',
 'price',
 'rating',
 'review_count',
 'transactions',
 'url']

In [211]:
businesses_df.drop(['alias',
         'id',
         'display_phone',
         'distance', 
         'image_url', 
         'is_closed', 
         'location.address2', 
         'location.address3', 
         'location.display_address',
         'phone',
         'transactions',
         'url'], axis=1, inplace=True)

#### Next I'll clean up the column names.

In [212]:
businesses_df.columns = ['categories', 'latitude', 'longitude', 'address', 'city', 'country', 'state', 'zip_code', 'name', 'price', 'rating', 'review_count']

In [213]:
cols = businesses_df.columns.tolist()
cols[0] = 'name'
cols[8] = 'categories'
businesses_df = businesses_df[cols]

In [214]:
businesses_df.head()

Unnamed: 0,name,latitude,longitude,address,city,country,state,zip_code,categories,price,rating,review_count
0,Zaytinya,38.89904,-77.02349,701 9th St NW,"Washington, DC",US,DC,20001,"[Greek, Turkish, Lebanese]",$$$,4.0,3944
1,Old Ebbitt Grill,38.898005,-77.033362,675 15th St NW,"Washington, DC",US,DC,20005,"[Bars, American (Traditional), Breakfast & Bru...",$$,4.0,6532
2,A Baked Joint,38.902411,-77.017139,440 K St NW,"Washington, DC",US,DC,20001,"[Coffee & Tea, Breakfast & Brunch, Sandwiches]",$,4.5,1214
3,Le Diplomate,38.911359,-77.031575,1601 14th St NW,"Washington, DC",US,DC,20009,"[Brasseries, French, Cafes]",$$$,4.0,2447
4,Rasika,38.895008,-77.021286,633 D St NW,"Washington, DC",US,DC,20004,[Indian],$$$,4.5,2631
