# Capstone Project - The Battle of Electronic Stores 

## Table of Contents 
* [Business Problem](#intro)
* [Data](#data)
* [Methodology](#meth)
* [Analysis](#analysis)

## Business Problem <a name="intro"></a>

This project is geared toward stakeholders particularly interested in identifying an optimal location for a **computer electronics store** business in Los Angeles, California. The business is aimed at computer enthusiast and professional online gamers, thus providing a large variety of computers, computer parts, electronics, softwares, and gaming supplies. 

We will focus our attention on locations which **do not have an electronics stores in the area and are not primarly residential neighborhoods**. We would also prefer to consider locations which **contain malls, shopping centers, and/or a large population of retail stores in the area**. 

Five location will then be recommended based on how close it satifies the above criteria along with a description of the advantages of choosing that location versus other candidates.  

## Data <a name="data"></a>

The data used consists of the following:

* Names of neighborhoods and regions within Los Angeles County were obtained through web scraping the Los Angeles Times' Mapping LA project using the Beautiful Soup library. This data has been previously used by the City of Los Angeles Open Data portal to map out neighborhood boundaries.

* We make use of the neighborhood names data to obtain the longitude and latitude coordinates of each neighborhood using the geopy library. 

* We use the Fouresquare API to obtain the most common venues in each nieghborhood.

* The number of electronic stores within each neighborhood will also be used to determine if a location will be recommended to our stakeholders or not. 

### Importing Libraries

In [1]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json 

import requests
from pandas.io.json import json_normalize

from bs4 import BeautifulSoup

from geopy.geocoders import Nominatim, GoogleV3

!pip -q install folium 
import folium

### Neighborhoods

We'll create a Pandas Dataframe from the table data in the Los Angeles Times' Mapping LA project webpage using Beautiful Soup. 

In [2]:
# Lets create a link to the Los Angeles Times' Mapping LA project webpage.

url_la = "http://maps.latimes.com/neighborhoods/neighborhood/list/"

content = requests.get(url_la).text

# Using BeautifulSoup, we use Python's html.parser to parse through the html file.
soup = BeautifulSoup(content, 'html.parser')

Now we find the contents of the webpage using the BeautifulSoup 'find()' method which scans the html doc for the given label and attribute. To identify the two input arguments we need to use Google Chrome's Developer Tools. 

In our case, both arguments are name='table' and attrs={'class': 'datagrid'}.

In [3]:
la_table = soup.find('table', attrs={'class':'datagrid'})

df_la = pd.read_html(str(la_table))

# Lets create the dataframe. 
df_la = pd.DataFrame(df_la[0])

# We'll rename one of column names. 
df_la.rename(columns={"Name":"Neighborhoods"}, inplace=True)

print(f"The number of neighborhoods in the Los Angeles County, CA are {len(df_la['Neighborhoods'].unique())}.")
df_la.head(20)

The number of neighborhoods in the Los Angeles County, CA are 272.


Unnamed: 0,Neighborhoods,Region
0,Acton,Antelope Valley
1,Adams-Normandie,South L.A.
2,Agoura Hills,Santa Monica Mountains
3,Agua Dulce,Northwest County
4,Alhambra,San Gabriel Valley
5,Alondra Park,South Bay
6,Altadena,Verdugos
7,Angeles Crest,Angeles Forest
8,Arcadia,San Gabriel Valley
9,Arleta,San Fernando Valley


Note that the Los Angeles Time's Mapping LA project includes the entire **Los Angeles County** which contains mountain regions, naional parks, and nearby cities. Our focus is on the **City of Los Angeles** thus we will use the City of Los Angeles open portal data to constrain our dataframe to only neighborhoods that fall inside the city of LA.

In [4]:
# @hidden_cell

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_b5d0dc04fc64440db4d6430c0bbbd1f5 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='-u2hkfWlHdrOEXoNHoEPYARyLuZx-WnHdBwSKYg0Qngo',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_b5d0dc04fc64440db4d6430c0bbbd1f5.get_object(Bucket='applieddatasciencecapstone-donotdelete-pr-hfzojppabnbdxd',Key='LA_Times_Neighborhood_Boundaries.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )


In [5]:
# The City of Los Angeles open portal provides a csv file with 114 neighborhoods wihin the city.  

df_data_1 = pd.read_csv(body)

print(f"The City of Los Angeles open portal data contains {len(df_data_1['name'].unique())} neighborhoods")
df_data_1.head()

The City of Los Angeles open portal data contains 114 neighborhoods


Unnamed: 0,OBJECTID,name
0,1,Adams-Normandie
1,2,Arleta
2,3,Arlington Heights
3,4,Atwater Village
4,5,Baldwin Hills/Crenshaw


Now that we have the neighborhoods solely in the city of Los Angeles let us remove all other nearby neighborhoods from 'df_la'.  

In [6]:
# We'll only keep neighborhoods in city of LA.
df_la = df_la[df_la["Neighborhoods"].isin(df_data_1['name'])]

print(f"The total number of neighborhoods in the city of Los Angeles are {len(df_la['Neighborhoods'].unique())}")

df_la.head(20)

The total number of neighborhoods in the city of Los Angeles are 114


Unnamed: 0,Neighborhoods,Region
1,Adams-Normandie,South L.A.
9,Arleta,San Fernando Valley
10,Arlington Heights,Central L.A.
13,Atwater Village,Northeast L.A.
17,Baldwin Hills/Crenshaw,South L.A.
19,Bel-Air,Westside
23,Beverly Crest,Westside
24,Beverly Grove,Central L.A.
26,Beverlywood,Westside
27,Boyle Heights,Eastside


### Obtaining Latitudes and Longitudes

In [7]:
lng = []
lat = []
for name in df_la['Neighborhoods']:
    
    try:
        address = f'{name}, CA'

        geolocator = Nominatim(user_agent="my-project")
        location = geolocator.geocode(address)
        lat.append(location.latitude)
        lng.append(location.longitude)
        
    except:
        print(f"Could not find latitude/longitude coordinates for {name}, CA")

# Create latitude/longitude columns
df_la["Latitude"] = lat
df_la["Longitude"] = lng

# Now we have geo coordinates for each neighborhood
# We'll sort the dataframe based on regions
df_la.sort_values("Region", inplace=True)

df_la.head(30)

Unnamed: 0,Neighborhoods,Region,Latitude,Longitude
91,Hancock Park,Central L.A.,34.06778,-118.332635
106,Hollywood Hills West,Central L.A.,34.110485,-118.373388
105,Hollywood Hills,Central L.A.,34.131179,-118.335547
104,Hollywood,Central L.A.,34.098003,-118.329523
126,Larchmont,Central L.A.,34.079837,-118.31787
95,Harvard Heights,Central L.A.,34.047111,-118.305483
89,Griffith Park,Central L.A.,34.135814,-118.294789
137,Los Feliz,Central L.A.,34.108214,-118.290032
146,Mid-City,Central L.A.,34.041527,-118.36037
147,Mid-Wilshire,Central L.A.,34.056862,-118.345803


### Fouresquare

Great! We now have a complete list of all the neighborhoods that reside in the city of Los Angeles.

Our next step will be to use the Foursquare API to obtain information about what type of businesses there are in a given neighborhood. The information we get back will also tell us whether a nieghborhood is primarily residential. 

In [9]:
# @hidden_cell

CLIENT_ID = 'WFUNCTLTXYVGNWQBBYFDUNNBDJ444DFUFFYMGMR5BLS40SYR'
CLIENT_SECRET = '5CMCVPRFLUFJU0CVOJMWX0UU0VX5CT4KHZ3E23BAM0ZUQ12Y'
VERSION = '20180604'
LIMIT = 100

Lets explore the businesses around all the neighborhoods. 

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):    
    """Obtains the venues within a 500m radius around a given location."""
    
    category_id = '4d4b7105d754a06378d81259'  #Shop&Service venue category
    venues_list = []
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        
        # API URL
        url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}".format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            category_id,
            radius,
            LIMIT)
        
        # GET request 
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # Append the relevant info from results
        venues_list.append([(
            name,
            lat,
            lng,
            ven['venue']['name'],
            ven['venue']['location']['lat'],
            ven['venue']['location']['lng'],
            ven['venue']['categories'][0]['name']) for ven in results])
        
        # Let's build the dataframe from venues_list
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        
        nearby_venues.columns = ['Neighborhood', 
                                 'Neighborhood Latitude',
                                 'Neighborhood Longitude',
                                 'Venue',
                                 'Venue Latitude',
                                 'Venue Longitude',
                                 'Venue Category']
    return(nearby_venues)

Let's call the getNearbyVenues() function we created above and print out some basic information about our new dataframe.

In [28]:
la_venues = getNearbyVenues(df_la['Neighborhoods'], df_la['Latitude'], df_la['Longitude'])

print(f"There are {la_venues.shape[0]} venues within the Shop & Service venue category that Foursquare recognizes.")
print(f"The number of neighborhoods Foursquare returns with venues is {len(la_venues['Neighborhood'].unique())} out of the {len(df_la['Neighborhoods'])} neighborhoods we were considering.")

la_venues.head(15)

There are 3521 venues within the Shop & Service venue Category that Foursquare recognizes.
The number of neighborhoods Foursquare returns with venues is 105 out of the 114 neighborhoods we originally started with.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hancock Park,34.06778,-118.332635,Chloe Massage Spa,34.067983,-118.33241,Massage Studio
1,Hancock Park,34.06778,-118.332635,Los Angeles Mobile App Developers,34.068979,-118.328118,Business Service
2,Hancock Park,34.06778,-118.332635,Leading Tax Group,34.065761,-118.327884,Lawyer
3,Hollywood Hills West,34.110485,-118.373388,Laurel Canyon Country Store,34.108925,-118.369616,Grocery Store
4,Hollywood Hills West,34.110485,-118.373388,Jeff Pinette | Photography,34.108522,-118.37787,Photography Studio


Notice that we now have less neighborhoods than what was passed through the getNearbyVenues() method. This is because the Foursquare API returns neighborhoods which have venues within a 500m radius and in return removes locations which have no businesses in the surrounding area. This is great because now we only have locations that contain businesses. 

Generally, electronics stores do not offer in-store computer parts or hardware in their stores but do offer them on their websites for customers to pickup in-store or have the parts delivered to them. We will take a conservative approach toward our analysis and assume that all electronics stores pose a competition to the stackholder's business. This in return allows us to remove neighborhoods that host electronics stores from our data and focus on recommending locations which do not have them.

Let's go ahead and isolate the electronics store onto a seperate dataframe called `elec_neigh`.

In [29]:
elec_neigh = la_venues[la_venues['Venue Category'] == 'Electronics Store']

print(f"The number of electronics stores is {elec_neigh.shape[0]} within {len(elec_neigh['Neighborhood'].unique())} neighborhoods")
elec_neigh.head(15)

The number of electronics stores is 57 within 33 neighborhoods


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
49,Hollywood,34.098003,-118.329523,Input Output,34.098247,-118.328875,Electronics Store
56,Hollywood,34.098003,-118.329523,Ultimate Ears,34.098309,-118.330139,Electronics Store
73,Hollywood,34.098003,-118.329523,Computer & Iphone Repair Los Angeles,34.098426,-118.326359,Electronics Store
89,Hollywood,34.098003,-118.329523,Emergency Lights Co.,34.098526,-118.33297,Electronics Store
124,Harvard Heights,34.047111,-118.305483,Han's Appliance & TV,34.048672,-118.308586,Electronics Store
278,Fairfax,37.987293,-122.587967,Acoustic Frontiers LLC,37.987803,-122.584658,Electronics Store
296,East Hollywood,34.090428,-118.296625,Loco 4 Tech,34.090909,-118.296061,Electronics Store
346,Downtown,34.042849,-118.247673,iStore,34.043041,-118.25042,Electronics Store
526,Echo Park,34.074,-118.260874,Chris Castro Mac and PC Repair,34.072537,-118.263425,Electronics Store
546,Echo Park,34.074,-118.260874,Rewind Audio,34.076937,-118.264068,Electronics Store


This is great! We now have the electronics stores in each neighborhood. 

Now that we have a good idea of where such businesses are established, lets create the dataframe tht will consist of neighborhoods with no electronics stores called, `la_clean`.

In [31]:
la_clean = la_venues[~la_venues['Neighborhood'].isin(elec_neigh['Neighborhood'])]

print(f"The number of neighborhoods without electronics stores are {len(la_clean['Neighborhood'].unique())} out of {len(la_venues['Neighborhood'].unique())}.")
print(f"The number of businesses in these neighborhoods are {la_clean.shape[0]}.")
la_clean.head(10)

The number of neighborhoods without electronics stores are 72 out of 105.
The number of businesses in these neighborhoods are 1467.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hancock Park,34.06778,-118.332635,Chloe Massage Spa,34.067983,-118.33241,Massage Studio
1,Hancock Park,34.06778,-118.332635,Los Angeles Mobile App Developers,34.068979,-118.328118,Business Service
2,Hancock Park,34.06778,-118.332635,Leading Tax Group,34.065761,-118.327884,Lawyer
3,Hollywood Hills West,34.110485,-118.373388,Laurel Canyon Country Store,34.108925,-118.369616,Grocery Store
4,Hollywood Hills West,34.110485,-118.373388,Jeff Pinette | Photography,34.108522,-118.37787,Photography Studio
5,Hollywood Hills West,34.110485,-118.373388,Built By Blank,34.108528,-118.377884,IT Services
6,Hollywood Hills,34.131179,-118.335547,State Wide Construction and Remodeling,34.130489,-118.340357,Construction & Landscaping
7,Hollywood Hills,34.131179,-118.335547,wineVEIL,34.131387,-118.34086,Wine Shop
108,Larchmont,34.079837,-118.31787,Design Build Maintain DBM,34.079659,-118.319839,Construction & Landscaping
109,Larchmont,34.079837,-118.31787,Social Security Law Attorney,34.083375,-118.31563,Lawyer


This looks good! We've successfully obtained the locations of venues inside neighborhoods which are not residential and do not have an electronics store. This satisfies the criterias we initially identified in our business problem. 

Before we move on lets review the data we have collected so far.

* `df_la` contains the names of all 114 neighborhoods in the city of Los Angeles along with their associated regions. The dataframe is organized by regions, meaning all neighborhoods belonging to a certain region are grouped together. This dataframe also contains the geographical coordinates of each neighborhood.


* `la_venues` provides us information about the businesses within a neighborhood. We have business names, geographical coordinates, and the category a given business falls in. 


* `elec_neigh` provides similar information as `la_venues` with the exception that the businesses represented in this dataframe are strictly electronics stores.


* `la_clean` also derived from the `la_venue` dataframe, `la_clean` provides information on all other businesses that are no electronics stores. The information in this dataframe will be used to make our recommendation later in the analysis. 

This concludes the data collection phase, we will now move forward on our analysis using the data we have collected thus far. 

## Methodology <a name="meth"></a>

## Analysis <a name="analysis"></a>

Lets begin this secion by visualizing the data we obtained in the previous section. 

We'll start by mapping out all the neighborhoods in Los Angeles.

In [34]:
address = 'Los Angeles, California'

geolocator = Nominatim(user_agent='my-proj')
location = geolocator.geocode(address)
latitude1 = location.latitude
longitude1 = location.longitude
print(f'The geographical coordinates of LA are {latitude1}, {longitude1}.')

map_LA = folium.Map(location=[latitude1, longitude1], zoom_start=12)

for lat1, lng1, neigh in zip(df_la['Latitude'], df_la['Longitude'], df_la['Neighborhoods']):
    label1 = folium.Popup(neigh, parse_html=True)
    folium.CircleMarker(
        [lat1, lng1],
        radius = 5,
        popup = label1,
        color = 'blue',
        fill = True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_LA)
    
print(f"Displayed below are the {df_la.shape[0]} neighborhoods in the city of Los Angeles.")
map_LA

The geographical coordinates of LA are 34.0536909, -118.242766.
Displayed below are the 114 neighborhoods in the city of Los Angeles.


Lets take a look at the locations of electronics stores.

In [41]:
address = 'Los Angeles, California'

geolocator = Nominatim(user_agent='my-proj')
location = geolocator.geocode(address)
latitude2 = location.latitude
longitude2 = location.longitude

print(f'The geographical coordinates of LA are {latitude2}, {longitude2}.')

print(f"Mapping {elec_neigh.shape[0]} electronics stores.")

electronics_map = folium.Map(location=[latitude2, longitude2], zoom_start=12)

for lat2, lng2, busin in zip(elec_neigh['Venue Latitude'], elec_neigh['Venue Longitude'], elec_neigh['Venue']):
    label2 = folium.Popup(busin, parse_html=True)
    folium.CircleMarker(
        [lat2, lng2], 
        radius = 3.5,
        popup = label2,
        color = 'red',
        fill = True,
        fill_opacity = 0.7).add_to(electronics_map)
    
electronics_map

The geographical coordinates of LA are 34.0536909, -118.242766.
Mapping 57 electronics stores.


From the map above we see that a good portion of the electronics stores reside near Downtown Los Angeles and stretch toward the Santa Monica area as well as a few scattered businesses South of Northridge.

Lets go ahead and map out venues that are not electronics stores to get a better understanding of potential locations to recommend.

In [42]:
address = 'Los Angeles, California'

geolocator = Nominatim(user_agent='my-proj')
location = geolocator.geocode(address)
latitude3 = location.latitude
longitude3 = location.longitude

print(f'The geographical coordinates of LA are {latitude3}, {longitude3}.')

print(f"Mapping {la_clean.shape[0]} venues.")

neigh_map = folium.Map(location=[latitude3, longitude3], zoom_start=12)

for lat3, lng3, busin in zip(la_clean['Venue Latitude'], la_clean['Venue Longitude'], la_clean['Venue']):
    label3 = folium.Popup(busin, parse_html=True)
    folium.CircleMarker(
        [lat3, lng3], 
        radius = 3.5,
        popup = label3,
        color = 'purple',
        fill = True,
        fill_opacity = 0.7).add_to(neigh_map)
    
neigh_map

The geographical coordinates of LA are 34.0536909, -118.242766.
Mapping 1467 venues.
