# **Methodology To Be Adopted**
## How to gather the data and build the approach to come up with the solution

Now that the consultant has been identified to come up with a solution, let us check what approach he can choose -

Steps in building the required Data -

**1. Have a detailed list of Postal Codes for California State along with the Latitude and Longitude details for each and every postal code**

There are multiple web sites on internet where you can get the Postal Codes for California State along with Latitude and Longitude Details
Once the website is identified, the details can be extracted in one of the two ways --
1. Scrape the data from the website using html5 library and load the same in pandas
2. Get the details downloaded in a CSV format. Make necessary changes in the data and then upload the same in the pandas

In case we want to scrape the data using html5 library, following command needs to be executed to install the package -
**pip install html5lib**

Once that is done - we will need lxml package to scrape the contents of the webpage - **pip install lxml**

Next Step -
Import Pandas - **import pandas as pd**

Using following syntax to get the details in pandas -
dfs = pd.read_html('URL', header = 0)

Now, we have a data in place that will consist at least following columns - 

1. Postal Code
2. Borough
3. City
4. State
5. Latitude
6. Longitude

There can be few changes made like - removing the State column considering for all the columns it will be the same - **'California'**

**2. Considering there are multiple cities in the California State, shortlisting only Los Angeles for narrowing down the research part. Assumption is that compared to other Cities, Los Angeles is a better pick.**

This can be replicating the Dataframe to a new one with only the details where City is **'Los Angeles'**

**3. Now that we have all the Latitudes and Longitudes details for all the postal codes, next thing that we need is the neighborhood details for all these postal codes**

Neighborhood details actually mean Venues surrounding the Postal Code

We can get the details from Foursquare location data

Getting the Credentials First

CLIENT_ID = 'Foursquare ID'
CLIENT_SECRET = 'Foursquare Secret'
VERSION = 'Version Number'

Generating URL with which requests can be made to Foursquare APIs for fetching Venues data

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, limit)
url

**4. Using the Foursquare Location data for getting the details about the nearby Venues**

We already have the Credentials in place and also the URL created. 
Now we can request Foursquare API for the venues details.

For that, 1st step is to import requests library using following command - **import requests**

Once the request is made, the result that we can from Foursquare will be a JSON file.
To be able to read it properly and do required operations on it, we will need to get the details from JSON transferred to Pandas Dataframe.

For this we will need JSON Normalization package by executing following command - **from pandas.io.json import json_normalize**

Making the request be using following code - **results = requests.get(url).json()**

**5. What to expect from Neighborhood Data and Why do we need that ?**

Which Neighborhoods should be more appealing for an Ice Cream Parlor ?

When we want our Ice Cream Parlor to be in an easy reach to the customers, first level of filtering the venues that we have will be to identify areas with a good population density. 

How to figure out whether the Density is low or high – 
This we can figure out based on the existing venues in the localities – The places where there are many restaurants, shopping malls, schools, colleges, universities, firms, corporate/business parks. 

In general, this is all about finding areas that are always filled with potential customers rather than choosing the areas which are too residential and lacks the human activity needed to sustain the business.

**Important venues to be tracked from the Foursquare Location Data -**

1. Restaurants/Pubs
2. Malls
3. Multiplexes, Cinema Theatres, Local Operas
4. Schools
5. Universities/Colleges
6. Industries/Firms/Business Parks
7. Airports/Bus Terminals/Metro Stations

Apart from these, once we have the venue details, we can add few more to the list.

**6. Finding Top Venues for all the Postal Codes to check the areas of our interest, the areas that will suit an Ice Cream Parlor**

This will be done by Finding 10 common venues in order for all the postal codes that we have. This will be done using One Hot Coding - Having a counter set for all the venues and counting the number of such venues for each of the neighbourhood. 
Once that is done, sorting it in a descending order and coming up with the Top 10 Venues for each Borough/Postal Code.

Here, it is now important, to check the venues that are not of interest and removing those from the list.

Now we will have a reduced list of Boroughs having the venues or localities that have interest for us.

**7. Using the Top Venues Data, cluster will be formed of different areas based on similarity of venues and then it will be presented to the Client - IceStorm**

Clusters can be formed using the similarity of the top venues for each of the Borough. Clustering can be done using K-Means.

This now can be mapped on a graph using the geopy plotting function

**8. Once we have details on the venues of the clusters, we can the prioritize the details based on the list of neighborhood in point no. 4 and come up with 2-3 Postal Codes where IceStorm can start its Ice Cream Parlor.**

We can then present the shortlisted clusters to the Client and then see if any further narrowing is needed and if we can get more specifications to narrow down the result.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('California PinCodes.csv')

In [4]:
df.head(5)

Unnamed: 0,Sr. no,Borough,PinCode,District,State,Lattitude,Longitude
0,1,Beverly Hills,90210,Los Angeles,California,34.09,-118.406
1,2,Los Angeles,90002,Los Angeles,California,33.95,-118.246
2,3,Los Angeles,90003,Los Angeles,California,33.965,-118.273
3,4,Los Angeles,90004,Los Angeles,California,34.076,-118.303
4,5,Los Angeles,90006,Los Angeles,California,34.049,-118.292


In [5]:
df.drop(columns = 'State', inplace = True)

In [6]:
df

Unnamed: 0,Sr. no,Borough,PinCode,District,Lattitude,Longitude
0,1,Beverly Hills,90210,Los Angeles,34.09,-118.406
1,2,Los Angeles,90002,Los Angeles,33.95,-118.246
2,3,Los Angeles,90003,Los Angeles,33.965,-118.273
3,4,Los Angeles,90004,Los Angeles,34.076,-118.303
4,5,Los Angeles,90006,Los Angeles,34.049,-118.292
...,...,...,...,...,...,...
195,196,Inglewood,90311,Los Angeles,33.962,-118.353
196,197,Inglewood,90312,Los Angeles,33.962,-118.353
197,198,Santa Monica,90402,Los Angeles,34.035,-118.503
198,199,Santa Monica,90406,Los Angeles,34.019,-118.491


In [7]:
dfnew = df[df.District == 'Los Angeles']

In [8]:
dfnew.drop(columns = 'Sr. no', inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [9]:
dfnew.reset_index()

Unnamed: 0,index,Borough,PinCode,District,Lattitude,Longitude
0,0,Beverly Hills,90210,Los Angeles,34.09,-118.406
1,1,Los Angeles,90002,Los Angeles,33.95,-118.246
2,2,Los Angeles,90003,Los Angeles,33.965,-118.273
3,3,Los Angeles,90004,Los Angeles,34.076,-118.303
4,4,Los Angeles,90006,Los Angeles,34.049,-118.292
...,...,...,...,...,...,...
159,195,Inglewood,90311,Los Angeles,33.962,-118.353
160,196,Inglewood,90312,Los Angeles,33.962,-118.353
161,197,Santa Monica,90402,Los Angeles,34.035,-118.503
162,198,Santa Monica,90406,Los Angeles,34.019,-118.491


We have now narrowed down the data and have only latitudes and longitudes for Los Angeles City

## Setting up the Foursquare Credentials

In [10]:
CLIENT_ID = 'FFC1QZGY50XQSEEWBX4EP1AWTB1XNSYKZ2GJBWPYAA2NRMQ1' # your Foursquare ID
CLIENT_SECRET = 'JX4FLWTBWZKOV1UIDS5HOY5CP1PFBJMLOHQYS4SBBSRPBLNV' # your Foursquare Secret
VERSION = '20200328'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FFC1QZGY50XQSEEWBX4EP1AWTB1XNSYKZ2GJBWPYAA2NRMQ1
CLIENT_SECRET:JX4FLWTBWZKOV1UIDS5HOY5CP1PFBJMLOHQYS4SBBSRPBLNV


In [1]:
!conda install -c conda-forge geopy --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          92 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.21.0-py_0



Downloading and Extracting Packages
geopy-1.21.0         | 58 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##################################### |

In [11]:
from geopy.geocoders import Nominatim

In [25]:
address = 'Los Angeles, California'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles city are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Los Angeles city are 34.0536909, -118.2427666.


In [26]:
radius = 500
limit = 100
VERSION = 20200329
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, limit)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=FFC1QZGY50XQSEEWBX4EP1AWTB1XNSYKZ2GJBWPYAA2NRMQ1&client_secret=JX4FLWTBWZKOV1UIDS5HOY5CP1PFBJMLOHQYS4SBBSRPBLNV&v=20200329&ll=34.0536909,-118.2427666&radius=500&limit=100'

In [27]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

In [28]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e9c912bd03993002897f518'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Civic Center',
  'headerFullLocation': 'Civic Center, Los Angeles',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 25,
  'suggestedBounds': {'ne': {'lat': 34.0581909045, 'lng': -118.23734531946405},
   'sw': {'lat': 34.0491908955, 'lng': -118.24818788053594}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4fecf601067d351381ea64fa',
       'name': 'Grand Park',
       'location': {'address': '200 N Grand Ave',
        'crossStreet': 'btwn Temple & 1st St',
        'lat': 34.05503441823839,
        'lng': -118.24517873806079,
        'labeledLatLngs': [

### Now we have all the data that can be analysed and worked upon to come up to the solution to be provided

## End Of Notebook