# Major attractions nearby and Data Science jobs (in MA).

#### Date: 21 March 2019
#### Author: Nasir Ahmad
#### Email : nasir398@hotmail.com
#### Publication: This article is purely based on personal research for non commercial usages. The services used by third party are credited without any intention of copy right violation. 


## Introduction

Data science jobs are available around the globe yet the hunt for job is not easy when you have too many places to choose from. In this article I will focus on the types of locations where Data Science jobs are available in Massachusetts US. 
There are two approaches to get any job:
1.	Look for a new job around your current location. 
2.	Search with keyword and then go through each job description. 

Both of above methods fail to include one basic question: What kind of neighborhood is the company located in? Everyone knows the high-tech jobs are available in Silicon Valley, Seattle, New York, and Boston and so on. But what if you are not a big fan of living in a populous city, what if you just want a peaceful country side to live and code for living! 
This article answers the following question:

**“What are the major attractions nearby locations where Data Science jobs are being offered in Massachusetts?”**


## Data Section

1. Foursquare is a platform which provides information about places in a given neighborhood. In this article Foursquare‘s listing is used to get popular sites in a given area.  It provides developers with a good API which provides result in the form of JSON file, which developer can comprehend as per requirements. 

https://foursquare.com/city-guide

2. Adzuna (A) is a website which provides jobs listing for any given location in addition to other useful services related to jobs. This article uses Adzuna developer API for information related to Data Science jobs in Massachusetts. 

https://www.adzuna.com/

3. Massachusetts State is divided into 14 counties as shown in below image. 

https://upload.wikimedia.org/wikipedia/commons/b/b6/Massachusetts-counties-map.gif


#### lets start by importing libraries which will be used in this project

In [None]:
import requests 
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import pandas as pd
import numpy as np

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 

print('Libraries imported.')

Adzuna API for developer has detailed documentation, which provides the JSON file specialized for a geo location. 
I recommend to use this link : https://api.adzuna.com/static/swagger-ui/index.html#!/adzuna
        

Create a developer account which is free, providing user with app_id and app_key, prerequisite for API calling. 


In [None]:
#Set paramaters for API call
app_id ='51c4103a' # assigned by Adzuna
app_key='6bf4141ed2bb578e5c972eeb1a68e5b7' # Assigned by Adzuna
search_word='Data Science'#Keyword for which search is going to be performed 
search_location='Massachusetts' # location where the job search will be performed 

url ='https://api.adzuna.com:443/v1/api/jobs/us/geodata?app_id={}&app_key={}&what={}&where={}'.format(app_id,app_key,search_word,search_location)


We will use requests to call the url and get the JSON file in return. 

In [None]:
results = requests.get(url).json()
results

JSON file is good for communication but in notebook we can accomplish much more with dataframes. Up next we normalize the JSON file into dataframe which looks like below :


In [None]:
jobs_ma = results['locations']  
data_jobs_ma = json_normalize(jobs_ma) # flatten JSON
data_jobs_ma

To our cause the data required for research is number of jobs in each County so we clean the data and get our required data set 

In [None]:
# filter columns and clean data
filtered_columns = ['count', 'location.area', 'location.display_name']
data_jobs_ma =data_jobs_ma.loc[:, filtered_columns]
data_jobs_ma.columns = [col.split(".")[-1] for col in data_jobs_ma.columns]
data_jobs_ma.rename({'count':'job_count','display_name':'Neighborhood'}, axis=1, inplace=True)
#data_jobs_ma =data_jobs_ma.drop(labels='area',axis=1)


In [None]:

data_jobs_ma

Let's visualize the Data in form of Bar graph & see how it feels 

In [None]:
#Make a dataframe which is graph friendly 
jobs_ma_display = data_jobs_ma
#jobs_ma_display.Neighborhood=[col.split(",")[0] for col in jobs_ma_display.Neighborhood]
jobs_ma_display['Neighborhood'] = jobs_ma_display['Neighborhood'].str.replace(', Massachusetts','')
jobs_ma_display = jobs_ma_display.set_index(jobs_ma_display.Neighborhood)
jobs_ma_display =jobs_ma_display.drop(labels='Neighborhood',axis=1)

#assign the type of plot as Bar
jobs_ma_display.plot(kind='bar',figsize=(15, 9))

plt.xlabel('MA County')
plt.ylabel('Number of Data Scienc Jobs')

plt.title('Data Science jobs in Massachusetts', y=1) 

plt.show()

Interestingly most of the Data Science jobs are in Middlesex and Suffolk counties. Now let’s break the details of job count in counties other than these two by plotting a pie chart.  

In [None]:
#The default colors and boundries are not much pretty to see so i perfer my own ones
colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'orange', 'pink','brown','black','red','blue','silver']
explode_list = [0,  0,0,0.03,0.04,0.05,0.08, 0.22, 0.44] # ratio for each county with which to offset each wedge.

#Choose the jobs for each county less than 100. This criteria makes sure the major counties are excluded from our data. 
jobs_ma_display = jobs_ma_display[jobs_ma_display['job_count'] <100]


jobs_ma_display['job_count'].plot(kind='pie',
                            figsize=(20, 9),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,         # turn off labels on pie chart
                            pctdistance=1.2,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
                            colors=colors_list,  # add custom colors
                            explode=explode_list # 'explode' lowest counties
                            )


plt.title(' Data Science jobs in Massachusetts Except Middlesex & Suffolk county', y=1.12) 
plt.axis('equal') 

# add legend
plt.legend(labels=jobs_ma_display['job_count'].index, loc='upper left') 

plt.show()

Looks like Worcester and Norfolk take the lead now by conquering major portion of the data science jobs outside of Middlesex and Suffolk. 

This brings us to the end of Adunza section. 
Next we will explore Foursquare API for places in Massachusetts state. 

Foursquare API heavily relies on geographical coordinates, although address can be used but to be on the safe side we use latitude and longitude of each county. 

In [None]:
#funtion to return the  latitude and longitude
def get_log_lat(address):
    try:
        
        geolocator = Nominatim(user_agent="ma_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except :
        latitude=-1
        longitude=-1
    return  latitude , longitude


In [None]:
#Lets loop thorough the county of Massachusetts to add coordinates 
data_jobs_ma['latitude']=0
data_jobs_ma['longitude']=0
i=0
for row in data_jobs_ma['area']:
    print(row)
    data_jobs_ma['latitude'].iloc[i],data_jobs_ma['longitude'].iloc[i]=get_log_lat(row)
    i+=1

To confirm the changes explore the dataframe

In [None]:
data_jobs_ma

## 2. Explore Massachusetts using Foursquare API

Foursquare API provides data about sites in a given nearby place. Below are the important information required for calling API: 

In [None]:
CLIENT_ID = 'BKONW20YSKTSNRHM2ECIEPYFYA14SSA4QELHWVVP25N5SJLZ' # your Foursquare ID
CLIENT_SECRET = '1VZQZHXLZ3TT1CNSJ2O113HROIHMIPU24KF3MDLE1J4OSSJI' # your Foursquare Secret
VERSION = '20180604' 
radius= 22000 # the approx radius of smallest county
LIMIT = 100


creating a funtion to invoke the API

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=22000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Execute the above code for Neighborhood of MA

In [None]:
ma_venues = getNearbyVenues(names=data_jobs_ma['Neighborhood'],
                                   latitudes=data_jobs_ma['latitude'],
                                   longitudes=data_jobs_ma['longitude']
                                  )

In [None]:
print(ma_venues.shape)
ma_venues.head()

Let's see the count of each Neighborhood

In [None]:
ma_venues.groupby('Neighborhood').count()

#### Analyze each neighborhood

In [None]:
# one hot encoding
ma_onehot = pd.get_dummies(ma_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ma_onehot['Neighborhood'] = ma_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ma_onehot.columns[-1]] + list(ma_onehot.columns[:-1])
ma_onehot = ma_onehot[fixed_columns]

ma_onehot.head()

In [None]:
#lets see the stats 

ma_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
ma_grouped = ma_onehot.groupby('Neighborhood').mean().reset_index()
ma_grouped.head()

In [None]:
ma_grouped.shape

In [None]:
##Let's print each neighborhood along with the top 5 most common venues
num_top_venues = 8

for hood in ma_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = ma_grouped[ma_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
##Let's put that into a pandas dataframe
##First, let's write a function to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Now let's create the new dataframe and display the top 10 venues for each neighborhood

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = ma_grouped['Neighborhood']

for ind in np.arange(ma_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ma_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

In [None]:
ma_grouped.head()

## 3. Methodology : K- Means Cluster County

Since the objective of our article is to find out possible location's so let us run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# set number of clusters
kclusters = 3

ma_grouped_clustering = ma_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ma_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
data_jobs_ma.head()

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ma_merged = data_jobs_ma

# merge  with toronto_data to add latitude/longitude for each neighborhood
ma_merged = ma_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
ma_merged =ma_merged.drop(labels='area',axis=1)

ma_merged.head()

Finally, let's visualize the resulting clusters

In [None]:
# create map
latitude=42.40 
longitude=-71.38
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=8)
folium.TileLayer('cartodbpositron').add_to(map_clusters)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster,job_count in zip(ma_merged['latitude'], ma_merged['longitude'], ma_merged['Neighborhood'], ma_merged['Cluster Labels'],ma_merged['job_count']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=15+np.log(job_count),
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        weight = 5,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
ma_merged

<a id='item5'></a>

## The End