# The Importance of Location for Startup Companies
## (Machine Learning and API Focused Project)

# Introduction
## Background
   Artificial Intelligence (AI) is challenging ethical barriers and transcending the capabilities of modern day technology. AI is defined by IBM Research as “anything that makes machines act more intelligently,” [1]. What is important to take note of, though, is that AI is an umbrella where machine learning (ML) falls under. The goal of ML is to build models that predict things without being explicitly programmed[2]. In other words, the machine is essentially becoming an independent learner over time. Examples of Fortune 500 companies already implementing these tools in today’s market are Netflix, Amazon, Google, IBM, and Twitter. 

   AI and ML are changing the way the world conducts business and customer service. Therefore, it is vital to keep up with the pace of the present and to plan for the inevitable future. Diving a little bit deeper into the business world, location is one of the numerous factors involved in the future success of a company. Another ingredient to success is what most people already know: networking. You cannot be a venture backed startup without money (obviously), and you cannot obtain money from thin air. In a nutshell, it’s who you know and who you are as a person[3][4]. 


## Problem
I decided to devise an unconventional problem. A client from the United States is looking to move their AI startup company, but they need direction on where to go. The US has what is called the AI triangle, or a group of 3 cities. This triangle is heavily dominated by research and breakthroughs in the AI field, and it is comprised of San Francisco, CA, Boston, MA, and Pittsburgh, PA.

## Plan
I am going to show the client why Pittsburgh would be the best choice using a two-pronged approach: low cost of living and flourishing social life for networking.

## Audience
Interested parties would be data scientists, entrepreneurs, researchers, engineers, and anyone involved in the AI/ML field.

# Data
## Sources
1. The per capita GDP was found here: https://odn.data.socrata.com/dataset/ODN-GDP/mkpy-jf3j. This was used to compare the standard of living in California, Massachusetts, and Pennsylvania. 

2. The cost of living in each city was found here: https://www.numbeo.com/cost-of-living/. I then created my own excel table with the information I gathered for Boston, San Francisco, and Pittsburgh. I exported the file as a CSV and loaded it into my Jupyter notebook.

3. The latitude and longitude coordinates were found through Wikipedia[5]. I compiled the coordinates for each neighborhood into an excel file and exported it as a CSV.

4. Foursquare API is found here: https://developer.foursquare.com/. This was used to gather venues in Pittsburgh. I then organized it into the ten most common venues in each neighborhood.

## Cleaning
There was not much cleaning as the data I found/created was relatively good to go. However, Foursquare API did not have information for 3 of the 89 neighborhoods, so I dropped the rows with NaN using the df = df.dropna() method.

# Methodology
I used Anaconda to create my Jupyter notebook for this project. I downloaded all the necessary libraries first, then I organized the GDP data by state. I then obtained the average for each state over 20 years.

![Screen%20Shot%202019-07-14%20at%2010.22.13%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.22.13%20AM.png)

![Screen%20Shot%202019-07-14%20at%2010.22.26%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.22.26%20AM.png)

I prefer to have data in a graph and/or map format because it can help a client accurately visualize the data faster.

![Screen%20Shot%202019-07-14%20at%2010.27.50%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.27.50%20AM.png)

A map of the United States was created to show how far apart the cities are in relation to each other. This is important because some people are not fully aware of the distance. Additionally, the client wants to move their AI startup and would ideally like for it to be in the AI triangle.

![Screen%20Shot%202019-07-14%20at%2010.53.36%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.53.36%20AM.png)

A new dataframe was created to compare the cost of living and rent in each city. A bar graph was then created for visualization.

![Screen%20Shot%202019-07-14%20at%2010.56.14%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.56.14%20AM.png)

![Screen%20Shot%202019-07-14%20at%2010.57.21%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.57.21%20AM.png)

Now that I have showed why Pittsburgh would be the best choice out of the 3 cities in regard to cost of living/rent, I am going to implement Foursquare API to display the various venues throughout the city. The first step is to find the coordinates for all of the neighborhoods and assign each neighborhood to one of the five areas. Python’s folium library was then used to visualize Pittsburgh with the neighborhoods superimposed on top. In summary, there were 222 unique categories.

![Screen%20Shot%202019-07-14%20at%2010.59.47%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2010.59.47%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.02.45%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.02.45%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.03.36%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.03.36%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.00.57%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.00.57%20AM.png)

After taking a look at the venue categories in the neighborhoods, I utilized one hot encoding. One hot encoding is important because “many ML algorithms cannot work with label data directly”, which means they must be converted to a numerical form[6].

![Screen%20Shot%202019-07-14%20at%2011.10.49%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.10.49%20AM.png)

I then created a table which shows the top 10 most common venues for each neighborhood.

![Screen%20Shot%202019-07-14%20at%2011.43.13%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.43.13%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.43.20%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.43.20%20AM.png)

Some of the neighborhoods have common venue categories, so I will use the K-Means algorithm to cluster the neighborhoods. This is a type of unsupervised machine learning.

I ran the elbow method several times and the optimum k turned out to be 7 clusters. It is important to note that I realized this was not the best method, but it was a method within the scope of this project.


![Screen%20Shot%202019-07-14%20at%2011.50.06%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.50.06%20AM.png)

I then assigned cluster labels, dropped NaN values, and converted cluster labels to int type in a new dataframe as shown below:

![Screen%20Shot%202019-07-14%20at%2011.58.15%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.58.15%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.58.22%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.58.22%20AM.png)

In the final step, I created a map utilizing a rainbow assignment for the clusters.

![Screen%20Shot%202019-07-14%20at%2011.59.50%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.59.50%20AM.png)

![Screen%20Shot%202019-07-14%20at%2011.59.57%20AM.png](attachment:Screen%20Shot%202019-07-14%20at%2011.59.57%20AM.png)

Lastly, I organized each cluster by its label number (0, 1, 2, 3, 4, 5, or 6) to look at which neighborhoods have the most common venues. An example is shown below:

![Screen%20Shot%202019-07-14%20at%2012.02.56%20PM.png](attachment:Screen%20Shot%202019-07-14%20at%2012.02.56%20PM.png)

# Results
Massachusetts had the highest per capita GDP. This means that this state has the highest standard of living compared to Pennsylvania and California. After comparing the cost of living and rent indexes between the 3 cities, Pittsburgh was the best option and returned 222 unique venue categories. The optimum k was 7 clusters to segment the neighborhoods according to most common venues.

# Discussion
Pittsburgh was the most affordable city out of the AI Triangle, which suggests that the client should strongly consider moving their startup to this city. The high amount of unique venues, in relation to the total population, shows that there are numerous ways to network and to switch up meeting locations. Additionally, Pittsburgh has public transportation throughout the city and reasonable parking permits for businesses and residents.

# Conclusion 
Pittsburgh is an AI triangle member, has a low cost of living, and boasts diverse networking venues, which would build a strong foundation for an AI startup.

# References

[1] AI definition: https://researcher.watson.ibm.com/researcher/view_group.php?id=135

[2] ML goal by Vishal Maini: https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12

[3] Venture Backed Startups: https://medium.com/swlh/the-truth-behind-how-venture-capital-chooses-startups-3c95a96ba836

[4] Whom You Know Matters: Venture Capital Networks and Investment Performance: https://www.stat.berkeley.edu/~aldous/Networks/hochberg.pdf

[5] Wikipedia: https://en.wikipedia.org/wiki/List_of_Pittsburgh_neighborhoods

[6] One-Hot Encoding: https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/