# AI venue type investment model
## Author: [Carlos Morlan](https://www.linkedin.com/in/carlos-morlan-96343a15/)
### Published date: July 22<sup>th</sup>, 2019

[![Battle of neighborhoods](https://www.garybarker.co.uk/files/uk-city-life-cartoon-illustration.jpg)](https://www.garybarker.co.uk)

# Table of contents

  - [Introduction](#Intro)
  - [Data](#Metho)

### <a name="Intro"></a>Introduction

I live in Mexico City, one of the biggest and most populated cities in the world. One of its citizens main concerns is that the country's economy is volatile, you can feel it in the air. A good proxy for the overall stability of a country is the consistency of its economic growth. From my personal point of view, getting more investments is a good way to improve its economic growth. The investments, that can be done by the government or by private companies, should be well planned based on the different communities needs through all the main country's cities.

This capstone project will try to show how Mexico City can attract new invests for Mexico's economy improvement. The Government and new investors should know what are the popular places where the citizens have fun, get dinner or bought supplies. With such information, either of both can make best decisions about the type of business they can open and how well the people will take the new venue. Moreover, if there are popular places with special attributes like a Medical Center, they can start opening required business types near such places like Laboratories, Pharmacies or even a Hotel so the people from outside town can stay there while their patients are receiving treatment. As you can see, small or big investors can use this valuable information to take the path of a successful opportunity and the city communities will also get more and better services: Is a win-win situation.

### <a name="Metho"></a>Methodology

To make this happen some ingredients are mandatory:

- Major city spots identified by zone or neighborhood. 
- Popular venues identification based on social networks (www.foursquare.com).
- Data Science tools and algorithms.

With the acquired Data Science knowledge from all the courses included in the IBM Data Science Professional Coursera track an AI venue type investment model will be created. What information will be shown by this model? First of all, a map that will display 7 different clusters that group neighborhoods with similar venue types. As learned through all the past weeks, data visualization is always more advantageous  to communicate ideas. Secondly, for each cluster a list of most common venue types will be shown ordered by it's frequency. With these lists, the Government or private investors can decide where to open a new business (on which neighborhood) and of what type (based on the venue's category from the social network used). They can also notice if there is any type of missing business required on the specific neighborhoods.

I hope you agree with me that the analysis that will be done in this project can be applied broadly if and only if the essential data is present for the city or group of cities where the model wants to be applied. Adding more available data to the model in the future, like real estate costs or criminal rate, will make it superior.

#### Major City Spots
To get this information I will be using a public dataset from www.geonames.org site downloaded locally, check the [References](#Refer) section for more details. The data format is tab-delimited text in utf8 encoding, with the following fields :

* country code      : iso country code, 2 characters
* postal code       : varchar(20)
* place name        : varchar(180)
* admin name1       : 1. order subdivision (state) varchar(100)
* admin code1       : 1. order subdivision (state) varchar(20)
* admin name2       : 2. order subdivision (county/province) varchar(100)
* admin code2       : 2. order subdivision (county/province) varchar(20)
* admin name3       : 3. order subdivision (community) varchar(100)
* admin code3       : 3. order subdivision (community) varchar(20)
* latitude          : estimated latitude (wgs84)
* longitude         : estimated longitude (wgs84)
* accuracy          : accuracy of lat/lng from 1=estimated, 4=geonameid, 6=centroid of addresses or shape

_FYI, this file doesn't have column headers._

In [1]:
import pandas as pd
import numpy as np

# Read source, the file is tab delimited and the postal code column (#2) should be treated as string
postal_codes_tmp = pd.read_csv('MX.txt', sep='\t', header=None, dtype={1:str})

# Assign column headers because the file doesn't have it
postal_codes_tmp.columns = ['CountryCode', 'PostalCode', 'PlaceName', 'State', 'StateCode', 'TownHall', 'TownHallCode', 'AdminName3', 'AdminCode3', 'Latitude', 'Longitude', 'Accuracy']
# print(postal_codes_tmp.dtypes)

postal_codes_tmp.head()

Unnamed: 0,CountryCode,PostalCode,PlaceName,State,StateCode,TownHall,TownHallCode,AdminName3,AdminCode3,Latitude,Longitude,Accuracy
0,MX,20000,Zona Centro,Aguascalientes,1,Aguascalientes,1,Aguascalientes,1.0,21.8734,-102.2806,1
1,MX,20010,Olivares Santana,Aguascalientes,1,Aguascalientes,1,Aguascalientes,1.0,21.9644,-102.3192,1
2,MX,20010,Ramon Romo Franco,Aguascalientes,1,Aguascalientes,1,Aguascalientes,1.0,21.9644,-102.3192,1
3,MX,20010,Las Brisas,Aguascalientes,1,Aguascalientes,1,Aguascalientes,1.0,21.9644,-102.3192,1
4,MX,20010,San Cayetano,Aguascalientes,1,Aguascalientes,1,Aguascalientes,1.0,21.9644,-102.3192,1


#### Popular Venues Identification
A developer account was created in Foursquare to get access to the available endpoints to get the popular venues from a particular location and radius.

_FYI For this project a 500 meter radius will be used unless something different is noted._

In [2]:
# Define Foursquare Credentials and Version
CLIENT_ID = '**************************************'
CLIENT_SECRET = '**************************************'
VERSION = '********'
LIMIT = 10

#### Data Science Tools & Algorithms
Python Notebook was created in Skills Network Labs framework for this project, check the [References](#Refer) section for more details.

K-Means clustering method will be used to group the data identifying the 10 more common venues in 500 meters radius for each identified zone/neighborhood. After identifying all the venues, the data for each cluster will be grouped by category's venue to show the most frequent categories.

_FYI A matrix of `10 x n` will be created by the algorithm (number of the most common venues times the number of neighborhoods) for each cluster_

### <a name="Refer"></a>References

- [Notebook image](https://www.garybarker.co.uk)
- [Volatile economies article](https://qz.com/1550062/the-most-and-least-volatile-economies-of-the-21st-century/)
- [Geolocation Mexico Postal Codes](http://download.geonames.org/export/zip/)
- [Foursquare endpoints](https://developer.foursquare.com/docs/api/endpoints)
- [Data Science framework](https://labs.cognitiveclass.ai)
- [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)
