# Spending Public Resources Wisely

## Data Overview

### Coordinates
In order to complete this study we'll start from a KMZ file containing all 96 districts of São Paulo and their coordinates. This information can be found [here](http://geosampa.prefeitura.sp.gov.br/PaginasPublicas/_SBC.aspx#). Also, we'll use an on-line service to convert the KMZ file into a CSV file. Coding a converter in python is out of the scope of this analysis, so I recommend [this](https://mygeodata.cloud/converter/kmz-to-csvOs) service. After the conversion process we will read the CSV into a raw dataframe. The raw dataset should be as follows:

| |         X|         Y|gid| Name|description|        Field_2|Field_1|
|-|----------:|----------:|---:|-----:|-----------:|---------------:|-------:|
|**0**|-46.641307|-23.458062|  1|kml_1|        NaN|       MANDAQUI|     51|
|**1**|-46.709869|-23.937325|  2|kml_2|        NaN|       MARSILAC|     52|
|**2**|-46.661700|-23.594773|  3|kml_3|        NaN|          MOEMA|     32|
|**3**|-46.465239|-23.574191|  4|kml_4|        NaN|PARQUE DO CARMO|     57|
|**4**|-46.680754|-23.537946|  5|kml_5|        NaN|       PERDIZES|     60|

__'X'__ stands for longitudes, __'Y'__ for latitudes and __Field_2__ for district names. 

### Foursquare API - Explore

After cleaning process, our dataframe containing latitudes and longitudes can be used along with a Foursquare API to explore each district and get the relevant venues for each one. To do so, we will use a personal credential to run the __explore__ function in Foursquare API. The response is a json file and it contains the IDs of each venue. Below follows an example of what we will get from __explore__ function.

{
  "meta": {
    "code": 200,
    "requestId": "5ac51d7e6a607143d811cecb"
  },
  "response": {
    "venues": [
      {__"id": "5642aef9498e51025cf4a7a5"__,
        "name": "Mr. Purple",
        "location": {
          "address": "180 Orchard St",
          "crossStreet": "btwn Houston & Stanton St",
          "lat": 40.72173744277209,
          "lng": -73.98800687282996,
          "labeledLatLngs": [
            {
              "label": "display",
              "lat": 40.72173744277209,
              "lng": -73.98800687282996
            }
          ],
          "distance": 8,
          "postalCode": "10002",
          "cc": "US",
          "city": "New York",
          "state": "NY",
          "country": "United States",
          "formattedAddress": [
            "180 Orchard St (btwn Houston & Stanton St)",
            "New York, NY 10002",
            "United States"
          ]
        },
        "categories": [
          {
            "id": "4bf58dd8d48988d1d5941735",
            "name": "Hotel Bar",
            "pluralName": "Hotel Bars",
            "shortName": "Hotel Bar",
            "icon": {
              "prefix": "https://ss3.4sqi.net/img/categories_v2/travel/hotel_bar_",
            },
            "primary": true
          }
        ],
        "venuePage": {
          "id": "150747252"
        }
      }
    ]
  }
}

We should be able to extract all IDs from the json file into a pandas dataframe.

### Foursquare API - Hours

Foursquare API offers __hours__ function. Through this function we are able to retrieve for each venue the most popular hours in each day of the week. To do so we just need to pass venue ID as a parameter. Response is also a json file and the field we are looking for is __popular__, which is highlighted in below response. 

{
  "meta": {
    "code": 200,
    "requestId": "59a04e6fdd579714214cddc0"
  },
  "response": {
    "hours": {
      "timeframes": [
        {
          "days": [
            1,
            2,
            3,
            4,
            5
          ],
          "includesToday": true,
          "open": [
            {
              "start": "1600",
              "end": "+0100"
            }
          ],
          "segments": []
        },
        {
          "days": [
            6,
            7
          ],
          "open": [
            {
              "start": "1400",
              "end": "+0000"
            }
          ],
          "segments": []
        }
      ]
    },
    __"popular": {
      "timeframes": [
        {
          "days": [
            5
          ],
          "includesToday": true,
          "open": [
            {
              "start": "1700",
              "end": "+0000"
            }
          ],
          "segments": []
        },
        {
          "days": [
            6
          ],
          "open": [
            {
              "start": "1600",
              "end": "+0000"
            }
          ],
          "segments": []
        },
        {
          "days": [
            7
          ],
          "open": [
            {
              "start": "1500",
              "end": "2300"
            }
          ],
          "segments": []
        },
        {
          "days": [
            1,
            2,
            3
          ],
          "open": [
            {
              "start": "1800",
              "end": "2300"
          ],
          "segments": []
        },
        {
          "days": [
            4
          ],
          "open": [
            {
              "start": "1800",
              "end": "+0000"
            }
          ],
          "segments": []
        }
      ]
    }__
  }
}

We will use some panda methods to turn __popuplar__ into a workable dataframe like below.

||days|start|end  |
|-|----:|-----:|-----:|
|**0**|1|1800|2300|
|**1**|2|1800|2300|
|**2**|3|1800|2300|
|**3**|4|1800|+0000|
|**4**|5|1700|+0000|
|**5**|6|1600|+0000|
|**6**|7|1500|2300|

In this context, __days__ stands for days of the week (1=Monday, 7=Sunday). __Start__ and __end__ represent hours in HHMM format. 

Finally, we have the basic information to start exploratory data analysis and then use K-Nearest Neighborhood algorithm to classify the districts of São Paulo. 