In [1]:
import numpy as np
import pandas as pd
import os
#os.rename('../datasets/fsq/umn_foursquare_datasets/checkins.dat', '../datasets/fsq/umn_foursquare_datasets/checkins.csv')

## Task 1

Download data about visits to the establishments. They contain information about the user's registration to the establishments 
and the geolocation of these establishments. Clear the data from records with gaps. \
Output the number of records after cleaning.

In [2]:
data = pd.read_csv('../datasets/fsq/umn_foursquare_datasets/checkins.csv', sep = "|", low_memory = False)

data.columns = ['id', 'user_id', 'venue_id', 'latitude', 'longitude', 'created_at']
data.replace("                   ", np.nan, inplace=True)
data = data.dropna()
data.head()
print(len(data))

396634


## Task 2

This data contains information about establishments from all over the world. Using geolocations and the [Reverse Geocoder](https://github.com/thampiman/reverse-geocoder) library,
find out the country for each geolocation. \
Find the **name** of the second country by the number of entries.

In [3]:
import reverse_geocoder as rg
pd.options.mode.chained_assignment = None

temp = rg.search(list(data[["latitude", "longitude"]].itertuples(False, None)))

temp_df = pd.DataFrame(temp, index = data.index)

new_data = data.join((temp_df)["cc"])

new_data["cc"].value_counts().nlargest(2)

print("Вторая страна по количеству записей: Индонезия")

Loading formatted geocoded file...
Вторая страна по количеству записей: Индонезия


## Task 3

We will only be interested in American locations. Clean up the data from locations in other countries. 
Also, to reduce the number of geolocations, leave only the 50 most frequent establishments in the sample. \
Output the number of locations remaining after these purges.

In [4]:
new_data = new_data[new_data["cc"] == "US"]
top_venue = new_data["venue_id"].value_counts().nlargest(50)

new_data = new_data.loc[new_data['venue_id'].isin(top_venue.index)]
len(new_data)

162099

## Task 4

Let's move on to the clustering problem. Use the [Mean Shift](https://scikit-learn.org/stable/modules/clustering.html#mean-shift) algorithm
to cluster locations. Give `MeanShift(bandwidth=0.1, bin_seeding=True)` as parameters. 

    The `bandwidth=0.1` is the width of the clustering kernel. For middle latitudes of the United States - it is about 5-10 km. 
    `bin_seeding=True` - to speed up the algorithm.
    
Output the number of clusters that you have as a result of clustering.

In [5]:
from sklearn.cluster import MeanShift
new_data = new_data[["latitude", "longitude"]]
clustering = MeanShift(bandwidth=0.1, bin_seeding=True).fit(new_data)
len(pd.unique(clustering.labels_))

2846

## Task 5

The centers of the resulting clusters are potential locations for company banners. Now we would like to find those cluster centers, 
that are closest to the company's sales offices. \
Download [data on company office coordinates](datasets/offices.csv). For each office, find the 5 cluster centers closest to it. 
The company has 11 offices, so we should have 55 banner locations. \
Print the coordinates of the banner that is closest to the company office.

In [6]:
from scipy.spatial.distance import cdist
offices_df = pd.read_csv('../datasets/offices.csv')
temp1 = list(offices_df[["latitude", "longitude"]].itertuples(False, None))
cluster_centers = clustering.cluster_centers_
banner_places_dist = []
best_places_coord_top5 = []
best_banners = []

for i in range(11):
    arr = np.asarray(temp1[i])
    arr = arr.reshape(1, -1)
    dist = cdist(arr, cluster_centers)
    res = sorted(dist[0], reverse=False)[:5]
    banner_places_dist.append(res)
for k in range(11):
    arr = np.asarray(temp1[k])
    arr = arr.reshape(1, -1)
    dist = cdist(arr, cluster_centers)
    for m in range(len(dist[0])):
        if (dist[0][m] in banner_places_dist[k]):
            best_places_coord_top5.append([k, dist[0][m], cluster_centers[m]])

# [номер офиса (0-10), расстояние от офиса до баннера, [координаты одного из 5 ближайших баннеров]]
best_places_coord_top5

temp_arr = []
for k in range(11):
    for l in range(55):
        if (best_places_coord_top5[l][0] == k):
            temp_arr.append(best_places_coord_top5[l])
    min = 2
    for j in range(5):
        if temp_arr[j][1] < min:
            min = temp_arr[j][1]
            min_index = j
    best_banners.append(temp_arr[min_index][2])
    temp_arr=[]

# 11 пар координат для ближайших к каждому офису баннеров
df = pd.DataFrame(best_banners, columns = ['latitude','longitude'])
df["index_office"] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
df

Unnamed: 0,latitude,longitude,index_office
0,37.688043,-122.409142,0
1,34.187689,-118.448805,1
2,33.809806,-118.144971,2
3,32.715963,-117.158197,3
4,29.301348,-94.797696,4
5,30.694357,-88.043054,5
6,27.949461,-82.464971,6
7,25.786986,-80.218559,7
8,30.332432,-81.654927,8
9,32.785318,-79.924742,9


## Task 6

Using the [scatter_mapbox](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html) function 
mark the points where the banners will be installed. You should get an image like this.
Use the color of the dot to indicate which office this banner will belong to.

<center><img src="../misc/images/task_6.png" width="800" height="800"/> <center/>

In [18]:
import plotly.express as px
df["index_office"] = df["index_office"].astype(str)
fig = px.scatter_mapbox(df, lat="latitude", lon="longitude", color="index_office", zoom=3, mapbox_style='open-street-map')
fig.show()