## Data Interpretation

In this Jupyter notebook, we interpret and visualize data collected in previous notebook. In the first part, we import all packages needed in following steps. Then, we load the data and provide a descriptive statistics. We determine location of the user and based on that, we select 5 nearest places to user's location. Finally, we visualize data on an interactive leaflet map.

### Import of packages

In this section, we import all packages that are necessary for *data interpretation* as well as for their *visualization*. First, we import pandas, geopandas, json, numpy and matpotlib.pyplot in order to process the data. Then, we import Nominatim from geopy.geocoders in order to determine the location of the user. To perform geodetic computations, we import Geod from pyproj. Finally, we import folium and its plugin to visualize data.

In [57]:
import pandas as pd
import geopandas as gpd
import json
import numpy as np
import matplotlib.pyplot as plt

from geopy.geocoders import Nominatim

from pyproj import Geod

#we put following line into terminal: "python -m pip install folium"
import folium
from folium.plugins import MarkerCluster

### Data loading & descriptive statistics

After importing packages, it is necessary to download json data from API. Data are then merged and prepared in csv file.

In [58]:
swim_data = pd.read_csv('raw_data.csv')

We provide desriptive statistics, including only numeric columns. Since column *ID* and column *id_a* (which are identical) are not useful for the purpose of data description, we remove both of them.

In [59]:
swim_desc = swim_data.drop('id_a', 1).drop('ID', 1).drop('Unnamed: 0', 1)
desc = swim_desc.describe(include = [np.number])

#We round values in column Average rating and Number of rating to one decimal point and display the results.
av_rat = desc[['Average rating']]
desc.loc[:, av_rat.columns] = np.round(av_rat)

num_rat = desc[['Number of ratings']]
desc.loc[:, num_rat.columns] = np.round(num_rat)

desc

Unnamed: 0,Average rating,Number of ratings,Longitude,Latitude
count,18.0,20.0,20.0,20.0
mean,4.0,4.0,49.11423,14.60351
std,1.0,3.0,2.100714,3.052217
min,2.0,0.0,41.76032,2.983775
25%,4.0,2.0,49.065163,14.315799
50%,4.0,4.0,50.001197,14.636598
75%,5.0,6.0,50.152053,15.82921
max,5.0,12.0,50.279876,19.445608


To prepare data for another part, we will select only columns with Name, Latitude, Longitude and Average rating of each location. We display first rows of our data to make sure that they are in desired format.

In [60]:
swim_loc = swim_data[['Name','Latitude','Longitude', 'Average rating', 'Entrance', 'Nudist beach']]

swim_loc.head()

Unnamed: 0,Name,Latitude,Longitude,Average rating,Entrance,Nudist beach
0,Costa Brava,2.983775,41.76032,,No entrance fee,Not suitable for nudists
1,Nádrž Dolní Žleb,17.307842,49.751182,2.0,No entrance fee,Not suitable for nudists
2,Kralupy nad Vltavou,14.307153,50.256161,5.0,Entrance fee,Not suitable for nudists
3,Koupaliště Flošna,15.841977,50.205154,4.8,Entrance fee,Suitable for nudists
4,Městské lázně,15.828759,50.214217,4.5,Entrance fee,Not suitable for nudists


To find a set of coordinates corresponding to the address, we use Nominatim imported from geopy.geocoders. Nominatim uses OpenStreetMap data to find any locations on Earth by name and address (geocoding) and to detect its coordinates. It can also do the reverse. 

Any user of this application can insert his location (city is enough) into the brackets and the application will determine precise address as well as the coordinates of that place. We display this location and both, the latitude and the longitude of that place.

In [61]:
geolocator = Nominatim(user_agent='myapplication')

# Insert city of your location:
location = geolocator.geocode("Litomyšl")

print("Your location is ", location.address, ".")
print("Latitude = {}, Longitude = {}".format(location.latitude, location.longitude))

Your location is  Litomyšl, okres Svitavy, Pardubický kraj, Severovýchod, 570 01, Česká republika .
Latitude = 49.8725491, Longitude = 16.3101243


Ater we know location of user, the application can measure the distance between that location and all the places in the dataset. We decided to use World Geodetic System (WGS) ellipsoid to measure the distances as it the estimated accuracy of this approach is very high.

In [62]:
#We use World Geodetic System ellipsoid to measure the distance.
g = Geod(ellps='WGS84')

#We declare a function measuring distance between pairs of latitude-longitude points.
def Distance(name, lat1, lon1, lat2, lon2, rating):
    az12,az21,dist = g.inv(lon1, lat1, lon2, lat2)
    return dist

#As lat1 and lon1, we will use latitude and longitude of user's location.
name = swim_loc['Name']
lat1 = location.latitude
lon1 = location.longitude
lat2 = swim_loc['Latitude']
lon2 = swim_loc['Longitude']
rating = swim_loc['Average rating']

#We define new dataframe including values of columns from swim_loc.
df = pd.DataFrame({'Name':name,'Average rating':rating,'Latitude_1':lat1,'Longitude_1':lon1,'Latitude_2':lat2,'Longitude_2':lon2})

#We add a column with distance in metres using distance function.
df['Distance'] = Distance(df['Name'].tolist(),df['Latitude_1'].tolist(),df['Longitude_1'].tolist(),df['Latitude_2'].tolist(),df['Longitude_2'].tolist(),df['Average rating'].tolist())

#Then we convert column with distance to kilometers and display first rows of new dataframe.
df['Distance'] = (1/1000)*df['Distance']

df.head()

Unnamed: 0,Name,Average rating,Latitude_1,Longitude_1,Latitude_2,Longitude_2,Distance
0,Costa Brava,,49.872549,16.310124,2.983775,41.76032,5725.521722
1,Nádrž Dolní Žleb,2.0,49.872549,16.310124,17.307842,49.751182,4688.608866
2,Kralupy nad Vltavou,5.0,49.872549,16.310124,14.307153,50.256161,5000.285866
3,Koupaliště Flošna,4.8,49.872549,16.310124,15.841977,50.205154,4852.033516
4,Městské lázně,4.5,49.872549,16.310124,15.828759,50.214217,4853.787915


We select 5 geographically nearest places to user's location and sort those places by the average rating. If there is "NaN" value in Average rating column, meaning there are no ratings for such a place, we put a zero instead of that value. It is because when there is no rating for some place, user does not have any feedback of that place. Therefore, we prefer places with at least one rating and we put places with no ratings at the bottom of our table.

As a result, we have a table of 5 places  with the shortest distance from user's current location. Moreover, these places are sorted by the average rating, having the highest-rated place at the top and lowest-rated place at the bottom. In this way, user of this application can choose a location for swimming very easily.

In [63]:
#We define new dataframe with 5 geographicaly nearest places to user's location. 
df_map = df.nsmallest(5, columns=['Distance']).sort_values(['Average rating'], ascending=[False])

#We remove columns Latitude 1 and Longitude 1 as those coordinates are identical for all rows and we don't need them.
df_map.drop(['Latitude_1','Longitude_1'], axis=1, inplace=True)

#Then, we reset indexing of rows for easier manipulation with data in future steps.
df_map.reset_index(drop=True, inplace=True)

#replacing NaN values with "x"
df_map.loc[np.isnan(df_map["Average rating"]), 'Average rating'] = 0

#We round values in column Average rating and Distance to one decimal point and display the result.
av_rat1 = df_map[['Average rating']]
df_map.loc[:, av_rat1.columns] = np.round(av_rat1)

dist = df_map[['Distance']]
df_map.loc[:, dist.columns] = np.round(dist)

df_map

Unnamed: 0,Name,Average rating,Latitude_2,Longitude_2,Distance
0,Romantická pláž,5.0,14.768348,44.922321,4676.0
1,Sluneční pláž,5.0,15.782788,50.033879,4848.0
2,Termální koupaliště Bešeňová,4.0,19.445608,49.100166,4452.0
3,Plavecký bazén Ústí nad Orlicí,4.0,16.402715,49.975747,4786.0
4,Nádrž Dolní Žleb,2.0,17.307842,49.751182,4689.0


### Data visualization

In the final part, we  analyze and visualize collected and summarized data.

As the first visualization, we  map all the points from the original dataset, using the Folium module. Since most of the places are located in the Czech Republic, we put latitude and longitude of the Czech Republic as the initial coordinates of the map.

In [71]:
#We need to remove punctuation from 'Name' column since the map does not display those letters correctly.
swim_loc['Name'] = swim_loc['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('Š','S').str.replace('č','c').str.replace('Č','C').str.replace('ř','r').str.replace('Ř','R').str.replace('ž','z').str.replace('Ž','Z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('Ú','U').str.replace('ň','n')

#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 8.
map0 = folium.Map(
    location=[49.8037633, 15.4749126],
    tiles='cartodbpositron',
    zoom_start=8
    )

#Then we create a list of latitude and longitude coordinate pairs.
map_locations0 = swim_locc[['Latitude', 'Longitude']]
locationlist0 = map_locations0.values.tolist()

#We set parameters of markers so that all points are well visible and recognizable. We add these markers to our map.
for point in range(0, len(locationlist0)):
    folium.Marker(locationlist0[point], 
                  popup=swim_locc['Name'][point], 
                  icon=folium.Icon(color='darkblue', icon='tint')
                 ).add_to(map0)    
    
#We create marker clusters that group points that overlap.
marker_cluster = folium.plugins.MarkerCluster().add_to(map0)    
    
map0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Then, we visualize a map where lacations have markers with color range based on the average rating of that location. In other words, we differentiate the places by average rating. To do that, we create a function to assign a unique color to each rating (rounded to whole number). The list of rating and corresponding color looks like this:

* 5 - darker green
* 4 - ligher green
* 3 - orange
* 2 - lighter red
* 1 - darker red
* x (NaN) - grey

In [72]:
#We prepare data with needed columns.
swim_rating = swim_locc[['Name','Latitude','Longitude', 'Average rating']]

#We round column with average rating to whole numbers.
av_rat = swim_rating[['Average rating']]
swim_rating.loc[:, av_rat.columns] = np.round(av_rat)

#We replace NaN values with "x".
swim_rating.loc[np.isnan(swim_rating["Average rating"]), 'Average rating'] = 'x'

#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 8.
#Initial coordinates of map are set to be the coordinates of Czech Republic.
map_rating = folium.Map(
    location=[49.8038, 15.4749],
    tiles='cartodbpositron',
    zoom_start=8
    )

#We create a dictionary for the colors of markers.
colordict = {'x': 'grey',
             5: 'seagreen', 
             4: 'springgreen', 
             3: 'orange',
             2: 'red',
             1: 'firebrick'
            }

#We set CircleMarker parameters and we add these markers to our map.
for lat, lon, name, rating in zip(swim_rating['Latitude'], swim_rating['Longitude'], swim_rating['Name'], swim_rating['Average rating']):
    folium.CircleMarker(
        location = [lat, lon],
        popup = ('Name: ' + str(name).capitalize() + '<br>' 'Average rating: ' + str(rating)),
        color = 'b',
        fill_color = colordict[rating],
        fill = True,
        fill_opacity = 0.5
        ).add_to(map_rating)

#We visualize new map.
map_rating

For another visualization, we use the table of geographically nearest places from previous part. In this map, we put user's coordinates as the initial coordinates. As there are only five locations displayed in the map (apart from user's location), we set zoom start of map to 10.

In [73]:
#We need to remove punctuation from 'Name' column since the map does not display those letters correctly.
df_map['Name'] = df_map['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('Š','S').str.replace('č','c').str.replace('Č','C').str.replace('ř','r').str.replace('Ř','R').str.replace('ž','z').str.replace('Ž','Z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('Ú','U').str.replace('ň','n')


#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 10.
map1 = folium.Map(
    location=[location.latitude, location.longitude],
    tiles='cartodbpositron',
    zoom_start=10)

#Then we create a list of latitude and longitude coordinate pairs.
map_locations = df_map[['Latitude_2', 'Longitude_2']]
locationlist = map_locations.values.tolist()

#We set Marker parameters of places to swim and we add these markers to our map.
for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point],                                  
                  popup = df_map['Name'][point], 
                  icon=folium.Icon(color='darkblue', icon='tint')
                 ).add_to(map1)

#We set Marker parameters of user's location and we add this marker to our map.
folium.Marker(
    location =[location.latitude, location.longitude],
    icon = folium.Icon(color = 'gray', icon = 'home')
    ).add_to(map1)

#We visualize new map.
map1

Another data visualization is useful for people looking for a place to swim and not willing to pay an entrance fee. Places without entrance fee are marked as green circles, while places with entrance fee of any value are marked as red circles.

In [74]:
#First, we prepare data with needed columns.
swim_entr = swim_loc[['Name','Latitude','Longitude', 'Entrance']]

#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 8.
map_entr = folium.Map(
    location=[49.8038, 15.4749],
    tiles='cartodbpositron',
    zoom_start=8
    )

#Then, we create a dictionary for the colors of markers.
colordict = {'Entrance fee': 'red', 
             'No entrance fee': 'lightgreen'
            }

#We set CircleMarker parameters and we add these markers to our map.
for lat, lon, name, entrance in zip(swim_entr['Latitude'], swim_entr['Longitude'], swim_entr['Name'], swim_entr['Entrance']):
    folium.CircleMarker(
        location = [lat, lon],
        popup = ('Name: ' + str(name).capitalize() + '<br>' 'Entrance: ' + str(entrance)),
        color = 'b',
        fill_color = colordict[entrance],
        fill = True,
        fill_opacity = 0.8
        ).add_to(map_entr)

#Finally, we visualize new map.
map_entr

Since there are people who enjoy swimming and catching the sun without their clothes, we prepared a map representing locations to swim that are suitable for nudists. Places suitable for nudists are marked as pink circles and places not suitable for nudists are marked as dark red circles.

In [75]:
#We prepare data with needed columns.
swim_nudist = swim_loc[['Name','Latitude','Longitude', 'Average rating', 'Nudist beach']]

#We create map object. We set tiles to cartodbpositron and zoom_start to 8.
map_nudist = folium.Map(
    location=[49.8038, 15.4749],
    tiles='cartodbpositron',
    zoom_start=8
    )

#Then, we create a dictionary for the colors of markers.
colordict = {'Suitable for nudists': 'pink', 
             'Not suitable for nudists': 'darkred'
            }

#We set CircleMarker parameters and we add these markers to our map.
for lat, lon, name, nudist in zip(swim_nudist['Latitude'], swim_nudist['Longitude'], swim_nudist['Name'], swim_nudist['Nudist beach']):
    folium.CircleMarker(
        location = [lat, lon],
        popup = ('Name: ' + str(name).capitalize() + '<br>' 'Nudist beach: ' + str(nudist)),
        color = 'b',
        fill_color = colordict[nudist],
        fill = True,
        fill_opacity = 0.8
        ).add_to(map_nudist)

#Finally, we visualize the map.
map_nudist