# Location entropy

Location entropy can be used to measure the extent by which a location is used by various users. Greater the entropy, greater the location shared by users. For example, using location entropy, we can classify places as public places like universities, movie theaters etc and private places like homes etc.

The formula for location entropy is:

$$E_l = -\Sigma_i p(i) log p(i)$$

where $i$ is the user who visits location $l$, and $p(i)$ formula is given later.

Location entropy is calculated for each location in the system. This measure is based on the following parameters:

1. Number of users who visit the location $l$
2. Number of locations within a radius $r$ from $l$, which are also visited by the users who visit $l$

For a given location $l$, let us assume that there are three locations ${l_1, l_2, l_3}$ within the radius $r$. Also assume that a user $u_1$ visits $l$ and also $l_1, l_2$. A user $u_2$ visits $l$ and also $l_1, l_3$. User $u_3$ visits $l_1, l_2, l_3$ locations.

The $p(u_1)$ is nothing but the number of places visited by $u_1$ within $l$ vicinity (or $r$) divided by the number of places within $l$ vicinity. NOTE that $u_1$ must visit $l$, and the location $l$ is not counted.

$$p(u_1) = 2/3$$

since user $u_1$ has visited $l_1, l_2$ places which are in the vicinity of $l$ location, and the number of locations within the vicinity of location $l$ is 3 (or $l_1, l_2, l_3$)

Similarly,
$$p(u_2) = 2/3$$
$$p(u_3) = 3/3 = 1$$

Therefore if many users visit a location, then the entropy of that location will be more.

Then $E_l$ can be calculated as:

$$E_{l} = -2/3 log(2/3) - 2/3 log(2/3) - 1log(1) = -0.67(-0.176)-0.67(-0.176)-1(0) = 0.236$$



## Data

Let us create some artificial data. This data can be used to test our program to calculate location entropy. 

We will consider the following as the unique locations:
$A, B, C, D, E, F, G, H$

We will consider the following as the unique users:
$1,2,3,4,5,6,7,8,9,10$


We will create two data frames:

**user_location_df**: Will have 2 columns $user$ and $location$. This data frame will have the locations visited by the users.

**distances_df**: Will have the distance between each pairs of locations.


In [193]:
##Import necessary packages
from itertools import product
import numpy as np
import pandas as pd

In [234]:
#Create distances_df
distances_dict = {}
distances_list = []
unique_locations = ['A','B','C','D','E','F','G']
for x,y in product(unique_locations, unique_locations):
    if x==y:
        continue
    if "".join(sorted([x,y])) in distances_dict.keys():
        distances_list.append([x,y,distances_dict["".join(sorted([x,y]))]])
    else:
        distances_dict["".join(sorted([x,y]))] = np.random.randint(low = 1, high = 8)
        distances_list.append([x,y,distances_dict["".join(sorted([x,y]))]])
        
distances_df = pd.DataFrame(distances_list,columns=['location_1','location_2','distance'])

#this distances_df_original will be used  to test our entropy function later
#distances_df_original = distances_df.copy()
distances_df

Unnamed: 0,location_1,location_2,distance
0,A,B,3
1,A,C,7
2,A,D,1
3,A,E,4
4,A,F,7
5,A,G,4
6,B,A,3
7,B,C,3
8,B,D,5
9,B,E,4


In [235]:
#create user_loc_df

A_visitors = list(product([1,2,3,4,5,6,7,8],['A']))
B_visitors = list(product([2,9,5,4,6],['B']))
C_visitors = list(product([1,2],['C']))
D_visitors = list(product([9],['D']))
E_visitors = list(product([5,6],['E']))
F_visitors = list(product([6,8],['F']))
G_visitors = list(product([1,10],['G']))
H_visitors = list(product([1,7,8],['H']))

visitors_data = A_visitors + \
      B_visitors + \
      C_visitors + \
      D_visitors + \
      E_visitors + \
      F_visitors + \
      G_visitors + \
      H_visitors

user_loc_df = pd.DataFrame(visitors_data, columns=["user","location"])
display(user_loc_df)         



Unnamed: 0,user,location
0,1,A
1,2,A
2,3,A
3,4,A
4,5,A
5,6,A
6,7,A
7,8,A
8,2,B
9,9,B


### Vicinity

Let us consider 4 distance units as the vicinity.So we will filter the distances_df which have a distance of > 4. 

In [236]:
#Consider radius as 4

distances_df = distances_df[distances_df["distance"] <= 4]
distances_df

Unnamed: 0,location_1,location_2,distance
0,A,B,3
2,A,D,1
3,A,E,4
5,A,G,4
6,B,A,3
7,B,C,3
9,B,E,4
10,B,F,3
11,B,G,3
13,C,B,3


### Finding the denominator 
The denominator in the formula $p(i)$ will be the number of locations which are within the vicinity of a given location.

To obtain this, we need to do apply group by on distances_df (group by location_1, and count location_2)

In [237]:
##Find the number of locations within a vicinity of 4 units for a given location

locations_vicinity_counts = pd.DataFrame(distances_df["location_1"].value_counts().reset_index())
locations_vicinity_counts.columns = ["location","count"]
locations_vicinity_counts


Unnamed: 0,location,count
0,E,6
1,G,5
2,B,5
3,A,4
4,D,4
5,F,4
6,C,2


The above data frame shows how many locations are within a vicinity of 4 radius units

### Find the users who visited a location
To find the visitors who visited a location, we could use user_loc_df. But for each location $l$, we need to find the users visited location $l$ and also for each user who visited location $l$ we need to find the other locations (which are in the vicinity of $l$) visited by that user.

Hence we will join the data frames user_loc_df and distances_df on user_loc_df.location = distances_df.location. We will call this result as user_visit_loc_1

In [238]:
##Find the users who visited a given location
#We have to perform inner join between user_loc_df and distances_df using location = location_1
user_visit_loc_1 = pd.merge(user_loc_df.rename(columns={"location":"location_1"}), distances_df,on=["location_1"])
user_visit_loc_1

Unnamed: 0,user,location_1,location_2,distance
0,1,A,B,3
1,1,A,D,1
2,1,A,E,4
3,1,A,G,4
4,2,A,B,3
5,2,A,D,1
6,2,A,E,4
7,2,A,G,4
8,3,A,B,3
9,3,A,D,1


The above data frame contains which locations a user (user column) has visited a location (location_1 column). If we inner join the above data frame with user_loc_df on user_loc_df.user = user_visit_loc_1.user and user_loc_df.location = user_visit_loc_1.location_2, then we will get the locations which are visited by users of location_1 and also location_2(NOTE that location_2 is within the vicinity of a given radius from location_1.)

In [239]:
user_visit_loc_2 = pd.merge(user_loc_df.rename(columns={"location":"location_2"}), user_visit_loc_1,on=["user","location_2"])
user_visit_loc_2

Unnamed: 0,user,location_2,location_1,distance
0,1,A,G,4
1,2,A,B,3
2,4,A,B,3
3,5,A,B,3
4,5,A,E,4
5,6,A,B,3
6,6,A,E,4
7,2,B,A,3
8,2,B,C,3
9,5,B,A,3


In the above data frame, we found the place (location_1) visited by a user and another place (location_2 which is in the vicinity of location_1) by the same user. We will ignore the distance column in the above data frame. 

If we group by user, location_1 and count location_2, then we will get the number of locations visited by user within location_1 (note that the user has also visited location_1).

In [240]:
user_loc_visit_counts = user_visit_loc_2.drop(["distance"],axis=1).groupby(["user","location_1"]).count().reset_index()
user_loc_visit_counts.rename(columns={"location_1":"location","location_2":"count"},inplace=True)
user_loc_visit_counts

Unnamed: 0,user,location,count
0,1,A,1
1,1,G,1
2,2,A,1
3,2,B,2
4,2,C,1
5,4,A,1
6,4,B,1
7,5,A,2
8,5,B,2
9,5,E,2


The above data frame shows that user has visited location, and he also visited other locations within the vicinity of location.

In [241]:
##Get the inner join of user_loc_visit_counts and locations_vicinity_counts on 
#user_loc_visit_counts.location = locations_vicinity_counts.location
final_df = pd.merge(user_loc_visit_counts,locations_vicinity_counts,on=["location"])
final_df



Unnamed: 0,user,location,count_x,count_y
0,1,A,1,4
1,2,A,1,4
2,4,A,1,4
3,5,A,2,4
4,6,A,2,4
5,1,G,1,5
6,2,B,2,5
7,4,B,1,5
8,5,B,2,5
9,6,B,3,5


We have to divide count_x by count_y to get $p(i)$. Then we have to also apply $log$ on $p(i)$.

In [242]:
final_df["p"] = final_df["count_x"]/final_df["count_y"]
final_df["final_p"] = -1 * np.log(final_df["p"])*final_df["p"]
final_df

Unnamed: 0,user,location,count_x,count_y,p,final_p
0,1,A,1,4,0.25,0.346574
1,2,A,1,4,0.25,0.346574
2,4,A,1,4,0.25,0.346574
3,5,A,2,4,0.5,0.346574
4,6,A,2,4,0.5,0.346574
5,1,G,1,5,0.2,0.321888
6,2,B,2,5,0.4,0.366516
7,4,B,1,5,0.2,0.321888
8,5,B,2,5,0.4,0.366516
9,6,B,3,5,0.6,0.306495


Finally get the sum of entropies for a given location

In [243]:
#Get the sum of final_p group by location, to get the entropy of locations

final_df = final_df[["location","final_p"]].groupby(["location"]).sum()
final_df.reset_index()

Unnamed: 0,location,final_p
0,A,1.732868
1,B,1.361416
2,C,0.346574
3,E,0.712778
4,F,0.346574
5,G,0.321888


We can conclude that the place A is the most visited place. NOTE that the above entropy values might change depending on the radius you choose.

### Putting all together
Let us create a function that accepts distances_df, users_loc_df and radius ($r$) as input and returns the entropies of the locations:



In [244]:
def location_entropy(distances_df, user_loc_df, r):
    #Consider only locations which are within the vicinity of r from a given location
    distances_df = distances_df[distances_df["distance"] <= r]
    
    ##Find the number of locations within a vicinity of r units for a given location
    locations_vicinity_counts = pd.DataFrame(distances_df["location_1"].value_counts().reset_index())
    locations_vicinity_counts.columns = ["location","count"]
    
    ##Find the users who visited a given location
    #We have to perform inner join between user_loc_df and distances_df using location = location_1
    user_visit_loc_1 = pd.merge(user_loc_df.rename(columns={"location":"location_1"}), distances_df,on=["location_1"])
    
    #Find location l, and user u (who visited l), find locations which are also visited by u, and within the vicinity of l
    user_visit_loc_2 = pd.merge(user_loc_df.rename(columns={"location":"location_2"}), user_visit_loc_1,on=["user","location_2"])

    #group by to get counts of locations visited by user u who visited location l
    user_loc_visit_counts = user_visit_loc_2.drop(["distance"],axis=1).groupby(["user","location_1"]).count().reset_index()
    user_loc_visit_counts.rename(columns={"location_1":"location","location_2":"count"},inplace=True)
    
    ##Get the inner join of user_loc_visit_counts and locations_vicinity_counts on 
    #user_loc_visit_counts.location = locations_vicinity_counts.location
    final_df = pd.merge(user_loc_visit_counts,locations_vicinity_counts,on=["location"])
    
    
    final_df["p"] = final_df["count_x"]/final_df["count_y"]
    final_df["final_p"] = -1 * np.log(final_df["p"])*final_df["p"]
    
    #Get the sum of final_p group by location, to get the entropy of locations
    return final_df[["location","final_p"]].groupby(["location"]).sum().reset_index()

In [245]:
#we are supplying radius as 8
location_entropy(distances_df_original, user_loc_df, r=8)

Unnamed: 0,location,final_p
0,A,2.049819
1,B,1.676235
2,C,0.733033
3,D,0.321888
4,E,0.673012
5,F,0.6452
6,G,0.366204


In [246]:
#we are supplying radius as 6
location_entropy(distances_df_original, user_loc_df, r=6)

Unnamed: 0,location,final_p
0,A,1.94863
1,B,1.255482
2,C,0.693147
3,E,0.562335
4,F,0.6452
5,G,0.346574


### References
https://www.cs.cmu.edu/~jasonh/publications/ubicomp2010-location-priv-final.pdf