## Assessment : Imtiaz Khan

### Dataset : http://snap.stanford.edu/data/index.html

### Overview

Gowalla is a location-based social networking website where users share their locations by checking-in.
The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. They have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010.

####  Sample Time and location information of check-ins made by users

In [2]:
# [user]	 [check-in time]	  [latitude]	  [longitude]   [location id]
# 196514  2010-07-24T13:45:06Z    53.3648119      -2.2723465833   145064
# 196514  2010-07-24T13:44:58Z    53.360511233    -2.276369017    1275991
# 196514  2010-07-24T13:44:46Z    53.3653895945   -2.2754087046   376497
# 196514  2010-07-24T13:44:38Z    53.3663709833   -2.2700764333   98503
# 196514  2010-07-24T13:44:26Z    53.3674087524   -2.2783813477   1043431
# 196514  2010-07-24T13:44:08Z    53.3675663377   -2.278631763    881734
# 196514  2010-07-24T13:43:18Z    53.3679640626   -2.2792943689   207763
# 196514  2010-07-24T13:41:10Z    53.364905       -2.270824       1042822	

### Location Entropy

It was first introduced to describe the popularity of a location.

Let l be a location and

V(l,u) = {< u, l, t >: ∀t}   be a set of check-ins at location l of User u and 

V(l) = {< u, l, t >: ∀t, ∀u} be a set of all check-ins at location l of all users.

The probability that a randomly picked check-in from V(l) belongs to User u is 

P(u,l) = |V(l,u)|/|V(l)|.

If we define this event as a random variable, then its uncertainty is
given by the Shannon entropy as follow:

\begin{equation*}
H(l) = - \sum_{u,P(u,l)!=0}P(u,l)\log{P(u,l)}
\end{equation*}




### Code For Location Entropy

#### 1. Loading Required Libraries

In [1]:
import pandas as pd
import math
path="C:/Users/imtiaz.a.khan/Downloads/Gowalla_totalCheckins.txt"
location_data = pd.read_csv(path,sep="\t",header=None)
columnNames=['user','check-in time','latitude','longitude','location_id']
location_data.columns=columnNames
location_data.head()

Unnamed: 0,user,check-in time,latitude,longitude,location_id
0,0,2010-10-19T23:55:27Z,30.235909,-97.79514,22847
1,0,2010-10-18T22:17:43Z,30.269103,-97.749395,420315
2,0,2010-10-17T23:42:03Z,30.255731,-97.763386,316637
3,0,2010-10-17T19:26:05Z,30.263418,-97.757597,16516
4,0,2010-10-16T18:50:42Z,30.274292,-97.740523,5535878


#### Grouping by location for overall checkin count

In [2]:
df_location_grouped=location_data.groupby(['location_id'],as_index=False)['check-in time'].count()
df_location_grouped.columns=['location_id','TotalCheckinsForLocation']
df_location_grouped.head()

Unnamed: 0,location_id,TotalCheckinsForLocation
0,8904,12
1,8932,16
2,8936,12
3,8938,130
4,8947,570


#### Grouping by location and user for user specific checkin count

In [3]:
df_location_user_grouped=location_data.groupby(['location_id','user'],as_index=False)['check-in time'].count()
df_location_user_grouped.columns=['location_id','user','TotalCheckinsPerLocationPerUser']
df_location_user_grouped.head()

Unnamed: 0,location_id,user,TotalCheckinsPerLocationPerUser
0,8904,24,1
1,8904,256,3
2,8904,310,1
3,8904,343,4
4,8904,392,1


#### Merging the both location level and location_user level checkin counts

In [4]:
Merged_df=df_location_user_grouped.merge(df_location_grouped,how="left",on='location_id')
Merged_df.head()

Unnamed: 0,location_id,user,TotalCheckinsPerLocationPerUser,TotalCheckinsForLocation
0,8904,24,1,12
1,8904,256,3,12
2,8904,310,1,12
3,8904,343,4,12
4,8904,392,1,12


##### The probability that a randomly picked check-in from V(l) belongs to User u is P(u,l) = |V(l,u)|/|V(l)|.

In [5]:
Merged_df['ProbOfCheckinOfLocToUser']=Merged_df['TotalCheckinsPerLocationPerUser']/Merged_df['TotalCheckinsForLocation']
Merged_df.head()

Unnamed: 0,location_id,user,TotalCheckinsPerLocationPerUser,TotalCheckinsForLocation,ProbOfCheckinOfLocToUser
0,8904,24,1,12,0.083333
1,8904,256,3,12,0.25
2,8904,310,1,12,0.083333
3,8904,343,4,12,0.333333
4,8904,392,1,12,0.083333


#### location_entropy Method

In [14]:
import scipy.stats as sc
def location_Entropy(data):
    data['location_entropy']=sc.entropy(data['ProbOfCheckinOfLocToUser'])
    return data

#### Finally applying location entropy to sample of a dataset

In [27]:
sample_Merged_df=Merged_df.head(54)
sample_Merged_entropy_df=sample_Merged_df.groupby(['location_id'],as_index=False).apply(location_Entropy)
sample_Merged_entropy_df.groupby('location_id',as_index=False).first()['location_entropy']

0    1.748155
1    2.512659
2    1.698783
3    2.144194
Name: location_entropy, dtype: float64

 #### Conclusion
 
A high value of the location entropy indicates a popular place with
many visitors and is not specific to anyone. On the other hand, a
low value of the location entropy implies a private place with few
visitors, such as houses, which are specific to a few people.
 

#### References

https://dl.acm.org/citation.cfm?id=2996985

http://www.cse.unt.edu/~huangyan/6350/paper/EBM.pdf