# Flow Detection

This notebook is used to design and play with the flow detection algorithm.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../swiss_flows')

## Data Cleaning

Clean the data: 

In [4]:
from clean_tweets import clean_tweets

# clean_tweets('../data/clean_tweets')

Import the clean tweets: 

In [5]:
tweets = pd.read_csv('../data/clean_tweets.csv', parse_dates=[2])
tweets.head()

Unnamed: 0,id,userId,createdAt,placeLongitude,placeLatitude,userLocation
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,8.96044,46.0027,Earleen.
1,776523000636203010,2741685639,2016-09-15 20:48:05,8.22414,46.8131,Suisse
2,776523045200691200,435239151,2016-09-15 20:48:15,5.94082,47.201,Fontain
3,776523058404290560,503244217,2016-09-15 20:48:18,6.16552,45.8011,Shargeyah
4,776523058504925185,452805259,2016-09-15 20:48:18,6.14319,46.2048,İstanbul/Burgazada


## Grouping by user

In order to detect flows, we need to analyse the differents locations of people. This requires to analyse tweets by user.

In [6]:
# Group by user
grouped = tweets.groupby('userId')

nb_user = len(tweets['userId'].value_counts())
print('Number of different users : {}.'.format(nb_user))

Number of different users : 2763.


How many tweets by user do we get ?

*Note: The following cells have been commented because they cause the evaluation of the notebook (Run All) to be slow.*

In [7]:
'''
df = grouped.agg('count')['id'].reset_index().rename(columns={'id': 'tweets'}, index=str)
df.head()
'''

"\ndf = grouped.agg('count')['id'].reset_index().rename(columns={'id': 'tweets'}, index=str)\ndf.head()\n"

In [8]:
'''
df['tweets'].value_counts().plot(kind='bar')
plt.title('Distribution of tweets per user')
plt.xlabel('Number of tweets')
plt.ylabel('Number of user')
plt.show()
'''

"\ndf['tweets'].value_counts().plot(kind='bar')\nplt.title('Distribution of tweets per user')\nplt.xlabel('Number of tweets')\nplt.ylabel('Number of user')\nplt.show()\n"

Well, it seems most of the users tweeted only once. This isn't very good for our ultimate goal. But we have to keep in mind that the actual dataset contains much more tweets, and so much more user that tweet more than once.

## Filter users

Users who only post a unique tweet provide no insight in the flows we wish to detect. Their corresponding tweets should be removed.

How many users do we initially have?

In [9]:
len(grouped)

2763

Let's take a look at the data, grouped by user.

In [10]:
# grouped is a DataFrameGroupBy object, it cannot be displayed like a dataframe
for name, group in grouped:
    print(name)
    print(group)
    print('--------------------------------------------------------------------')

2397
                      id  userId           createdAt  placeLongitude  \
374   776529870969049088    2397 2016-09-15 21:15:23         5.97056   
3127  776679574910472192    2397 2016-09-16 07:10:15         5.97056   

      placeLatitude       userLocation  
374         46.2907  London and Geneva  
3127        46.2907  London and Geneva  
--------------------------------------------------------------------
12582
                      id  userId           createdAt  placeLongitude  \
4540  776720080960909312   12582 2016-09-16 09:51:12         7.58314   
4891  776727261932314624   12582 2016-09-16 10:19:44         7.58314   
5296  776736953245044736   12582 2016-09-16 10:58:15         7.58314   
5662  776746758730899456   12582 2016-09-16 11:37:13         7.60826   
6140  776762011606740992   12582 2016-09-16 12:37:49         7.60826   
6273  776764790366666752   12582 2016-09-16 12:48:52         7.60826   
7073  776784241703133184   12582 2016-09-16 14:06:09         7.60826   
7107

We filter out the users who only have 1 one tweet.

In [11]:
# Transform the DataFrameGroupBy object into a dictionary
user_tweets = {}
for name, group in grouped:
    # Filter out users with less than 1 tweet
    if(group.shape[0] > 1):
        # Remove the userId column since we use it as key
        user_tweets[name] = group.drop('userId', axis=1)
        
user_tweets

{697844817385099264:                       id           createdAt  placeLongitude  placeLatitude  \
 1291  776555114316304386 2016-09-15 22:55:41         6.14319        46.2048   
 1344  776557863514869760 2016-09-15 23:06:37         6.14319        46.2048   
 1346  776557914546966528 2016-09-15 23:06:49         6.14319        46.2048   
 1349  776558094126047232 2016-09-15 23:07:32         6.14319        46.2048   
 7985  776804978035920896 2016-09-16 15:28:33         6.14319        46.2048   
 
      userLocation  
 1291          Gnv  
 1344          Gnv  
 1346          Gnv  
 1349          Gnv  
 7985          Gnv  ,
 699995720103886849:                       id           createdAt  placeLongitude  placeLatitude  \
 8302  776812319133368320 2016-09-16 15:57:44         6.14319        46.2048   
 8728  776820423602401280 2016-09-16 16:29:56         6.14319        46.2048   
 
              userLocation  
 8302  Worldwide customers  
 8728  Worldwide customers  ,
 702599546224910338: 

The resulting data structure is a dictionary, which is easier to play with than the DataFrameGroupBy object we had. 
However, each value is a Dataframe which has a "userId" column. This piece of information is the same as the corresponding key.
We remove this column.

## Further filtering

### Time interval condition

Tweets which are excessively spaced out in time are not viable. In other words, tweet emitted at time $t$ by a given user should be removed if this user hasn't emitted another tweet within the interval $[t - l/2, t + l/2]$, where $l$ is a set duration.

Therefore, for each user, we create all pairs of tweets which respect the above criterion, since they may represent a potential flow. We remove tweets which cannot pair with another tweet from the same user.

### Different nodes condition

The pairs generated represent potential flows only if the tweets were emitted from different Nodes. Consequently, we remove pairs where:
- one or both tweet(s) do(es) not reside in a Node
- both tweets were emitted at the same Node

### Repetitive flows

Note that there is a problem here. Some users have an incredible number of pairs which potentially represent flows. However, a lot of thos pairs represent the exact same flow. Indeed, certain users post several tweets at the same location. Therefore, a user may have several pairs representing the same flow, during the same period. We should only keep one of those pairs.

The solution consists in:
- creating the flows corresponding to each pair
- if a user has several identical flows whose time period overlap, keep only one of those flows
- drop the symmetrical flows if the flow is not directed : A <--> B is equivalent to B <--> A

In [12]:
from node import Node

# Generate the nodes
nodes = Node.generate_swiss_nodes() 



In [17]:
from flow import Flow
import itertools

# Duration in days
l = 2

user_flows = {}
for user_id, tweet_info in user_tweets.items():
    
    # Generate all possible pairs of tweet ids
    tweet_ids = list(tweet_info['id'])
    id_pairs = itertools.combinations(tweet_ids, 2)
    
    flows = []
    for id_pair in id_pairs:
        
        # Retrieve the tweet 
        t1 = tweet_info[tweet_info['id']==id_pair[0]]
        t2 = tweet_info[tweet_info['id']==id_pair[1]]
        
        # Nodes corresponding to the tweets
        n1 = Node.locate_point((t1.placeLatitude, t1.placeLongitude), nodes)
        n2 = Node.locate_point((t2.placeLatitude, t2.placeLongitude), nodes)
        
        # Time interval condition
        time_cond = abs(t1.createdAt.dt.dayofyear.values[0] - t2.createdAt.dt.dayofyear.values[0]) <= l
        
        # Node conditions
        geo_cond = n1 and n2 and (n1 != n2)
        
        if time_cond and geo_cond:
            # Build the flow
            flow = Flow(src=n1, dst=n2, directed=False)
            
            # Drop duplicate flows
            if flow not in flows:
                # Drop symmetrical flows if not directed
                if flow.directed or (not flow.directed and flow.symmetrical not in flows):
                    flows.append(flow)
                    
    # Save those flows
    if(len(flows) > 0):
        user_flows[user_id] = flows

In [18]:
for user, flows in user_flows.items():
    print('---- New user ----')
    for f in flows:
        print(f)

---- New user ----
[Flow] Lausanne <--> Geneve None None, weight = 0.
---- New user ----
[Flow] Biel/Bienne <--> Bern None None, weight = 0.
---- New user ----
[Flow] Zurich <--> Luzern None None, weight = 0.
---- New user ----
[Flow] Biel/Bienne <--> Bern None None, weight = 0.
[Flow] Biel/Bienne <--> Winterthur None None, weight = 0.
[Flow] Biel/Bienne <--> Basel None None, weight = 0.
[Flow] Biel/Bienne <--> Sankt Gallen None None, weight = 0.
[Flow] Biel/Bienne <--> Geneve None None, weight = 0.
[Flow] Biel/Bienne <--> Luzern None None, weight = 0.
[Flow] Biel/Bienne <--> Zurich None None, weight = 0.
[Flow] Bern <--> Winterthur None None, weight = 0.
[Flow] Bern <--> Basel None None, weight = 0.
[Flow] Bern <--> Sankt Gallen None None, weight = 0.
[Flow] Bern <--> Geneve None None, weight = 0.
[Flow] Bern <--> Luzern None None, weight = 0.
[Flow] Bern <--> Zurich None None, weight = 0.
[Flow] Winterthur <--> Basel None None, weight = 0.
[Flow] Winterthur <--> Sankt Gallen None Non