# Flow Detection

This notebook is used to design and play with the flow detection algorithm.

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../swiss_flows')

## Data Cleaning

Clean the data: 

In [6]:
from clean_tweets import clean_tweets

# clean_tweets('../data/clean_tweets')

Import the clean tweets: 

In [7]:
tweets = pd.read_csv('../data/clean_tweets.csv', parse_dates=[2])
tweets.head()

Unnamed: 0,id,userId,createdAt,placeLongitude,placeLatitude,userLocation
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,8.96044,46.0027,Earleen.
1,776523000636203010,2741685639,2016-09-15 20:48:05,8.22414,46.8131,Suisse
2,776523045200691200,435239151,2016-09-15 20:48:15,5.94082,47.201,Fontain
3,776523058404290560,503244217,2016-09-15 20:48:18,6.16552,45.8011,Shargeyah
4,776523058504925185,452805259,2016-09-15 20:48:18,6.14319,46.2048,İstanbul/Burgazada


## Grouping by user

In order to detect flows, we need to analyse the differents locations of people. This requires to analyse tweets by user.

In [8]:
# Group by user
grouped = tweets.groupby('userId')

nb_user = len(tweets['userId'].value_counts())
print('Number of different users : {}.'.format(nb_user))

Number of different users : 2763.


How many tweets by user do we get ?

In [9]:
# Uncomment only if necessary (takes a long time)
# df = grouped.agg('count')['id'].reset_index().rename(columns={'id': 'tweets'}, index=str)
#df.head()

In [10]:
'''
df['tweets'].value_counts().plot(kind='bar')
plt.title('Distribution of tweets per user')
plt.xlabel('Number of tweets')
plt.ylabel('Number of user')
plt.show()
'''

"\ndf['tweets'].value_counts().plot(kind='bar')\nplt.title('Distribution of tweets per user')\nplt.xlabel('Number of tweets')\nplt.ylabel('Number of user')\nplt.show()\n"

Well, it seems most of the users tweeted only once. This isn't very good for our ultimate goal. But we have to keep in mind that the actual dataset contains much more tweets, and so much more user that tweet more than once.

## Filter users

Users who only post a unique tweet provide no insight in the flows we wish to detect. Their corresponding tweets should be removed.

How many users do we initially have?

In [None]:
len(grouped)

2763

Let's take a look at the current dataframe.

In [None]:
for name, group in grouped:
    print(name)
    print(group)
    print('')

2397
                      id  userId           createdAt  placeLongitude  \
374   776529870969049088    2397 2016-09-15 21:15:23         5.97056   
3127  776679574910472192    2397 2016-09-16 07:10:15         5.97056   

      placeLatitude       userLocation  
374         46.2907  London and Geneva  
3127        46.2907  London and Geneva  

12582
                      id  userId           createdAt  placeLongitude  \
4540  776720080960909312   12582 2016-09-16 09:51:12         7.58314   
4891  776727261932314624   12582 2016-09-16 10:19:44         7.58314   
5296  776736953245044736   12582 2016-09-16 10:58:15         7.58314   
5662  776746758730899456   12582 2016-09-16 11:37:13         7.60826   
6140  776762011606740992   12582 2016-09-16 12:37:49         7.60826   
6273  776764790366666752   12582 2016-09-16 12:48:52         7.60826   
7073  776784241703133184   12582 2016-09-16 14:06:09         7.60826   
7107  776785191796928512   12582 2016-09-16 14:09:56         7.60826   


We filter out the users who only have 1 one tweet.

*Note: the grouped object is a DataframeGroupBy object, which is difficult to work with since it differs from Dataframes. Here, we choose to represent the the grouping of tweets per user using a dictionary*.

In [None]:
user_tweets = {}
for name, group in grouped:
    if(group.shape[0] > 1):
        user_tweets[name] = group
        
user_tweets

## Maximum time interval

Tweets which are excessively spaced out in time are not viable. Thus, a tweet emitted at time $t$ by a given user should be removed if this user hasn't emitted another tweet within the interval $[t - l/2, t + l/2]$, where $l$ is a set duration.

Therefore, for each user, we create all pairs of tweets which respect the above criterion, since they may represent a potential flow. Tweets which can not pair with another tweet are considered "isolated", and thus removed.

In [None]:
tweet_ids = list(user_tweets[697844817385099264]['id'])
itertools.combinations(tweet_ids, 2)

In [None]:
import itertools # Generate pairs

l = 0
for user_id, tweets in user_tweets:
    
    # Generate all possible pairs of tweet ids
    tweet_ids = list(tweets['id'])
    all_pairs = itertools.combinations(tweet_ids, 2)
    
    # Keep only pairs whose time interval overlap
    

## Actual flows

The pairs generated in the previous step represent potential flows only if the tweets were emitted at different Nodes. Consequently, we remove pairs where:
- one or both tweet(s) do(es) not reside in a Node
- both tweets were emitted at the same Node