# Extract Movement/Behavior Pattern

## According to references of tweet's related analysis, there are some useful methods to learn user behavior or mobility pattern:

- Text-based/Content-based analysis [less useful]
- Geo-located information analysis
- Hybrid analysis [more practical]
- Others 

## Relative Data
Based on what is included in the raw data, some relevant messages we can get:

- Time: year, (season,) month, week, day, hour(day/night)...
- Location: longtitude/latitude, user location(country/institude/unknown...) -> geopy -> cities in CH // cantons
- Content: (language translation?) topic/event -> behavior/event inference
- Counts(weights of effect): follower/friend/number of status (Machine Learning), 'Many tweets can include many topics; therefore, the ratio for total tweets is more important than the frequency.'[01]
- Road/Transport data: transport network, fee...

## Content/Text-based

[02] Geo-locating user(city-level)

* purely on content;
* classification component: identify words in tweet;
* location estimate: lattice-based neighborhood smoothing model


[03] Text-based geo prediction

Integrated geolocation prediction framework & Factors impact on prediction accuracy

* Feature selection methods for "location indicative words"
* Impact of non-geotagged tweet, language and user-declared metadata
* Impact of temporal variance on model genelisation & user difference

## Geo-located

[06] infer residence and mobility

Anchor-poing clustering and classification
- dominant residential cluster(rc): rc with highest number of tweets -> usual residence
- aggregate the total number of tweets by local authority level we are interested -> may not accurate enough
- DBSCAN: 'density-reachability' with 2 paras distance(i, for finding points in the same cluster, small enough) and min number of points(minpts, for ensure the valid cluster or noise)
no need to predefine the number of cluster, 3 kinds of point(core, density reachable, noise), not cluster all available points(filter infrequent locations)
- Analysis: geolocated penetration rate
- short/long-term cluster: (tourist/short-term migrant) e.g. monthly pattern of net flows, student mobility(holiday...)



## Hybrid analysis

[01] Behavior analysis (tweet data & geo-tag)

label data (small-size supervised learning, manually) -> SVM for top unique words

- tweet data: Learn (return-home) behavior by using SVM; verbal explanatory factor(distance, time of travel mode(train/bus/on foot, some are out of work in specific time period)...); 
- geo-tag: Non-verbal explanatory factor; 
- estimate behavior: combine 1) and 2), using a dicrete choice model(MNL, multinomial logic model)

[04] Extract relevant knowledge

extract representative terms 

- model with 3 dimension: word, location and time
- TD-IDF measures weighting terms -> Generalized TF-INF

spatio-temporal data mining appreaches (pattern mining): group of objects staying together during a period of time.

- group of moving objects O, timestamps set T (within objects stay together)
- $\epsilon$: min number of objects; $min_t$: min number of timestamps
- particular pattern: swarm, convoy, GET_MOVE, gradual pattern.
- all with same pre-processing: cluster with DBSCAN, for grouping objects closed to the same location

[07] Multi-scale; mobility pattern; visual-analytics ****

Based on Apache Hadoop(distributed computing); LBSM(location based social media)
- study mobility patterns using (geo-located) LBSM data -- visual-analytics methods
- extracted, aggregate and summarize multi-level spatiotemporal mobility pattern

Scalable visual-analytics framework
- 1) Data processing unit: Hadoop model & extract movement, generate space-time trajectory(multi-scale)
For each user: save tuple record <id, loc, t> -> visitation behavior (displacement & radius of gyration)
- 2) Geo-vitualization unit: 3D virtual globe web mapping interface

[08] travel servey method
- Literature review: [12-event], [18,19-activity pattern]
- Activity purpose identification: Modified LDA

build dictionary -> word counts (>20 times) -> LDA cluster -> top 3 words for each cluster -> correlation: the number of co-occurence of top 3 and other words -> activity tags(top 3 and correlation) -> assign all tweets with cluster tag(>=1)

## Others

[05] Tweet Data Analytics(guidance)

In chapter 4 - Analyzing Twitter Data
- Retweet network(Streaming API): information of the way users communicate and how they value each other
- Text Measures: finding topics with latent Dirichlet allocation(LDA)

In chapter 5 - Visualizing Twitter Data
- Visualizing Geo-Spatial information: heatmaps with Kernel Density Estimation(KDE)


## Conclusion
Relevant techniques:
- text measures: latent Dirichlet allocation(LDA)[05][08] identifies correlations between words and text corpora for topics
- Twitter Search API(geo-located twitter, since we have already, just ignore it)
- Viz: heat map



In [1]:
import pandas as pd
import numpy as np
import csv
import pickle

In [11]:
twitter_data  =pickle.load( open( "twitter_swisscom_proc/xaa_proc.p", "rb" ) )
twitter_data.shape

(22611, 20)

In [18]:
twitter_user=twitter_data.groupby(twitter_data['userId'])
twitter_user.size()

userId
2397.0          3
5033.0         28
7173.0          7
11323.0         1
11688.0         6
12187.0        23
17863.0         7
17893.0         1
18123.0        29
22573.0         2
36863.0         8
41483.0        42
49943.0         9
51233.0        88
53453.0        74
55223.0         7
63283.0         1
73363.0         4
80523.0        37
99983.0        33
120433.0      761
240193.0       18
368393.0       17
370153.0      196
371243.0        9
600483.0        1
613853.0       10
625553.0       60
626163.0      257
627253.0       18
             ... 
17751562.0      1
17755176.0      2
17782820.0      4
17787097.0      4
17787476.0      2
17793681.0     31
17864908.0      1
17870064.0      6
17910879.0     63
17926317.0      1
17964926.0      1
18019485.0    165
18060248.0      5
18083346.0     34
18124018.0    123
18156299.0      1
18186341.0      5
18196415.0    152
18213699.0     89
18218361.0      1
18222168.0      2
18225225.0     10
18225460.0     41
18265203.0      1
182

In [19]:
twitter_data.head()

Unnamed: 0,id,userId,text,longitude,latitude,inReplyTo,placeLatitude,placeLongitude,followersCount,friendsCount,statusesCount,userLocation,year,month,day,week_number,hour,city,state,country
551,10231423626,6257282.0,"The new apartment is nice, but there is no Wif...",7.58531,47.5455,,47.5367,7.57849,14249,9260.0,19585.0,"Potsdam, Germany",2010.0,3.0,9.0,10.0,18.0,Binningen,Basel-Landschaft,ch
609,10292646240,15602037.0,Is that wet yet solid stuff on my screen suppo...,8.52725,47.3876,,47.3791,8.50021,177,136.0,5167.0,"Zürich, Switzerland",2010.0,3.0,10.0,10.0,22.0,Zürich,Zürich,ch
611,10309829732,625553.0,I'm at DCTI - David Dufour in Geneva http://go...,6.13183,46.2006,,46.1996,6.13011,471,82.0,3363.0,"Geneva, Switzerland",2010.0,3.0,11.0,10.0,5.0,Genève,Genève,ch
612,10310391132,17341045.0,God morgon! :-),7.44235,46.8957,,46.9214,7.38855,586,508.0,9016.0,"Bern, Switzerland",2010.0,3.0,11.0,10.0,6.0,Köniz,Bern - Berne,ch
618,10311568050,634553.0,"At this very minute, the sun is pink.",6.199,46.2043,,46.1938,6.15415,2230,387.0,10605.0,"Geneva, Switzerland",2010.0,3.0,11.0,10.0,7.0,Genève,Genève,ch


In [28]:
twitter_data['userId'].nunique()

849

In [30]:
twitter_data['country'].unique()

array(['ch', 'us', 'it', 'fr', 'de', 'at', 'li'], dtype=object)