There are a lot of papers to analyze users by their web cites visits while more than half
of digital traffic online now comes from mobile devices and through mobile apps (based
on [comScore report](http://www.comscore.com/Insights/Blog/Major-Mobile-Milestones-in-May-Apps-Now-Drive-Half-of-All-Time-Spent-on-Digital)).

The goal is to predict the demographic and life style profiles of users based on their
previous locations and past behavior at a certain hour of a day.
In case if we have additional context (like any truth set, or application used, user’s
tweets, etc.) we could tune the model.

As a first step, let’s imagine we have a data set that contains user id, timestamp and
location (latitude/longitude pair).

1) Detect “frequent spots”:
<ul>
<li>cluster data using KMeans algorithm (represent users trajectories as fixed-length
vectors of coordinates and then compare such vectors by means of Euclidean
distance) or (as another approach) using Hidden Markov models</li>
<li>detect multiple interleaved periods using Fourier Transform and autocorrelation</li>
</ul>

| record | user | timestamp   | latitude  | longitude   |
|--------|------|-------------|-----------|-------------|
| r1     | u1   | 42499.375   | 37.786137 | -122.409143 |
| r2     | u1   | 42499.39583 | 37.785737 | -122.410922 |
| r3     | u1   | 42499.54167 | 37.787011 | -122.406039 |
| r4     | u2   | 42455.53125 | 37.7862   | -122.4096   |
| r5     | u3   | 42430.71875 | 37.785934 | -122.411144 |

2) Label the spots based on timestamps and external context available (like type of location from GooglePlacesAPI): “Office”, “Home”, “Shopping Mall” etc.

| record | annotation                                           |
|--------|------------------------------------------------------|
| r1     | San Francisco, Starbucks, coffeehouse, working hours |
| r2     | San Francisco, Hilton, hotel, working hours          |
| r3     | San Francisco, Macy's, department store, lunch time  |
| r4     | San Francisco, road                                  |
| r5     | San Francisco, FedEx                                 |

3) Predict user profiles using decision trees with generative grammar component (associative rules, NLP are applicable).

##### High-level examples
<ul>
<li>Frequent visits to “Victoria Secret” => Gender: female</li>
<li>Frequent visits to Chinese, Japanese restaurants => Food interest: Asian</li>
</ul>

Let’s consider a finite set of users $V_А$, a finite set of profiles $V_T$ and describe a finite set of rules А→φ, where А ∈ $V_А$ , φ ∈ $V_T$.

##### Example
Suppose we have users (A1, A2, A3, A4, A5) and the following rules:

| Conditional rules           | Decision rules                        |
|-----------------------------|---------------------------------------|
| A1 $|$ (s1=”+”  s2=”-”) → φ11 | A1 $|$ (φ1 = φ11) (s3 := “+”)           |
| A1 $|$ (s1=”+”  s2=”+”) → φ12 | A1 $|$ (φ1 = φ12) (s3:= “-“)            |
| A2 $|$ (s3=”+”  s4=”+”) → φ21 | A2 $|$ φ2 = φ21) (s5=”+”)               |
| A3 $|$ (s4=”+”) → φ31         | A3 $|$ (φ3 = φ31) (s2:= “-“)            |
| A3 $|$ (s4=”-”) → φ32         |                                       |
| A4 $|$ (s6=”+”) → φ41         | A4 $|$ (φ4 = φ41) (s1:= “-“)            |
| A4 $|$ (s6=”-”) → φ42         | A4 $|$ (φ4 = φ42) (s1:= “+“   s4:= “+“) |
| A5 $|$ (s1=”+”) → φ51         | A5 $|$ (φ5 = φ51) (s6:= “-“)            |
| A5 $|$ (s1=”-”) → φ52         | A5 $|$ (φ5 = φ52) (s6:= “+“)            |

Then the algorithm is as follows:

| Setting |    |    |    |    |    | Profile Ai, Rule type | Hypothesis                                     |
|---------|----|----|----|----|----|-----------------------|------------------------------------------------|
| s1      | s2 | s3 | s4 | s5 | s6 | I                     |                                                |
| .       | .  | .  | .  | .  | .  | -                     |                                                |
| +       | -  | .  | .  | .  | .  | 1, cond.              | H1 : s1=”+”  s2=”-”                            |
| +       | -  | +  | .  | .  |    | 1, cond.              |                                                |
| +       | -  | +  | +  | .  | .  | 2, cond.              | H2 : s4=”+”                                    |
| +       | -  | +  | +  | +  | .  | 2, cond.              |                                                |
| +       | -  | +  | +  | +  | .  | 3, cond.              |                                                |
| +       | -  | +  | +  | +  | .  | 3, cond.              | Confirmation for  s2=”-” in H1                 |
| +       | -  | +  | +  | +  | +  | 4, cond.              | H3 : s6=”+”                                    |
| -       | -  | +  | +  | +  | +  | 4, cond.              | Rejection for s1=”+” in H1                     |
| +       | -  | +  | +  | +  | -  | 4, cond.              | H3 : s6=”-”                                    |
| +       | -  | +  | +  | +  | -  | 4, cond.              | Confirmation for s1=”+” in H1 and s4=”+” in H2 |
| +       | -  | +  | +  | +  | -  | 5, cond.              |                                                |
| +       | -  | +  | +  | +  | -  | 5, cond.              | Confirmation for s6=”-” in H3                  |

Therefore we obtain the following classification:

| A1  | A2  | A3  | A4  | A5  |
|-----|-----|-----|-----|-----|
| φ11 | φ21 | φ31 | φ42 | φ51 |

Improvements and known issues:
<ul>
<li>GPS accuracy. The United States government currently [claims](http://www.gps.gov/systems/gps/performance/accuracy/) 4 meter RMS (7.8
meter 95% Confidence Interval) horizontal accuracy for civilian (SPS) GPS.
Vertical accuracy is worse. So in step 2, we need to use not latitude/longitude
pair, but a circle with radius at least 8 meters (we choose 10 meters).</li>
<li>For demographic profiles some open data sets can be used as the truth sets like:
<ul>
<li>http://proximityone.com/location_based_demographics.htm</li>
<li>http://www.census.gov/topics/income-poverty/income.html</li>
</ul>
</li>
<li>To smooth our probabilities in case of high deviations it worst to add some
weights to every profile. As a first approach, for this step we need to estimate the
overall population in the area using [deep learning model](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004845).</li>
</ul>