# Understanding user and product interaction

VP of product in the company ABC has asked you to review how users interact with their online travel website.
They store their data in JSON files. Each row in these files lists all the different cities that have been searched for by a user within the same session (as well as some other info about the user). 

Business questions:
1. There was a bug in the code and one country didn't get logged. Can you guess which country was that? How?

2. For each city, find the most likely city to be also searched for within the same session.
3. Travel sites are browsed by two kinds of users. Users who are actually planning a trip and users who just dream about a vacation. The first ones have obviously a much higher purchasing intent. Users planning a trip often search for cities close to each other, while users who search for cities far away from each other are often just dreaming about a vacation. Based on this idea, can you come up with an algorithm that clusters sessions into two groups: high intent and low intent.



In [1]:
# install packages if needed
list.of.packages <- c("dplyr","jsonlite","arules")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos = "http://cran.us.r-project.org")

In [3]:
library(dplyr)
library(jsonlite)
library(arules)

### Load json and transform to a dataframe

In [4]:
# load json file
json_data <- fromJSON("data/challenge_3.json", flatten=TRUE)

In [5]:
# extract values from user field
aux <- data.frame(json_data$user[[1]])
for(i in c(2:nrow(json_data))){
  aux <- rbind(aux, data.frame(json_data$user[[i]]))
}

# bind user columns with the initial dataset
json_data <- cbind(json_data, aux)

# remove previous user column
json_data$user <- NULL

# convert all to dataframe columns
json_data <- json_data%>%
            mutate(
              session_id = as.character(session_id),
              unix_timestamp = as.character(unix_timestamp),
              cities = as.character(cities)
            )

## First Target: Guessing the missing Country

In [6]:
# Guessing the missing country
mysteriours_country <- json_data%>%filter(country=="")

# write cities in csv file
write.table(mysteriours_country[,"cities"], "data/mysteriours_country.csv", quote=FALSE, row.names=FALSE, col.names=FALSE, sep=",")

# list of distinct countries
json_data%>%count(country)

country,n
,2820
DE,3638
ES,1953
FR,2298
IT,1882
UK,3555
US,3876


In [7]:
head(mysteriours_country$cities)

<span style="color:red">It's probably Canada. There are many Canadian cities (such as "Toronto ON", "Montreal QC" and "Vancouver BC", with 457, 334 and 207 searches respectively) in user’s sessions with missing country field. In addition to this, Canada is one of the countries with many searched cities that does not appear in the list of distinct countries in the dataset. For more details check the bottom of the [Notebook 3.2](Challenge 3.2 - Understanding user and product interaction.ipynb)  </span>

### Writing the cities searched in sessions

In [9]:
# writing transactions in a csv file
write.table(json_data[,"cities"], "data/cities.csv", quote=FALSE, row.names=FALSE, col.names=FALSE, sep=",")

## Second Target: For each city, find the most likely city to be also searched for within the same session.


**The goal is**: For each city, find the most likely city to be also searched for within the same session.

**Solution**: The safest way to solve this is to calculate the fequency and ratio by each cities relation. 

<span style="color:red">Please, for this solution check the link to [Notebook 3.2](Challenge 3.2 - Understanding user and product interaction.ipynb)</span>

###  Association Rule Learning: APriori
Another way to analyze this kind of data is to transform the researched cities in a session to transactions, and then use association rules learning algorithms like Apriori to discover the frequent item sets (or cities sets) and find cities association.

In [11]:
# transforming sessions to transactions
items <- strsplit(json_data$cities,",")
transactions <- as(items, "transactions")

In [12]:
# running apriori
rules <- apriori(transactions, parameter = list(sup = 0.001, conf = 0.05, target="rules",minlen=2))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
       0.05    0.1    1 none FALSE            TRUE       5   0.001      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 20 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[107 item(s), 20022 transaction(s)] done [0.00s].
sorting and recoding items ... [86 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [581 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].


In [13]:
# checking the assiciation rules
inspect(head(rules))

    lhs                       rhs               support     confidence
[1] { Hialeah FL}          => {Jacksonville FL} 0.001198681 1.0000000 
[2] {Jacksonville FL}      => { Hialeah FL}     0.001198681 0.0591133 
[3] { Miami FL}            => {Jacksonville FL} 0.001198681 1.0000000 
[4] {Jacksonville FL}      => { Miami FL}       0.001198681 0.0591133 
[5] { Bakersfield CA}      => {Los Angeles CA}  0.001897912 0.9743590 
[6] { Saint Petersburg FL} => {Jacksonville FL} 0.002047747 1.0000000 
    lift      count
[1] 49.315271 24   
[2] 49.315271 24   
[3] 49.315271 24   
[4] 49.315271 24   
[5]  9.828018 38   
[6] 49.315271 41   


<span style="color:red">
With this quick analysis we can see the strong relationship between Hialeah and Jacksonville cities, as well as the number of times this relationship happened and its proportion in the dataset. The interesting thing about this type of analysis besides highlighting the most representative association rules is that we can see not only the relation of one city to another but the relation of it to several cities in a single rule. 
</span>

## Third Target:  An algorithm that clusters sessions into two groups: high intent and low intent.


Using **latitude** and **longitude** information, for each one of the cities researched by user, calculate the pairwise the distance between them and after that extract some aggregated measure from distances, such as the mean, the standard deviation, the minimum and the maximum. 

Using a clustering algorithm, such as **K-mean**, we could segregate and visualize different users and combine them to the best high intend and low intend group.