# Similarity Map of WhatsApp Users based

In this tutorial we briefly show how to obtain a similarity map from the users from your WhatsApp group. We will use the library ``whatstk`` included in this project. 

Below we provide some theory background. However, for a detailed and further documentation you can refer to some of these references [Kohonen 1982](http://campus.fi.uba.ar/mod/resource/view.php?id=34864), [Kohonen 1998](http://www.sciencedirect.com/science/article/pii/S0925231298000307), , [Rojas 1996](https://page.mi.fu-berlin.de/rojas/neural/chapter/K15.pdf), [Joschka Boedecker 2015](http://ml.informatik.uni-freiburg.de/_media/documents/teaching/ss15/som.pdf) or simply take a look at the [Wikipedia page](https://en.wikipedia.org/wiki/Self-organizing_map).

## 1. Self-Organizing-Maps
A self-organizing map (SOM) is an unsupervised learning method that performs dimension reduction to a topological-predefined output space.

The picture below (from this [course](http://www.pitt.edu/~is2470pb/Spring05/FinalProjects/Group1a/tutorial/)) illustrates the main idea of SOM

![Image of Yaktocat](http://www.pitt.edu/~is2470pb/Spring05/FinalProjects/Group1a/tutorial/kohonen1.gif)


In this picture, $\boldsymbol{x} = [x_1, \dots, x_n]$ denotes the input vector of features. Note that it only contains one single layer. 

The network is divided into two stages: (1) Competitive Learning and (2) Topologycal output space.

### 1.3 Competitive Learning

Each neuron (unit) $k$ is represented by a _prototype vector_ $\boldsymbol{w}_k$. When feeding the network with the input vector $\boldsymbol{x}$, the closest unit (unit minimizing $||\boldsymbol{x}-\boldsymbol{w}_k||$) is known as the **winning unit**. In a Winner-Takes-All (WTA) approach, only the winning unit prototype vector is updated, i.e.

$$ \Delta \boldsymbol{w}_{win} = \eta(\boldsymbol{x}-\boldsymbol{w}_{win}).$$

Note that in the WTA approach, dead units might easily appear. Thus, at some points we need to allow for dead units to learn a bit in order to start claiming their territory!

### 1.2 Kohonen Map
An alternative to WTA approach relies on allowing other units to be also updated. In particular, all units are updated according to their proximity to the winning units in the output space. Now the update rule is

$$ \Delta \boldsymbol{w}_{k} = \eta h_k(\boldsymbol{x}-\boldsymbol{w}_{k}),$$

where the term $h_k$ quantifies the proximity of the unit $\boldsymbol{w}_k$ to the winning unit in the ouput space (high if they are close) and $\eta$ represent the learning rate.


This simple but powerful idea, allows for easy visualization of high-dimensional data in an eye-friendly format. Typical output spaces are lines, circles or 2D grids (like in the picture).




## 2. Code

Let us now begin this brief tutorial. Let us first import the basic libraries that we will be using.

### 2.1 Initialization

In [9]:
from __future__ import print_function

In [10]:
import sys
sys.path.append('../')

In [11]:
from whatstk.wparser import WhatsAppChat

#### Create chat object
We first create a WhatAppChat object of our chat log file. For testing pruposes, we provide a sample chat log file. However, please please feel free to try with your own chats.

In [66]:
wpchat = WhatsAppChat("../chats/samplechat.txt")

#### Obtain some basic data
Let us now obtain basic data from the chat

In [67]:
# Obtain the names of the users from the chat
users = wpchat.usernames
# Obtain list of days with interventions
days = wpchat.days
# Obtain number of interventions in the chat
num_interventions = wpchat.num_interventions

In [68]:
print("Brief summary")
# Print name of users
print("\n *", len(users),"users found: ")
[print("\t", user) for user in users]
# Number of days the chat has been active
print("\n * Chat was active", len(days), "days")
# Number of interventions
print(" * Chat had", num_interventions, "interventions")
# Average number of interventions per day
int_day=num_interventions/len(days)
print(" * Chat had an average of %.2f" % int_day, "interventions/day")
# Average number of interventions per day per user
int_day_pers = int_day/len(users)
print(" * Chat had an average of %.2f" % int_day_pers, "interventions/day/person")

Brief summary

 * 8 users found: 
	 Ash Ketchum
	 Brock
	 Jessie & James
	 Meowth
	 Misty
	 Prof. Oak
	 Raichu
	 Wobbuffet

 * Chat was active 6 days
 * Chat had 18 interventions
 * Chat had an average of 3.00 interventions/day
 * Chat had an average of 0.38 interventions/day/person


### 2.2 Obtain input data
We start by obtaining the number of interventions of each user per day of chat activity. We can do this by calling the method `interventions_per_day` from the class `WhatsAppChat`, which returns a `DataFrame` of the data (column per username)

In [69]:
# Dataframe containing the number of user interventions per day (only days of chat activity are considered)
interventions_per_day = wpchat.interventions('days')

In [70]:
# Show dataframe
interventions_per_day

Unnamed: 0,Brock,Ash Ketchum,Misty,Raichu,Jessie & James,Prof. Oak,Meowth,Wobbuffet
2016-08-06,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0
2016-08-07,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
2016-08-10,0.0,1.0,0.0,2.0,1.0,0.0,0.0,0.0
2016-08-11,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2016-09-11,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2016-10-31,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


To ease the learning, we normalize and center each dimension

In [71]:
# Center each dimension
interventions_per_day = interventions_per_day.sub(interventions_per_day.mean(axis=1), axis=0)
# Normalize each dimension
interventions_per_day = interventions_per_day.divide(interventions_per_day.max(axis=1)-interventions_per_day.min(axis=1), axis=0)
# Show dataframe
interventions_per_day

Unnamed: 0,Brock,Ash Ketchum,Misty,Raichu,Jessie & James,Prof. Oak,Meowth,Wobbuffet
2016-08-06,0.625,0.625,0.625,-0.375,-0.375,-0.375,-0.375,-0.375
2016-08-07,0.5,0.5,0.5,-0.5,-0.5,0.5,-0.5,-0.5
2016-08-10,-0.25,0.25,-0.25,0.75,0.25,-0.25,-0.25,-0.25
2016-08-11,-0.125,-0.125,0.875,-0.125,-0.125,-0.125,-0.125,-0.125
2016-09-11,-0.125,-0.125,-0.125,-0.125,-0.125,-0.125,0.875,-0.125
2016-10-31,-0.25,-0.25,-0.25,-0.25,-0.25,0.75,-0.25,0.75


### 2.3 Self-Organizing Map

Once we have the our WhatsAppChat object created, we are ready to do have some fun.

In [72]:
from whatstk.learning.som import SelfOrganizingMap

In [73]:
# We define the number of units that we will be using. Large number leads to good global
# fit but poor local fit (low number leads to the oposite)
num_units = 10
# We choose an output space define by an array of neurons arranged in a line fashion
topology = 'line'

In [74]:
# Initialize our SOM
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)

In [75]:
# Train our SOM
som.train()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 5.0
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25075


Finally, we can print the results of the similarity map

In [76]:
som.print_results()


* Results Self Organizing Map *

0 - Misty
1
2 - Brock, Ash Ketchum
3
4 - Prof. Oak
5
6 - Wobbuffet
7
8 - Meowth
9
10 - Raichu, Jessie & James


There are other topologies available. Let us try them.

In [77]:
# Circle topology is the same as line, except that the first and last components are connected
topology = 'circle'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)
som.train()
som.print_results()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 5.0
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25075

* Results Self Organizing Map *

0 - Jessie & James
1 - Meowth
2
3 - Wobbuffet
4 - Prof. Oak
5
6 - Misty
7
8 - Brock, Ash Ketchum
9
10 - Raichu


In [78]:
# We now try a 2D-grid
# Number of units now denotes the number of units per side. In total, we have 
# num_units*num_units units

num_units = 5 
topology = '2dgrid'
som = SelfOrganizingMap(interventions_per_day, num_units, sigma_initial=num_units/2, num_epochs=1000,
    learning_rate_initial=1, topology=topology)
som.train()
som.print_results()


* Training *
- Starting parameters: 
	 learning rate = 1
	 sigma = 2.5
- Ending parameters: 
	 learning rate = 0.00034
	 sigma = 0.25058

* Results Self Organizing Map *

           0            1          2  3               4
0  Prof. Oak               Wobbuffet     Jessie & James
1                                                Raichu
2             Ash Ketchum                              
3                              Brock                   
4      Misty                                     Meowth
