![](imgs/miredtwitter_header2.png)  
<center>
<h1>
Dime a quiénes sigues y te diré qué prefieres
</h1>

<h2>
[Tell me who you follow and I'll tell you what you like]
</h2>

<h2>
(SNA, ML and a bit of NLP)
</h2>
<h3>Pablo Gabriel Celayes</h3>
<h4>November 16th, 2017 - PyData San Luis</h4>
</center>

# Roadmap

* Introduction

* Toolkit

* Dataset

* Social prediction

* Adding NLP


# Introduction: who am I?

* Mathematician (2006)

* Computer Scientist (2017)

* ![](imgs/mitwitter.png)

# Introduction: what I did

## Original idea

- *Personalized* content-based recommender system of news articles.
- How do we improve it with social information? ( from external sources or inferred relations )
- Explicit preference information collected by the Cogfor platform.

![](imgs/cogfor.png)

## Mutation

- Build our own dataset ( users, preferences, connections ).
- Start predicting preferences using only social information.
- Improve using content-based features.


# Toolkit

## Data
![](imgs/tweepy.png)
![](imgs/sqlalchemy.png)

## Networks
![](imgs/networkx.png)
![](imgs/graphtool.png)

## Data analysis
![](imgs/numpy.png)
![](imgs/pandas.png)
![](imgs/jupyter.png)

## ML + NLP
![](imgs/sklearn.png)
![](imgs/nltk.png)
![](imgs/gensim.png)
![](imgs/twitterLDA.png)

## Visualization
![](imgs/bokeh.png)
![](imgs/gephi_small.png)
### pyLDAvis

# Dataset: Social graph

- up to 3 steps in the $\texttt{follow}$ relation, starting with my own profile.
- We consider only **relevant** users ( >40 followed/followers)
- ~3M users
- ~10M connections


*** First step... ***

( it wouldn't be an SNA talk wihout a picture like this )

![](imgs/miredtwitter.png)


# Dataset: Content

- Subgraph $G$:
    - we start off with a small set of *seed* users
    - for each user, we add her 50 *most affine* followed ( affinity: rate of common followed )
    - we repeat the process until no new users are added
    - 5180 users
    - ~230k connections

- Tweets:
    - *timelines* between 25/8/2015 y 24/9/2015 (+ *retweets* and *favs*)
    - 2M tweets ( 1,6M in Spanish )

# Social prediction

Given a user $u$, how much can we learn about the **content** they **prefer**, from what we know about the preferences of users in her **social neighborhood**.

- **Content** =  _"visible"_ tweets in Spanish ( $T_u$ )
- **Preferences** = retweets
- **Neighborhood** = followed + followed-by-followed ( $E_u$ )


## Visible tweets

* Shared by $u$ or her followed 
* We exclude those _written_ by $u$
* maximum $10000$ ( we subsample negative examples when necessary )

## User selection

* Training and evaluating models is computationally expensive
* I'm lazy and I wanted to do everything on my laptop
* So we pick which users to focus on for our predictions:
    * $A$ = $1000$ most *active* (nr. of tweets)
    * $I$ = $1000$ most *important* (Katz centrality)
    * The lucky ones are $A \cap I$ : $194$ users

## Feature extraction

* $E_u = \{u_1, u_2, \ldots , u_n \}$ 

* $T_u = \{ t_1, \ldots, t_m \}$

$$
  M_u := [ \verb|tweet_in_tl|(t_i, u_j) ]_{\substack{ 1 \leq i \leq m \\ 1 \leq j \leq n}}  
$$


$$
  y_u := [ \texttt{tweet_in_tl}(t_i, u) ]_{ 1 \leq i \leq m }  
$$



## Classification problem

* Predict $y_u$ from the rows of $M_u$
* Dataset split:
    * $70\%$ training ($M^{tr}_u, y^{tr}_u$)
    * $10\%$ tuning ($M^{tu}_u, y^{tu}_u$)
    * $20\%$ test ($M^{te}_u, y^{te}_u$)

## Support Vector Machines

* Goal: **maximize** margin and **minimize** errors
* Kernel functions allow to find **non-linear** decision boundaries:
    * *Radial Basis Function*
    * Polynomial

![](imgs/svm_linsep_err.png)



## Classification quality metric

$$\texttt{precision} := \frac{|\{x_i | f(x_i) = 1 \text{ y } y_i = 1 \}|}{|\{x_i | f(x_i) = 1 \}|}$$

$$ $$

$$\texttt{recall} := \frac{|\{x_i | f(x_i) = 1 \text{ y } y_i = 1 \}|}{|\{x_i | y_i = 1 \}|}$$

$$ $$

$$\texttt{F1} := \frac{2}{\frac{1}{\texttt{precision}} + \frac{1}{\texttt{recall}}} = \frac{2 * \texttt{precision} * \texttt{recall} }{\texttt{precision} + \texttt{recall}}$$



## Hyper-parameter tuning

* Exhaustive search using $\verb|GridSearchCV|$
* $3$-fold cross validation
* Goal: maximize $F1$
* Grid:
```
{
    "C": [ 0.01, 0.1, 1 ],
    "class_weight": [ "balanced", None ], 
    "gamma": [ 0.1, 1, 10 ],
    "kernel": [ "rbf", "poly" ]
}
```
    * $C$: controls the balance between margin and error
    * $class\_weight$: give more importance to the minoritary (positive) class
    * $gamma$: controls the shape of the decision boundary

## Results

* $F1$ sobre $M^{tu}_u$
* Average $87,7 \%$

![](imgs/f1s_social_valid.png)

# *... and without looking at the content!*



![](imgs/robot2.png)

# Adding NLP...

![](imgs/robotlee.jpg)

## (Further) selection of users

* $F1 < 0,75$ in $M^{tu}_u$ ( $27$ users )

* ... and $10$ random users

## Pre-processing

- Normalization
- Tokenization
- Dictionary
- *Bag of words*
- LDA


## Normalization

- remove URLs
- lower case
- remove accents
- collapse vowel and space repetitions

![](imgs/normalize.png)

## Tokenization

- separate in words
- remover punctuation
- _stemming_
- remove 1-character words

![](imgs/tokenize.png)

## Dictionary

* Vocabulary: the whole $T$, tokenized.
* significative ( in at least $100$ tweets ).
* informative ( in at most $30\%$ of $T$ ).

* Dictionary $D$ with ~11K terms.



## *Bag of words*

- Text $t$:  $\rightarrow$ multiset (*bag*) of $D$-terms in $t$.
- Order is unimportant, but repetitions count.
- For tweets ( $\leq 140$ characters ), it's normally a *set* ( $0$ or $1$ occurrences).

- We fix an order for dictionary $D = \{ t_1, \ldots, t_{11000} \}$ : $\rightarrow$ vector of integer (boolean) features:

$$ v_{BOW}(tweet) = [count(t_i, tokens(tweet))]_{i=1}^{11000} $$

- Sparse representation is used for $v_{BOW}(tweet)$.


## LDA

* Discovers latent topics in a corpus of texts

* Can be used as a dimensionality reduction technique ( from the space of *terms* in $D$ to a space of *topics* )


### Example: LDA with 10 topics

![](imgs/lda10.png)

## Twitter-LDA

- Problem: tweets are short and usually talk about a single topic.

- Adaptations: 
    - Group all tweets from a user, and treat them as document.
    - Assign just one topic per tweet

## Evaluating on $M_u^{te}$

![](imgs/f1s_social_vs_socialldas.png)


* Best model: $\texttt{TwitterLDA10}$ ( average uplift of $1,7\%$)
* NLP is not improving a lot for now.
* Many cases improve in *train* but they get worse on *test* (*overfitting*)

# Next steps?

- Fix overfitting in Social+NLP models

- Take temporality into account

- More features!

- Generalize ( models that do not depend on a fixed central user )

- _Retweetability_ in communities ( in progress )

- Word embeddings?


# Thank you!

**Code** https://github.com/pablocelayes/sna_classifier/tree/micai_datos2015

**Thesis (Spanish)** https://www.dropbox.com/preview/Public/tesisSNA.pdf

**Paper (English)** https://www.dropbox.com/preview/Public/retweet_prediction_micai_2017.pdf

**Twitter**: @PCelayes