# Recommending questions on CrossValidated
In this example, we'll try to recommend questions to be answered to users of stats.stackexchange.com.

## Loading the data
The full CrossValidated dataset is available at https://archive.org/details/stackexchange. Helper functions to obtain and process it are defined in `data.py`, and we are going to use them here.

In [1]:
import data

(interactions, question_features,
 user_features, question_vectorizer,
 user_vectorizer) = data.read_data() # This will download the data if not present

`interactions` is a matrix with entries equal to 1 if the i-th user posted an answer to the j-th question; the goal is to recommend the questions to users who might answer them. `question_features` is a sparse matrix containing question metadata in the form of tags. `vectorizer` is a `sklearn.feature_extraction.DictVectorizer` instance that translates the tags into vector form.

In [2]:
print(repr(interactions))
print(repr(question_features))
print(question_vectorizer.inverse_transform(question_features[:3]))
print(user_vectorizer.inverse_transform(user_features[0]))

<61885x63570 sparse matrix of type '<type 'numpy.int32'>'
	with 63006 stored elements in Compressed Sparse Row format>
<63570x1189 sparse matrix of type '<type 'numpy.int32'>'
	with 171861 stored elements in Compressed Sparse Row format>
[{'bayesian': 1, 'elicitation': 1, 'prior': 1}, {'distributions': 1, 'normality': 1}, {'open-source': 1, 'software': 1}]
[{'user_id:45': 1}]


We can split the dataset into train and test sets by using utility functions defined in `model.py`.

In [2]:
import model
import inspect
print(inspect.getsource(model.train_test_split))

def train_test_split(interactions):

    train = interactions.copy()
    test = interactions.copy()

    for i in range(len(train.data)):
        if random.random() < 0.2:
            train.data[i] = 0
        else:
            test.data[i] = 0

    train.eliminate_zeros()
    test.eliminate_zeros()

    return train, test



In [3]:
train, test = model.train_test_split(interactions)

## Fitting models
### Traditional MF model
Let's start with a traditional collaborative filtering model that does not use any metadata. We can do this using `lightfm` -- we simply do not pass in any metadata matrices. We'll use the following function to train a WARP model.

In [17]:
print(inspect.getsource(model.fit_lightfm_model))

def fit_lightfm_model(interactions, post_features):

    model = lightfm.LightFM(loss='warp',
                            no_components=30)

    model.fit(interactions,
              item_features=post_features,
              epochs=10)

    return model



In [18]:
mf_model = model.fit_lightfm_model(train, None)

The following function will compute the AUC score on the test set:

In [20]:
print(inspect.getsource(model.auc_lightfm))
mf_score = model.auc_lightfm(mf_model, test, None)
print(mf_score)

def auc_lightfm(model, interactions, post_features):

    no_users, no_items = interactions.shape

    pid_array = np.arange(no_items, dtype=np.int32)

    scores = []

    for i in range(interactions.shape[0]):
        uid_array = np.empty(no_items, dtype=np.int32)
        uid_array.fill(i)
        predictions = model.predict(uid_array,
                                    pid_array,
                                    item_features=post_features,
                                    num_threads=2)
        y = np.squeeze(np.array(interactions[i].todense()))

        try:
            scores.append(roc_auc_score(y, predictions))
        except ValueError:
            # Just one class
            pass

    return sum(scores) / len(scores)

0.540323033962


Ooops. That's barely better than random. In this case, this is because the CrossValidated dataset is very sparse: there just aren't enough interactions to support a traditional collaborative filtering model. In general, we'd also like to recommend questions that have no answers yet, making the collaborative model doubly ineffective.

### Content-based model
To remedy this, we can try using a content-based model. The following code uses question tags to estimate a logistic regression model for each user, predicting the probability that a user would want to answer a given question.

In [21]:
print(inspect.getsource(model.fit_content_models))

def fit_content_models(interactions, post_features):

    models = []

    for user_row in interactions:
        y = np.squeeze(np.array(user_row.todense()))

        model = LogisticRegression(C=0.4)
        try:
            model.fit(post_features, y)
        except ValueError:
            # Just one class
            pass

        models.append(model)

    return models



Running this and evaluating the AUC score gives

In [22]:
content_models = model.fit_content_models(train, question_features)

In [23]:
content_score = model.auc_content_models(content_models, test, question_features)
print(content_score)

0.622322622991


That's a bit better, but not great.
### Hybrid LightFM model
What happens if we estimate theLightFM model _with_ question features?

In [28]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features)
lightfm_score = model.auc_lightfm(lightfm_model, test, post_features=question_features)
print(lightfm_score)

0.646683848941


In [4]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features,
                                         user_features=user_features)
lightfm_score = model.auc_lightfm(lightfm_model, train, post_features=question_features,
                                  user_features=user_features)
print(lightfm_score)

0.82503800457


In [7]:
repr(user_features)

"<69776x85883 sparse matrix of type '<type 'numpy.int32'>'\n\twith 284212 stored elements in Compressed Sparse Row format>"

In [5]:
model.auc_lightfm(lightfm_model, test, post_features=question_features, user_features=user_features)

0.65162557761091566