# A Task-Construct space using an IRT-like factorization

We are going to develop an latent space shared between tasks and constructs, given a non-negative probability matrix, X, in which rows are tasks, columns are constructs, and cell values show the probability that respective task and construct are used simultaniously in the same PubMed publications.


## Setup

In [93]:
from sklearn.decomposition import NMF
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Load and prepare matrix X

In [None]:
df = pd.read_csv('data/pubmed/test_construct_matrix.csv')

# X = np.random.random((10,10))
X = df.pivot(index='test',columns='construct',values='shared_corpus_size')

## Matrix Factorization

There are a variety of methods to factorize a matrix in recommender systems literature, including methods to fill up missing values or approximate using neural networks. In our case, we get access to the full probability matrix, so we don't need those kinds of sophisticated techniques and simple factorization like Non-Negative Factorization (NMF) would suffice. We hence only optimize the hyper-parameter (`n_components`) by minimizing the MSE cost.

In [None]:
model = NMF(n_components=10, max_iter=1000000, init='random')

W = model.fit_transform(X)
H = model.components_

In [113]:
# MSE error
((W@H) - X).values.argmax(), (W@H).argmax(), X.values.argmax()
# X.mean()

(6810, 6725, 6725)

## Visualize embeddings

TODO