### Requisiti
Per eseguire correttamente il contenuto di questo notebook è consigliabile
- una macchina con sistema operativo Linux o Windows con WSL abilitato (prefiribilmente Ubuntu)
- memoria RAM uguale o superiore a 8 GB
- installare le dipendenze python specificate in requirements e usare un ambiente virtuale python 

In [None]:
import pandas as pd
import turicreate as tc
from gensim.models import KeyedVectors

"""
    Data imports
"""
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('data/ml-100k/u.user', sep='|', names=u_cols, encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')


In [18]:
frame.print_rows(num_rows=10)

+---------+----------+--------+----------------+
| user_id | movie_id | rating | unix_timestamp |
+---------+----------+--------+----------------+
|   196   |   242    |   3    |   881250949    |
|   186   |   302    |   3    |   891717742    |
|    22   |   377    |   1    |   878887116    |
|   244   |    51    |   2    |   880606923    |
|   166   |   346    |   1    |   886397596    |
|   298   |   474    |   4    |   884182806    |
|   115   |   265    |   2    |   881171488    |
|   253   |   465    |   5    |   891628467    |
|   305   |   451    |   3    |   886324817    |
|    6    |    86    |   3    |   883603013    |
+---------+----------+--------+----------------+
[100000 rows x 4 columns]



## Entity Liniking

Il riconoscimento di entita in un testo è una operazione ben conosciuta nell'ambito del Natural Language Processing e più recentemente ha subito nuovi sviluppi grazie alla disponibilità di basi di conoscenza liberamente disponibili sul web.
L'obiettivo principale di un sistema di EL è disambiguare la menzione di una entità $e$ appartenente ad una base di conoscenza $KB$ nel suo contesto.

Per poter poter estrarre le embeddings delle entità presenti occorre effettuare la mappattura dei film del dataset MovieLens con gli uri presenti nell'ontologia DBpedia. Per tale scopo si può utilizzare il tool DBPedia Lookup (query SPARQL, keyword index, ordinamento Wikipedia in-links)  ma per accorciare i tempi utilizziamo una mappattura già elaborata e disponibile online effettuata perè con uno snapshot di DBpedia del 2016.

**Riferimenti**
- [SPRank: Semantic Path-based Ranking for Top-N Recommendations using Linked Open Data](https://sisinflab.poliba.it/publications/2016/DOTD16/SPRank%20Semantic%20Path-based%20Ranking%20for%20Top-N%20Recommendations%20using%20Linked%20Open%20Data%20-%20ACM%20TIST%202016.pdf)
- [DBPedia Lookup](https://github.com/dbpedia/lookup/)
- [LODrecsys-datasets](https://github.com/sisinflab/LODrecsys-datasets/)
- [Movie-RS](https://github.com/voitijaner/Movie-RSs-Master-Thesis-Submission-Voit)

In [None]:
!wget https://raw.githubusercontent.com/sisinflab/LODrecsys-datasets/master/Movielens1M/MappingMovielens2DBpedia-1.2.tsv -O data/LODrecsys/mappings.tsv

### Generating embeddings example

In [None]:
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.walkers import (
    AnonymousWalker,
    CommunityWalker,
    HALKWalker,
    NGramWalker,
    RandomWalker,
    WalkletWalker,
    WLWalker, # Weisfeiler-Lehman
)
from pyrdf2vec.samplers import (
    ObjPredFreqSampler,
    PredFreqSampler,
    UniformSampler,
    ObjFreqSampler,
    PageRankSampler,
)

kg = KG("https://dbpedia.org/sparql/")
# RandomWalks[depth = 4, walks_per_graph = 500], word2vec[embedding_size=200, mode=Skip-Gram]
uris = [] # uris
depth = 4
walks_per_graph = 500
random_walker = RandomWalker(depth, walks_per_graph, n_jobs=6) #  n_jobs should be adapted according to the number of cores available
transformer = RDF2VecTransformer(walkers=[random_walker], embedder=Word2Vec(size=200, sg=1), verbose=2) # sg = {0 => CBOW, 1 => SGRAM}
embeddings, literals = transformer.fit_transform(kg, uris)
transformer.save("movielens1M-dpbedia-model")

### Pretrained RDF2Vec Emmbeddings
L'addestramento di RDF2Vec può richiedere abbastanza tempo e non è stato dunque possibile in questa occassione effettuare il training e effettuare un tunning dei parametri come la dimensione degli embeddings, la profondità delle camminare e la strategie di embedding di word2vec (CBOW e Skip-Gram). Però sono disponibili vari modelli già addestrati con diversi parametri su [kgvec2go](http://kgvec2go.org/download.html).
I parametri scelti, tra quelli disponibili, per questo task sono tratti da [RDF2Vec: RDF Graph Embeddings and Their Applications](https://sisinflab.poliba.it/publications/2018/RRDDP18b/RDF2Vec-RDF-Graph-Embeddings-and-Their-Applications.pdf)

In [None]:
# DBpedia 2021-09 500 walks, depth: 4, SG, 200 dimensions
!wget http://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-sg-200/model.kv -O data/dbpedia/model.kv
!wget http://data.dws.informatik.uni-mannheim.de/kgvec2go/dbpedia/2021-09/classic-rdf2vec-sg-200/model.kv.vectors.npy -O data/dbpedia/model.kv.vectors.npy

In [20]:
vectors = KeyedVectors.load('data/dbpedia/model.kv')

In [36]:
mappings = pd.read_csv('data/LODrecsys/mappings.tsv', sep='\t', header = 0, names=["movie_id", "movie_name", "movie_uri"])
uris = [uri for uri in df["movie_uri"] if vectors.__contains__(uri)]

mappings = mappings[mappings.apply(lambda x: vectors.__contains__(x["movie_uri"]), axis=1)]
mappings.head()

Unnamed: 0,movie_id,movie_name,movie_uri
0,1799,Suicide Kings (1997),http://dbpedia.org/resource/Suicide_Kings
1,521,Romeo Is Bleeding (1993),http://dbpedia.org/resource/Romeo_Is_Bleeding
2,3596,Screwed (2000),http://dbpedia.org/resource/Screwed_(2000_film)
3,3682,Magnum Force (1973),http://dbpedia.org/resource/Magnum_Force
4,2635,"Mummy's Curse, The (1944)",http://dbpedia.org/resource/The_Mummy's_Curse


In [37]:
dbpedia = pd.DataFrame([vectors[uri] for uri in uris])
dbpedia.insert(loc=0, column='movie_id', value=list(mappings["movie_id"]))
dbpedia.head()

Unnamed: 0,movie_id,0,1,2,3,4,5,6,7,8,...,190,191,192,193,194,195,196,197,198,199
0,1799,0.104363,-0.030366,0.024397,0.011784,0.074084,-0.009074,0.028095,-0.009043,0.032457,...,0.061471,-0.042182,0.076538,0.005617,0.068206,-0.091303,-0.021942,-0.035422,0.027923,0.011205
1,521,0.483485,-0.238453,-0.214155,-0.149148,0.374526,-0.263988,-0.193399,-0.581973,-0.548007,...,0.561658,-0.298676,-0.075646,0.285023,0.056442,-0.493393,-0.566069,-0.659976,-0.472822,1.9e-05
2,3596,0.038226,-0.022396,-0.189044,0.024776,0.49973,-0.102583,-0.520485,-0.944271,0.880034,...,0.79036,0.06305,0.331688,0.235169,-0.02107,-0.379494,-0.17678,-0.252495,0.529196,0.050202
3,3682,-0.132794,-0.716965,-0.296625,-0.064314,0.312275,-0.356286,-0.39258,-0.581087,0.227907,...,0.460223,0.149874,-0.045281,0.380723,-0.158305,-0.814315,-0.64002,-0.27009,0.297235,-0.442252
4,2635,0.186714,0.010445,-0.553984,0.019468,0.908595,-0.36153,0.011017,-0.416185,0.319215,...,0.44343,0.357368,0.396677,0.371717,0.056948,-0.438682,-0.347058,-0.45111,0.106972,0.116729


In [39]:
ratings = ratings[ratings["movie_id"].isin(mappings["movie_id"])]
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
2,22,377,1,878887116
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488


In [None]:
ratings_frame = tc.SFrame(ratings)
item_frame =tc.SFrame(dbpedia)
# ~ 80/20 split
train_data, test_data = tc.recommender.util.random_split_by_user(ratings_frame, user_id="user_id", item_id="movie_id")
train_data2 = train_data.copy()
train_data3 = train_data.copy()

rfm_with_rdf = tc.recommender.ranking_factorization_recommender.create(train_data, user_id="user_id", item_id="movie_id", target='rating', item_data=item_frame)
rfm = tc.recommender.ranking_factorization_recommender.create(train_data2, user_id="user_id", item_id="movie_id", target='rating')
itemknn = tc.recommender.item_similarity_recommender.create(train_data3,  user_id="user_id", item_id="movie_id", similarity_type="cosine")
# rfm_with_rdf_recommendations = rfm_with_rdf.recommend()
# rfm = rfm.recommend()


In [None]:

comparing = tc.recommender.util.compare_models(test_data, [rfm_with_rdf, rfm, itemknn], model_names=["Ranking FM with RDF2Vec embs", "Ranking FM", "ItemKnn"], metric='rmse')
# 1.0273390850597666, 1.1255684868360116

In [None]:
print(rfm_with_rdf_recommendations)

In [None]:
print(rfm)

In [None]:
for c in comparing:
    print(c)

In [87]:
def evaluate(test_data, threshold=3.5, est):
    tp = 0
    fn = 0
    fp = 0
    tn = 0
    test_data_pandas = test_data.to_dataframe()
    for row in test_data_pandas.iterrows():
        row["user_id"]
        row["movie_id"]
        if row["rating"] >= threshold:
            if est(row["user_id"], row["user_id"]) >= threshold:
                tp += 1
            else:
                fn += 1
        elif est(row["user_id"], row["user_id"]) >= threshold:
            fp += 1
        else:
            tn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * (precision * recall) / (precision + recall)

    return precision, recall, f1



(0, user_id                 196
movie_id                242
rating                    3
unix_timestamp    881250949
Name: 0, dtype: int64)
(1, user_id                  22
movie_id                377
rating                    1
unix_timestamp    878887116
Name: 1, dtype: int64)
(2, user_id                 298
movie_id                474
rating                    4
unix_timestamp    884182806
Name: 2, dtype: int64)
(3, user_id                 253
movie_id                465
rating                    5
unix_timestamp    891628467
Name: 3, dtype: int64)
(4, user_id                  62
movie_id                257
rating                    2
unix_timestamp    879372434
Name: 4, dtype: int64)
(5, user_id                 210
movie_id                 40
rating                    3
unix_timestamp    891035994
Name: 5, dtype: int64)
(6, user_id                 119
movie_id                392
rating                    4
unix_timestamp    886176814
Name: 6, dtype: int64)
(7, user_id                