https://github.com/graus/recsys_summer_school

**Issues**:
- Need to install c++ compiler. Resolution: don't try right now. Find a buddy.
- Spacy model `en_core_web_md` is not found. `!python3 -m spacy download en_core_web_md` and restart kernel.
- Typo in `candidate_retriever`, change to: `candidates = corpus.get(lambda x: "Obituaries" in x._.meta['categories'], limit=limit)` (remove o and add ]).


# Requirements

Download the pre-cooked `textacy.Corpus` from Dropbox: [Download here](https://www.dropbox.com/s/wlbyvxdfnx748bg/textacy_corpus.bin.gz?dl=1) (118MB) to `./data/enwikinews/`.


In [1]:
import textacy
from datetime import datetime

dt_format = "%Y-%m-%dT%H:%M:%SZ"  # for parsing Wikinews' timestamps

In [2]:
corpus = textacy.Corpus.load("en_core_web_md", "./data/enwikinews/textacy_corpus.bin.gz")

# Assigment

This hands on session aims to improve Wikinews' existing "Related articles"-widget, by incorporating a two-step pre-retrieval and re-rank methodology. More specifically, we expect you to create:

### 1. Candidate retrieval

One method that takes as input a `textacy.Corpus` document, and as output a (ranked) list of candidate 'related articles'. A textacy-based reimplementation of a simple `candidate_retriever` could be:


In [3]:
def candidate_retriever(doc, corpus, limit=10):
    """
    Retrieve candidate articles "related" to doc in corpus
    "related" here means documents that have one or more categories in common.
    """
    
    doc_categories = doc._.meta["categories"]
    
    # Match if the candidate doc (`c`) has any category overlapping with input doc (`doc`)
    match_func = lambda c: any([cat in doc_categories 
                                for cat in c._.meta.get("categories")])
    
    # match_func = lambda c: "Obituaries" in c._.meta['categories']  # Or: match category "Obituaries"
    
    candidates = corpus.get(lambda x: "Obituaries" in x._.meta['categories'], limit=limit)
    
    return candidates

### 2. Re-ranking

The second step consists of re-ranking the set of candidate articles using some criterium. In the case of Wikinews, it is ranked by recency (newest first), i.e.;

In [4]:
def re_ranker(doc, candidates):
    """ Sorts candidates by date created"""
    
    ranked_candidates = sorted(list(candidates), key=lambda x: datetime.strptime(x._.meta['dt_created'], dt_format), reverse = True)
    
    # Ranking by similarity on document vectors:
    # ranked_candidates = sorted(list(candidates), key=lambda x: sum(doc.vector * x.vector), reverse = True)
    
    return ranked_candidates

In [5]:
candidates = candidate_retriever(corpus[16127], corpus)
ranked_candidates = re_ranker(corpus[16127], candidates)

In [6]:
print(corpus[16127]._.meta['dt_created'], corpus[16127]._.meta['title'])
print(80*'-')
for c in ranked_candidates:
    print(c._.meta['dt_created'], c._.meta['title'])

2018-04-21T02:39:45Z 28-year-old Swedish electronic dance music artist Avicii dies in Oman
--------------------------------------------------------------------------------
2009-05-23T05:24:13Z Former South Korean President dead after apparent suicide
2008-09-08T01:21:55Z Silent film actress Anita Page dies at age 98
2008-06-13T19:59:12Z Tim Russert, NBC News "Meet the Press" moderator dies at age 58
2007-08-14T02:02:52Z American philanthropist Brooke Astor dies at 105
2007-08-11T08:00:37Z Tony Wilson dies
2007-07-12T10:49:17Z Australian radio personality Stan Zemanek dies aged 60
2007-06-26T01:01:08Z Professional wrestler Chris Benoit and family found dead
2006-05-29T07:45:06Z Australian woman dies in backburning operation near Bathurst
2005-08-17T02:35:46Z Taizé ecumenical community founder Frère Roger assassinated
2005-02-11T20:58:49Z Arthur Miller dies, aged 89
