# Requirements

Download the pre-cooked `textacy.Corpus` from Dropbox: [Download here](https://www.dropbox.com/s/wlbyvxdfnx748bg/textacy_corpus.bin.gz?dl=1) (118MB) to `./data/enwikinews/`.


In [None]:
import textacy
from datetime import datetime

dt_format = "%Y-%m-%dT%H:%M:%SZ"  # for parsing Wikinews' timestamps

In [None]:
corpus = textacy.Corpus.load("en_core_web_md", "./data/enwikinews/textacy_corpus.bin.gz")

# Assigment

This hands on session aims to improve Wikinews' existing "Related articles"-widget, by incorporating a two-step pre-retrieval and re-rank methodology. More specifically, we expect you to create:

### 1. Candidate retrieval

One method that takes as input a `textacy.Corpus` document, and as output a (ranked) list of candidate 'related articles'. A textacy-based reimplementation of a simple `candidate_retriever` could be:


In [None]:
def candidate_retriever(doc, corpus, limit=10):
    """
    Retrieve candidate articles "related" to doc in corpus
    "related" here means documents that have one or more categories in common.
    """
    
    doc_categories = doc._.meta["categories"]
    
    # Match if the candidate doc (`c`) has any category overlapping with input doc (`doc`)
    match_func = lambda c: any([cat in doc_categories 
                                for cat in c._.meta.get("categories")])
    
    # match_func = lambda c: "Obituaries" in c._.meta['categories']  # Or: match category "Obituaries"
    
    candidates = corpus.get(lambda x: "Obituaries" oin x._.meta['categories', limit=limit)
    
    return candidates

### 2. Re-ranking

The second step consists of re-ranking the set of candidate articles using some criterium. In the case of Wikinews, it is ranked by recency (newest first), i.e.;

In [None]:
def re_ranker(candidates):
    """ Sorts candidates by date created"""
    
    ranked_candidates = sorted(list(candidates), key=lambda x: datetime.strptime(x._.meta['dt_created'], 
                                                                                 dt_format), reverse = True)
    
    return ranked_candidates

In [None]:
candidates = candidate_retriever(corpus[16127], corpus)
ranked_candidates = re_ranker(candidates)

In [None]:
for c in ranked_candidates:
    print(c._.meta['dt_created'], c._.meta['title'])