# Part 2: Building a recommendation system

In this analysis we will build a recommendation engine based on the IMDB actor biographies extracted in part 1. This notebook will guide you through the process. See [here](https://github.com/nestauk/taller_centro_cultura_digital/blob/master/data_analysis.ipynb) for one possible solution.

# NB: replace '???' with your own code!

In [None]:
# Text processing
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from joel_tools import SynonymBuilder

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.decomposition import PCA

# For general data manipulations
import numpy as np
import pandas as pd

# For importing the biography data
import json

# Load this in advance - this will parse our text for us
PARSER = spacy.load('en')

## Getting started

First we'll get started by loading the data:

In [None]:
# Open the data from Part 1
with open("data/bios.json") as f:
    bios = json.load(f)

#### a)  Take a look at the data and understand it's structure. How many "rows" are there? How much variation is there in the length of the biographies?

## Extracting tokens

In data science terms, this isn't a lot of data, but it's enough to work with if we use some tricks. First we need to split up each biography into 'tokens' (these could be individual words, or a number of consective words).

#### b i) Run the PARSER on the "Robert De Niro's" biography and inspect the tokens. 

In [None]:
bio = ???  # Replace ??? with "Robert De Niro's" biography
tokens = PARSER(bio)
for t in tokens:
    print(t.text, "\t", t.is_alpha, t.is_punct, t.is_stop, t.ent_type_)

As you can see, the PARSER has automatically identified 'entities' such as names and places. It would be best if these were treated as a single token. We can do that by merging them.

In [None]:
# Merge entities, and take a note of them
entities = set()
for entity in tokens.ents:
    entity.merge()
    entities.add(entity.text)
    print(entity.text, entity.label_)

#### b ii) Do the entities make sense? Which types of entities will not be useful when generating a recommendation engine? Remember: our aim here is to reduce the size of the vocubulary by removing tokens which could introduce strange biases into our recommendation engine.

    <Put your answer here>

In [None]:
bad_entities = []

We can now generate a smaller list of tokens for "Robert De Niro" by excluding these bad entities:

In [None]:
tokens = [t for t in tokens
          if (t.is_alpha or t.text in entities)
          and not (t.is_punct or t.is_stop
                   or t.ent_type_ in bad_entities)]

#### b iii) Inspect the `t.lemma_` attribute for each token `t`. How is the `lemma_` related to the token? Why will it be useful for us?

In [None]:
for t in tokens:
    print(t, ???)

    <Put your answer here>

We can now combine the above steps into a single 'tokenizer' function, which generate our tokens:

In [None]:
def spacy_tokenizer(sentence, bad_entities):
    """Split a sentence into tokens"""
    # Initial parsing
    tokens = PARSER(sentence)
    
    # Merge entities, and take a note of them
    entities = set()
    for entity in tokens.ents:
        entity.merge()
        entities.add(entity.text)
        
    # Only accept tokens which are made of letters, and are not
    # organisations, people or dates
    tokens = [t for t in tokens
              if (t.is_alpha or t.text in entities)
              and not (t.is_punct or t.is_stop
                       or t.ent_type_ in bad_entities)]
    
    # Lowercase and strip the text of excess spaces
    tokens = [t.lemma_.lower().strip()
              if (t.lemma_ != "-PRON-" and
                  t.text not in entities)
              else t.lower_ for t in tokens]

    # Finally, entities starting with pronouns are overkill for this analysis
    # so strip off the pronouns to maximise term counts
    for start in ["a", "an", "the"]:
        n = len(start) + 1
        tokens = [t[n:] if t.startswith(f"{start} ")
                  else t for t in tokens]
    return tokens

#### b iv) Confirm that the `spacy_tokenizer` runs as expected on 10 biographies.

In [None]:
for i, name in enumerate(bios):
    if i == 10:
        break
    ???

## Cleaning the vocabulary

If we inspect the entire vocabulary, we might find some more patterns which we can exploit to reduce the size of our vocabulary.

####  c i) Create a list of every token from the first 100 biographies, and look at the unique terms with `set`.

In [None]:
vocab = []
for i, name in enumerate(bios):
    if i == 100:
        break
    ???

# Get the unique terms
vocab_set = set(vocab)

#### c ii) In the vocabulary there are some tokens that we might want to consider as being the same. For example, inspect every token in `vocab_set` which contains the substring "golden globe".

    <Put your answer here>

#### c iii) Discuss with a nearby participant a strategy for combining these terms together. Feel free to look at [SynonymBuilder](https://github.com/nestauk/taller_centro_cultura_digital/blob/master/joel_tools/synonym_builder.py) for inspiration.

#### c iv) Apply `SynonynBuilder` to your vocabulary, and confirm that it has reduced the size of the vocabulary as expected. No algorithm is perfect - what features of `SynonynBuilder` might you like to improve, looking at the results (for example, search for "golden globe" again)?

In [None]:
syn_builder = SynonymBuilder()
reduced_vocab = syn_builder.fit_transform(vocab)

???

    <Put your answer here>

#### c v) The PARSER appears to have done a bad job with some names (again, no algorithm is perfect!), and second names such as 'smith' still appear in the vocab. In the context of a recommendation engine, suggest three kinds of bias that could this introduce.

    1.
    2.
    3.

#### c vi) Remove all first names and second names from your vocabulary.

In [None]:
for name in bios.keys():
    ???

Similarly to the tokenizer, we can wrap up all of this code into a function:

In [None]:
def clean_vocabulary(names, texts, bad_entities):
    """Generate the vocabulary to be used in the analysis. In order to maximise our
    chance of getting reasonable results, we need to increase our token counts. We therefore
    use the SynonymBuilder to decide which terms are actually the same.""" 
    # Build the basic vocab from the tokenizer
    vocab = []
    for text in texts:
        vocab += spacy_tokenizer(text, bad_entities)
        
    # Use the synonym builder to reduce the data size
    syn_builder = SynonymBuilder()
    vocab = syn_builder.fit_transform(vocab)
    texts = syn_builder.transform(texts)
    
    # Remove the author's name from the vocabulary
    vocab = set(vocab)
    for name in names:
        vocab = vocab - set(name.lower().split())
    return vocab, texts

#### c vii) Apply `clean_vocabulary` to the first 50 biographies.

In [None]:
names = ???
texts = ???
vocab, texts = clean_vocabulary(names, texts)

## Vectorizing: turning text into data

We convert text into data by "vectorizing". The most simple way to vectorize would be to apply a simple 'binary' vectorizer. The default tokenizer will be used, and no 'cleaning' of the vocabulary is performed. 

#### d i) Apply a binary `CountVectorizer` to the first 50 biographies. What do you think of the vocabulary (column names)?

In [None]:
cv = CountVectorizer(binary=True)
data = cv.fit_transform(texts)
df_binary = pd.DataFrame(data.todense(), columns=cv.get_feature_names(), index=names)
df_binary

    <Put your answer here>

We can, of course, use our own tokenizer and cleaned vocabulary. 

#### d ii) Apply a binary `CountVectorizer` using the `spacy_tokenizer` cleaned vocabulary. How do you feel about the vocabulary? Are there still junk terms in the vocabulary? Could you modify `spacy_tokenizer` in order to give a slightly better result (for example replacing '"' quotations?)

In [None]:
bad_entities = ???
cv = CountVectorizer(binary=True, vocabulary=vocab, tokenizer=lambda x: spacy_tokenizer(x, bad_entities))
data = cv.fit_transform(texts)
df_binary = pd.DataFrame(data.todense(), columns=cv.get_feature_names(), index=names)
df_binary

#### d iii) Discuss with a nearby participant why using a binary count could be both good and bad. (Hint: think in terms of the importance of each token)

    <Put your answer here>

#### d iv) Generally a better strategy is to use the `TfidfVectorizer` instead of a binary count. Apply the `TfidfVectorizer`, and describe what low and high values of 'tfidf' physically mean.

In [None]:
bad_entities = ???
tv = TfidfVectorizer(vocabulary=vocab, tokenizer=lambda x: spacy_tokenizer(x, bad_entities))
data = tv.fit_transform(texts)
df_tfidf = pd.DataFrame(data.todense(), columns=tv.get_feature_names(), index=names)

# Get just the row for Robert De Niro
_df = df_tfidf.loc[(df_tfidf.index == 'Robert De Niro')]
_df = _df[_df != 0.0].dropna(axis=1)

# Sort by tfidf
_df.T.sort_values('Robert De Niro', ascending=True)

    <Put your answer here>

#### d v) Now look at the 'shape' of your data (i.e. print `data.shape`). What do the two numbers correspond to? Soon we're going to try to find similar rows in our data. Why might the current shape our data make that difficult?

In [None]:
???

    <Put your answer here>

## The recommendation engine

A common strategy for reducing your data size is using Principal Component Analysis ('PCA'). This will compact the number of columns before we examine the similarity of rows. We'll apply the 'cosine similarity' and examine the most similar actors.

In [None]:
# Reduce to 75% of original size
pca = PCA(0.75)
_data = pca.fit_transform(data.todense())

# Calculate similarity 
sims = cosine_similarity(_data) - np.eye(data.shape[0])

# Build a dataframe to display the results of the recommendation engine
most_similar = []
for name, row in zip(names, sims):
    highest = row.max()
    found = False
    for _name, score in zip(names, row):
        if np.isclose(score, highest):
            found = True
            break
    if not found:
        continue
    most_similar.append(dict(name=name, most_similar=_name, score=score))
    
df_sim = pd.DataFrame(most_similar, columns=["name","most_similar","score"]).sort_values("score", ascending=False)
df_sim.head(50)

#### e i) Play with the PCA value (between 0 and 1), and describe the effect on the recommendation engine's results.

    <Put your answer here>

#### e ii) Try using `euclidean_distances` instead of `cosine_similarity`, what is the effect? How can you explain this?

    <Put your answer here>

#### e iii) Run the recommendation system on the full dataset. Compare results with your fellow participants. Try to understand how you may have got different results.