Before we start, run the code cell below for a nicer layout.

In [None]:
%%html
<style>
h1 { margin-top: 3em !important; }
h2 { margin-top: 2em !important; }
#notebook-container { 
    width: 50% !important; 
    min-width: 800px;
}
</style>

# Cheese! Jaccard!

In the lecture we computed the Jaccard similarity of a few cheeses. We are doing the same here, but on a larger scale! Here's our plan:

1. Have a look at the data
2. Come up with a few questions
3. Prepare some tools
4. Load the data
5. Analyse it
6. Critique

This is roughly how we will conduct exploratory data analysis! In a real example, we would probably find in step 5. that our analysis did not work and we would to go back to step 2. to adapt the questions we want to answer.

## Step 1: Look at the data

First, let's have a look at the data in `01-resources/cheeses.txt` (use your favourite 
text editor, like `notepad`, `notepad++`, `sublime`...). The first few lines look as follows:

```
champignon de luxe garlic,soft,soft-ripened,garlicky,herbaceous,herbal,spicy,cream,...
bleu dauvergne,semi-soft,artisan,buttery,creamy,grassy,herbaceous,salty,spicy,tangy,...
paesanella caciotta,semi-soft,mild,milky,buttery,milky,white,buttery,chewy and soft,...
daphnes aged goat cheese,semi-soft,artisan,piquant,sharp,tart,goaty,dense and smooth,...
olomoucke tvaruzky,soft,soft-ripened,pungent,spicy,strong,yellow,crumbly and soft

```

So the values are **comma-separated** and we have one entry per line, with the first value
being the cheese's name and the remainder it's description. This format is easy enough
that we will parse it using only basic Python, no need for a library.

A few notes we should take: the file has 1829 lines, so unless there is something unexpected somewhere in the middle we should get 1829 cheeses out of it.

## Step 2: Come up with questions

What do we want to know about cheese similarity? Let us imagine some scenarios
for the data analysis. First, let's say we have a business and we know that 
the cheese `doublet` is very successful in our target demographic, and we would like to suggest very similar cheeses to them.

> **Question 1:** Which cheese in our dataset is most similar to `doublet`?
  How do they differ?

Second, assume we are a cheese manufacturer that wants to survey market. In particular we would like to know what combinations of cheese properties
are successful. For that reason, we would like to find the **most common** property set.

> **Question 2:** What is the most frequent property set?

Finally, let's say we are working for cheese award which prizes the most
distinct, innovative cheeses. From our dataset, we want to create a shortlist
of around a dozen candidates that should enter the competition. We decide that all cheeses that are different (Jaccard similarity < 1) to **every** other cheese in the set should be on the list.

> **Question 3:** What are the 'unique' cheeses in our dataset?





## Step 3: Preparing tools

There is really only one tools that we need that does not come with Python: 
a function to compute the Jaccard similarity. Let's quickly implement one in the following
cell, once we run it (successfully) it will be available to all following cells.

> Complete all TODOs in the following cell. You can test the expected result by running the
cell below it afterwards, it should print `Everything fine!`.

In [None]:
# This function is supposed to compute the Jaccard similarity between two sets.
def jaccard(A, B): 
    # TODO: Implement it!

In [None]:
# Do not edit this cell! It will test whether your implementation works as expected.
assert jaccard(set("ABC"), set("BAC")) == 1, "jaccard(..) returned wrong value"
assert jaccard(set("ABC"), set()) == 0, "jaccard(..) with wrong value"
assert jaccard(set(range(100)), set(range(50))) == .5, "jaccard(..) with wrong value"
assert jaccard(set(range(100)), set(range(25))) == .25, "jaccard(..) with wrong value"
assert jaccard(set(), set()) == 1, "Similiarity of empty set with itself should be 1"
''.join(reversed("!enif gnihtyrevE"))

## Step 4: Loading the data

For our purposes it makes sense to load the data into a dictionary: as keys we use the
names of the cheeses (they are unique) and as values we will have sets containing the
cheese's properties. We use sets here because we will later compute Jaccard similarity!

The following cell contains 
code that is incomplete, please complete it before continuing. 

> Complete all TODOs in the following cell. You can test the expected result by running the
cell below it afterwards, it should print `Everything fine!`.

In [None]:
cheeses = {} # This dictionary will hold the cheese properties (as sets!),
             # using the cheese names as keys.
    
with open('01-resources/cheeses.txt') as f:
    for l in f.readlines():
        l = l[:-1] # Remove newline character 
        # TODO: Split the line `l` (using ',' as the separator).
        #       The first entry is the name of a cheese, the rest its properties.
        #       Store the properties in they dictionary `cheeses` and use the cheese
        #       name as the key.   

In [None]:
# Do not edit this cell! It will test whether your implementation works as expected.
assert len(cheeses) == 1829, "Not the right number of cheeses in the dictionary"
assert "red cloud" in cheeses, "Missing cheese: red cloud"
assert "limburger" in cheeses, "Missing cheese: limburger"
assert "champignon de luxe garlic" in cheeses, "Missing cheese: champignon de luxe garlic"
assert cheeses["red cloud"] == set('semi-hard,artisan,grassy,nutty,barnyardy,goaty,pungent,cream,creamy and firm,washed'.split(',')), \
       "Properties not a set?"
''.join(reversed("!enif gnihtyrevE"))

## Step 5-6: Question 1

We will perform steps 5 and 6 on a per-question basis because they do not really relate to each other.

> **Question 1:** Which cheese in our dataset is most similar to `doublet`?
  How do they differ?

Before we get into answering the question fully, let's test our tools. 
First, let us have a look at the `doublet` entry:

In [None]:
cheeses['doublet']

Let us test our implementation of the Jaccard similarity by comparing `doublet` to, say, `cheddar`:

In [None]:
print(sorted(cheeses['doublet']))
print(sorted(cheeses['cheddar']))
jaccard(cheeses['doublet'], cheeses['cheddar'])

If the above two outputs look sensible, our dataset and tools are in order. The above might seem pointless now, but simply testing our methods on simple examples helps to catch mistakes! 

Let's answer Question 1! We want to find the one cheese that is most similar to <a href="https://cheese.com/doublet/">Doublet</a> according to our dataset.

> Complete all TODOs in the following cell or write your own solution.

In [None]:
best_score = -1
most_similar = None
doublet_props = cheeses['doublet']
for cheese, props in cheeses.items():
    # TODO: Compute jaccard similarity and compare to best_score.
    #       If the score is better, set most_similar accordingly.
print("The most similar cheese is '{}' with a score of {:.2f}".format(most_similar, best_score))

<div style="background: #b3ffb3; padding: 1.2em; text-align:center;">
Your solution should be either <a href="https://cheese.com/shepherds-crook/">this cheese</a> 
or <a href="https://cheese.com/little-rydings/">this cheese</a>.
</div>

###  Critique

The question was rather simple and it looks like we got two instead of one
cheese. That seems good enough for what we set out to do, at least with
this dataset. 


## Step 5-6: Question 2

> **Question 2:** What is the most frequent property set?

There are several methods to approach this. I suggest to use the
`Counter` data structure from the `collections` library
([documentation](https://docs.python.org/3.7/library/collections.html#collections.Counter)). Here is a quick example of how it is used:

In [None]:
from collections import Counter

counter = Counter()
for c in 'supercalifragilisticexpialidocious':
    counter[c] += 1
print("The three most common letter are:", counter.most_common(3))

Hint: the elements stored in `cheeses` are sets. Python sets are not hashable, which means that we cannot use them as dictionary keys or `Counter` keys. However, a Python set can easily be converted into a `frozenset` which _is_ hashable.

> Write a piece of code that answers Question 2 in the cell below.

In [None]:
# TODO: Answer Question 2

<div style="background: #b3ffb3; padding: 1.2em; text-align:center;">
    Your solution should be a set of properties that appears 12 times in the dataset.
</div>

###  Critique

Go back to your code and output the ten most frequent properties. Do you see a problem? Remember, we asked Question 2 in the context of what finding out "what combinations of cheese properties are successful". 

> Identify the problem with our approach and propose a solution. Write a short text in the cell below


**TODO**: write here

## Step 5-6: Question 3

> **Question 3:** What are the 'unique' cheeses in our dataset?

Recall that we defined 'unique' as a cheese that has Jaccard similarity <1 to *all* other cheeses.

> Write a piece of code that answers Question 3 in the cell below.

###  Critique

How many 'unique' cheeses did you find? Over 1700? Remember, our goal was to shortlist around a dozen cheeses for our award. Can you think of a way 
that identifies cheeses that are 'more different' than other cheeses using Jaccard similarity?

> Identify the problem with our approach and think of a way to create a shortlist of around a dozen cheeses. Implement it in the cell above.

<h1 style="color: orange">🗲 Further challenges 🗲</h1>

If you solved all of the above and you still have plenty of time left, try the following challenges!

1. Write the solution to Question 2 in a single line of code.
2. Repeat the analysis of Question 2 but solve the problem you identified
   in the critique.
3. Write a function that takes a cheese name as input and display its picture from Cheese.com in the notebook. 

Some hints for the third challenge: 
- The url of the cheese `le cendrillon` is https://cheese.com/le-cendrillon/ 
- Displaying an image as output of a cell works as follows:

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url="https://cheese.com/media/img/cheese/Le_Cendrillon.jpg")