## Module submission header
### Submission preparation instructions 
_Completion of this header is mandatory, subject to a 2-point deduction to the assignment._ Only add plain text in the designated areas, i.e., replacing the relevant 'NA's. You must fill out all group member Names and Drexel email addresses in the below markdown list, under header __Module submission group__. It is required to fill out descriptive notes pertaining to any tutoring support received in the completion of this submission under the __Additional submission comments__ section at the bottom of the header. If no tutoring support was received, leave NA in place. You may as well list other optional comments pertaining to the submission at bottom. _Any distruption of this header's formatting will make your group liable to the 2-point deduction._

### Module submission group
- Group member 1
    - Name: Mangesh Raut
    - Email: mbr63@drexel.edu
- Group member 2
    - Name: Josh Clark
    - Email: jc4577@drexel.edu
- Group member 3
    - Name: Mobin Rahimi
    - Email: mr3596@drexel.edu
- Group member 4
    - Name: Will Wu
    - Email: ww437@drexel.edu

### Additional submission comments
- Tutoring support received: NA
- Other (other): NA

# Assignment group 3: Probabilistic modeling and prediction

## Module B _(39 pts)_ Exploring conditional probability and prediction
In this section we're going to experiment with some recipes data again, which can be obtained from Kaggle:

- https://www.kaggle.com/kaggle/recipe-ingredients-dataset

As usual, they're also in the assignment's directory:

- `./data/train.json`

__B1.__ _(3 pts)_ To start, write a function called `read_recipies`, which takes a string argument called `path_to_recipes_json` that contains the path to a json file containing recipe data. The function should use the `json` package to load the data and then return `recipes`, which will be a list of dictionaries containing the converted json data.

(_Hint_: This function will be _very similar_ to the one you wrote for _A2 D1_.)

In [None]:
# B1:Function(3/3)

import json

def read_recipes(path_to_recipes_json):

    #--- Your code starts here
    recipes = json.load(open(path_to_recipes_json,"r"))
    #--- Your code ends here
    
    return recipes

To test your function, let's provide it with the path to the `train.json` data and print the first three recipes.

Your output should look like this:

```
{'id': 10259, 'cuisine': 'greek', 'ingredients': ['feta cheese crumbles', 'garlic', 'seasoning', 'grape tomatoes', 'black olives', 'garbanzo beans', 'pepper', 'purple onion', 'romaine lettuce']} 

{'id': 25693, 'cuisine': 'southern_us', 'ingredients': ['ground pepper', 'ground black pepper', 'vegetable oil', 'plain flour', 'thyme', 'salt', 'green tomatoes', 'milk', 'yellow corn meal', 'eggs', 'tomatoes']} 

{'id': 20130, 'cuisine': 'filipino', 'ingredients': ['butter', 'green chilies', 'cooking oil', 'chicken livers', 'pepper', 'salt', 'grilled chicken breasts', 'garlic powder', 'soy sauce', 'mayonaise', 'yellow onion', 'eggs']}
```

In [None]:
# B1:SanityCheck

recipes = read_recipes('./data/train.json')

for recipe in recipes[:3]:
    print(recipe,"\n")

{'id': 10259, 'cuisine': 'greek', 'ingredients': ['romaine lettuce', 'black olives', 'grape tomatoes', 'garlic', 'pepper', 'purple onion', 'seasoning', 'garbanzo beans', 'feta cheese crumbles']} 

{'id': 25693, 'cuisine': 'southern_us', 'ingredients': ['plain flour', 'ground pepper', 'salt', 'tomatoes', 'ground black pepper', 'thyme', 'eggs', 'green tomatoes', 'yellow corn meal', 'milk', 'vegetable oil']} 

{'id': 20130, 'cuisine': 'filipino', 'ingredients': ['eggs', 'pepper', 'salt', 'mayonaise', 'cooking oil', 'green chilies', 'grilled chicken breasts', 'garlic powder', 'yellow onion', 'soy sauce', 'butter', 'chicken livers']} 



__B2.__ _(7 pts)_ For this problem, you will be creating a data structure to store information about ingredients' co-occurance within recipes. Instead of building a network structure, you will build an adjacency list structure. In particular, you will construct a data structure that lets you symmetrically store the number of shared recipes between pairs of ingredients:

```
    {
        Ingredient: {
            CoIngredient: NumSharedRecipes
        }
    }
```

To do this, use a default dictionary of counters (`defaultdict(lambda : Counter())`) for the co-ingredient data.

Please write a function `count_coingredients` that takes our `recipes` as an argument (from _B1_). It should return a default dictionary  with the above structure, where the ingredient and co-ingredient counts are populated based on the provided recipes.

In [5]:
# B2:Function(7/7)

from collections import defaultdict, Counter

def count_coingredients(recipes):
    
    #--- Your code starts here
    coingredients = defaultdict(lambda : Counter())
    
    for recipe in recipes:
        for ingredient in recipe['ingredients']:
            for coingredient in recipe['ingredients']:
                coingredients[ingredient][coingredient] += 1
    #--- Your code ends here            
    
    return coingredients

To test the function that you created, now we will apply it to the provided receipes data. Your output should look like the following:

```
# of recipes where romaine lettuce and feta cheese crumbles co-occur:
20

# of recipes where soy sauce and lo mein noodles co-occur:
13

# of recipes in which soy sauce appears (i.e., receipes where it appears with itself):
3296
```

In [6]:
# B2:SanityCheck

coingredients = count_coingredients(recipes)

print("Type of coingredients object:")
print(type(coingredients))
print()
print("# of recipes where romaine lettuce and feta cheese crumbles co-occur:")
print(coingredients['romaine lettuce']['feta cheese crumbles'])
print()
print("# of recipes where soy sauce and lo mein noodles co-occur:")
print(coingredients['soy sauce']['lo mein noodles'])
print()
print("# of recipes in which soy sauce appears (i.e., receipes where it appears with itself):")
print(coingredients['soy sauce']['soy sauce'])

Type of coingredients object:
<class 'collections.defaultdict'>

# of recipes where romaine lettuce and feta cheese crumbles co-occur:
20

# of recipes where soy sauce and lo mein noodles co-occur:
13

# of recipes in which soy sauce appears (i.e., receipes where it appears with itself):
3296


__B3.__ _(2 pts)_ In the response box below answer the following questions:
- Why didn't we choose to construct an adjacency matrix from our co-ingredients data?
- Why was this _adjacency list_ a more efficient choice for computing the co-ingredient frequencies?

In [None]:
# B3:Inline(2/2)

# Which approach would take up less memory to represent the
# adjacency list structure like we used in B2, or an adjacency
# matrix? Print your answer, either "Adjacency List" or "Adjacency Matrix"
print("Adjacency List")

Adjacency List


__B4.__ _(5 pts)_ Next, let's leverage our coingredient counts to start reasoning about ingredients in terms of probabilities. Write a function `prob_ingredient` that takes three arguments: `recipes`, our recipes from _B1_, `coingredients`, which will come from _B2_, and `ingredient`, which specifies the ingredient of interest. This function should return the probability a recipe contains the `ingredient`:

$$P(\text{a recipe contains ingredient } A)$$

This should be computed as:

$$\frac{\text{number of times ingredient }A\text{ is used in any recipe}}{\text{number of recipes in the dataset}}$$

**Hint**, you can find the number of times an ingredient appears across all recipes by looking at the number of times it co-occurs with itself in the coingredients structure.

In [7]:
# B4:Function(5/5)

def prob_ingredient(recipes, coingredients, ingredient):
    
    #--- Your code starts here
    probability = coingredients[ingredient][ingredient]/len(recipes)
    #--- Your code ends here
    
    return probability

To test our function, lets compute the probability that a recipe contains "feta cheese" or "soy sauce". Your results should look like the following:
```
Probability that a recipe contains `feta cheese`:
0.0067380700960426405

Probability that a recipe contains `soy sauce`:
0.08286820536028561
```

In [8]:
# B4:SanityCheck

print("Probability that a recipe contains `feta cheese`:")
print(prob_ingredient(recipes, coingredients, "feta cheese"))
print()
print("Probability that a recipe contains `soy sauce`:")
print(prob_ingredient(recipes, coingredients, "soy sauce"))
print()

Probability that a recipe contains `feta cheese`:
0.0067380700960426405

Probability that a recipe contains `soy sauce`:
0.08286820536028561



__B5.__ _(5 pts)_ Now, write a function called `prob_ingredient_pair` that takes `recipes` and `coingredients` just like the function from _B4_. Unlike the previous function, this one will take two ingredients `ingredient_a` and `ingredient_b`. This function should return the probability that a recipe contains both of the specified ingredients:

$$P(\text{a recipe contains ingredients } A \text{ and } B)$$

This should be computed as:

$$\frac{\text{number of times both ingredients were used in any recipe}}{\text{number of recipes in the dataset}}$$

In [9]:
# B5:Function(5/5)

def prob_ingredient_pair(recipes, coingredients, ingredient_a, ingredient_b):
    
    #--- Your code starts here
    num_recipes = len(recipes)
    probability = (coingredients[ingredient_a][ingredient_b]/num_recipes)
    #--- Your code ends here
    
    return probability

To test your function, let's use it to compute the probability that a randomly chosen recipe contains both `feta cheese` and `romaine lettuce`. Your output should look like this:
```
Probability that a recipe contains both `feta cheese` and `romaine lettuce`:
0.0003017046311660884
```

In [10]:
# B5:SanityCheck

print("Probability that a recipe contains both `feta cheese` and `romaine lettuce`:")
print(prob_ingredient_pair(recipes, coingredients, "feta cheese", "romaine lettuce"))

Probability that a recipe contains both `feta cheese` and `romaine lettuce`:
0.0003017046311660884


__B6.__ _(5 pts)_ Next, write a function called `prob_ingredient_given_ingredient` that takes the same arguments as your previous function (`recipes`, `coingredients`, `ingredient_a`, and `ingredient_b`), but instead returns the conditional probability that a recipe contains `ingredient_a` given that it already contains `ingredient_b`. This is the conditional probability:

$$P(\text{a recipe contains ingredient } A\mid\text{ it is a recipe that we know contains ingredient } B)$$

which can be computed as a quotient from Bayes' rule:

$$
P(\text{a recipe contains ingredient } A\mid\text{it is a recipe that we know contains ingredient } B)=
\frac{P(\text{a recipe contains ingredients } A \text{ and } B)}{P(\text{a recipe contains ingredient } B)}
$$

i.e., using the output of our previous two functions. 

In [11]:
# B6:Function(5/5)

def prob_ingredient_given_ingredient(recipes, coingredients, ingredient_a, ingredient_b):
    
    #--- Your code starts here
    probability = (prob_ingredient_pair(recipes, coingredients, ingredient_a, ingredient_b)/
                   prob_ingredient(recipes, coingredients, ingredient_b))
    #--- Your code starts here

    return probability

Next, lets test if your function works by using it to find the probability that a recipe contains `"feta cheese"`, given we know it contains `"romaine lettuce"`. Your output should look like this:
```
Probability a recipe contains `feta cheese` given that it contains `romaine lettuce`:
0.044444444444444446
```

In [12]:
# B6:SanityCheck

print("Probability a recipe contains `feta cheese` given that it contains `romaine lettuce`:")
print(prob_ingredient_given_ingredient(recipes, coingredients, "feta cheese", "romaine lettuce"))

Probability a recipe contains `feta cheese` given that it contains `romaine lettuce`:
0.044444444444444446


__B7.__ _(7 pts)_ Finally, write a function called `likely_coingredients` that takes three arguments: `recipes`, `coingredients`, and `ingredient`. Given a conditioning `ingredient`, the function should return the conditional probabilities for all ingredients where the conditional probability is non-zero (i.e., that co-occur at least once with the conditioning `ingredient`). The co-ingredients and their likelihoods should be returned in a `Counter()` object named `probs`.

In [13]:
# B7:Function(7/7)

def likely_coingredients(recipes, coingredients, ingredient):
    ## initilize a Counter() for the co-ingredient probabilities
    probs = Counter()
    
    #--- Your code starts here
    for coingredient in coingredients[ingredient]:
        probs[coingredient] = prob_ingredient_given_ingredient(recipes, coingredients, coingredient, ingredient)
    #--- Your code ends here
    
    return probs

Next, let's test this function by using it to find the ten ingredients that have the highest conditional probabilities given that we assume the recipe contains 'feta cheese'. Your output should look like this:
```
[('feta cheese', 1.0),
 ('olive oil', 0.5186567164179104),
 ('salt', 0.42537313432835827),
 ('purple onion', 0.22388059701492538),
 ('tomatoes', 0.1902985074626866),
 ('garlic cloves', 0.1902985074626866),
 ('dried oregano', 0.1902985074626866),
 ('garlic', 0.1791044776119403),
 ('pepper', 0.17537313432835822),
 ('extra-virgin olive oil', 0.17164179104477612)]
 ```

In [14]:
# B7:SanityCheck

likely_coingredients(recipes, coingredients, "feta cheese").most_common(10)

[('feta cheese', 1.0),
 ('olive oil', 0.5186567164179104),
 ('salt', 0.42537313432835827),
 ('purple onion', 0.22388059701492538),
 ('tomatoes', 0.1902985074626866),
 ('garlic cloves', 0.1902985074626866),
 ('dried oregano', 0.1902985074626866),
 ('garlic', 0.1791044776119403),
 ('pepper', 0.17537313432835822),
 ('extra-virgin olive oil', 0.17164179104477612)]

__B8.__ _(3 pts)_ Using the output from __B7:SanityCheck__, answer the following inline question.

In [15]:
# B8:Inline(3/3)

# Do you interpret these as fitting into a common cuisine of recipes? Print "Yes" or "No".
print("Yes")

Yes


__B9.__ _(2 pts)_ Finally, look at the numeric values from __B7:SanityCheck__. You'll notice that these probabilities do not add up to 1! Answer the following inline question regarding why:

In [16]:
# B9:Inline(2/2)

# Are the occurance of ingredients from B7:SanityCheck's output mutually exclusive?
# Print "Yes" or "No"
print("Yes")

Yes
