<a/ id='top'></a>

# CSCI4022 Homework 5; A-Priori

## Due Friday, March 4 at 11:59 pm to Canvas and Gradescope

#### Submit this file as a .ipynb with *all cells compiled and run* to the associated dropbox.

***

Your solutions to computational questions should include any specified Python code and results as well as written commentary on your conclusions.  Remember that you are encouraged to discuss the problems with your classmates, but **you must write all code and solutions on your own**.

**NOTES**: 

- Any relevant data sets should be available on Canvas. To make life easier on the graders if they need to run your code, do not change the relative path names here. Instead, move the files around on your computer.
- If you're not familiar with typesetting math directly into Markdown then by all means, do your work on paper first and then typeset it later.  Here is a [reference guide](https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference) linked on Canvas on writing math in Markdown. **All** of your written commentary, justifications and mathematical work should be in Markdown.  I also recommend the [wikibook](https://en.wikibooks.org/wiki/LaTeX) for LaTex.
- Because you can technically evaluate notebook cells is a non-linear order, it's a good idea to do **Kernel $\rightarrow$ Restart & Run All** as a check before submitting your solutions.  That way if we need to run your code you will know that it will work as expected. 
- It is **bad form** to make your reader interpret numerical output from your code.  If a question asks you to compute some value from the data you should show your code output **AND** write a summary of the results in Markdown directly below your code. 
- 45 points of this assignment are in problems.  The remaining 5 are for neatness, style, and overall exposition of both code and text.
- This probably goes without saying, but... For any question that asks you to calculate something, you **must show all work and justify your answers to receive credit**. Sparse or nonexistent work will receive sparse or nonexistent credit. 
- There is *not a prescribed API* for these problems.  You may answer coding questions with whatever syntax or object typing you deem fit.  Your evaluation will primarily live in the clarity of how well you present your final results, so don't skip over any interpretations!  Your code should still be commented and readable to ensure you followed the given course algorithm.
- There are two ways to quickly make a .pdf out of this notebook for Gradescope submission.  Either:
 - Use File -> Download as PDF via LaTeX.  This will require your system path find a working install of a TeX compiler
 - Easier: Use File ->  Print Preview, and then Right-Click -> Print using your default browser and "Print to PDF"



---
**Shortcuts:**  [Problem 1](#p1) | [Problem 2](#p2) | [Extra Credit](#p3) |
---


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
import itertools as it #may use for .combinations/similar, if desired.

***
<a/ id='p1'></a>
[Back to top](#top)
# Problem 1 (Practice: Candidate Items; 20 pts)

In the A-Priori algorithm, there is a step in which we create a candidate list of frequent itemsets of size $k+1$ as we prune the frequent itemsets of size $k$.  This this problem we will create two functions to do that formally.

#### Part A:

There are two types of data objects in which we might be holding the frequency counts of itemsets.  If $k=2$, they may be stored in a triangular array.  Create a function `Cand_Trips` that takes a triangular array and returns all valid candidate triples as a list.  Recall that the itemset $\{i,j,k\}$ is only a candidate if all 3 of the itemsets in $\{\{i,j\}, \{i,k\}, \{k,j\}\}$ are frequent.

Some usage notes:

- The first input argument is `triang_counts`,  a zero-indexed triangular (numeric) array, by same convention as introduced in class.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- You should not convert the input list `triang_counts` into a list of triples as part of your function.
- The return array `candidates` should be a list of 3-index lists of the item numbers of the triples.  So a final answer for some input might be:

`cand_trips` =
    `[[0,3,4], [1,2,7]]`

- An implementation note: there are two fundamentally different ways to think about implementing this function.  Option 1 involves thinking about the elements of `triang_counts` in terms of their locations on the corresponding *triangular matrix*: scan row $i$ for a pair of frequent pairs $\{\{i,j\}, \{i,k\}\}$ and then check if $\{j,k\}$ is in fact frequent.  Option 2 scans all of `tri_Counts` for frequent item pairs (the "pruning" step) and saves those in some object with their indices, then scans *that* object for candidates.  Both are valid for this problem, but option 2 may generalize to higher $k$ better...

In [2]:
def cand_trips(triang_counts, s):
    elements = int((1 + np.sqrt(1 + 8 * len(triang_counts))) / 2) # this is the reverse of int((n - 1) * n / 2) from nb08                                                
    pairs = []
    candidates = []
    counter = 0
    for i in range(elements):
        for j in range(i + 1, elements):
            if(triang_counts[counter] >= s):
                pairs.append((i, j))
            counter += 1       
    
    for a in range(len(pairs)):
        for b in range(a +  1, len(pairs)):
            if(pairs[b][0] in it.chain(pairs[a]) or pairs[b][1] in it.chain(pairs[a])):
                for c in range(b + 1, len(pairs)):
                    if(pairs[c][0] in it.chain(pairs[a]) or pairs[c][0] in it.chain(pairs[b])):
                        if(pairs[c][1] in it.chain(pairs[a]) or pairs[c][1] in it.chain(pairs[b])):
                            tempTuples = ((pairs[a]), (pairs[b]), pairs[c])
                            out = list(set([item for items in tempTuples for item in items]))
                            candidates.append(out)
        
    return candidates

#### Part B:

A quick test case.  Below is  a matrix $M$ and code including its corresponding the triangular array.  

$C=\begin{bmatrix}
\cdot &10&7&3&2\\
\cdot &\cdot&6&4&3\\
\cdot &\cdot&\cdot&3&6\\
\cdot &\cdot&\cdot&\cdot&0\\
\cdot &\cdot&\cdot&\cdot&\cdot\\
\end{bmatrix}$
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [3]:
triang_counts=[10,7,3,2,6,4,3,3,6 ,0]

print('For s>=6, candidate:', cand_trips(triang_counts, 6))
print('For s>=1, candidates:', cand_trips(triang_counts, 1))

#(10,7,3,2,6,4,3,3,6,0)
#((0, 1), (0, 2), (1, 2), (2, 4))
#(0, 1, 2)

#cand_trips(triang_counts, 1) returns all the possible triples except those that contain BOTH items 3 and 4.
#cand_trips(triang_counts, 6) returns only the triple [[0,1,2]]

For s>=6, candidate: [[0, 1, 2]]
For s>=1, candidates: [[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]


#### Part C:

Suppose instead that our $k=2$ item counts were stored in a list of the form e.g.
`pairs_counts` =
    `[[0,1,12], [0,2,0], [0,3,11], ..., [7,8,103]]`
    
Where each element is a triple storing the two item indices and their count, $[i,j,c_{ij}]$. 

Create a function `cand_trips_list` that takes in a list of pairs counts and returns all valid candidate triples as a list.  

Some usage notes:

- The first input argument is `pairs_counts`,  a zero-indexed list of triples.
- The second input argument is the positive integer support threshold `s`.
- The underlying itemset is 0-indexed, so e.g. `[0,1,3]` is a valid triple.
- The return array `candidates` should be a list of 3-element lists, as above.

You should **not** convert the input list `pairs_counts` into a triangular array as part of your function.  After all, sometimes we use the list format for pairs because it saves memory compared to the triangular array format!  You may be able to borrow heavily from the logic of your first function, though!

In [4]:
def cand_trips_list(pairs_counts, s):
    pairs = []
    candidates = []
    for pair in pairs_counts:
        if(pair[2] >= s):
            pairs.append(pair)

    for a in range(len(pairs)):
        for b in range(a +  1, len(pairs)):
            if(pairs[b][0] in it.chain(pairs[a][:-1]) or pairs[b][1] in it.chain(pairs[a][:-1])):
                for c in range(b + 1, len(pairs)):
                    if(pairs[c][0] in it.chain(pairs[a][:-1]) or pairs[c][0] in it.chain(pairs[b][:-1])):
                        if(pairs[c][1] in it.chain(pairs[a][:-1]) or pairs[c][1] in it.chain(pairs[b][:-1])):
                            tempTuples = ((pairs[a][:-1], (pairs[b][:-1]), pairs[c][:-1]))
                            out = list(set([item for items in tempTuples for item in items]))
                            candidates.append(out)    
                            
    return candidates


#### Part D:

Do the test case again.  Below is the list reprentation of the same matrix $M$ from part B.  
 
Input the given list into your function to verify that it returns the correct valid triples at $s=1$ and $s=6$.

In [5]:
pairs_counts=[[0,1,10], [0,2,7], [0,3,3], [0,4,2],\
             [1,2,6],[1,3,4], [1,4,3],\
             [2,3,3],[2,4,6],\
             [3,4,0]]
print(cand_trips_list(pairs_counts, 6))
print(cand_trips_list(pairs_counts, 1))
#Check that...
#cand_trips(pairs_counts, 1) returns all the possible triples except those that contain BOTH items 3 and 4.
#cand_trips(pairs_counts, 6) returns only the triple [[0,1,2]]

[[0, 1, 2]]
[[0, 1, 2], [0, 1, 3], [0, 1, 4], [0, 2, 3], [0, 2, 4], [1, 2, 3], [1, 2, 4]]


#### Part E

Describe *in words* how you would generalize your code in part D to work for generating candidate quadruples $[i_1, i_2, i_3, i_4]$ from an input list of triples counts (each element of the form $[i, j, k, c_{ijk}]$).

$\text{Within my if statements, I would add a new comparison for the additional element, which would also require me to add another loop. }$ <br>
$\text{For example, I would have to add another 'or' within all of my if statements which would check to see if pairs[x][2] is in the previous pairs}$ <br>
$\text{I would need another for loop which would be for d in range(c + 1, len(pairs)), within this for loop would be another if statement and would contain the }$ <br>
$\text{tempTuples, out and append lines.}$


***
<a/ id='p2'></a>
[Back to top](#top)
# Problem 2 (Practice: A-Priori; 25 pts) 

Consider the recipe data set provided in `recipes.npy` (use `np.load`).  This includes 100,000 recipes from a variety of sources.

We want to use the baskets and the ingredients therein (see `ingredients.npy`) to perform an item basket analysis.

This data set is small enough to run directly from main memory, so you may do that if you wish.

Loading and accessing the data set is shown below:

In [6]:
ingredients=np.load('../ingredients.npy', allow_pickle = True)
recipes=np.load('../recipes.npy', allow_pickle = True)

In [7]:
print(recipes[:2]) #list of lists
print(ingredients[:2]) #inventory list
print(ingredients[recipes[1]]) #to access a recipe by string
# print(ingredients[0])

[array([ 233, 2754,   42,  120,  560,  345,  150, 2081,   12,   21])
 array([ 198,  249,    2,  194, 1884,  791,  965,  423,   53,   48,  798,
          31,  362, 1031,   94,   26,    8])                             ]
['salt' 'pepper']
['balsamic vinegar' 'boiling water' 'butter' 'cooking spray'
 'crumbled gorgonzola' 'currants' 'gorgonzola' 'grated orange' 'kosher'
 'kosher salt' 'orange rind' 'parsley' 'pine nuts' 'polenta' 'toasted'
 'vinegar' 'water']




#### a) Since the ingredients file alrady provides integer codes for each of our items, we can move directly into countin via the A-Priori algorithm.  Using the two given files, create a table of frequent single items at 1% support threshold. You may use Python's native classes to set up your lookup functions/tables.

Was 1% an appropriate support threshold?  Describe why or why not.  Keep in mind, the goal here is two fold: you want "actionable" conclusions, and output that's small enough that you or your grader can make sure that you have the right set!


In [8]:
def freqItems(s):    
    occurances = {}
    for i in range(0, len(recipes)):
        temp = ingredients[recipes[i]]
        for j in range(len(temp)):
            if(temp[j] in occurances):
                occurances[temp[j]] += 1
            else:
                occurances[temp[j]] = 1
                
    df = pd.DataFrame.from_dict(occurances, columns = ["Count"], orient = 'Index')
    dfPruned = df[df["Count"] / len(recipes)  >= s]
    dfNewIndicies = dfPruned.reset_index()
    dfNewColumns = dfNewIndicies.rename(columns = {"index":"Ingredient"})
    return dfNewColumns

In [9]:
thingy = freqItems(.01)
print(thingy)

       Ingredient  Count
0    basil leaves   1413
1          leaves   6635
2      mozzarella   2729
3        rosemary   2159
4          sliced  16155
..            ...    ...
288  coconut milk   1066
289       topping   1066
290  strawberries   1544
291     asparagus   1009
292   dried basil   1181

[293 rows x 2 columns]


$\text{I think that 1% is a bad support threshold because too many ingredients are within the threshold. The larger amount of ingredients which are considered 'frequent' the }$ <br>
$\text{more time it will take to find frequent pairs and may reduce how actionable the conclusions were can derivce from these pairs.}$


#### b) Use A-priori to find all frequent  pairs of items from your set of frequent items in a).  Use whatever support threshold you feel is most appropriate, but make sure your result is readable: you should list the top handful of most frequent pairs, sorted by their prevalence.

Report the confidences of the two association rules corresponding to the most frequent item pair.


In [10]:
def freqPairs(df, arrI, arrR):
    pairCount = {}
    items = df["Ingredient"].to_list()
    pairs = [(a, b) for idx, a in enumerate(items) for b in items[idx + 1:]]
    for pair in pairs:
        for i in range(len(arrR)):
            if(pair[0] in arrI[arrR[i]] and pair[1] in arrI[arrR[i]]):
                if pair in pairCount:
                    pairCount[pair] += 1
                else:
                    pairCount[pair] = 1
                    
    df = pd.DataFrame.from_dict(pairCount, columns = ["Count"], orient = 'Index')
    dfNewIndicies2 = df.reset_index()
    dfNewColumns2 = dfNewIndicies2.rename(columns = {"index":"Pairs"})
    dfNewColumns2.sort_values('Count', inplace = True)
    return dfNewColumns2

thingy2 = freqItems(.165)
print(thingy2)
thingy3 = freqPairs(thingy2, ingredients, recipes)

   Ingredient  Count
0      butter  29030
1       water  19771
2      garlic  29054
3       olive  20249
4   olive oil  20118
5       onion  20950
6      pepper  38472
7        salt  42163
8      ground  20674
9       sugar  32748
10      flour  22696


In [11]:
print(thingy3.tail(8))

                 Pairs  Count
23      (garlic, salt)  14388
6       (butter, salt)  14513
51       (salt, flour)  14845
46    (pepper, ground)  15020
50       (salt, sugar)  15506
22    (garlic, pepper)  19276
27  (olive, olive oil)  19981
45      (pepper, salt)  22494


In [12]:
lastPair = thingy3.iloc[-1].Pairs

pairSupport = (thingy3.iloc[-1].Count / len(recipes)) 
itemOneFull = thingy2.loc[thingy2["Ingredient"] == lastPair[0]]
itemOneValue = itemOneFull.Count.tolist()
itemOneSupport = itemOneValue[0] / len(recipes)
oneToTwo = pairSupport / itemOneSupport

print("The confidence of {} -> {} is {}".format(lastPair[0], lastPair[1], oneToTwo))

itemTwoFull = thingy2.loc[thingy2["Ingredient"] == lastPair[1]]
itemTwoValue = itemTwoFull.Count.tolist()
itemTwoSupport = itemTwoValue[0] / len(recipes)
twoToOne = pairSupport / itemTwoSupport

print("The confidence of {} -> {} is {}".format(lastPair[1], lastPair[0], twoToOne))

The confidence of pepper -> salt is 0.5846849656893325
The confidence of salt -> pepper is 0.5335009368403577


**c)**

Zach has to go to the store and stock his pantry.  He knows that his girlfriend has a (borderline unhealthy to those around her) love of garlic.  What should he purchase to make sure he has in stock?  What are two most frequent $\{garlic, x\}$ item pairs, and what are the two most **interesting** $garlic \to X$ associations?

In [63]:
def findGarlic(arr, arr2, arr3, term):
    garlicPairs = []
    interests = []
    for row in arr.iterrows():
        if term in row[1][0]:
            garlicPairs.append(row)
                   
    for pair in garlicPairs:
        if(pair[1][0][1] != "garlic"):
            row = arr2.loc[arr2["Ingredient"] == pair[1][0][1]]
            number = row.Count.tolist()
        else:
            row = arr2.loc[arr2["Ingredient"] == pair[1][0][0]]
            number = row.Count.tolist()

        interests.append((pair[1][0], ( (pair[1][1] / len(arr3)) / (number[0] / len(arr3)) ) - (number[0] / len(arr3)) ))
    return interests
          
thingy4 = findGarlic(thingy3, thingy2, recipes, "garlic")
dfGarlicPairs = pd.DataFrame(thingy4, columns = ["pairs", "interest"])
dfTopTwo = dfGarlicPairs.nlargest(3, ["interest"])
print(dfTopTwo)

                 pairs  interest
6  (garlic, olive oil)  0.396891
7      (garlic, olive)  0.394428
5      (garlic, onion)  0.291837


$\text{The most interesting pairs are (garlic, olive oil) and (garlic, olive). However, I am not sure if the garlic and olive pair is a mistake or not because I wouldn't think many recipes}$ <br>
$\text{actually use whole olives in them. I am going to guess my algorithm to find frequent ingridients mistakenly counted all entries that had the ingredient 'olive oil' towards the ingredient 'olive'.}$ <br>
$\text{This would explain why there are more counts of 'olive' than 'olive oil', which otherwise wouldn't make sense.}$ <br>
$\text{So instead, I would say the most interesting pairs are (garlic, olive oil) and (garlic, onion). Which aren't all that interesting but whatever.}$

***
<a/ id='p3'></a>
[Back to top](#top)
# Problem 3 (Extra-Credit: A-Priori with hashing and more baskets; 10 pts each part) 

The data set in 2 had two very appealing propeties that we typically do **not** assume to be the case:
- It came with an ingredient list provided
- It was small enough to fit into main memory.

To fully implement the model, you can get some extra credit by attempting variants of the data that do not have those properties.  We will tackle each problem individually.  You should answer each problem *in its own, separate notebooks* to ensure you're not using any variables from your solution to problem 2 above.

## EC1: A-P with hashing

#### EC1a) The file `recipesbying` contains the same data set as in problem 2, but the strings themselves live in each recipe.

Create a hash table as in nb08 that hashes each ingredient observed based on its string. In other words, create your own version of what **was** in `ingredients.npy` by creating your own hash and/or lookup functions.
Include a check to minimize and fix any collisions, as in nb08.



**EC1b)** Use A-priori to find all frequent items and all frequent pairs of items from your hashed data set in part EC1).  Ensure that the results match those of problem 2.



## EC2: A-P with massive data

The `.npz` file `simplified-recipes-1M.npz` contains over 1 million recipes, and is the original source of the 100,000 recipes used in problem 2.  Using this file (and `ingredients.npy`, if desired), use A-priori to find all frequent items and all frequent pairs of items.  However, you should **not** load all of the file into main memory.  Instead, use `np.memmap` or other options to ensure that you never load into main memory more than 100,000 recipes at a time.  Include any processing in your submission, and use the same proportionate support threshold as you did in problem 2.  Do the most common items differ?

A few notes: 
- If you process the data to make it readable in other forms `.npy`, `.csv`, etc., that's fine, but show all processing code in your submission.
- For example, if you find `.memmap` hard to get working, you may convert to `.csv` and use `pd.read_csv` with arguments `chunksize` or `skiprows`, `nrows`
- You may be able to do the problem with very little additional work if you are clever about how you open the file and read over it.  In this case, set up your "loop" over baskets to only go over 100,000 rows of the file at a time, though, and be very explicit as to how you're avoiding the larger objects ever entering main memory.