In [3]:
import numpy as np
import pandas as pd

# HW9: Review

## Problem 1: Curse of Greed

This is a simple game involving a single unbiased die. Your *payoff* is the value that comes up on each roll. Continue rolling and progressively adding each roll to your total score; **beware of the two (2)!** If you roll a two, your score is
returned to zero and the game ends. 

You can avoid this by opting out of the game at any time. Opting out means that you bank your current score; your opponents may continue to accumulate points or be promptly returned to zero if a two appears. Each game ends when a two is rolled. The player with the highest banked score wins.

### What to Expect?

Would you opt out and bank your score if you only had 6 points? What about 9 points? 

You may have observed that it is reasonably common to go back to zero even for these relatively low totals. However, the player should probably consider themselves unlucky (or maybe greedy!!).

**Question:**
1. Calculate the expectation of the payoff for the next roll if you currently have $S$ points. If the next roll is a 2, your payoff is $-S$.
2. Determine the value of $S$ for which your expected return would be zero.

**Answer here**

1) Expected Utility = 1(1/6) + -S(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6) 

2) 0 = 1(1/6) + -S(1/6) + 3(1/6) + 4(1/6) + 5(1/6) + 6(1/6)  


`solve(0=1*((1)/(6))+−s*((1)/(6))+3*((1)/(6))+4*((1)/(6))+5*((1)/(6))+6*((1)/(6)),s)`
`s=19` 


## Problem 2: Naïve Bayes 

Imagine that a team of researchers recently conducted a survey of Gimme! customers. They asked 90 customers three binary questions about their drink: whether it was hot, bitter, and caffeinated.

The following table provides information about the frequencies of different conditions for different types of drinks. Note that hot/bitter/caffeinated are not exclusive categories, so the marginal counts do not sum to 90, while coffee/tea/hot chocolate *are* mutually exclusive.

| Type | Hot | Bitter | Caffeinated | Total |
| --- | --- | --- | --- | --- |
| Coffee | 25 | 35 | 35 | 40 |
| Tea | 25 | 10 | 15 | 30 |
| Hot Chocolate | 20 | 5 | 0 | 20 |
| **Total** | 70 | 50 | 50 | 90 |


* What is the probability that a drink is hot given that it is coffee?

* What is the probability that a drink is coffee given that it is hot?

* Assuming independence, what is the probability that a drink is hot *and* bitter given that it is coffee?

You get a drink from Gimme! that is hot, bitter, and caffeinated. Using Naive Bayes, do the following:
* Calculate the probability that this drink is coffee. Then, calculate the probability of tea. Do the same for hot chocolate. Print out each probability in an informative way. **Confidence check: probabilities should sum to 1, and hot chocolate should have probability zero.**



* **Laplace Correction**: In lecture 21, we discussed how zeros can complicate using Naive Bayes. We can use Laplace correction if we have zeroes in our data (see the 0 for P(Hot Chocolate | Caffeinated)). 
Imagine that you recorded three additional drinks, one coffee, one tea, and one hot chocolate, all three of which were hot, bitter, and caffeinated.
Recalculate the conditional probabilities for coffee, tea, and hot chocolate for an additional drink that is hot, bitter, and caffeinated. Print these values again. 

* What about your results changes when you incorporate Laplace correction?  Why is Laplace correction useful? 


* **Log probabilities**: Calculate the *log* probability of 15 hot, bitter, caffeinated coffees and 12 hot, bitter, caffeinated teas with Laplace correction (the joint probabilities, not the conditional probabilities). Print the log probabilty and the probability (ie the exponential of the log).

* Describe what differences you notice between calculating with and without taking the log. Why are log probabilities helpful for Naive Bayes? 




**Answer here**

1) Calculate the probability that this drink is coffee. Then, calculate the probability of tea. Do the same for hot chocolate. Print out each probability in an informative way. Confidence check: probabilities should sum to 1, and hot chocolate should have probability zero.  

P(Coffee|hot, bitter, and caffeinated)  = P(hot, bitter, and caffeinated | coffee) * P(coffee) 

(((25)/(40))*((35)/(40))*((35)/(40))*((40)/(90)))▶Decimal = 0.21267361111111 / 0.25896990740741 = `0.82122905027932`


P(tea|hot, bitter, and caffeinated)  = P(hot, bitter, and caffeinated | tea) * P(tea) = 

(((25)/(30))*((10)/(30))*((15)/(30))*((30)/(90)))▶Decimal = 0.046296296296296 / 0.25896990740741 = `0.17877094972067 `


P(HC|hot, bitter, and caffeinated)  = P(hot, bitter, and caffeinated | HC) * P(HC) = 
(((20)/(20))*((5)/(20))*((0)/(20))*((20)/(90)))▶Decimal = 0 / 0.25896990740741 = `0 `






In [18]:
#Laplace Correction
denom = (26/41) * (36/41) * (36/41) * (41/93) + (26/31) * (11/31) * (16/31) * (31/93) + (21/21) * (6/21) * (1/21) * (21/93)

a = ((26/41) * (36/41) * (36/41) * (41/93))/ denom
print ("Prob(C) = " + str(a))

b = ((26/31) * (11/31) * (16/31) * (31/93))/denom
print ("Prob(T) = " + str(b))

c = ((21/21) * (6/21) * (1/21) * (21/93))/ denom
print ("Prob(HC) = " + str(c))

Prob(C) = 0.7988484892172525
Prob(T) = 0.18976512909798454
Prob(HC) = 0.011386381684762933


What about your results changes when you incorporate Laplace correction?  Why is Laplace correction useful?

my results changed because the probability for hot chocolate it is no longer 0. Helpful for calculations because using a zero will mess up your calculations, when multiplying by 0 or dividing by 0, which is impossible. 

In [7]:
probability = 15* np.log((26/41) * (36/41) * (36/41) * (41/93)) + 12* np.log((26/31) * (11/31) * (16/31) * (31/93))

print(probability)
print(np.exp(probability))

-58.68305860318951
3.267920384835903e-26


Describe what differences you notice between calculating with and without taking the log. Why are log probabilities helpful for Naive Bayes? 
 
Using log probabilities gives you a bigger range of values, or values that are easier to interpret at small magnitutes, whereas normal probability is between 0 and 1. For naive bayes, when multiplying lots of probabilities you are more likely to get a probability close to 0, and using log probabilities givses you a result that is easier to interpret. 

# SVD with Recipes
Below, we've loaded a dataset of ingredients in different recipes called `recipes_df`. In this dataframe, each column is an ingredient and each row is a recipe. We have binary values for each ingredient. This tells us whether that ingredient is (1) or is not (0) used in the given recipe.

This data is a subset of [a recipe ingredients dataset](https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset) released on Kaggle by the meal planning site [Yummly](https://www.yummly.com/).

In [8]:
recipes_df = pd.read_csv('recipes.csv')
ingredient_names = recipes_df.columns[3:]
recipes_df.sample(5)

Unnamed: 0.1,Unnamed: 0,RecipeID,Cuisine,all-purpose_flour,avocado,bacon,baking_powder,baking_soda,balsamic_vinegar,bay_leaf,...,water,whipping_cream,white_onion,white_pepper,white_sugar,white_vinegar,whole_milk,worcestershire_sauce,yellow_onion,zucchini
294,294,3466,southern_us,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
176,176,42925,russian,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
245,245,18253,italian,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
440,440,10236,indian,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
216,216,48965,italian,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Problem 3: Generate the SVD

Our goal is to perform SVD on this ingredient/recipe matrix. There's some additional metadata that we don't need for now (index, recipe ID, and cuisine). Use `iloc` to extract the ingredient columns from `recipe_df` into a new variable `A`. Print `A`'s shape. **Confidence check: it should be 500 x 200**

Using `np.linalg.svd`, generate the singular value decomposition of `A`. Assign the output of this function to values `U`, `s`, and `Vt`. Describe what these three matrices entail in words, for our recipe data setting.

Print the shape of `s`, `Vt`, and `U`. Print out `s`. 

What is the rank of this matrix? What does rank represent in the SVD? What do the values in `s` represent? 

Use `sorted(zip(Vt[i,:], ingredient_names))` to inspect the first three components (ie set `i` to 0, 1, and 2). Describe what they might mean. (You're welcome to look at more, but the first three are the easiest to interpret.)

In [9]:
A = recipes_df.iloc[:,3:]
print(A.shape)
U, s, Vt = np.linalg.svd(A)

(500, 200)


**Answer here**

U is the recipe to concept matrix 

s represents the weights for each ingredient and weights how strong the concept is on each ingredient 

Vt is the transposed ingredient to concept matrix 


In [10]:
print(s.shape)
print(Vt.shape)
print(U.shape)
print(s)

(200,)
(200, 200)
(500, 500)
[21.10221602 11.58855113 10.79328149 10.13306498  9.30608995  9.17719068
  8.87285695  8.38505874  8.11521132  7.8240836   7.52323668  7.37346903
  7.17680375  7.00194196  6.8977361   6.75295575  6.55602832  6.38737845
  6.19572994  6.09563127  6.07686413  5.94745346  5.8481218   5.82354006
  5.75244936  5.5655567   5.47711551  5.42078922  5.40221378  5.31212678
  5.27035256  5.17759153  5.16389005  5.12374155  5.01310225  4.97793047
  4.92042735  4.83913106  4.83095688  4.75492937  4.71855739  4.71119579
  4.65861492  4.61902167  4.56595423  4.53074599  4.49231466  4.41989605
  4.36692846  4.34559338  4.3213857   4.28665965  4.21785432  4.21290946
  4.17317402  4.14311217  4.12520511  4.04157241  4.03360865  3.98680757
  3.96123564  3.94454194  3.87020447  3.85290876  3.82133465  3.79139541
  3.78226401  3.74312054  3.72021907  3.69111924  3.65853014  3.60898892
  3.60160079  3.5629985   3.55217732  3.52083916  3.50981073  3.46919289
  3.4542132   3.426166


What is the rank of this matrix? What does rank represent in the SVD? What do the values in `s` represent? 




The rank of the matrix is 200. 

rank in the SVD represents an ingredient.

The values in `s` represent the weights for the ingredients.

In [15]:
(sorted(zip(Vt[0,:], ingredient_names))) 


[(-0.6213226816922767, 'salt'),
 (-0.2694438904468339, 'onions'),
 (-0.2594171850598179, 'olive_oil'),
 (-0.23475629208246773, 'garlic_cloves'),
 (-0.2210848613718907, 'water'),
 (-0.2210373128399198, 'garlic'),
 (-0.1754831767880215, 'sugar'),
 (-0.14429573276348057, 'pepper'),
 (-0.1381685343340287, 'all-purpose_flour'),
 (-0.1240200689863969, 'ground_cumin'),
 (-0.11672945909648834, 'ground_black_pepper'),
 (-0.11299276022337235, 'tomatoes'),
 (-0.11056354144938572, 'vegetable_oil'),
 (-0.09649379188427604, 'eggs'),
 (-0.09600781010080782, 'butter'),
 (-0.088932310114394, 'chili_powder'),
 (-0.08631704959941976, 'carrots'),
 (-0.08499080422140343, 'black_pepper'),
 (-0.07453140769359044, 'green_onions'),
 (-0.06722933744607314, 'soy_sauce'),
 (-0.06505429481782642, 'jalapeno_chilies'),
 (-0.06483609364829297, 'kosher_salt'),
 (-0.06467361185523061, 'diced_tomatoes'),
 (-0.06249381203829212, 'chopped_cilantro_fresh'),
 (-0.06197293206693403, 'unsalted_butter'),
 (-0.05854787394701812

In [16]:
(sorted(zip(Vt[1,:], ingredient_names))) 


[(-0.3873822681676701, 'salt'),
 (-0.2718507574146312, 'sugar'),
 (-0.23234552995831212, 'all-purpose_flour'),
 (-0.18489161924934072, 'baking_powder'),
 (-0.18068564327629985, 'eggs'),
 (-0.1416549776768529, 'unsalted_butter'),
 (-0.13647248809321177, 'milk'),
 (-0.13208544086027857, 'baking_soda'),
 (-0.11293234097114054, 'vanilla_extract'),
 (-0.11117795136997699, 'buttermilk'),
 (-0.10266228674906146, 'large_eggs'),
 (-0.09380772978718818, 'butter'),
 (-0.0660723411672475, 'powdered_sugar'),
 (-0.059197224997851884, 'white_sugar'),
 (-0.05783723750887896, 'granulated_sugar'),
 (-0.05203696504892735, 'flour'),
 (-0.04564316935428831, 'egg_yolks'),
 (-0.04509321710680893, 'oil'),
 (-0.04219588540015279, 'vegetable_oil'),
 (-0.036053124635578866, 'large_egg_yolks'),
 (-0.035509420739454955, 'large_egg_whites'),
 (-0.029961950781753403, 'lemon_juice'),
 (-0.02749209680569748, 'cinnamon'),
 (-0.02610349664100929, 'white_vinegar'),
 (-0.025085335072505904, 'warm_water'),
 (-0.02293228077

In [17]:
(sorted(zip(Vt[2,:], ingredient_names))) 

[(-0.4884910524035593, 'garlic'),
 (-0.28737505417477305, 'onions'),
 (-0.1989183087526747, 'water'),
 (-0.17050345104476966, 'soy_sauce'),
 (-0.15146549158731065, 'vegetable_oil'),
 (-0.10447527552621672, 'ginger'),
 (-0.10401220187394511, 'chili_powder'),
 (-0.10287898573925684, 'green_onions'),
 (-0.09892423175311943, 'kosher_salt'),
 (-0.09385838984613427, 'sesame_oil'),
 (-0.09003282848496996, 'chicken_broth'),
 (-0.0865821272552729, 'oil'),
 (-0.0737713448804514, 'scallions'),
 (-0.0715761021477943, 'cayenne_pepper'),
 (-0.07024097395347348, 'sugar'),
 (-0.06975327530847157, 'garam_masala'),
 (-0.06276469540978863, 'fresh_ginger'),
 (-0.06258897488077084, 'tomato_paste'),
 (-0.06154117854067352, 'fish_sauce'),
 (-0.061218063374697734, 'butter'),
 (-0.05832219949984257, 'chicken_breasts'),
 (-0.05742623291259245, 'ground_cumin'),
 (-0.05663074731146862, 'cilantro_leaves'),
 (-0.05586524155738122, 'cinnamon_sticks'),
 (-0.054105188824750405, 'carrots'),
 (-0.04706477975330057, 'bro

These are different ways to sort the different ingredients. I think the first is sorted by frequency because the first ingredients to show up are commonly used whereas the last ingredients are not as commonly used and seem to be more rare. The second one is grouped by baking versus savoury cooking, and the last one is grouped by region of the cuisine-- different regional foods. 

## Problem 4: A/B Experimentation
You work at an online shopping company, and your boss asks you whether the company should increase the font of the “Buy” button from 12 point font to 14 point font.  What do you do?

Specifically, please describe in detail: the method you’d use, your hypothesis, and any considerations you need to make to execute on this method.

**Answer here**

I would use an A/B testing and this is the method I would use: 

1.  1) I would have arm A be the current arm in which the button size is 12pt font 
2. 2) I would have arm B as the experimental arm in which the button size is 14pt font 
3.  3) I would randomise all users on the site into two groups one who experiences arm A every time they visit the site and one group which experiences arm B each time they visit the site. 
4. 4) I would use conversion rate as my metric because this experiment has to do with how much the users actually "buy" something and I would want to measure how the button size translates to actually buying something (clicking "Buy") or not. Conversion rate has a binomial distribution hence it would be Fisher’s exact test. 
5. 5) My hypotheses: 
6. - Hnull: there is no difference in the conversion rates of arm A and arm B 
7. - Halt: the conversion rates of arm A and arm B are not equal (there is a difference in conversion rates between the two arms). 

An important consideration is that users will have different characteristics and should be treated accordingly so as to ensure proper randomisation between arms A and B. You can block by a criterion to help ensure that it doesn’t interfere with the project/experiment being run. For example, ensuring that both arms A and B have a variety of users in terms of gender, age, geographic region, and stuff like that. Pitfalls to watch out for that come to mind for me when running this experiment are violating SUTVA in terms of the new payment feature, and perhaps the New Zealand approach would be something to think about in terms of blocking and making sure that the people you roll out the new "buy" button to are randomised enough. 


I would run the experiment for the desired time (which we can calculate) but here are a few considerations for stopping early: 

1. 1) Benefit: if the increased font size leads to drastically increased conversion rates in terms of clicking the buy-button and buying the product, it could make sense to end early and increase the font size on the button for everyone, and what qualifies as a "drastic increase" should be preregistered and calculated beforehand. 
2. 2) Harm: While subjects would likely not be harmed in this experiment, if the company’s profit is harmed greatly, it could make sense to stop the experiment if the company is losing a great deal of money, and of course, it is important to preregister before, regarding the how much profit decrease to see before halting the experiment 
3. 3) futility: if the experiment is running and there is a level of consistency between the results of the arm A button and the arm B button, to save money it might be worth halting the experiment. If there is no difference being observed for a certain period of time, then it is unlikely that there suddenly would be a difference being observed. The level of consistency and duration of the timeframe of the consistency should be preregistered beforehand as always. 
