Data från: https://cseweb.ucsd.edu/~jmcauley/datasets.html#foodcom

[RAW_interactions.csv.zip](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions?select=RAW_interactions.csv)

[RAW_recipes.csv.zip](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions?select=RAW_recipes.csv)

referens:

**Generating Personalized Recipes from Historical User Preferences** <br>
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley <br>
*EMNLP*, 2019

\*\* Länk till tillhörande graf finns under varje kodsnutt, då plotly grafer har tendes att inte synas efter filuppladning 

In [None]:
!unzip -o data/RAW_recipes.csv.zip -d data/
!unzip -o data/RAW_interactions.csv.zip -d data/

In [2]:
import pandas as pd
import plotly.express as px
import numpy as np

In [3]:
df_recipes = pd.read_csv('data/RAW_recipes.csv')
df_ratings = pd.read_csv('data/RAW_interactions.csv')

#### Städning av data

In [368]:
print(f'Recipes shape: {df_recipes.shape}')
print(f'Ratings shape: {df_ratings.shape}')

Recipes shape: (231637, 12)
Ratings shape: (1132367, 5)


230 tusen recept och över 1 miljon indviduella ratings

In [369]:
df_ratings['user_id'].nunique()

226570

In [370]:
len(df_recipes) * df_ratings['user_id'].nunique()

52481995090

226 tusen indivduella användare

Detta leder till en `user x item` matris blir över 52 miljarder värden, vilket blir för stort med dessa verktyg och resurser (pandas sa nej) → en del data behöver filtreras bort

In [371]:
df_recipes.describe()

Unnamed: 0,id,minutes,contributor_id,n_steps,n_ingredients
count,231637.0,231637.0,231637.0,231637.0,231637.0
mean,222014.708984,9398.546,5534885.0,9.765499,9.051153
std,141206.635626,4461963.0,99791410.0,5.995128,3.734796
min,38.0,0.0,27.0,0.0,1.0
25%,99944.0,20.0,56905.0,6.0,6.0
50%,207249.0,40.0,173614.0,9.0,9.0
75%,333816.0,65.0,398275.0,12.0,11.0
max,537716.0,2147484000.0,2002290000.0,145.0,43.0


In [372]:
df_recipes['minutes'].unique()

array([        55,         30,        130,         45,        190,
                0,         15,        120,        180,         70,
                5,       1460,       2970,        525,        500,
              110,         35,         20,         25,         10,
               40,        495,         90,         13,         26,
               12,         50,         18,        230,      14450,
            20160,        125,        135,         28,         60,
              160,       1470,         65,        150,         75,
                2,         32,        100,        330,        510,
              280,        175,         80,          6,        345,
              195,        300,        200,        105,         85,
               68,         38,          7,        315,        645,
               95,        255,          9,        185,        245,
                1,        260,        240,        250,        370,
             1500,        140,        171,       1440,      43

>filtrerar bort rader var information saknas

Som man kan se av `.describe`, finns det en del märkliga värden i 'minutes' (hur länge det tar att laga) kolumnen, raderar rader var minutes är mer än 1440 (dvs 24h) och var det är 0. Det finns även recept var antalet steg är 0

In [373]:
df_recipes[df_recipes['minutes'] == 2147483647]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
144074,no bake granola balls,261647,2147483647,464080,2007-10-26,"['60-minutes-or-less', 'time-to-make', 'course...","[330.3, 23.0, 110.0, 4.0, 15.0, 24.0, 15.0]",9,"['preheat the oven to 350 degrees', 'spread oa...",healthy snacks that kids (and grown ups) will ...,"['rolled oats', 'unsweetened dried shredded co...",8


Hoppas att det inte tar över 4000 år att laga *granola balls*

In [4]:
## tar bort ovalida recept
df_recipes = df_recipes[(df_recipes['minutes'] <= 1440) 
        & (df_recipes['minutes'] != 0) 
        & (df_recipes['n_steps'] != 0) 
        & (df_recipes['n_ingredients'] != 0)]

df_ratings = df_ratings[df_ratings['recipe_id'].isin(df_recipes['id'])]

>Filtrerar bort användare som har gett färre än 24 recensioner och recept med färre än 18 recensioner

Detta behöver köras tills inga fler ändringar sker då:
- radering av recept påverkar hur många reviews en användare har gett
- radering av användare påverkar hur många reviews ett recept har

In [5]:
changed = True
while changed:
    prev_len = len(df_ratings)
    
    ## filter users
    user_counts = df_ratings['user_id'].value_counts()
    valid_users = user_counts[user_counts >= 24].index
    df_ratings = df_ratings[df_ratings['user_id'].isin(valid_users)]
    
    ## filter recipes
    review_counts = df_ratings['recipe_id'].value_counts()
    valid_ids = review_counts[review_counts >= 18].index
    df_ratings = df_ratings[df_ratings['recipe_id'].isin(valid_ids)]
    df_recipes = df_recipes[df_recipes['id'].isin(valid_ids)]

    df_ratings = df_ratings[df_ratings['recipe_id'].isin(df_recipes['id'].values)]
    
    changed = len(df_ratings) != prev_len

In [6]:
df_recipes.reset_index(drop=True, inplace=True)
df_ratings.reset_index(drop=True, inplace=True)

In [377]:
print(df_ratings['recipe_id'].nunique())
print(df_recipes['id'].nunique())

2908
2908


In [378]:
df_ratings['user_id'].nunique()

1842

In [379]:
print(f'Recipes shape: {df_recipes.shape}')
print(f'Ratings shape: {df_ratings.shape}')

Recipes shape: (2908, 12)
Ratings shape: (106288, 5)


In [10]:
len(df_recipes) * df_ratings['user_id'].nunique()

5356536

5 356 536

`user x item` matrisen kommer ha dryga 5 miljoner värden, vilket är rimligare

In [11]:
len(df_ratings) / (len(df_recipes) * df_ratings['user_id'].nunique())

0.019842674444827776

1 - 0.02 = 98% gläshet

#### Exploring

In [381]:
df_recipes

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,how i got my family to eat spinach spinach ca...,25775,50,37305,2002-04-22,"['60-minutes-or-less', 'time-to-make', 'course...","[166.1, 16.0, 6.0, 32.0, 19.0, 26.0, 3.0]",5,"['preheat oven to 350 degrees', 'place spinach...","if spinach scares you, this is one recipe that...","['frozen chopped spinach', 'egg', 'salt', 'bla...",8
1,land of nod cinnamon buns,22526,35,29212,2002-03-14,"['60-minutes-or-less', 'time-to-make', 'course...","[575.3, 18.0, 116.0, 34.0, 28.0, 22.0, 34.0]",7,"['before you turn in for the night , grease a ...",i have made this several times and it's dead e...,"['rolls', 'brown sugar', 'instant vanilla pudd...",6
2,never weep whipped cream,74805,5,87877,2003-11-01,"['15-minutes-or-less', 'time-to-make', 'course...","[276.3, 45.0, 2.0, 1.0, 3.0, 91.0, 0.0]",4,['whip all ingredients together until firm pea...,"i don't know where i got this, but it works. t...","['whipping cream', 'vanilla instant pudding mi...",4
3,ant kelly s london broil marinade,155959,200,59476,2006-02-13,"['time-to-make', 'main-ingredient', 'preparati...","[673.5, 61.0, 18.0, 67.0, 133.0, 57.0, 2.0]",13,"['mix all marinade ingredients together', 'lig...",my niece shwana loves this! she always writes...,"['balsamic vinegar', 'soy sauce', 'worcestersh...",8
4,bar cheese,42151,35,44807,2002-10-03,"['60-minutes-or-less', 'time-to-make', 'course...","[707.1, 76.0, 91.0, 147.0, 74.0, 161.0, 9.0]",8,"['in a large sauce pan over low heat , melt th...",a friend shared this with me last year. i have...,"['velveeta cheese', 'mayonnaise', 'horseradish...",5
...,...,...,...,...,...,...,...,...,...,...,...,...
2903,zucchini oven fries,178820,25,219942,2006-07-24,"['30-minutes-or-less', 'time-to-make', 'course...","[128.1, 1.0, 17.0, 13.0, 12.0, 0.0, 8.0]",8,"['preheat oven to 400f', 'lightly coat baking ...",these are really good and have the added bonus...,"['zucchini', 'egg white', 'flour', 'cornstarch...",7
2904,zucchini pancakes,16702,35,27416,2002-01-04,"['60-minutes-or-less', 'time-to-make', 'course...","[363.0, 41.0, 9.0, 34.0, 19.0, 51.0, 7.0]",16,"['wash and shred zucchini on a fine shredder',...","this is a great vegetarian pancake, easy to ma...","['fresh zucchini', 'eggs', 'all-purpose flour'...",8
2905,zucchini ribbons with basil butter,34110,35,14015,2002-07-15,"['60-minutes-or-less', 'time-to-make', 'course...","[99.9, 11.0, 19.0, 3.0, 6.0, 14.0, 2.0]",8,"['bring pot of water to boil', 'mean-while , w...",found this in a magazine! made it with the gar...,"['zucchini', 'butter', 'olive oil', 'parmesan ...",6
2906,zucchini salsa canned,11217,105,4470,2001-08-29,"['weeknight', 'time-to-make', 'course', 'main-...","[211.9, 2.0, 139.0, 159.0, 10.0, 1.0, 15.0]",7,"['day one:in a large bowl combine', 'zucchini ...","this recipe is from a friend's, daughter's mot...","['zucchini', 'onions', 'green peppers', 'red p...",16


Namnen på recepten är varierande i hur väl de beskriver receptet och koncishet.

In [382]:
df_recipes['minutes'].value_counts().head(10)

minutes
30    209
20    195
35    184
25    181
40    157
10    154
5     138
45    135
15    134
50    128
Name: count, dtype: int64

In [383]:
fig = px.histogram(df_recipes, x='minutes', nbins=80,
                   title='Distribution of time to make',
                   labels={'minutes': 'Time (minutes)'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_tt.png)

< 100 minuters recept är vanligast. Majoriteten av recept är runt 20-40min, med fåtal recept som tar länge. Här ska man också komma ihåg att olika personer kan defineiera lagningstid på olika vis.

In [384]:
df_recipes['n_steps'].value_counts().head(10)

n_steps
7     321
6     304
8     275
5     261
9     221
10    204
4     195
11    180
3     169
13    118
Name: count, dtype: int64

In [385]:
fig = px.histogram(df_recipes, x='n_steps', nbins=40,
                   title='Distribution of step count',
                   labels={'n_steps': 'steps'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_sc.png)

Recept med 6-8 steg är vanligast. Datan lutar mot recept med färre steg

In [386]:
df_recipes['n_ingredients'].value_counts().head(10)

n_ingredients
7     341
6     327
8     322
9     319
10    278
5     259
11    226
4     175
12    141
3     118
Name: count, dtype: int64

In [387]:
fig = px.histogram(df_recipes, x='n_ingredients', nbins=20,
                   title='Distribution of ingredient count',
                   labels={'n_ingredients': 'ingredients'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_ic.png)

7 ingrediencer är vanligast, men 6 - 9 är väldigt jämnt, de flesta recept har färre ingrediencer och överlag är distributionen mindre än t.ex antal steg.

Vill gissa att det mest populära receptet har 7 ingrediencer, 8 steg och tar max 30 minuter

In [414]:
fig = px.scatter(df_recipes, x='n_ingredients', y='n_steps',
                 title='Recipe Complexity: ingredients vs Steps',
                 labels={'n_ingredients': 'Number of ingredients', 'n_steps': 'Number of steps'},
                 opacity=0.6)
fig.show()

[plot](plots/rc_is.png)

De flesta recept faller under < 30 steg och < 15 ingrediencer. Finns enstaka recept som har många steg + få ingrediencer och färre steg + många ingrediencer. Sen måste man komma ihåg att dessa recept är producerat av ett community, och olika personer kommer dela på stegen och ingrediencerna på olika vis.

In [415]:
fig = px.scatter(df_recipes, x='minutes', y='n_steps',
                 title='Recipe Complexity: Time vs Steps',
                 labels={'minutes': 'Time to make (min)', 'n_steps': 'Number of steps'},
                 opacity=0.6)
fig.show()

[plot](plots/rc_ts.png)

De flesta recepten är kortare med färre steg, men det finns ändå en hel del recept med färre steg och lång lagningstid – antagligen är de såna var man behöver vänta länge.

In [390]:
df_ratings

Unnamed: 0,user_id,recipe_id,date,rating,review
0,28649,33096,2002-07-29,5,This was very simple and very refreshing. Thi...
1,22973,33096,2003-08-11,5,"Merlot,\r\n I took the ingredients for making..."
2,37449,33096,2003-08-31,5,So easy and so good! My husband and son scarfe...
3,89831,33096,2004-03-15,5,Merlot...this is the second time that I made y...
4,101034,33096,2004-06-15,5,"What a great tasting, refreshing dessert this ..."
...,...,...,...,...,...
106283,416985,31311,2010-05-13,5,"These were great- made them for a ""key club"" (..."
106284,985795,31311,2010-05-15,5,These went great to make for a pot luck I was ...
106285,407007,31311,2010-05-30,5,These are addictive! They are a little more ti...
106286,119956,31311,2011-12-21,3,I had a lot of trouble with this simple cookie...


In [391]:
fig = px.bar(x=df_ratings['rating'].value_counts().index, y=df_ratings['rating'].value_counts().values, 
             title='Rating count', labels={'x': 'rating', 'y': 'count'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/rat_c.png)

In [392]:
fig = px.histogram(
    df_ratings.groupby('recipe_id')['rating'].mean(), x='rating', nbins=20, 
    title='Distribution of Mean Ratings per Recipe', labels={'rating': 'Mean Rating'},
)
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_mr.png)

En rating på 5 är överlägset vanligast att ge, och de flesta recept har hög rating (>4). Kan det vara så att man endast ger en recension om receptet var bra, eller så kan filtreringen råkat radera recept med låg rating.

In [393]:
df_ratings.groupby('recipe_id')['rating'] \
    .agg(['mean', 'count'])  \
    .sort_values(by=['count', 'mean'], ascending=[False, False])

Unnamed: 0_level_0,mean,count
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
27208,4.526582,395
89204,4.273273,333
39087,4.678808,302
22782,4.600000,300
69173,4.824916,297
...,...,...
7553,3.888889,18
41882,3.888889,18
44908,3.888889,18
179050,3.888889,18


In [394]:
df_ratings.groupby('recipe_id')['rating'] \
    .agg(['mean', 'count'])  \
    .sort_values(by=['mean', 'count'], ascending=[False, False])

Unnamed: 0_level_0,mean,count
recipe_id,Unnamed: 1_level_1,Unnamed: 2_level_1
66947,5.000000,50
154351,5.000000,41
77497,5.000000,38
222166,5.000000,38
186029,5.000000,35
...,...,...
95409,3.833333,18
38584,3.821429,28
81853,3.736842,38
35331,3.736842,19


In [395]:
print(df_recipes[df_recipes['id'] == 27208][['id','name', 'minutes', 'n_steps', 'n_ingredients']])
print(df_recipes[df_recipes['id'] == 66947][['id','name', 'minutes', 'n_steps', 'n_ingredients']])

         id                        name  minutes  n_steps  n_ingredients
2690  27208  to die for crock pot roast      545        7              5
         id                                       name  minutes  n_steps  \
2212  66947  refreshing mojito  by the pitcher mojitos       15        5   

      n_ingredients  
2212              5  


*to die for crock pot roast* har mest ratings och är gillat med en rating på 4.5, medans
*refreshing mojito  by the pitcher mojitos* har högsta rating (med flest ratings)

In [396]:
df_ratings.groupby('user_id')['rating'] \
    .agg(['mean', 'count'])  \
    .sort_values(by='count', ascending=False)

Unnamed: 0_level_0,mean,count
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
140132,4.713204,1189
176615,4.895492,488
173579,4.908120,468
126440,4.868534,464
37449,4.927536,414
...,...,...
125524,4.875000,24
82772,4.958333,24
647466,4.666667,24
308434,4.541667,24


Användare `140132` har gett flest reviews på recept och tydligen gillat de flesta, eller så skriver hen endast en review om det var bra

In [401]:
df_ratings[df_ratings['user_id'] == 140132]['rating'].describe()

count    1189.000000
mean        4.713204
std         0.549872
min         0.000000
25%         4.000000
50%         5.000000
75%         5.000000
max         5.000000
Name: rating, dtype: float64

In [416]:
avg_ratings = df_ratings.groupby('recipe_id')['rating'].mean()
merged = df_recipes.merge(avg_ratings, left_on='id', right_on='recipe_id')

fig = px.scatter(merged, x='minutes', y='rating',
                 title='Average Rating vs Time to make',
                 labels={'minutes': 'Time (minutes)', 'rating': 'Average Rating'},
                 opacity=0.6)
fig.show()


[plot](plots/ar_tt.png)

Lagningstid verkar inte påverka allt för mycket på vad folk gillar – de flesta lagningstider har variation. Datan drar sig mot höga ratings och korta recept, men höga ratings och korta recept var annors också vanligast i datan.

In [417]:
avg_ratings = df_ratings.groupby('recipe_id')['rating'].mean()
merged = df_recipes.merge(avg_ratings, left_on='id', right_on='recipe_id')

fig = px.scatter(merged, x='n_ingredients', y='rating',
                 title='Average Rating vs Number of ingredients',
                 labels={'n_ingredients': 'Number of ingredients', 'rating': 'Average Rating'},
                 opacity=0.6)
fig.show()

[plot](plots/ar_ni.png)

Antalet steg i ett recept verkar inte heller spela allt för stor roll i dens popularitet

In [413]:
review_counts = df_ratings['recipe_id'].value_counts().reset_index()
review_counts.columns = ['recipe_id', 'num_reviews']

fig = px.histogram(review_counts, x='num_reviews', nbins=50,
                   title='Distribution of Number of Reviews per Recipe',
                   labels={'num_reviews': 'Number of Reviews', 'count': 'Number of Recipes'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_nr.png)

De flesta recept har få reviews

In [420]:
review_counts = df_ratings['user_id'].value_counts().reset_index()
review_counts.columns = ['user_id', 'num_reviews']

fig = px.histogram(review_counts, x='num_reviews', nbins=60,
                   title='Distribution of Number of Reviews per User',
                   labels={'num_reviews': 'Number of Reviews', 'count': 'Number of Reviews Given'})
fig.update_layout(bargap=0.1)
fig.show()

[plot](plots/dist_ru.png)

De flesta användare har gett ett fåtal recensioner. Detta i kombination att de flesta recept har också fått få recensioner kan göra att det finns möjligen inte så mycket överlappning bland användare, vilket kan leda till att speciellt att den kollaborativa rekommendationssystmet kan ha svårigheter att göra bra rekommendationer. 