## Creating a combined quality dataset

In this notebook we make a quality dataset out of the three datasets "recipes", "reviews" and "recipes_with_search_terms". In order to give some interesting insights of the data.

Therefore, we:
- left joined them on `RecipeId` 
- considered only the recipes with count(ratings) >= 10
- introduced a column `Average rating` which is an average of all reviews for this recipe 
- and a column `Number of ratings` which is a number of data records in reviews.csv for this recipe
- calculated an Overall Mean:
  - Calculates the mean of all ratings across the entire DataFrame
  - Assigns this overall mean to a new column 'average_rating' for all rows
- calculated a Per-Recipe Mean (mean_of_all_ratings in the loop):
  - For each recipe, calculates the mean of ratings for all rows with the same 'RecipeId'
  - Assigns this per-recipe mean to the 'average_rating' column for the corresponding row

In [None]:
# load all datasets
import pandas as pd

recipes = pd.read_csv("../data/recipes.csv")
reviews = pd.read_csv("../data/reviews.csv")
search_terms_and_tags = pd.read_csv("../data/recipes_w_search_terms.csv")

In [None]:
reviews.shape

In [None]:
# 2. Combine all recipes that are present in BOTH recipes.csv and recipes_w_search_terms.csv 
# as one dataset (we have RecipeId/id as primary key). Delete the unnecessary columns (see below). 
# There are columns Name, Description that are present in both datasets. We just take the values from one dataset (doesn't matter which).
# The column we don't need: From recipes.csv: Images 

reviews.rename(columns={'AuthorId':'ReviewerId'}, inplace=True)
reviews.rename(columns={'AuthorName':'ReviewerName'}, inplace=True)
reviews.drop("DateSubmitted", axis=1, inplace=True)
reviews.drop("DateModified", axis=1, inplace=True)
reviews.head()

In [None]:
recipes.rename(columns={'AuthorId':'RecipeContributerId'}, inplace=True)
recipes.rename(columns={'AuthorName':'RecipeContributorName'}, inplace=True)
recipes.drop("Images", axis=1, inplace=True)
recipes.drop("Name", axis=1, inplace=True)
recipes.drop("Description", axis=1, inplace=True)
recipes.head()

In [None]:
search_terms_and_tags.rename(columns={'id':'RecipeId'}, inplace=True)
search_terms_and_tags.head()

In [None]:
#Merge all the datasets
df_combined = pd.merge(reviews, recipes, on='RecipeId', how="left")
df_combined = pd.merge(df_combined, search_terms_and_tags, on='RecipeId', how="left")
print(len(df_combined))
df_combined.head()

In [None]:
# 3. For the recipes that we now have, take only those that have at least 10 reviews in reviews.csv.
# Introduce a column "Average rating" which is an average of all reviews for this recipe. 
# And a column "Number of ratings" which is a number of data records in reviews.csv for this recipe. 

df_combined = df_combined[df_combined.groupby('RecipeId')['RecipeId'].transform('size') >= 10]
print(len(df_combined))
df_combined['number_of_ratings'] = df_combined.groupby('RecipeId')['RecipeId'].transform('size')
df_combined.sample(5)

In [None]:
# this is inefficient but I cant come up with a cleaner solution
df_combined['average_rating'] = df_combined[["Rating"]].mean(axis=0)

for index,row in df_combined.iterrows():
    all_rows_with_this_recipie_id = df_combined[df_combined["RecipeId"] == row["RecipeId"]]
    mean_of_all_ratings = all_rows_with_this_recipie_id['Rating'].mean() 
    df_combined.loc[index, "average_rating"] = mean_of_all_ratings
df_combined.sample(5)

In [None]:
df_combined.shape

## Data Visualization

In [None]:
# 4. So now we have a dataset with columns from recipes.csv and recipes_w_search_terms.csv, 
# and two additional columns: "Average rating" and "Number of ratings". Depending on the number of
# records that we now have, either leave the dataset as this, or sample a subset. 
print(len(df_combined))

df_combined.hist(column='Rating')
df_combined.hist(column='number_of_ratings')
df_combined.hist(column='average_rating')



In [None]:
df_combined.to_csv("../data/combined_data.csv")