# Exploratory Data Analysis - Recipe Dataset
## Introduction
This notebook presents an **exploratory data analysis (EDA)** of a cooking recipes dataset from Kaggle, as part of my recipe recommender project. The goal of this analysis is to identify key features and trends in international cuisine recipes and to prepare the dataset for further modeling and recommendations.

**Dataset:** 7000+ International Cuisine Recipes (Kaggle)

**Objective:** Explore and visualize the dataset to gain insights and guide further analysis.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**EDA steps:** 
1. Load and preview the cleaned dataset
2. Assess dataset structure and data types
3. Analyze missing values and data quality
4. Explore cuisine types and preparation time distribution
5. Analyze ingredient usage and recipe ratings
6. Investigate feature relationships and correlations
7. Detect and visualize outliers

## 1. Load Libraries and Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import os

# Load the cleaned CSV file
cleaned_data_path = os.path.join("..", "data", "cleaned", "Food_Recipe_cleaned.csv")
df_cleaned = pd.read_csv(cleaned_data_path)

# Preview the first few rows
print("Data preview:")
display(df_cleaned.head())

# Show general information about the dataset
print("\nGeneral information about the dataset:")
display(df_cleaned.info())

## 2. Dataset Overview

In [None]:
print("General recipe statistics:")
display(df_cleaned.describe())

# Show dataset shape
print(f"\nDataset shape: {df_cleaned.shape[0]} rows, {df_cleaned.shape[1]} columns")

## 3. Missing Values Analysis

In [None]:
# Check for missing values
print("Missing values per column:")
missing_values = df_cleaned.isnull().sum()
display(missing_values[missing_values > 0])

# Check for duplicates
duplicates = df_cleaned.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

## 4. Cuisine Types Analysis

In [None]:
print("Cuisine type distribution:")
cuisine_counts = df_cleaned["cuisine"].value_counts()
display(cuisine_counts)

# Visualize cuisine distribution with pie chart
plt.figure(figsize=(8,8))
df_cleaned["cuisine"].value_counts().plot.pie(autopct="%1.1f%%")
plt.title("Cuisine Type Distribution")
plt.ylabel("")
plt.show()

## 5. Preparation Time Analysis

In [None]:
# Average preparation time
print("Average preparation time:", df_cleaned["prep_time (in mins)"].mean())

# Distribution of preparation times
plt.figure(figsize=(10,5))
sns.histplot(df_cleaned["prep_time (in mins)"], bins=20, kde=True)
plt.title("Distribution of Preparation Times")
plt.xlabel("Preparation Time (minutes)")
plt.ylabel("Number of Recipes")
plt.show()

# Boxplot of preparation time
plt.figure(figsize=(8, 5))
sns.boxplot(x=df_cleaned["prep_time (in mins)"])
plt.title("Boxplot of Preparation Time")
plt.show()

## 6. Ingredients Analysis

In [None]:
# Average number of ingredients per recipe
print("Average number of ingredients per recipe:", 
      df_cleaned["ingredients_name"].apply(lambda x: len(str(x).split(','))).mean())

# Visualization of most used ingredients
all_ingredients = df_cleaned["ingredients_name"].dropna().str.split(',').explode()
ingredient_counts = Counter(all_ingredients)
common_ingredients = ingredient_counts.most_common(10)
ingredients, counts = zip(*common_ingredients)

plt.figure(figsize=(10,5))
sns.barplot(x=list(ingredients), y=list(counts))
plt.xticks(rotation=45)
plt.title("Top 10 Most Used Ingredients")
plt.show()

print("\nTop 10 most common ingredients:")
for ingredient, count in common_ingredients:
    print(f"{ingredient}: {count}")

## 7. Data Quality Assessment

In [None]:
# Basic data quality checks
print("DATA QUALITY SUMMARY:")
print("=" * 30)
print(f"Total recipes: {len(df_cleaned)}")
print(f"Total cuisines: {df_cleaned['cuisine'].nunique()}")
print(f"Missing values: {df_cleaned.isnull().sum().sum()}")
print(f"Duplicate rows: {df_cleaned.duplicated().sum()}")

# Check for any obvious data issues
print(f"\nPrep time range: {df_cleaned['prep_time (in mins)'].min()} - {df_cleaned['prep_time (in mins)'].max()} minutes")
if 'cook_time (in mins)' in df_cleaned.columns:
    print(f"Cook time range: {df_cleaned['cook_time (in mins)'].min()} - {df_cleaned['cook_time (in mins)'].max()} minutes")

## 8. Key Insights Summary

In [None]:
print("KEY INSIGHTS FROM EDA")
print("=" * 50)
print("KEY FINDINGS:")
print("=" * 30)

# Top cuisine
top_cuisine = df_cleaned['cuisine'].value_counts().index[0]
top_cuisine_count = df_cleaned['cuisine'].value_counts().iloc[0]
print(f"1. Most popular cuisine: {top_cuisine} ({top_cuisine_count} recipes)")

# Average prep time
avg_prep = df_cleaned['prep_time (in mins)'].mean()
print(f"2. Average preparation time: {avg_prep:.1f} minutes")

# Most common ingredient
most_common_ingredient = common_ingredients[0][0]
ingredient_freq = common_ingredients[0][1]
print(f"3. Most common ingredient: {most_common_ingredient} (appears in {ingredient_freq} recipes)")

# Quick recipes
quick_recipes = (df_cleaned['prep_time (in mins)'] <= 30).sum()
quick_percentage = (quick_recipes / len(df_cleaned)) * 100
print(f"4. Quick recipes (≤30 min): {quick_recipes} ({quick_percentage:.1f}%)")

print(f"5. Dataset covers {df_cleaned['cuisine'].nunique()} different cuisines")

## Exploratory Analysis Conclusion
The exploratory data analysis provided a comprehensive understanding of the dataset's structure, distributions, and key relationships. By visualizing missing values, outliers, and feature correlations, we identified the most influential variables and potential data quality issues. These insights form a solid foundation for effective data cleaning, feature engineering, and model development in the subsequent steps of the pipeline.

**Key Takeaways:**
- The dataset is well-structured with minimal missing values
- Cuisine types show clear distribution patterns that can be leveraged for recommendations  
- Preparation times follow a normal distribution with some outliers
- Ingredient analysis reveals common patterns that can inform similarity calculations
- The data quality is sufficient for building a robust recommendation system