In [None]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## An Exploratory Data Analysis (EDA) of Indian Food. 


# What we want to know

1. Which flavor is the most preferred?
2. Which flavor requires the most cooking time?
3. Which state has the most number of dishes?
4. What is the diet based on region?
5. Which dish has the highest total cooking time?
6. Which are the most common ingredients used throughout the country?

# Prepare



Reliable - Unsure 
Original - Yes, the orginial data source is in Kaggle 
Comprehensive - Yes, most of the regions and states have been covered
Current - No, the metadata shows that it is from 2017
Cited - In the metadata

# Process

We will begin by installing and loading the packages. The data is loaded using the read.csv() function. To understand our data we will use the head(), str(), colnames(), and summary() functions. The unique values of the columes can be know using the unique() function.

In [None]:
library(ggplot2)


In [None]:
food_data <- read.csv("../input/indian-food-101/indian_food.csv")

In [None]:
head(food_data)
str(food_data)
summary(food_data)
colnames(food_data)

In [None]:
unique(food_data$state)
unique(food_data$flavor_profile)
unique(food_data$course)
unique(food_data$region)

The missing values are replaced with "None" and "unknown"

In [None]:
indian_food <- food_data %>% 
mutate(state = str_replace(string = state, pattern = "-1", replacement = "None"), 
      region = str_replace(string = region, pattern = "-1", replacement = "None"), 
      flavor_profile = str_replace(string = flavor_profile, pattern = "-1", replacement = "unknown"), 
      prep_time = ifelse(prep_time == -1, 0, prep_time), 
    cook_time = ifelse(cook_time == -1, 0, cook_time))


In [None]:
flavor_profile<- indian_food %>%
count(flavor_profile)
flavor_profile


In [None]:
courses <- indian_food %>%
count(course)
courses

Combine prep_time and cook_time into one column- "total_cooking_time"

In [None]:
cleaned_food<- indian_food %>%
mutate(total_cookingtime= prep_time + cook_time) %>%
select(-(prep_time),-(cook_time))

In [None]:
summary(cleaned_food)

# Analyze

## 1. Which is the preferred flavor throughtout the nation?

In [None]:
ggplot(flavor_profile) + geom_col(mapping=aes(x=reorder(flavor_profile,n), y=n, fill= flavor_profile))+ labs(title="The Flavor Profile of India",x="flavor profile", y= "count",subtitle= "Spice is Nice!")

## 2. Which flavor profile requires the most cooking time?

In [None]:
cooktime_flavor <- cleaned_food %>%
group_by(flavor_profile) %>%
summarise(Mean_Cookingtime = mean(total_cookingtime)) %>%
    arrange(flavor_profile)
head(cooktime_flavor)

In [None]:
ggplot(data=cooktime_flavor) + geom_col(mapping=aes(x= flavor_profile, y= Mean_Cookingtime, fill = flavor_profile)) + labs(title = "Flavor with the most cooking time")

## 3. Which are the top 10 states with the most number of disher?

In [None]:
state<-cleaned_food %>%
group_by(state) %>%
count(state) %>%
arrange(desc(n)) %>%
head(10)
state

In [None]:
ggplot(data=state) + geom_col(mapping=aes(x= reorder(state,n), y = n, fill=state))+ labs(title="Top 10 states with max number of dishes", x= "state", y= "count") +  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

## 4. What is the diet based on region

In [None]:
diet_by_region <- cleaned_food %>%
select(diet, region) %>%
group_by(region) %>%
count(diet)


In [None]:
ggplot(data=diet_by_region) + geom_col(aes(x=reorder(region,n), y= n, fill=diet)) + labs(title = "Preferred diet of India", x = "region", y= "count")


## 5. Which dish has the most total cooking time?

In [None]:
dish_cookingtime <- cleaned_food %>%
select(name, total_cookingtime) %>%
arrange(desc(total_cookingtime)) %>%
head(10)
dish_cookingtime

In [None]:
ggplot(data=dish_cookingtime) + geom_col(aes(x=reorder(name,total_cookingtime), y= total_cookingtime, fill=name)) + 
labs(title = "Dish with most cooking time", x="name of dish", y = "total cooking time") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))


## 6. Which are the most common ingredients used all over the country

In [None]:
ingredients <- cleaned_food %>%
 select(ingredients) %>%
 mutate(ingredients = str_split(ingredients,' ')) %>%
 unnest(ingredients) %>%
 group_by(ingredients) %>%
 count() %>%
 arrange(desc(n)) %>%
 head(20)
 

In [None]:
ggplot(data= ingredients) + geom_col(aes(x = reorder(ingredients,n), y = n, fill = ingredients)) +
 labs(title = 'Most Common Ingredients',x = 'Ingredients',y = 'Count') +
 theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

In [None]:
# Share

1. The preferred flavor is "spicy".
2. Sweet dishes take the most amount of cooking time.
3. The states Gujarat, Punjab, and Maharashtra top the list for most number of dishes. 
4. A vegetarian diet is what seems to be the preferred diet throughout the country.
5. Shrikhand has the highest total cooking time (includes preparation time and actual cooking time).
6. Some of the most commong ingredients are flour, sugar, dal, and gram masala.
