# Restaurant Food Cost

#### Project Description

Who doesn’t love food? All of us must have craving for at least a few favourite food items, we may also have a few places where we like to get them, a restaurant which serves our favourite food the way we want it to be. But there is one factor that will make us reconsider having our favourite food from our favourite restaurant, the cost. Here in this hackathon, you will be predicting the cost of the food served by the restaurants across different cities in India. You will use your Data Science skills to investigate the factors that really affect the cost, and who knows maybe you will even gain some very interesting insights that might help you choose what to eat and from where.

You are provided with following 2 files:
1. train.csv : Use this dataset to train the model. This file contains all the details related to restaurant food cost as well as the target variable “cost”. You have to train your model using this file.
2. test.csv : Use the trained model to predict the cost of a two person meal.

#### Dataset Attributes
* TITLE: The feature of the restaurant which can help identify what and for whom it is suitable for.
* RESTAURANT_ID: A unique ID for each restaurant.
* CUISINES: The variety of cuisines that the restaurant offers.
* TIME: The open hours of the restaurant.
* CITY: The city in which the restaurant is located.
* LOCALITY: The locality of the restaurant.
* RATING: The average rating of the restaurant by customers.
* VOTES: The overall votes received by the restaurant.
* COST: The average cost of a two-person meal.

#### Dataset Link-
•	https://github.com/FlipRoboTechnologies/ML-Datasets/tree/main/Restaurant%20Food%20Cost

•	https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Restaurant%20Food%20Cost/Data_Test.xlsx
•	https://github.com/FlipRoboTechnologies/ML-Datasets/blob/main/Restaurant%20Food%20Cost/Data_Train.xlsx


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [4]:
train_data = pd.read_excel("https://github.com/FlipRoboTechnologies/ML-Datasets/raw/main/Restaurant%20Food%20Cost/Data_Train.xlsx")
test_data = pd.read_excel("https://github.com/FlipRoboTechnologies/ML-Datasets/raw/main/Restaurant%20Food%20Cost/Data_Test.xlsx")

In [15]:
# Combine train and test data for preprocessing
combined_data = pd.concat([train_data, test_data], ignore_index=True)

In [16]:
# Drop unnecessary columns
combined_data.drop(['RESTAURANT_ID', 'TIME'], axis=1, inplace=True)

In [17]:
# Handle missing values
combined_data['RATING'].fillna('NEW', inplace=True)
combined_data['VOTES'].fillna('0 votes', inplace=True)

In [18]:
# Ensure 'VOTES' column is of string type
combined_data['VOTES'] = combined_data['VOTES'].astype(str)

In [19]:
# Extract numerical value from 'VOTES' column
combined_data['VOTES'] = combined_data['VOTES'].str.extract('(\d+)')

In [20]:
# Convert 'VOTES' to float, handle errors with coerce
combined_data['VOTES'] = pd.to_numeric(combined_data['VOTES'], errors='coerce')

In [22]:
# Encode categorical variables
encoder = LabelEncoder()
combined_data['TITLE'] = encoder.fit_transform(combined_data['TITLE'])
combined_data['CUISINES'] = encoder.fit_transform(combined_data['CUISINES'])
combined_data['CITY'] = encoder.fit_transform(combined_data['CITY'])
combined_data['LOCALITY'] = encoder.fit_transform(combined_data['LOCALITY'])

In [None]:
combined_data['RATING'] = combined_data['RATING'].apply(lambda x: float(x.split('/')[0]) if '/' in x else float(x))

In [24]:
# Split the data back into train and test sets
train_processed = combined_data[:train_data.shape[0]]
test_processed = combined_data[train_data.shape[0]:]

# Split features and target variable
X = train_processed.drop('COST', axis=1)
y = train_processed['COST']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Define preprocessing for numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_transformer = SimpleImputer(strategy='mean')

categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [None]:
# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the model
model = RandomForestRegressor(n_estimators=100, random_state=42)

In [None]:
# Create and fit the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)])

pipeline.fit(X_train, y_train)

# Predict on validation set
val_predictions = pipeline.predict(X_val)

# Evaluate the model
val_mse = mean_squared_error(y_val, val_predictions)
val_rmse = val_mse ** 0.5
print(f'Validation RMSE: {val_rmse}')

# Make predictions on the test data
test_predictions = pipeline.predict(test_processed)

# Output predictions to a file
output = pd.DataFrame({'COST': test_predictions})
output.to_excel('predicted_costs.xlsx', index=False)
