## Problem Statement

The first thing any visitor to India will take in — probably while staring out the window in awe as their aeroplane descends is the sheer size of this country. It is densely populated and patchworked with distinct neighbourhoods, each with its own culinary identity. It would take several lifetimes to get to know all of the street stands, holes in the wall, neighbourhood favourites, and high-end destinations in this city.
And for Indians dining out is and always will be a joyous occasion. Everyone has their own favourite restaurants in the city starting from the street food stall across the street to the 5-star restaurants in the heart of the city. Some are favourites because of the memory attached to it and some are favourites because of the fact that the place has a fantastic ambience. There are a lot of other factors as well which contribute to the likeness of the restaurants which in turn determines their popularity among masses. 

If you look at this from the business perspective for a restaurant, more popularity may mean more visits to the joint increasing the annual turnover of the restaurants. For any restaurant to survive and do well, the annual turnover of the restaurants has to be substantial. 

This problem takes a shot at predicting the annual turnover of a set of restaurants across India based on a set of variables given in the data set. This includes the data related to the restaurant such as location, opening date, cuisine type, themes etc. This also includes data pooled from different sources such as social media popularity index, Zomato ratings, etc. Lastly, it also adds a different flavour to the problem by looking at the Customer survey data as well as ratings provided by a mystery visitor data (audit done by a third party). 

## Import Libraries

In [None]:
# to load and manipulate data
import pandas as pd
import numpy as np

# to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# to split data into training and test sets
from sklearn.model_selection import train_test_split

# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# to tune different models
from sklearn.model_selection import GridSearchCV

# to compute classification metrics
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
    ConfusionMatrixDisplay,
    silhouette_score
)

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans

# to perform t-SNE
from sklearn.manifold import TSNE

# To ignore unnecessary warnings
import warnings
warnings.filterwarnings("ignore")

# to define a common seed value to be used throughout
RS=23

## Data Overview

In [None]:
restaurants = pd.read_csv('data/updateme.csv')
data = restaurants.copy()

* How many columns/rows do we have
* Checks for null/duplicated data
* Check data types and update as necessary
* Check summaries and look for immediate oddities
  * Negative values where they should always be positive
  * Zero values when values should always be higher than zero

## Exploratory Data Analysis

### Univariate Analysis

### Bivariate Analysis

* Check both pearson and spearman correlations
* Get pairplots
* Use the above info to get an idea of most important factors
  * Dive deeper into those factors and how they stack up against our target

## Feature Engingeering

## Model Objective

* Logistic Regression
* Decision Tree
* Random Forest
* Cluster
  * Logistic Regression
  * Decision Tree
  * Random Forest

## Model Comparison and Selection