<a href="https://colab.research.google.com/github/graveo-wicaksana/DA_restaurantSales/blob/main/DA_restaurantSales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
Explain the background

# Key Performance Indicators
1.   Finding most ordered item for each categories
2.   Listing the prefered items for certain payment method
3.   Explain the trend sales during a year to find high sales on certain months
4.   Recommend marketing strategic to increase sales and/or engage more customers



# Preparation Datasets
The dataset is obtained from kaggle by Ahmed Mohamed with title "Restaurant Sales-Dirty Data for Cleaning Training". I have downloaded the file and stored in google drive to prevent error execution in the future in case the author move the path of the file. Hopefully, people will download the file from kaggle's author in this [link](https://www.kaggle.com/datasets/ahmedmohamed2003/restaurant-sales-dirty-data-for-cleaning-training).

The dataset will be processed using Python in this Jupyter Notebook.

In [1]:
#Load Library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Load file and show top 5 records.
url = 'http://drive.google.com/file/d/1H1ya5-Dv4Pq-ony2SbsFMKvzZ3Yp_Df0/view?usp=sharing'
url = 'http://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.head()

Unnamed: 0,Order ID,Customer ID,Category,Item,Price,Quantity,Order Total,Order Date,Payment Method
0,ORD_705844,CUST_092,Side Dishes,Side Salad,3.0,1.0,3.0,2023-12-21,Credit Card
1,ORD_338528,CUST_021,Side Dishes,Mashed Potatoes,4.0,3.0,12.0,2023-05-19,Digital Wallet
2,ORD_443849,CUST_029,Main Dishes,Grilled Chicken,15.0,4.0,60.0,2023-09-27,Credit Card
3,ORD_630508,CUST_075,Drinks,,,2.0,5.0,2022-08-09,Credit Card
4,ORD_648269,CUST_031,Main Dishes,Pasta Alfredo,12.0,4.0,48.0,2022-05-15,Cash


In [3]:
#Show info of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17534 entries, 0 to 17533
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Order ID        17534 non-null  object 
 1   Customer ID     17534 non-null  object 
 2   Category        17534 non-null  object 
 3   Item            15776 non-null  object 
 4   Price           16658 non-null  float64
 5   Quantity        17104 non-null  float64
 6   Order Total     17104 non-null  float64
 7   Order Date      17534 non-null  object 
 8   Payment Method  16452 non-null  object 
dtypes: float64(3), object(6)
memory usage: 1.2+ MB


In [4]:
#Show descriptif statistic of numeric data type using describe
df.describe()

Unnamed: 0,Price,Quantity,Order Total
count,16658.0,17104.0,17104.0
mean,6.586325,3.014149,19.914494
std,4.834652,1.414598,18.732549
min,1.0,1.0,1.0
25%,3.0,2.0,7.5
50%,5.0,3.0,15.0
75%,7.0,4.0,25.0
max,20.0,5.0,100.0


In [5]:
#Show descriptif statistic of object data type using describe
df.describe(include="object")

Unnamed: 0,Order ID,Customer ID,Category,Item,Order Date,Payment Method
count,17534,17534,17534,15776,17534,16452
unique,17534,100,5,26,730,3
top,ORD_680707,CUST_066,Main Dishes,Pasta Alfredo,2023-11-25,Credit Card
freq,1,207,3551,998,42,5504


# Preprocessing Datasets

Do cleaning process to get clean dataset by checking and handling:
1. Duplicated records
2. Null records
3. Inconsistent records (i.e. format, and categories)
4. Typo records
5. Invalid records


## Handling Duplicated Records

In [6]:
#check duplicated records
df.duplicated().sum()

np.int64(0)

## Handling Null Value

In [32]:
#check null records.
print (df.isna().sum())
#total entries: 17534 entries
#weakenss: Not show directly the precentage

#try to show percentage
na_columns = [i for i in df.columns if df[i].isna().mean() > 0]
df_na_columns = df[na_columns].isna().mean()
print (df_na_columns)
#better use this to know the percentage

Order ID             0
Customer ID          0
Category             0
Item              1758
Price              876
Quantity           430
Order Total        430
Order Date           0
Payment Method    1082
dtype: int64
Item              0.100262
Price             0.049960
Quantity          0.024524
Order Total       0.024524
Payment Method    0.061709
dtype: float64


### Numeric Datatype

In [8]:
#To handling null value for numeric datatype, use the relation between price, quantity, and order total
df2 = pd.DataFrame(df) #keep the origin
# df[df.isna().any(axis=1)] #show all nan values

#Handling price column
print ("Handling NaN value in Price column")
print ("Before: ", df2['Price'].isna().sum())
df2['Price'] = df2['Price'].fillna(df2['Order Total']/df2['Quantity']) #target fill na 876
print ("After: ", df2['Price'].isna().sum(), "\n") #remaining na is 430. It is suspected that three columns has nan value since two other columns has 430 na records (Quantity and Order Total)

#Handling Quantity Column
print ("Handling NaN value in Quantity column")
print ("Before: ", df2['Quantity'].isna().sum())
df2['Quantity'] = df2['Quantity'].fillna(df2['Order Total']/df2['Price']) #target fill na 430
print ("After: ", df2['Quantity'].isna().sum(), "\n") #still 430.

#Handling order total Column
print ("Handling NaN value in Order Total column")
print ("Before: ", df2['Order Total'].isna().sum())
df2['Order Total'] = df2['Order Total'].fillna(df2['Price']*df2['Quantity']) #target fill na 430
print ("After: ", df2['Order Total'].isna().sum()) #still 430.



Handling NaN value in Price column
Before:  876
After:  430 

Handling NaN value in Quantity column
Before:  430
After:  430 

Handling NaN value in Order Total column
Before:  430
After:  430


The handling process for numeric datatype is **pending**. Trying to focus first on other null values.

### Item column

In [9]:
#To handling item values, use menu map from source. The menu map will be written in a new dataset
df_menumap = pd.DataFrame({
    'Category': ['Starters', 'Starters', 'Starters', 'Starters', 'Starters', 'Starters',
                'Main Dishes', 'Main Dishes', 'Main Dishes', 'Main Dishes', 'Main Dishes',
                'Desserts', 'Desserts', 'Desserts', 'Desserts', 'Desserts',
                'Drinks', 'Drinks', 'Drinks', 'Drinks', 'Drinks',
                'Side Dishes', 'Side Dishes', 'Side Dishes', 'Side Dishes', 'Side Dishes'],
    'Item' : ['Chicken Melt', 'French Fries', 'Cheese Fries', 'Sweet Potato Fries', 'Beef Chili', 'Nachos Grande',
             'Grilled Chicken', 'Steak', 'Pasta Alfredo', 'Salmon', 'Vegetarian Platter',
             'Chocolate Cake', 'Ice Cream', 'Fruit Salad', 'Cheesecake', 'Brownie',
             'Coca Cola', 'Orange Juice', 'Lemonade', 'Iced Tea', 'Water',
             'Mashed Potatoes', 'Grilled Vegetables', 'Side Salad', 'Garlic Bread', 'Onion Rings'],
    'Price' : [8.0, 4.0, 5.0, 5.0, 7.0, 10.0,
               15.0, 20.0, 12.0, 18.0, 14.0,
               6.0, 5.0, 4.0, 7.0, 6.0,
               2.5, 3.0, 3.0, 2.5, 1.0,
               4.0, 5.0, 3.0, 4.0, 5.0]
})

#preview menumap
df_menumap

Unnamed: 0,Category,Item,Price
0,Starters,Chicken Melt,8.0
1,Starters,French Fries,4.0
2,Starters,Cheese Fries,5.0
3,Starters,Sweet Potato Fries,5.0
4,Starters,Beef Chili,7.0
5,Starters,Nachos Grande,10.0
6,Main Dishes,Grilled Chicken,15.0
7,Main Dishes,Steak,20.0
8,Main Dishes,Pasta Alfredo,12.0
9,Main Dishes,Salmon,18.0


In [43]:
#Examine menu with duplicate price for different items
duplicate_menu_cost = df_menumap.duplicated(subset=['Category', 'Price'], keep=False)
print(df_menumap[duplicate_menu_cost].sort_values(by=['Category', 'Price'], ascending=[True,False]))
#Starters: 5| Side Dishes: 5 or 4| Drinks: 2.5 or 3| Desserts: 6
#Purpose: Know price with duplicate items
#        Category                Item  Price
# 11     Desserts      Chocolate Cake    6.0 798 items
# 15     Desserts             Brownie    6.0 469 items
# 17       Drinks        Orange Juice    3.0 591 items
# 18       Drinks            Lemonade    3.0 479 items
# 16       Drinks           Coca Cola    2.5 756 items
# 19       Drinks            Iced Tea    2.5 328 items
# 22  Side Dishes  Grilled Vegetables    5.0 578 items
# 25  Side Dishes         Onion Rings    5.0 373 items
# 21  Side Dishes     Mashed Potatoes    4.0 799 items
# 24  Side Dishes        Garlic Bread    4.0 399 items
# 2      Starters        Cheese Fries    5.0 686 items
# 3      Starters  Sweet Potato Fries    5.0 471 items


#Examine menu with total count for each item
columns_filter = ['Category', 'Item', 'Price']
print(df2[columns_filter].groupby(['Category', 'Item']).count())
#Purpose: Check existing data that has the most items for certain parameters
#as is total items for duplicate price in each categories
#                                 Price
# Category    Item
# Desserts    Brownie               469
#             Cheesecake            485
#             Chocolate Cake        798
#             Fruit Salad           449
#             Ice Cream             936
# Drinks      Coca Cola             756
#             Iced Tea              328
#             Lemonade              479
#             Orange Juice          591
#             Water                 956
# Main Dishes Grilled Chicken       822
#             Pasta Alfredo         998
#             Salmon                422
#             Steak                 574
#             Vegetarian Platter    382
# Side Dishes Garlic Bread          399
#             Grilled Vegetables    578
#             Mashed Potatoes       799
#             Onion Rings           373
#             Side Salad            978
# Starters    Beef Chili            470
#             Cheese Fries          686
#             Chicken Melt          442
#             French Fries          897
#             Nachos Grande         238
#             Sweet Potato Fries    471

#Examine existing df with Na Values for each price
#below is the rough calculation
# print (df2[(df2['Category']=='Starters') & (df2['Item'].isna()) & (df2['Price']==5.0)]) #Tidy up, please to show in a group
#Dessert 6dolar itemNA sebanyak 103 --> Chocolate Cake
#Drink 3dolar itemNA sebanyak 94 --> Orange Juice
#Drink 2.5dolar itemNA sebanyak 95 --> Coca Cola
#Side Dishes 5dolar itemNA sebanyak 100 --> Grilled Vegetables
#Side Dishes 4dolar itemNA sebanyak 82 --> Mashed Potatoes
#Starters 5dolar itemNA sebanyak 83 --> Cheese Fries

#Replace Na Value with certain value
dessert_6 = 'Chocolate Cake'
drink_3 = 'Orange Juice'
drink_2_5 = 'Coca Cola'
sideD_5 = 'Grilled Vegetables'
sideD_4 = 'Mashed Potatoes'
starters_5 = 'Cheeese Fries'

#Make variable to make more readable
na_dessert = df2['Category']=='Desserts'
na_drink = df2['Category']=='Drinks'
na_sideD = df2['Category']=='Side Dishes'
na_starters = df2['Category']=='Starters'
na_6 = df2['Price']==6.0
na_3 = df2['Price']==3.0
na_2_5 = df2['Price']==2.5
na_5 = df2['Price']==5.0
na_4 = df2['Price']==4.0
na_item = df2['Item'].isna()

print (df2[na_dessert & na_6 & na_item].fillna(dessert_6))

#Check NaN Values for Cateogry, Item, Price
# na_value = ['Item', 'Price', 'Quantity', 'Order Total']
# df2[df2[na_value].isna().any(axis=1)]

       Category                Item  Price
11     Desserts      Chocolate Cake    6.0
15     Desserts             Brownie    6.0
17       Drinks        Orange Juice    3.0
18       Drinks            Lemonade    3.0
16       Drinks           Coca Cola    2.5
19       Drinks            Iced Tea    2.5
22  Side Dishes  Grilled Vegetables    5.0
25  Side Dishes         Onion Rings    5.0
21  Side Dishes     Mashed Potatoes    4.0
24  Side Dishes        Garlic Bread    4.0
2      Starters        Cheese Fries    5.0
3      Starters  Sweet Potato Fries    5.0
                                Price
Category    Item                     
Desserts    Brownie               469
            Cheesecake            485
            Chocolate Cake        798
            Fruit Salad           449
            Ice Cream             936
Drinks      Coca Cola             756
            Iced Tea              328
            Lemonade              479
            Orange Juice          591
            Water      

# Analyzing Datasets
Do analyzing data by making aggregations, graphics, give short recommendations etc

# Conclusions
Explain conclusions with actionable act and ensure answering KPI