<a href="https://www.kaggle.com/code/ainurrohmanbwx/waiter-tips-prediction-eda-linear-regression?scriptVersionId=146238475" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

In this waiter tips prediction analysis, I have undertaken a series of steps to uncover patterns and factors influencing the amount of tips given by customers. First, I conducted exploratory data analysis (EDA) by examining the correlation between the total bill and tips based on variables such as day, gender, and time. Subsequently, I explored the distribution of tips based on factors such as day, gender, smoker status, and time. Following that, I performed data preprocessing, which involved encoding categorical features and addressing missing values. Next, I developed a linear regression model to predict tips based on relevant features. Finally, I evaluated the model using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-Squared, and Mean Absolute Error (MAE). This case study will provide insights into the factors that can influence tips given by customers in the service industry.

# Import Data

In [1]:
import pandas as pd

tips = pd.read_csv("/kaggle/input/waiter-tips-dataset-for-prediction/tips.csv")

In [2]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
tips.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [4]:
tips.shape

(244, 7)

Features explanation:

- **total_bill**: This feature records the total cost of a customer's bill at a restaurant. This is one of the most important features in the dataset, as the tip size is often closely related to the total cost paid by the customer.
- **tip**: This feature is the tip amount given by the customer to the waiter. In addition, this feature is the target variable or the one you want to predict.
- **sex**: This feature records the customer's gender. A customer's gender can potentially influence tipping rates.
- **smoker**: This feature records whether the customer is a smoker or not. Smoking habits can influence customer behavior in restaurants, including tipping rates.
- **day**: This feature records the day of the week (such as "Sun", "Mon", etc.) when the transaction was made. The day of the week can influence the level of visits to a restaurant and the size of tips given by customers.
- **time**: This feature records the time of day (usually "Lunch" or "Dinner") when the transaction was made. Meal times can influence crowd levels and service levels in restaurants.
- **size**: This feature records the size of the group of customers coming together. The size of the group can affect the total bill and also the size of the tip given.

# Exploratory Data Analysis (EDA)

#### Relationship between total_bill and tip based on day, gender, and time

In [5]:
import plotly.express as px

figure = px.scatter(data_frame=tips, x="total_bill", y="tip", size="size", color="day", trendline="ols")

# Add a title
figure.update_layout(
    title="Relationship between Total Bill and Tip by Day",
    xaxis_title="Total Bill",
    yaxis_title="Tip"
)

# Customize the color palette for markers
figure.update_traces(marker=dict(size=12),
                      selector=dict(mode='markers'))

# Add an appealing color palette
figure.update_traces(marker=dict(size=12,
                                line=dict(width=2, color='DarkSlateGrey')),
                  selector=dict(mode='markers+text'))

# Show the plot
figure.show()

In [6]:
figure = px.scatter(data_frame=tips, x="total_bill", y="tip", size="size", color="sex", trendline="ols")

# Add a title
figure.update_layout(
    title="Relationship between Total Bill and Tip by Gender",
    xaxis_title="Total Bill",
    yaxis_title="Tip"
)

# Customize the color palette for markers
figure.update_traces(marker=dict(size=12,
                                line=dict(width=2, color='DarkSlateGrey')),
                  selector=dict(mode='markers+text'))

# Show the plot
figure.show()

In [7]:
figure = px.scatter(data_frame=tips, x="total_bill", y="tip", size="size", color="time", trendline="ols")

# Add a title
figure.update_layout(
    title="Relationship between Total Bill and Tip by Time of Day",
    xaxis_title="Total Bill",
    yaxis_title="Tip"
)

# Customize the color palette for markers
figure.update_traces(marker=dict(size=12,
                                line=dict(width=2, color='DarkSlateGrey')),
                  selector=dict(mode='markers+text'))

# Show the plot
figure.show()

#### Distribution of tips by day, gender, smoker, and time.

In [8]:
figure = px.pie(tips, values='tip', names='day', hole=0.5)

# Add a title
figure.update_layout(
    title="Distribution of Tips by Day of the Week"
)

# Adjust the color palette
colors = px.colors.qualitative.Set3
figure.update_traces(marker=dict(colors=colors, line=dict(color='white', width=2)))

# Show the plot
figure.show()

In [9]:
figure = px.pie(tips, values='tip', names='sex', hole=0.5)

# Add a title
figure.update_layout(
    title="Distribution of Tips by Gender"
)

# Customize the color palette
colors = ['#1f77b4', '#ff7f0e']  # You can specify custom colors here
figure.update_traces(marker=dict(colors=colors, line=dict(color='white', width=2)))

# Show the plot
figure.show()

In [10]:
figure = px.pie(tips, values='tip', names='smoker', hole=0.5)

# Add a title
figure.update_layout(
    title="Distribution of Tips by Smoking Status"
)

# Customize the color palette
colors = px.colors.qualitative.Set2  # You can use a different color palette if desired
figure.update_traces(marker=dict(colors=colors, line=dict(color='white', width=2)))

# Show the plot
figure.show()

In [11]:
figure = px.pie(tips, values='tip', names='time', hole=0.5)

# Add a title
figure.update_layout(
    title="Distribution of Tips by Time of Day"
)

# Customize the color palette
colors = px.colors.qualitative.Set3  # You can use a different color palette if desired
figure.update_traces(marker=dict(colors=colors, line=dict(color='white', width=2)))

# Show the plot
figure.show()

# Data Preprocessing

#### Encoding features

In [12]:
tips['sex'] = tips['sex'].map({'Female':0, 'Male':1})
tips['smoker'] = tips['smoker'].map({'No':0, 'Yes':1})
tips['day'] = tips['day'].map({'Thur':0, 'Fri':1, 'Sat':2, 'Sun':3})
tips['time'] = tips['time'].map({'Lunch':0, 'Dinner':1})

In [13]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,3,1,2
1,10.34,1.66,1,0,3,1,3
2,21.01,3.5,1,0,3,1,3
3,23.68,3.31,1,0,3,1,2
4,24.59,3.61,0,0,3,1,4


#### Are there any missing values?

In [14]:
# Check for missing values
missing_values = tips.isnull().sum()

# Display columns with missing values and the count of missing values
missing_values = missing_values[missing_values > 0]

if not missing_values.empty:
    print("Columns with missing values:")
    for column, count in missing_values.items():
        print(f"{column}: {count} missing values")
else:
    print("There are no columns with missing value")

There are no columns with missing value


# Modelling

In [15]:
x = tips[['total_bill', 'sex', 'smoker', 'day', 'time', 'size']].values
y = tips['tip'].values

In [16]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [17]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

# Evaluation

In [18]:
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate model predictions on the test data
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE) and convert to percentage
mse_percentage = mean_squared_error(y_test, y_pred) * 100

# Calculate Root Mean Squared Error (RMSE) and convert to percentage
rmse_percentage = np.sqrt(mse_percentage)

# Calculate R-squared (R^2) and leave it as is since it's already a percentage
r_squared = r2_score(y_test, y_pred) * 100

# Calculate Mean Absolute Error (MAE) and convert to percentage
mae_percentage = mean_absolute_error(y_test, y_pred) * 100

# Print the calculated metrics as percentages
print("Mean Squared Error (MSE): {:.2f}%".format(mse_percentage))
print("Root Mean Squared Error (RMSE): {:.2f}%".format(rmse_percentage))
print("R-squared (R^2): {:.2f}%".format(r_squared))
print("Mean Absolute Error (MAE): {:.2f}%".format(mae_percentage))


Mean Squared Error (MSE): 69.63%
Root Mean Squared Error (RMSE): 8.34%
R-squared (R^2): 44.29%
Mean Absolute Error (MAE): 66.86%


# Case Study

There was a guest named John Doe who came to the restaurant, with the following criteria:
John Doe's gender is female, he is not a smoker, came to the restaurant on Friday and it was time for dinner, bringing 3 friends, John Doe paid the bill amount of 40.5. So how many tips does the waiter receive?

In [19]:
# features = [["total_bill", "sex", "smoker", "day", "time", "size"]]
features = np.array([[40.5, 0, 0, 1, 1, 3]])
result = model.predict(features)
output_text = f"Waiters received a tip of {result[0]:.2f}"
output_text

'Waiters received a tip of 5.06'