# Final Project Group 3

In [1]:
library(dplyr)
library(tidyverse)
library(ggplot2)
library(patchwork)
library(dplyr)
library(leaps)
library(car)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Loading req

## (1) Introduction
Start with relevant background information on the topic to prepare those unfamiliar for the rest of your proposal.

Formulate one or two questions for investigation and detail the dataset that will be utilized to address these questions.

Additionally, align your question/objectives with the existing literature. To contextualize your study, include a minimum of two scientific publications (these should be listed in the References section).

Tipping is a social norm in North America and many research looked into factors affecting tipping behavior. Lynn and McCall (2000) investigated how service quality ratings affect tipping amount, as expected, they found positive correlation. However, this is not the whole picture, Lynn pointed out there are other factors such as costomer mood, also contributing to changes in tipping amount. Data from the US(Peck and Deehan, 2024) shows tips make up on average, 22.6% of a restaurant workers' income. Thus it is important to try to analyze and understand what affect tipping behavior so workers could make changes and potentially earn more.

We are given dataset `tips.csv` from https://www.kaggle.com/datasets/saurabhbadole/restaurant-tips-dataset, and the variables are:

1. `total_bill`:Total bill amount in dollars.         _numerical_
2. `tip`: Tip amount in dollars.                      _numerical_
4. `sex`:Gender of the costomer paying.               _Male or Female, binary_
5. `smoker`: Whether the costomer paying is a smoker. _Yes or no, binary_
6. `day`: Day of the week of the transaction.         _Thur/Fri/Sat/Sun, categorical_
7. `time`:Time of day of transaction.                 _Lunch/Dinner, binary_
8. `size`:Size of the dining party.                   _numerical_

### Research Question:

Which factors from total bill amount, sex, smoker or not, day, time and party size determine tip percentage? (Feel free to adjust the wording if needed)

## (2) Methods and Results
In this section, you will include:

a) “Exploratory Data Analysis (EDA)”

Demonstrate that the dataset can be read into R.
Clean and wrangle your data into a tidy format.
Plot the relevant raw data, tailoring your plot to address your question.
Make sure to explore the association of the explanatory variables with the response.
Any summary tables that are relevant to your analysis.
Be sure not to print output that takes up a lot of screen space.
Your EDA must be comprehensive with high quality plots.



### a) EDA

In [24]:
# Main developer: Dominique
# Contributer: Tara (created git link)
url <- "https://raw.githubusercontent.com/tarauboviccc/stat301_project/main/tips.csv" 
dt_tip <- read.csv(url) 
tip<-dt_tip|>mutate(size=factor(size),
                    sex=as.factor(sex),
                    smoker=as.factor(smoker),
                    time=as.factor(time),
          tip_pct=tip/total_bill)
tip|>count(size)

size,n
<fct>,<int>
1,4
2,156
3,38
4,37
5,5
6,4


* As number of observations from party size of more than two people are limited, we could combine `size` into two levels: 2- or 2+

In [25]:
#Main Contributer: Dominique
tip <- tip %>%
    mutate(size = fct_collapse(size, '2-'=c('1','2'),'2+' = c('3','4','5','6')))

tip %>% count(size, sort = TRUE)

size,n
<fct>,<int>
2-,160
2+,84


In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

tip_explore_plot <- 
    tip |>
    ggplot(aes(x = day, y = tip_pct, color = time))+
    geom_point()+
    stat_summary(fun = mean, geom = "crossbar", width = 0.5, color = "black")+
    scale_color_manual(values = c("Lunch" = "lightgreen", "Dinner" = "darkgreen")) +
    labs(title="Tip percent vs. Day and Time of week",
         x="Day of the week",
         y="Tip Percent")

tip_box_plot <- tip |>
  ggplot(aes(x = smoker, y = tip_pct, fill = smoker)) +
  geom_boxplot() +
  stat_summary(fun = mean, geom = "point", shape = 20, size = 3, color = "black") +
  scale_fill_manual(values = c("Yes" = "lightgray", "No" = "pink")) +
  labs(title = "Average Tip percent vs. Smokers",
       x = "Smoker",
       y = "Tip Percent")

curve_plot <- tip |>
  ggplot(aes(x = total_bill, y = tip_pct)) +
  geom_point(aes(size = size, alpha = size), color = "pink") +
  geom_smooth(method = "loess", se = FALSE, color = "lightgreen") +
  scale_alpha_continuous(range = c(0.2, 0.5)) +
  scale_size_continuous(range = c(1, 6)) +
  labs(title = "Tip Percent vs Total Bill with Party Size",
       x = "Total Bill",
       y = "Tip Percent")

gender_violin_plot <- tip |>
  ggplot(aes(x = sex, y = tip_pct, fill = sex)) +
  geom_violin(trim = FALSE, alpha = 0.5) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  scale_fill_manual(values = c("Female" = "salmon", "Male" = "lightblue"))+
  labs(title = "Tip Percent Distribution by Sex",
       x = "Sex",
       y = "Tip Percent")
       

combined_plot <-  (tip_explore_plot + tip_box_plot) / (curve_plot + gender_violin_plot)
combined_plot

[1m[22m`geom_smooth()` using formula = 'y ~ x'


b) “Methods: Plan”

Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., are the coefficients significant? How does the model fit the data)?
A careful model assessment must be conducted.
If prediction is the project's aim, describe the test data used or how it was created.
Ensure your tables and/or figures are labelled with a figure/table number.

## (3) Discussion
In this section, you’ll interpret the results you obtained in the previous section with respect to the main question/goal of your project.

Summarize what you found and the implications/impact of your findings.
If relevant, discuss whether your results were what you expected to find.
Discuss how your model could be improved;
Discuss future questions/research this study could lead to.

## (4) References
At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.
*I'm using MLA -Dominique

Lynn, M., & McCall, M. (2000). Gratitude and gratuity: A meta-analysis of research on the service–tipping relationship. The Journal of Socio-Economics, 29(2), 203–214. https://doi.org/10.1016/S1053-5357(00)00062-7

Peck, E., & Deehan, M. (2024, November 18). Tips make up large share of Mass. Restaurant Workers’ pay - axios Boston. AXIOS Boston. https://www.axios.com/local/boston/2024/11/18/tips-restaurant-workers-pay-massachusetts 