<a href="https://colab.research.google.com/github/maurienko/tips/blob/main/tipspersentage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Analyze tip percentages from the `tip.csv` dataset based on categorical features like 'sex', 'smoker', 'day', 'time', and 'size'.

## Load the dataset

### Subtask:
Load the `tip.csv` file into a pandas DataFrame and display the first few rows and basic information.


**Reasoning**:
Load the `tip.csv` file into a pandas DataFrame, display the first few rows, and show basic information about the DataFrame.



In [1]:
import pandas as pd

df = pd.read_csv('/content/tip.csv')

print("First 5 rows of the DataFrame:")
print(df.head())

print("\nBasic information about the DataFrame:")
df.info()

First 5 rows of the DataFrame:
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Basic information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


## Calculate tip percentage

### Subtask:
Calculate the tip percentage by dividing the 'tip' by the 'total_bill' and store it in a new column.


**Reasoning**:
To calculate the tip percentage, I will divide the 'tip' column by the 'total_bill' column and multiply by 100, storing the result in a new column named 'tip_percentage'. Then, I will display the first few rows of the DataFrame to verify the new column.



In [2]:
df['tip_percentage'] = (df['tip'] / df['total_bill']) * 100

print("DataFrame with 'tip_percentage' column:")
print(df.head())

DataFrame with 'tip_percentage' column:
   total_bill   tip     sex smoker  day    time  size  tip_percentage
0       16.99  1.01  Female     No  Sun  Dinner     2        5.944673
1       10.34  1.66    Male     No  Sun  Dinner     3       16.054159
2       21.01  3.50    Male     No  Sun  Dinner     3       16.658734
3       23.68  3.31    Male     No  Sun  Dinner     2       13.978041
4       24.59  3.61  Female     No  Sun  Dinner     4       14.680765


## Analyze tip percentage by categories

### Subtask:
Group the data by relevant categorical columns (e.g., 'sex', 'smoker', 'day', 'time', 'size') and calculate the mean tip percentage for each category.


**Reasoning**:
I will group the DataFrame by each specified categorical column ('sex', 'smoker', 'day', 'time', 'size') and calculate the mean 'tip_percentage' for each group, storing and printing the results.



In [3]:
tip_by_sex = df.groupby('sex')['tip_percentage'].mean()
tip_by_smoker = df.groupby('smoker')['tip_percentage'].mean()
tip_by_day = df.groupby('day')['tip_percentage'].mean()
tip_by_time = df.groupby('time')['tip_percentage'].mean()
tip_by_size = df.groupby('size')['tip_percentage'].mean()

print("Mean tip percentage by sex:\n", tip_by_sex)
print("\nMean tip percentage by smoker:\n", tip_by_smoker)
print("\nMean tip percentage by day:\n", tip_by_day)
print("\nMean tip percentage by time:\n", tip_by_time)
print("\nMean tip percentage by size:\n", tip_by_size)

Mean tip percentage by sex:
 sex
Female    16.649074
Male      15.765055
Name: tip_percentage, dtype: float64

Mean tip percentage by smoker:
 smoker
No     15.932846
Yes    16.319604
Name: tip_percentage, dtype: float64

Mean tip percentage by day:
 day
Fri     16.991303
Sat     15.315172
Sun     16.689729
Thur    16.127563
Name: tip_percentage, dtype: float64

Mean tip percentage by time:
 time
Dinner    15.951779
Lunch     16.412793
Name: tip_percentage, dtype: float64

Mean tip percentage by size:
 size
1    21.729202
2    16.571919
3    15.215685
4    14.594901
5    14.149549
6    15.622920
Name: tip_percentage, dtype: float64


## Display the results

### Subtask:
Present the aggregated tip percentages in a clear table format for each category.


## Summary:

### Q&A
*   **Which categorical features influence tip percentages most significantly?**
    The `size` of the dining party shows the most significant variation in mean tip percentage, with smaller groups (size 1) tipping considerably higher (21.73%) compared to larger groups. Among days, Friday has the highest mean tip percentage (16.99%).
*   **Are there noticeable differences in tipping behavior based on 'sex', 'smoker' status, 'day', or 'time'?**
    Yes, there are differences:
    *   **Sex**: Female customers tip slightly more on average (16.65%) than male customers (15.77%).
    *   **Smoker Status**: Smokers show a slightly higher mean tip percentage (16.32%) compared to non-smokers (15.93%).
    *   **Day**: Friday has the highest mean tip percentage (16.99%), while Saturday has the lowest (15.32%).
    *   **Time**: Lunch has a slightly higher mean tip percentage (16.41%) than Dinner (15.95%).

### Data Analysis Key Findings
*   The `tip_percentage` was calculated and added as a new column, indicating the tip as a percentage of the total bill.
*   **Mean tip percentage by sex**: Female customers had a slightly higher mean tip percentage (16.65%) compared to male customers (15.77%).
*   **Mean tip percentage by smoker status**: Smokers had a slightly higher mean tip percentage (16.32%) than non-smokers (15.93%).
*   **Mean tip percentage by day**: Friday showed the highest mean tip percentage (16.99%), followed by Sunday (16.69%), Thursday (16.13%), and Saturday (15.32%).
*   **Mean tip percentage by time**: Lunch had a slightly higher mean tip percentage (16.41%) compared to Dinner (15.95%).
*   **Mean tip percentage by size**: Smaller groups (size 1) exhibited a significantly higher mean tip percentage (21.73%). The tip percentage generally decreased as group size increased, though size 6 showed a slight increase over size 5.

### Insights or Next Steps
*   **Insight**: Group size appears to be the most influential factor on tip percentage, with single diners tipping disproportionately more. This suggests that smaller parties might feel more obliged to tip a higher percentage or perhaps tend to order less expensive items, leading to a higher percentage when a standard tip amount is given.
*   **Next Steps**: Investigate the relationship between `total_bill` and `tip_percentage` across different `size` categories to understand if a smaller `total_bill` for single diners artificially inflates their `tip_percentage`. Additionally, explore if there are interactions between features, e.g., how `day` and `time` might jointly influence tipping behavior.
