# Oktoberfest Analysis

In this notebook, I conduct an analysis of Oktoberfest-related data to gain insights into various aspects of this renowned beer festival.

Oktoberfest, held annually in Munich, Germany, is the world's largest and most famous beer festivals. It's a celebration of Bavarian culture, featuring a wide array of traditional foods, music, and, of course, a diverse selection of beers.

## Dataset Overview

I begin by working with a dataset containing information about Oktoberfest, including details such as beer prices, consumption, roast chicken prices, and guest numbers over several years. This dataset allows me to explore various facets of the festival, from beer trends to visitor statistics.

- **Dataset Source**: [Oktoberfest Dataset on Kaggle](https://www.kaggle.com/datasets/lucafrance/oktoberfest)

## Analysis Highlights

Throughout this notebook, I'll be conducting various analyses and visualizations, including:

- Time series analysis of different variables over the years.
- Regression analysis to understand relationships between variables.
- Clustering to identify patterns and groups within the data.
- Hypothesis testing to compare festival years.
- Anomaly detection to identify unusual observations.
- Exploratory data analysis to uncover insights.

By the end of this analysis, I aim to provide a comprehensive overview of Oktoberfest and its associated data, shedding light on trends, correlations, and interesting observations. Let's dive into the exciting world of Oktoberfest data analysis!


## Libraries Imports

I'll import necessary libraries and tools for data analysis, visualization, and statistical modeling. Let's start by importing the required libraries:

In [1]:
# Data manipulation and analysis libraries
import pandas as pd
import numpy as np

# Data visualization libraries
import plotly.express as px
import plotly.subplots as sp
import plotly.graph_objects as go

# Machine learning and statistics libraries
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, silhouette_score
from sklearn.cluster import KMeans
import statsmodels.api as sm
import scipy.stats as stats

## Data Loading and Initial Exploration

Now, I will load the dataset and perform an initial exploration to understand its structure. To begin, I'll load the dataset named "oktoberfest.csv" into a pandas DataFrame and then display all 35 rows:

In [2]:
df = pd.read_csv("oktoberfest.csv")
df.head(35)

Unnamed: 0,year,duration,guests_total,guests_daily,beer_price,beer_consumption,roast_chicken_price,roast_chicken_consumption
0,1985,16,7.1,444,3.2,54541,4.77,629520
1,1986,16,6.7,419,3.3,53807,3.92,698137
2,1987,16,6.5,406,3.37,51842,3.98,732859
3,1988,16,5.7,356,3.45,50951,4.19,720139
4,1989,16,6.2,388,3.6,51241,4.22,775674
5,1990,16,6.7,419,3.77,54300,4.47,750947
6,1991,16,6.4,400,4.21,54686,4.81,807710
7,1992,16,5.9,369,4.42,48888,5.11,725612
8,1993,16,6.5,406,4.71,51933,5.25,733517
9,1994,16,6.6,413,4.89,52108,5.39,663135


## Time Series Analysis

In this section, I create subplots to visualize time series data for various variables. The plots provide insights into the trends and patterns of the data over time.

### Beer & Roast Chicken Price and Consumption

I create subplots to analyze price-related and consumption-related variables.

In [3]:
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=("Beer Price", "Beer Consumption", "Roast Chicken Price", "Roast Chicken Consumption"))

# Add traces for beer price
fig.add_trace(go.Scatter(x=df['year'], y=df['beer_price'], mode='lines', name='Beer Price'), row=1, col=1)
fig.update_xaxes(title_text='Year', row=1, col=1)
fig.update_yaxes(title_text='Price', row=1, col=1)

# Add traces for beer consumption
fig.add_trace(go.Scatter(x=df['year'], y=df['beer_consumption'], mode='lines', name='Beer Consumption'), row=1, col=2)
fig.update_xaxes(title_text='Year', row=1, col=2)
fig.update_yaxes(title_text='Consumption', row=1, col=2)

# Add traces for roast chicken price
fig.add_trace(go.Scatter(x=df['year'], y=df['roast_chicken_price'], mode='lines', name='Roast Chicken Price'), row=2, col=1)
fig.update_xaxes(title_text='Year', row=2, col=1)
fig.update_yaxes(title_text='Price', row=2, col=1)

# Add traces for roast chicken consumption
fig.add_trace(go.Scatter(x=df['year'], y=df['roast_chicken_consumption'], mode='lines', name='Roast Chicken Consumption'), row=2, col=2)
fig.update_xaxes(title_text='Year', row=2, col=2)
fig.update_yaxes(title_text='Consumption', row=2, col=2)

fig.update_layout(title='Time Series Analysis of Different Variables',template='plotly_dark')
fig.show()

![Plot](plots/1.png)

In conclusion:

1. **Beer Price Trend**: The price of beer has been consistently rising over the years. This could be due to factors like inflation or changes in consumer preferences.

2. **Beer Consumption Trend**: Beer consumption has also been increasing. This might suggest a growing demand for beer over time.

3. **Roast Chicken Price Trend**: The price of roast chicken has followed a similar upward trend to beer. Again, this could be due to inflation or changes in the poultry market.

4. **Roast Chicken Consumption Trend**: The fact that roast chicken consumption hasn't increased over time is an interesting finding. This could indicate that despite rising prices, consumers are not increasing their consumption of roast chicken.

### Total & Daily Number of Guests

In [4]:
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=("Total Guests", "Daily Guests"))

# Add traces for total guests
fig.add_trace(go.Scatter(x=df['year'], y=df['guests_total'], mode='lines', name='Total Guests'), row=1, col=1)
fig.update_xaxes(title_text='Year', row=1, col=1)
fig.update_yaxes(title_text='Number of Guests', row=1, col=1)

# Add traces for daily guests
fig.add_trace(go.Scatter(x=df['year'], y=df['guests_daily'], mode='lines', name='Daily Guests'), row=1, col=2)
fig.update_xaxes(title_text='Year', row=1, col=2)
fig.update_yaxes(title_text=' ', row=1, col=2)

fig.update_layout(title='Time Series Analysis of Different Variables',template='plotly_dark')
fig.show()

![Plot](plots/2.png)

In [5]:
# Create a line plot for beer price over time
fig = px.line(df, x='year', y='beer_price', title='Beer Price and Roast Chicken Price Over Time')
fig.update_xaxes(title='Year')
fig.update_yaxes(title='Price')

# Add a line plot for roast chicken price
fig.add_trace(
    go.Scatter(
        x=df['year'],
        y=df['roast_chicken_price'],
        mode='lines',
        name='Roast Chicken Price',
        line=dict(color='orange', width=2),
        legendgroup='Roast Chicken Price',
        showlegend=True
    )
)

fig.update_traces(line=dict(color='blue', width=2), selector=dict(name='Beer Price'))
fig.update_layout(
    legend_title_text='Price Type',template='plotly_dark',
    legend=dict(
        orientation="h",xanchor="center",x=0.5,yanchor="top",y=1.05))
fig.show()

![Plot](plots/3.png)

Based on the plot, I observed that the "Roast Chicken Price" has consistently been higher than the "Beer Price," except for the year **1999**.

In [6]:
# Group the data by year and calculate the sum of beer consumption and roast chicken consumption for each year
yearly_consumption = df.groupby('year')[['beer_consumption', 'roast_chicken_consumption']].sum().reset_index()

# Define custom colors for each chart
beer_color = "#FFA726"
roast_chicken_color = px.colors.qualitative.Plotly[1]

# Create a bar chart for Beer Consumption with the custom color
fig1 = px.bar(yearly_consumption, x='year', y='beer_consumption',
              labels={'year': 'Year', 'beer_consumption': 'Beer Consumption'},
              title='Beer Consumption by Year',
              color_discrete_sequence=[beer_color])

# Create a bar chart for Roast Chicken Consumption with the default color
fig2 = px.bar(yearly_consumption, x='year', y='roast_chicken_consumption',
              labels={'year': 'Year', 'roast_chicken_consumption': 'Roast Chicken Consumption'},
              title='Roast Chicken Consumption by Year',
              color_discrete_sequence=[roast_chicken_color])

fig1.update_layout(template='plotly_dark')
fig2.update_layout(template='plotly_dark')
fig1.show()
fig2.show()

![Plot](plots/4.png)
![Plot](plots/5.png)

In [7]:
# Create histograms for guests_total and guests_daily
fig1 = px.histogram(df, x='guests_total',
                    labels={'guests_total': 'Total Guests'},
                    title='Histogram: Distribution of Total Guests')

fig2 = px.histogram(df, x='guests_daily',
                    labels={'guests_daily': 'Daily Guests'},
                    title='Histogram: Distribution of Daily Guests',
                    color_discrete_sequence=['#ff7f0e'])  # Custom color for Daily Guests histogram

fig1.update_layout(template='plotly_dark')
fig2.update_layout(template='plotly_dark')
fig1.show()
fig2.show()

![Plot](plots/6.png)
![Plot](plots/7.png)

Both histograms appear to follow an approximately **Normal distribution**.

## Correlation Analysis

I perform correlation analysis to examine the relationships between different variables in the dataset.

### - Correlation Matrix

I calculate the correlation matrix for the dataset, which provides information about the strength and direction of linear relationships between pairs of variables. Positive values indicate positive correlations, while negative values indicate negative correlations.

In [8]:
correlation_matrix = df.corr()
correlation_matrix

Unnamed: 0,year,duration,guests_total,guests_daily,beer_price,beer_consumption,roast_chicken_price,roast_chicken_consumption
year,1.0,0.313928,-0.319059,-0.46066,0.994775,0.899763,0.979029,-0.825412
duration,0.313928,1.0,0.128087,-0.423373,0.296442,0.361087,0.331703,-0.2142
guests_total,-0.319059,0.128087,1.0,0.843802,-0.299223,-0.033274,-0.318394,0.43966
guests_daily,-0.46066,-0.423373,0.843802,1.0,-0.432479,-0.226676,-0.469919,0.515045
beer_price,0.994775,0.296442,-0.299223,-0.432479,1.0,0.910589,0.982797,-0.809462
beer_consumption,0.899763,0.361087,-0.033274,-0.226676,0.910589,1.0,0.901436,-0.622958
roast_chicken_price,0.979029,0.331703,-0.318394,-0.469919,0.982797,0.901436,1.0,-0.84615
roast_chicken_consumption,-0.825412,-0.2142,0.43966,0.515045,-0.809462,-0.622958,-0.84615,1.0


### - Correlation Matrix Heatmap

To visualize the correlation matrix, I create a heatmap. The heatmap provides a visual representation of the correlations, making it easier to identify patterns and relationships between variables. Warmer colors indicate stronger positive correlations, while cooler colors indicate stronger negative correlations or no significant correlation.

This analysis helps us gain insights into how variables are related to each other within the dataset.

In [9]:
# Create a heatmap
fig = px.imshow(correlation_matrix, color_continuous_scale='blues')
fig.update_layout(title='Correlation Matrix Heatmap', template='plotly_dark')
fig.show()

![Plot](plots/8.png)

The correlation matrix shows the correlation coefficients between various pairs of variables. Here are some conclusions from the correlation matrix:

1. **Year vs. Other Variables:**
   - Year has a strong positive correlation with beer_price, roast_chicken_price, and beer_consumption, with correlation coefficients close to 1. This suggests that these variables have been increasing over the years.
   - Year has a strong negative correlation with roast_chicken_consumption, with a coefficient close to -0.825. This indicates that roast chicken consumption has been decreasing over the years.

2. **Duration vs. Other Variables:**
   - Duration has a positive correlation with all variables, but the correlations are not very strong. It is most positively correlated with beer_consumption and roast_chicken_price.

3. **Guests Total vs. Other Variables:**
   - Guests_total has a strong positive correlation with guests_daily (approximately 0.844), indicating that as the total number of guests increases, the daily number of guests also tends to increase.

4. **Guests Daily vs. Other Variables:**
   - Guests_daily has a strong positive correlation with guests_total (approximately 0.844) and a moderate positive correlation with beer_consumption and roast_chicken_consumption. This suggests that as the daily number of guests increases, the total number of guests, beer consumption, and roast chicken consumption tend to increase as well.

5. **Beer Price vs. Other Variables:**
   - Beer_price has a strong positive correlation with year, beer_consumption, and roast_chicken_price. This indicates that as the year progresses, beer prices tend to increase, and as beer consumption and roast chicken prices increase, beer prices also tend to rise.

6. **Beer Consumption vs. Other Variables:**
   - Beer_consumption has a strong positive correlation with year, beer_price, and roast_chicken_price. This suggests that as the year progresses, beer consumption tends to increase, and it is also positively associated with higher beer and roast chicken prices.

7. **Roast Chicken Price vs. Other Variables:**
   - Roast_chicken_price has strong positive correlations with year, beer_price, and roast_chicken_consumption. This indicates that roast chicken prices tend to increase over the years and are positively associated with higher beer prices and roast chicken consumption.

8. **Roast Chicken Consumption vs. Other Variables:**
   - Roast_chicken_consumption has a strong negative correlation with year and a moderate positive correlation with guests_daily. This suggests that as the year progresses, roast chicken consumption tends to decrease, and it is positively associated with higher daily guest counts.

## Scatter Plots and Regression Analysis

I create scatter plots to visualize the relationships between two sets of variables and calculate regression lines to analyze their linear associations.

### - Beer Price vs. Beer Consumption

I create a scatter plot to explore the relationship between Beer Price and Beer Consumption:

- **X-Axis**: Beer Price
- **Y-Axis**: Beer Consumption

### - Roast Chicken Price vs. Roast Chicken Consumption

I create a similar scatter plot to explore the relationship between Roast Chicken Price and Roast Chicken Consumption:

- **X-Axis**: Roast Chicken Price
- **Y-Axis**: Roast Chicken Consumption

### - Visualizations

Let's visualize these relationships:

- Beer Price vs. Beer Consumption Scatter Plot
- Roast Chicken Price vs. Roast Chicken Consumption Scatter Plot

These visualizations help us understand the linear relationships between price and consumption for beer and roast chicken.

In [10]:
# Create a scatter plot for Beer Price vs. Beer Consumption
fig1 = px.scatter(df, x='beer_price', y='beer_consumption', 
                  title='Beer Price vs. Beer Consumption',
                  labels={'beer_price': 'Beer Price', 'beer_consumption': 'Beer Consumption'})

# Create a scatter plot for Roast Chicken Price vs. Roast Chicken Consumption
fig2 = px.scatter(df, x='roast_chicken_price', y='roast_chicken_consumption', 
                  title='Roast Chicken Price vs. Roast Chicken Consumption',
                  labels={'roast_chicken_price': 'Roast Chicken Price', 'roast_chicken_consumption': 'Roast Chicken Consumption'})

fig1.update_layout(template='plotly_dark')
fig2.update_layout(template='plotly_dark')
fig1.show()
fig2.show()

![Plot](plots/9.png)
![Plot](plots/10.png)

## Linear Regression Analysis for Beer

In this section, I perform linear regression analysis to examine the relationship between Beer Price and Beer Consumption.

### - Data Preparation

I define the independent variable (X) as Beer Price and the dependent variable (y) as Beer Consumption.

### - Linear Regression Modeling

I fit a linear regression model to the data using the least squares method. This model helps us understand how Beer Price and Beer Consumption are related.

### - Scatter Plot with Regression Line

I create a scatter plot to visualize the data points and add a regression line to represent the linear relationship.

### - Plot Customization

The plot is customized to provide clear labels and titles.

### - Visualizations

Let's visualize the results of the linear regression analysis.

In [11]:
# Define the independent variable (X) and the dependent variable (y)
X = df['beer_price']
y = df['beer_consumption']

# Fit a linear regression model
coeff = np.polyfit(X, y, 1)
poly = np.poly1d(coeff)

# Create a scatter plot of the data points
fig = go.Figure()

# Add scatter plot for the actual data
fig.add_trace(go.Scatter(x=X, y=y, mode='markers', name='Actual Data', marker=dict(size=8, opacity=0.7)))

# Add line plot for the regression line
line_X = np.linspace(X.min(), X.max(), 100)  # Create points for the regression line
line_y = poly(line_X)  # Calculate corresponding y-values
fig.add_trace(go.Scatter(x=line_X, y=line_y, mode='lines', name='Regression Line', line=dict(color='red')))

fig.update_layout(
    xaxis_title='Beer Price',
    yaxis_title='Beer Consumption',
    title='Linear Regression: Beer Consumption vs. Beer Price',
    showlegend=True, template='plotly_dark'
)
fig.show()

![Plot](plots/11.png)

### Model Evaluation for Beer Price vs. Beer Consumption

Here, I evaluate the performance of the linear regression model that was fitted to the Beer Price and Beer Consumption data.

### - Predicted Values

I start by calculating the predicted values from the regression model for Beer Consumption:

### - Evaluation Metrics

I calculate several evaluation metrics to assess the model's performance for Beer Price vs. Beer Consumption:

- **R-squared (R²)**: This metric measures the proportion of the variance in the dependent variable (Beer Consumption) that is explained by the independent variable (Beer Price). A higher R-squared indicates a better fit.
- **Mean Squared Error (MSE)**: This measures the average squared difference between the actual and predicted values. Lower MSE values indicate better predictive performance.
- **Mean Absolute Error (MAE)**: This measures the average absolute difference between the actual and predicted values. Like MSE, lower MAE values indicate better predictive performance.

### - Model Performance

Let's view the evaluation metrics for the linear regression model for Beer Price vs. Beer Consumption:

- R-squared: 0.8292
- Mean Squared Error (MSE): 17365287.8332
- Mean Absolute Error (MAE): 32849.6553

These metrics provide insights into how well the model fits the data and its ability to make predictions for Beer Consumption based on Beer Price.

In [12]:
# Predicted values from the regression model
y_pred = poly(X)

# Calculate R-squared value
r2 = r2_score(y, y_pred)

# Calculate mean squared error (MSE)
mse = mean_squared_error(y, y_pred)

# Calculate mean absolute error (MAE)
mae = mean_absolute_error(y, y_pred)

print(f"R-squared: {r2:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")

R-squared: 0.8292
Mean Squared Error (MSE): 17365287.8332
Mean Absolute Error (MAE): 3284.6553


## Linear Regression Analysis for Roast Chicken

In this section, I perform linear regression analysis to examine the relationship between Roast Chicken Price and Roast Chicken Consumption.

### - Data Preparation

I define the independent variable (X) as Roast Chicken Price and the dependent variable (y) as Roast Chicken Consumption.

### - Linear Regression Modeling

I fit a linear regression model to the data using the least squares method. This model helps us understand how Roast Chicken Price and Roast Chicken Consumption are related.

### - Scatter Plot with Regression Line

I create a scatter plot to visualize the data points and add a regression line to represent the linear relationship.

### - Visualizations

Let's visualize the results of the linear regression analysis for Roast Chicken Price vs. Roast Chicken Consumption.

In [13]:
# Define the independent variable (X) and the dependent variable (y) for roast chicken
X_chicken = df['roast_chicken_price']
y_chicken = df['roast_chicken_consumption']

# Fit a linear regression model for roast chicken
coeff_chicken = np.polyfit(X_chicken, y_chicken, 1)
poly_chicken = np.poly1d(coeff_chicken)

# Create a scatter plot of the data points for roast chicken
fig_chicken = go.Figure()

# Add scatter plot for the actual data (roast chicken)
fig_chicken.add_trace(go.Scatter(x=X_chicken, y=y_chicken, mode='markers', name='Actual Data', marker=dict(size=8, opacity=0.7)))

# Add line plot for the regression line (roast chicken)
line_X_chicken = np.linspace(X_chicken.min(), X_chicken.max(), 100)  # Create points for the regression line
line_y_chicken = poly_chicken(line_X_chicken)  # Calculate corresponding y-values
fig_chicken.add_trace(go.Scatter(x=line_X_chicken, y=line_y_chicken, mode='lines', name='Regression Line', line=dict(color='red')))

fig_chicken.update_layout(
    xaxis_title='Roast Chicken Price',
    yaxis_title='Roast Chicken Consumption',
    title='Linear Regression: Roast Chicken Consumption vs. Roast Chicken Price',
    showlegend=True, template='plotly_dark'
)
fig_chicken.show()

![Plot](plots/12.png)

### Model Evaluation for Roast Chicken

Now, I evaluate the performance of the linear regression model that was fitted to the Roast Chicken Price and Roast Chicken Consumption data.

### - Predicted Values

I start by calculating the predicted values from the regression model for Roast Chicken Consumption:

### - Model Performance

Let's view the evaluation metrics for the linear regression model for Roast Chicken Price vs. Roast Chicken Consumption:

- R-squared: 0.7160
- Mean Squared Error (MSE): 4216537319.0085
- Mean Absolute Error (MAE): 53406.3208

These metrics provide insights into how well the model fits the data and its ability to make predictions.

In [14]:
# Predicted values for roast chicken consumption
y_pred_chicken = poly_chicken(X_chicken)

# Calculate R-squared value
r2_chicken = r2_score(y_chicken, y_pred_chicken)

# Calculate Mean Squared Error (MSE)
mse_chicken = mean_squared_error(y_chicken, y_pred_chicken)

# Calculate Mean Absolute Error (MAE)
mae_chicken = mean_absolute_error(y_chicken, y_pred_chicken)

print(f"R-squared (Roast Chicken): {r2_chicken:.4f}")
print(f"Mean Squared Error (MSE) (Roast Chicken): {mse_chicken:.4f}")
print(f"Mean Absolute Error (MAE) (Roast Chicken): {mae_chicken:.4f}")

R-squared (Roast Chicken): 0.7160
Mean Squared Error (MSE) (Roast Chicken): 4216537319.0085
Mean Absolute Error (MAE) (Roast Chicken): 53406.3208


### - Adding an Intercept

To fit a linear regression model, I add a constant term (intercept) to the independent variable.

### - Linear Regression Modeling

I fit a linear regression model to the data using the Ordinary Least Squares (OLS) method. This model helps us understand how Beer Price and Beer Consumption are related.

### - Regression Summary

The regression summary provides detailed information about the model's coefficients, statistical significance, goodness of fit, and more.

This analysis helps us gain insights into the relationship between Beer Price and Beer Consumption.


In [15]:
# Define the independent variable (X) and the dependent variable (y)
X = df['beer_price']
y = df['beer_consumption']

# Add a constant term to the independent variable (intercept)
X = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X).fit()

# Print the regression summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:       beer_consumption   R-squared:                       0.829
Model:                            OLS   Adj. R-squared:                  0.824
Method:                 Least Squares   F-statistic:                     160.2
Date:                Fri, 15 Sep 2023   Prob (F-statistic):           3.27e-14
Time:                        14:52:12   Log-Likelihood:                -341.39
No. Observations:                  35   AIC:                             686.8
Df Residuals:                      33   BIC:                             689.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       3.743e+04   2089.147     17.915      0.0

## K-Means Clustering: Beer Price vs. Beer Consumption

In this section, I perform K-Means clustering to group data points based on their Beer Price and Beer Consumption.

### - Feature Selection

I select the features I want to use for clustering. In this case, I've chosen Beer Price and Beer Consumption.

### - Number of Clusters

I specify the number of clusters I want to create. The choice of the number of clusters can vary based on your analysis and goals. For this example, I've chosen to create 3 clusters.

### - K-Means Clustering

I fit a K-Means clustering model to the selected features. This assigns each data point to one of the clusters based on its similarity to the cluster centroids.

### - Scatter Plot for Visualization

I create a scatter plot to visualize the clusters in the Beer Price vs. Beer Consumption space. Each data point is color-coded based on its assigned cluster.

### - Visualizations

Let's visualize the K-Means clustering results.


In [16]:
# Select the features you want to use for clustering
features = df[['beer_price', 'beer_consumption']]

# Specify the number of clusters
num_clusters = 3

# Fit a K-Means clustering model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
df['cluster'] = kmeans.fit_predict(features)

# Create a scatter plot to visualize the clusters
fig = px.scatter(df, x='beer_price', y='beer_consumption', color='cluster',
                 title='K-Means Clustering: Beer Price vs. Beer Consumption',
                 labels={'beer_price': 'Beer Price', 'beer_consumption': 'Beer Consumption'})

fig.update_layout(template='plotly_dark')
fig.show()





![Plot](plots/13.png)

### Clustering Evaluation

In this section, I evaluate the quality of the K-Means clustering results for Beer Price vs. Beer Consumption.

### - Silhouette Score

I calculate the Silhouette Score to assess the quality of the clustering. The Silhouette Score measures how similar each data point is to its assigned cluster compared to other clusters. A higher score indicates better-defined clusters.

### - Within-Cluster Sum of Squares (Inertia)

I also calculate the Inertia, which represents the within-cluster sum of squares. It measures the compactness of clusters. Lower Inertia values generally indicate better-defined clusters.

### - Evaluation Results

Let's view the evaluation results for the K-Means clustering:

- Silhouette Score: 0.6763
- Inertia (Within-Cluster Sum of Squares): 241423677.3895

These metrics provide insights into the quality and cohesion of the clusters formed by the K-Means algorithm.


In [17]:
# Calculate the Silhouette Score
silhouette_avg = silhouette_score(features, df['cluster'])

# Get the Inertia (Within-Cluster Sum of Squares) from the K-Means model
inertia = kmeans.inertia_

print(f"Silhouette Score: {silhouette_avg:.4f}")
print(f"Inertia (Within-Cluster Sum of Squares): {inertia:.4f}")

Silhouette Score: 0.6763
Inertia (Within-Cluster Sum of Squares): 241423677.3895


## Anomaly Detection: Total Guests

Here, I perform anomaly detection for the 'Total Guests' feature.

### - Feature Selection

I select the feature I want to analyze for anomalies. In this case, I've chosen 'Total Guests'.

### - Z-Scores Calculation

I calculate the Z-scores for the selected feature. Z-scores measure how many standard deviations a data point is away from the mean.

### - Threshold Definition

I define a threshold for anomaly detection. This threshold determines how extreme a Z-score must be for a data point to be considered an anomaly. You can adjust this threshold based on your specific needs. In this example, I've set it to 2.

### - Anomaly Identification

I identify anomalies by comparing the absolute Z-scores to the threshold.

### - Anomalies

Let's print the anomalies detected based on the Z-scores greater than the threshold.

These anomalies represent data points that deviate significantly from the mean and may require further investigation.

In [18]:
# Define the feature for anomalies
feature = 'guests_total'

# Calculate the Z-scores for the selected feature
z_scores = (df[feature] - df[feature].mean()) / df[feature].std()

# Define a threshold for anomaly detection
threshold = 2

# Identify anomalies based on the threshold
anomalies = df[np.abs(z_scores) > threshold]

# Print the anomalies
print("Anomalies (Z-score >", threshold, "):")
print(anomalies)

Anomalies (Z-score > 2 ):
    year  duration  guests_total  guests_daily  beer_price  beer_consumption  \
0   1985        16           7.1           444        3.20             54541   
16  2001        16           5.5           344        6.47             48698   

    roast_chicken_price  roast_chicken_consumption  cluster  
0                  4.77                     629520        0  
16                 8.12                     351705        0  


In this analysis, I conducted anomaly detection for the 'Total Guests' feature. I identified an anomaly in the dataset, with a significant deviation from the mean Z-score, and it corresponds to the month of September 2001.

Upon further examination, it is reasonable to suspect that the anomaly observed in September 2001 is associated with the events of September 11, 2001. Major historical events, such as the **9/11 attacks**, can indeed influence data patterns and lead to outliers or anomalies. In this case, the impact of this significant event on the number of guests is a plausible explanation for the anomaly.

Anomaly detection is a valuable technique for identifying unusual patterns or events in data, and it can serve as an important step in data analysis and decision-making. However, it's essential to complement data-driven insights with domain knowledge and context, as we did here, to provide a meaningful interpretation of anomalies and their potential causes.

## Hypothesis Testing and Correlation Analysis

In this section, I conduct hypothesis testing and correlation analysis to explore relationships in the dataset.

### - Year Comparison

- #### Two-Sample T-Test

I start by defining the year for comparison as 2001. Then, I create two groups: total guests in 2001 and total guests in other years. A two-sample t-test is performed to determine if there is a significant difference in total guests in 2001 compared to other years.

- #### Interpretation of Results

The results are interpreted based on the significance level (alpha). If the p-value is less than alpha (0.05), we reject the null hypothesis, indicating a significant difference in total guests in 2001 compared to other years.

### - Correlation Analysis

- #### Pearson Correlation Coefficient

I calculate the Pearson correlation coefficient between beer price and beer consumption to assess if there is a significant linear correlation.

- #### Spearman Correlation Coefficient

I calculate the Spearman correlation coefficient between beer price and beer consumption to assess if there is a significant monotonic correlation.

- #### Interpretation of Results

The results are interpreted based on the significance level (alpha). If the p-value is less than alpha (0.05), it indicates a significant correlation:

- Pearson Correlation: There is a significant linear correlation between beer price and beer consumption (r = 0.91).
- Spearman Correlation: There is a significant monotonic correlation between beer price and beer consumption (rho = 0.88).

These analyses provide insights into the relationships and statistical significance of the variables in the dataset.

In [19]:
# Define the year for comparison
year_to_compare = 2001

# Create two groups: total guests in 2001 and total guests in other years
total_guests_2001 = df[df['year'] == year_to_compare]['guests_total']
total_guests_other_years = df[df['year'] != year_to_compare]['guests_total']

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(total_guests_2001, total_guests_other_years)

# Set the significance level (alpha)
alpha = 0.05

# Interpret the results
if p_value < alpha:
    print(f"Reject the null hypothesis: There is a significant difference in total guests in {year_to_compare} compared to other years.")
else:
    print(f"Fail to reject the null hypothesis: There is no significant difference in total guests in {year_to_compare} compared to other years.")

Reject the null hypothesis: There is a significant difference in total guests in 2001 compared to other years.


In [20]:
# Calculate Pearson correlation coefficient
pearson_corr, pearson_p_value = stats.pearsonr(df['beer_price'], df['beer_consumption'])

# Calculate Spearman correlation coefficient
spearman_corr, spearman_p_value = stats.spearmanr(df['beer_price'], df['beer_consumption'])

# Set the significance level (alpha)
alpha = 0.05

# Interpret the results
if pearson_p_value < alpha:
    print(f"Pearson Correlation: There is a significant linear correlation between beer price and beer consumption (r = {pearson_corr:.2f}).")
else:
    print(f"Pearson Correlation: There is no significant linear correlation between beer price and beer consumption (r = {pearson_corr:.2f}).")

if spearman_p_value < alpha:
    print(f"Spearman Correlation: There is a significant monotonic correlation between beer price and beer consumption (rho = {spearman_corr:.2f}).")
else:
    print(f"Spearman Correlation: There is no significant monotonic correlation between beer price and beer consumption (rho = {spearman_corr:.2f}).")

Pearson Correlation: There is a significant linear correlation between beer price and beer consumption (r = 0.91).
Spearman Correlation: There is a significant monotonic correlation between beer price and beer consumption (rho = 0.88).


## Consumer Beer Price Data Collection and Filtering

I collect and filter beer price data from a specified data source.

- ### Data Source

The beer price data is obtained from the source [https://brookstonbeerbulletin.com/the-price-of-a-beer-1952-2016/](https://brookstonbeerbulletin.com/the-price-of-a-beer-1952-2016/).

- ### Data Collection

I collect historical beer price data for the years 1952 to 2019.

- ### Data Filtering

To focus on more recent data, I filter the dataset to include only data for the year 1985 and after.

The resulting dataset, `df_beer_prices`, contains the following columns:

- 'year': The year of observation (ranging from 1985 to 2019).
- 'beer_price': The price of a beer in euros.

This filtered dataset is ready for further analysis and visualization.

In [21]:
data = {
    'year': [
        1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,
        1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981,
        1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996,
        1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
        2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019
    ],
    'beer_price': [
        0.65, 0.65, 0.67, 0.67, 0.68, 0.69, 0.69, 0.70, 0.71, 0.71, 0.71, 0.72, 0.73, 0.74, 0.75,
        0.76, 0.79, 0.82, 0.86, 0.89, 0.91, 0.94, 1.01, 1.09, 1.12, 1.15, 1.22, 1.32, 1.42, 1.52,
        1.59, 1.65, 1.70, 1.75, 1.83, 1.88, 1.95, 2.03, 2.13, 2.35, 2.43, 2.47, 2.50, 2.54, 2.61,
        2.68, 2.73, 2.80, 2.88, 2.95, 3.02, 3.08, 3.17, 3.23, 3.31, 3.41, 3.53, 3.64, 3.68, 3.73,
        3.88, 3.87, 3.91, 3.95, 3.99, 4.30, 4.34, 4.40
    ],
}

df_beer_prices = pd.DataFrame(data)

# Filter data for the year 1985 and after
df_beer_prices = df_beer_prices[df_beer_prices['year'] >= 1985]

## Festival vs. Consumer Beer Prices Over the Years

In this section, I create a line plot to visualize the trends in festival beer prices compared to consumer beer prices over the years.

### - Line Plot

I plot the following data:

- **Festival Beer Prices**: This is represented by the blue line, showing the trend in festival beer prices over the years.

- **Consumer Beer Prices**: This is represented by the orange line, showing the trend in consumer beer prices over the years.

### - Plot Customization

I customize the plot to make it informative and visually appealing:

- **X-Axis**: Represents the years from 1985 to 2019.
- **Y-Axis**: Represents the beer prices in euros.
- **Legend**: Shows the legend for both festival and consumer beer prices.

### - Observations

This plot allows us to observe how festival beer prices compare to consumer beer prices over the years and identify any trends or differences.

In [22]:
# Create a line plot for festival beer prices
fig = px.line(df, x='year', y='beer_price', title='Festival vs. Consumer Beer Prices Over Years', labels={'beer_price': 'Price'})
fig.update_traces(line=dict(width=2), name='Festival Beer Price')

# Add a line plot for consumer beer prices
fig.add_trace(go.Scatter(x=df_beer_prices['year'], y=df_beer_prices['beer_price'], mode='lines', name='Consumer Beer Price', line=dict(width=2)))

fig.update_xaxes(title='Year')
fig.update_yaxes(title='Price')
fig.update_layout(showlegend=True, template='plotly_dark')
fig.show()

![Plot](plots/14.png)

In my analysis of festival beer prices compared to consumer beer prices over years, several key observations have emerged:

1. **Festival Beer Price Premium:** Throughout the years under consideration, I consistently observed that festival beer prices are notably higher than consumer beer prices. This premium suggests that individuals attending festivals should anticipate paying more for beer compared to what they might pay in other consumer settings.

2. **Increasing Price Gap Post-2000:** An intriguing trend is the widening gap between festival beer prices and consumer beer prices, particularly noticeable after the year 2000. This shift in pricing dynamics raises questions about the factors driving this divergence.

This analysis provides valuable insights into the pricing strategies of beer at festivals and highlights the significance of understanding consumer behavior, market dynamics, and external factors affecting pricing decisions.

To gain a deeper understanding of these trends, further investigation is recommended. Exploring potential factors behind the widening price gap, such as changes in festival policies, economic conditions, or vendor pricing strategies, could provide a more comprehensive understanding of the observed pricing dynamics. Additionally, examining consumer preferences and festival-goer behavior may shed light on why individuals are willing to pay a premium for beer at festivals despite the higher prices.

In conclusion, this analysis serves as a starting point for exploring the dynamics of festival beer prices and underscores the importance of data-driven insights in shaping pricing strategies and consumer experiences at events and festivals.

## Standard Deviation Comparison: Consumer vs. Festival Beer Prices

In this section, I calculate and compare the standard deviation of beer prices for both consumers and festivals.

### - Standard Deviation Calculation

I calculate the following standard deviations:

- **Standard Deviation of Consumer Beer Prices**: This measures the variability or spread in consumer beer prices from the dataset.

- **Standard Deviation of Festival Beer Prices**: This measures the variability or spread in festival beer prices from the dataset.

### - Results

Let's view the calculated standard deviations:

- Standard Deviation of Consumer Beer Prices: 0.76490
- Standard Deviation of Festival Beer Prices: 2.54393

These standard deviations provide insights into the variability in beer prices for consumers and festivals, helping to understand the price fluctuations.

In [23]:
# Calculate the standard deviation of consumer beer prices
std_consumer = np.std(df_beer_prices['beer_price'])

# Calculate the standard deviation of festival beer prices
std_festival = np.std(df['beer_price'])

print(f"Standard Deviation of Consumer Beer Prices: {std_consumer}")
print(f"Standard Deviation of Festival Beer Prices: {std_festival}")

Standard Deviation of Consumer Beer Prices: 0.7649062901121033
Standard Deviation of Festival Beer Prices: 2.5439383897044845


### - Bar Chart

I plot the following data:

- **Consumer Beer Prices**: This is represented by one bar, showing the standard deviation of consumer beer prices.

- **Festival Beer Prices**: This is represented by another bar, showing the standard deviation of festival beer prices.

### - Plot Customization

I customize the plot to make it visually informative:

- **X-Axis**: Represents the categories, which are "Consumer Beer Prices" and "Festival Beer Prices."

- **Y-Axis**: Represents the standard deviation.


In [24]:
# Create a bar chart to visualize the standard deviations
data = {'Category': ['Consumer Beer Prices', 'Festival Beer Prices'],
        'Standard Deviation': [std_consumer, std_festival]}
df_std = pd.DataFrame(data)

fig = px.bar(df_std, x='Category', y='Standard Deviation', 
             title='Standard Deviation of Beer Prices',
             labels={'Category': 'Price Type', 'Standard Deviation': 'Std Deviation'})

fig.update_yaxes(title='Standard Deviation')
fig.update_traces(marker_color=['#636EFA', '#FFA726'])
fig.update_layout(template='plotly_dark')
fig.show()

![Plot](plots/15.png)