# Introduction

In this project, I'm exploring the dataset **"Texas Lottery® Sales by Fiscal Month/Year, Game and Retailer"**.

The project consists of three distinct parts:

- API querying with SoQL
- Local data processing with Python and Dask
- Data visualization with Tableau

This Notebook focuses on the second part: processing the local copy of the dataset using Python and Dask.

## Objectives

Within the scope of this part, I aim to achieve the following goals:

- Preprocessing the Data

- Analyzing Seasonal Trends:

  - Examining seasonal sales trends
  - Analyzing ticket prices over time and identifying seasonal variation

- Optimization and Export for Tableau:

  - Aggregating data to reduce dataset size
  - Imputing missing values and ensuring data completeness

## The tools I use

- **Python**: Primary programming language used for data manipulation and analysis.
- **Dask**: Parallel computing library that scales Python code from single machines to large clusters, used for handling large datasets that do not fit into memory.
- **Pandas**: Python library for data manipulation and analysis, used for cleaning and transforming the dataset.
- **Matplotlib/Seaborn**: Visualization libraries in Python, employed for creating plots and graphs to explore data trends.
- **NumPy**: A fundamental package for scientific computing in Python.
- **Jupyter Notebook**: Interactive computing environment used for documenting the data analysis process.

## Data Sources

The dataset **"Texas Lottery® Sales by Fiscal Month/Year, Game and Retailer"** has been provided by **Texas Lottery Commission**.

- [Link to the Dataset on data.texas.gov](https://data.texas.gov/dataset/Texas-Lottery-Sales-by-Fiscal-Month-Year-Game-and-/beka-uwfq/about_data) 
- Access & Use: This dataset is intended for public access and use.
- Copyright and Trademark Notice: [Link to texaslottery.com](https://www.texaslottery.com/export/sites/lottery/Misc/copyright.html) 


### Disclosure

The dataset utilized in this project was sourced from the Texas Open Data Portal, which is publicly available and free to use. It's important to note, however, that **the Texas Lottery Commission holds copyright and trademark protections over various elements associated with their data and brand.** This encompasses all logos, text, content, including underlying HTML code, designs, and graphics depicted on their Internet website, safeguarded under United States and international copyright and trademark laws and treaties.

I do not claim any rights over these elements, and no such material has been reproduced within this project. The use of the dataset is for analytical and educational purposes only, adhering to the guidelines stipulated by the Texas Open Data Portal and respecting the copyright and trademark notice issued by the Texas Lottery Commission. Any specific trademarks or service marks mentioned within the dataset are duly recognized as the property of the Texas Lottery Commission, and their use in this project does not imply any affiliation with or endorsement by the Commission.

### Dataset Version

Due to server response limitations and to ensure uninterrupted data analysis, a **local copy of the dataset** was downloaded and used for data manipulation and analysis within Python. This approach was adopted to mitigate potential server timeouts and connectivity issues encountered during direct API access, allowing for a more stable and efficient data analysis process.

- Data Last Updated: January 25, 2024
- Data Coverage: 
  - Start Date: September 2020
  - End Date: January 2024

Unfortunately, due to the large size of the dataset file (7.5 GB), it cannot be included in the project's repository because of GitHub's file size limitations.

# 1. Setting up the Environment

The following command installs the Python libraries required for this project:

`pip install dask pandas numpy matplotlib seaborn ipython "dask[distributed]" "dask[dataframe]"`

# 2. Loading and Preprocessing the Data
## 2.1. Importing Libraries

In [None]:
# Importing the Dask Client for distributed computing.
from dask.distributed import Client

# Dask DataFrame module for parallel computing with large datasets.
import dask.dataframe as dd

# The garbage collection module to manage memory and perform cleanup.
import gc

# Essential tool to work with tabular data structures in Python.
import pandas as pd

# For numerical computations.
import numpy as np 

# For plotting graphs.
import matplotlib.pyplot as plt

# To format the axis tick labels.
from matplotlib.ticker import FuncFormatter

# To handle date formatting on the x-axis.
import matplotlib.dates as mdates

# For more advanced data visualization.
import seaborn as sns

# Configuring Jupyter to display plots inline.
%matplotlib inline 

# Setting the option to display all columns.
pd.set_option('display.max_columns', None)

# Setting the option to display all rows.
pd.set_option('display.max_rows', None)

# Setting display option to avoid scientific notation.
pd.set_option('display.float_format', lambda x: '%.0f' % x)

## 2.2. Starting a Local Dask Client

The dataset under analysis, containing nearly 30 million rows or 7.5 GB in size, presents a significant challenge for in-memory processing on a standard personal computer.

To address this, I'm going to use **Dask** - an open-source parallel computing library for large-scale data operations. Dask breaks down large datasets into manageable chunks and processes them in parallel, significantly speeding up data computations and analysis.

As I initialize a Dask Client, I'll specify a local directory for storing intermediate data. This is useful for handling large datasets or complex computations, as it allows Dask to efficiently manage temporary data and spill over to disk if the memory is insufficient.

In [None]:
# Initializing the Dask client with additional configurations.
client = Client(
    # Directory for intermediate data.
    local_directory='C:/PLACEHOLDER/PATH',
    memory_limit='4GB',  # Setting a memory limit for each worker.
    n_workers=4,  # Number of workers.
    processes=True,  # Using processes instead of threads.
    threads_per_worker=1  # Number of threads per worker.
)

client

### Processes vs. Threads

**Processes**: In Python, using processes means that each worker runs in its own independent memory space. This allows for true parallelism because each process is a separate instance of the Python interpreter, and they can run on multiple CPU cores simultaneously.

**Threads**: When using threads, multiple threads run within the same process and share the same memory space. However, Python's Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time, which can limit the performance benefits of multithreading for CPU-bound tasks.

## 2.3. Loading the Dataset

Dask provides a DataFrame interface that closely mirrors Pandas, allowing users to perform data manipulation and analysis in a familiar way but on larger-than-memory datasets.

Initially, when I tried to load the dataset into a Dask DataFrame, I encountered a ValueError:

```python
dask_dataframe = dd.read_csv('Texas_Lottery_Sales_by_Fiscal_Month_Year_Game_and_Retailer.csv')
```

Output:

```shell
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------+---------+----------+
| Column                      | Found   | Expected |
+-----------------------------+---------+----------+
| Retailer Location Address 2 | object  | float64  |
| Scratch Game Number         | float64 | int64    |
| Ticket Price                | float64 | int64    |
+-----------------------------+---------+----------+

The following columns also raised exceptions on conversion:

- Retailer Location Address 2
  ValueError("could not convert string to float: 'SUITE 180'")

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'Retailer Location Address 2': 'object',
       'Scratch Game Number': 'float64',
       'Ticket Price': 'float64'}

to the call to `read_csv`/`read_table`.

```

I'm going to follow the recommendation and specify dtypes in my `read_csv()` call.

In [None]:
# Defining the data types for the columns causing issues.
dtype = {
    'Retailer Location Address 2': 'object',
    'Scratch Game Number': 'float64',
    'Ticket Price': 'float64'
}

# Loading the dataset into a Dask DataFrame.
dask_dataframe = dd.read_csv(
    'Texas_Lottery_Sales_by_Fiscal_Month_Year_Game_and_Retailer.csv',
    dtype=dtype
)

If data is not evenly partitioned, some operations might load too much data into memory at once.

I repartition the data to have more, smaller partitions:

In [None]:
dask_dataframe = dask_dataframe.repartition(
    npartitions=dask_dataframe.npartitions * 2
)

# Persisting the DataFrame after repartitioning.
dask_dataframe = dask_dataframe.persist()

`persist()` computes the data in the DataFrame and stores it in memory. Unlike lazy evaluation (the default behavior in Dask), where computations are deferred until explicitly triggered by an action like `compute()`, persisting the DataFrame ensures that the data is actively held in memory.

The benefits of persisting include improved performance, avoiding recomputations, and added stability.

## 2.4. Initial Inspection

The dataset columns' description, provided by **data.texas.gov**:


| Column Name                                                                                                                       | Description                                                                                                                                     | Type        |
| --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | ----------- |
| Row ID                                                                                                                            | Unique key.                                                                                                                                     | Plain Text  |
| Fiscal Year                                                                                                                       | The fiscal year (Sept-Aug, i.e. Sept 2021-Aug 2022 = 2022, Sept 2022-Aug 2023= 2023 etc.) the pack settled/tickets were sold.                   | Number      |
| Fiscal Month                                                                                                                      | The fiscal month number (Sept-Aug, Sept =1, Oct=2, etc.) the pack settled/tickets were sold.                                                    | Number      |
| Fiscal Month Name and Number                                                                                                      | The fiscal month number and name (Sept - Aug, Sept =1, Oct=2, etc.) the pack settled/tickets were sold.                                         | Plain Text  |
| Calendar Year                                                                                                                     | The calendar year the pack settled/tickets were sold.                                                                                           | Number      |
| Calendar Month                                                                                                                    | The calendar month number the pack settled/tickets were sold.                                                                                   | Number      |
| Calendar Month Name and Number                                                                                                    | The calendar month number and name the pack settled/tickets were sold.                                                                          | Plain Text  |
| Month Ending Date                                                                                                                 | The month end date the pack settled/tickets were sold.                                                                                          | Date & Time |
| Game Category                                                                                                                     | The type of lottery game; i.e. Scratch, Lotto Texas®, Powerball®, etc.                                                                          | Plain Text  |
| Scratch Game Number                                                                                                               | The game number of the scratch ticket.                                                                                                          | Number      |
| Ticket Price                                                                                                                      | The price per ticket.                                                                                                                           | Number      |
| Retailer License Number                                                                                                           | The retailer license number that sold the ticket.                                                                                               | Number      |
| Retailer Location Name                                                                                                            | The retailer location name that sold the ticket.                                                                                                | Plain Text  |
| Retailer Number and Location Name                                                                                                 | The retailer location number/location name that sold the ticket. This number is the store number assigned to the location by the owning entity. | Plain Text  |
| Retailer Location Address 1                                                                                                       | The address line 1 of the retailer location that sold the ticket.                                                                               | Plain Text  |
| Retailer Location Address 2                                                                                                       | The address line 2 of the retailer location that sold the ticket.                                                                               | Plain Text  |
| Retailer Location City                                                                                                            | The city of the retailer location that sold the ticket.                                                                                         | Plain Text  |
| Retailer Location State                                                                                                           | The state of the retailer location that sold the ticket.                                                                                        | Plain Text  |
| Retailer Location Zip Code                                                                                                        | The zip code of the retailer location that sold the ticket.                                                                                     | Plain Text  |
| Retailer Location Zip Code +4                                                                                                     | The zip code +4 of the retailer location that sold the ticket.                                                                                  | Plain Text  |
| Retailer Location County                                                                                                          | The county of the retailer location that sold the ticket.                                                                                       | Plain Text  |
| Owning Entity Retailer Number                                                                                                     | This is the retailer number of the retailer owning entity who is financially responsible for the location where the pack settled/tickets sold.  | Number      |
| Owning Entity Retailer Name                                                                                                       | This is the name of the retailer owning entity who is financially responsible for the location where pack settled/tickets sold.                 | Plain Text  |
| Owning Entity/Chain Head Number and Name                                                                                          | This is the name and retailer number of the owning entity of the location financially responsible for the pack settled/tickets sold.            | Plain Text  |
| Gross Ticket Sales Amount                                                                                                         | This is the gross sales amount of the pack settled/tickets sold.                                                                                | Number      |
| Promotional Tickets Amount                                                                                                        | This is the dollar amount of free tickets given away as part of a promotion approved by the Lottery.                                            | Number      |
| Cancelled Tickets Amount                                                                                                          | This is the dollar amount of tickets that were printed then cancelled by retailer due to some sort of issue; e.g. printer jam, etc.             | Number      |
| Ticket Adjustments Amount                                                                                                         | This is the dollar amount of ticket adjustments made to the retailer's account; e.g. retailer request for adjustment for damaged tickets, etc.  | Number      |
| Ticket Returns Amount                                                                                                             | This is the dollar amount in ticket returns processed at the lottery warehouse and adjusted to retailer's account.                              | Number      |
| Net Ticket Sales Amount                                                                                                           | This is the net sales amount of the pack settled/tickets sold minus any promotional, cancelled, adjusted or returned tickets.                   | Number      |


### Dataset Shape

In [None]:
# Number of columns.
num_columns = len(dask_dataframe.columns)
print(f"Number of Columns: {num_columns}")

# Number of rows.
num_rows = dask_dataframe.shape[0].compute()
print(f"Number of Rows: {num_rows}")

Our dataset has 30 columns and almost 30 million rows (entries). 

### Viewing the First Few Rows

I examine the first few rows of the dataset to get a sense of the data structure and contents.

In [None]:
dask_dataframe.head()

## 2.5. Pruning Irrelevant Features

For large datasets, especially those that are close to or exceed the system's memory capacity, it can be beneficial to remove unnecessary columns early. This can reduce memory usage and improve processing speed.

Columns that I consider redundant or irrelevant to this project's analysis:

- `Fiscal Month Name and Number`
- `Calendar Month Name and Number`
- `Month Ending Date`
- `Scratch Game Number`
- `Retailer License Number`
- `Retailer Number and Location Name`
- `Retailer Location Address 1`
- `Retailer Location Address 2`
- `Retailer Location State`
- `Retailer Location Zip Code`
- `Retailer Location Zip Code +4`
- `Owning Entity Retailer Number`
- `Owning Entity Retailer Name`
- `Owning Entity/Chain Head Number and Name`
- `Ticket Adjustments Amount`

Regarding the `Retailer Location State` column, the assumption is that the entire dataset belongs to Texas. However, this assumption requires verification before the column can be safely removed.

In [None]:
# Counting the number of unique values in the "Retailer Location State" column.
unique_states = dask_dataframe['Retailer Location State'].nunique().compute()
unique_states

There are two unique values in the `Retailer Location State` column.

Let's check what they are.

In [None]:
# Getting the unique values.
unique_states = dask_dataframe['Retailer Location State'].unique().compute()

# Converting to a list for easier viewing.
unique_states_list = unique_states.tolist()
print("Unique values in the 'Retailer Location State' column: "
      f"{unique_states_list}")

The other state is Tennessee.

Checking how many rows are attributed to it in the dataset:

In [None]:
state_counts = dask_dataframe['Retailer Location State'].value_counts().compute()
state_counts

There is one row corresponding to the State of Tennessee. Given that all other rows are related to Texas, this lone Tennessee row can be considered an outlier and removed. Consequently, the `Retailer Location State` column becomes redundant and can also be safely removed.

In [None]:
# Filtering the DataFrame to exclude rows where "Retailer Location State" is "TN".
filtered_dask_dataframe = dask_dataframe[
    dask_dataframe['Retailer Location State'] != 'TN'
]

Dropping the rest of the columns I listed for removal earlier. 

In [None]:
columns_to_remove = [
    'Fiscal Year',
    'Fiscal Month',
    'Fiscal Month Name and Number',
    'Calendar Month Name and Number',
    'Month Ending Date',
    'Scratch Game Number',
    'Retailer License Number',
    'Retailer Location Name',
    'Retailer Number and Location Name',
    'Retailer Location Address 1',
    'Retailer Location Address 2',
    'Retailer Location State',
    'Retailer Location Zip Code',
    'Retailer Location Zip Code +4',
    'Owning Entity Retailer Number',
    'Owning Entity Retailer Name',
    'Owning Entity/Chain Head Number and Name',
    'Ticket Adjustments Amount'
]

# Dropping the specified columns from the DataFrame.
reduced_dask_dataframe = filtered_dask_dataframe.drop(columns=columns_to_remove)

# Persisting the reduced DataFrame to compute the operation and optimize
# further computations.
reduced_dask_dataframe = reduced_dask_dataframe.persist()

Checking the result:

In [None]:
reduced_dask_dataframe.head()

## 2.6. Data Types Check

Ensuring that each column is of the correct data type:

In [None]:
reduced_dask_dataframe.dtypes

Data types validated.

## 2.7. Summary Statistics

Generating summary statistics for numerical columns to understand their distribution, identify any obvious outliers, or spot missing values:

In [None]:
reduced_dask_dataframe.describe().compute()

From these summary statistics, the following columns appear right-skewed, indicating a concentration of lower values with fewer higher value outliers:

- `Ticket Price`
- `Gross Ticket Sales Amount`
- `Net Ticket Sales Amount`

And the following columns appear left-skewed, with a majority of values clustering towards the higher end and some extreme lower value outliers:

- `Promotional Tickets Amount`
- `Cancelled Tickets Amount`
- `Ticket Returns Amount`

However, it's important to note that the presence of extreme outliers in these columns, potentially with the exception of `Ticket Price`, might significantly influence these assessments. Such outliers can distort the mean and give an exaggerated sense of skewness. Therefore, these summary statistics alone might not fully capture the distribution patterns of the listed columns. Visualizing the data distributions could provide a more nuanced understanding of their characteristics.

## 2.8. Missing Values

### Identifying Missing Values

Checking for missing values in each column:

In [None]:
reduced_dask_dataframe.isnull().sum().compute()

As we can see, there is a large number of missing values in the column `Ticket Price`.

To understand this issue better, I'm going to analyze the distribution of null values within the `Game Category` column. This investigation will help identify if missing values are concentrated in specific categories and guide my approach to addressing these gaps.

In [None]:
# Grouping by "Game Category" and counting missing values in "Ticket Price"
# for each category.
missing_values_distribution = (
    reduced_dask_dataframe
    .groupby('Game Category')['Ticket Price']
    .apply(lambda x: x.isna().sum(), meta=('x', 'int64'))
    .compute()
)

# Displaying the result.
missing_values_distribution

In [None]:
# Grouping by "Game Category" and counting non-missing values in "Ticket Price"
# for each category.
nonnull_values_distribution = (
    reduced_dask_dataframe
    .groupby('Game Category')['Ticket Price']
    .apply(lambda x: x.notna().sum(), meta=('x', 'int64'))
    .compute()
)

# Displaying the result.
nonnull_values_distribution

The missing values in the `Ticket Price` column are because this column is only populated for the `Scratch Tickets` game category.

## 2.9. Duplicate Rows

### Checking for Duplicates

Unlike Pandas, Dask DataFrames do not have a direct equivalent to the `duplicated()` method. My initial attempts to count the duplicated rows resulted in exceeding the memory budget:

```python
deduplicated_df = reduced_dask_dataframe.drop_duplicates().persist()
row_count = reduced_dask_dataframe.shape[0].compute()
deduplicated_row_count = deduplicated_df.shape[0].compute()
number_of_duplicates = row_count - deduplicated_row_count
print(f"Number of Duplicate Rows: {number_of_duplicates}")
```


```python
duplicates_count = reduced_dask_dataframe.groupby('Row ID').size().compute()
duplicates_count = duplicates_count[duplicates_count > 1]
print(f"Number of Duplicate 'Row ID's: {len(duplicates_count)}")
```

After encountering those limitations, I adopted the Map-Reduce approach. I used `map_partitions` to apply a function to each partition of my DataFrame and then aggregated the results. The idea here is to identify duplicates within each partition first (map step) and then combine these results to identify global duplicates (reduce step).

Before doing this, I ran `gc.collect()`, which triggers Python's garbage collection process. This process reclaims memory by clearing unused objects. The number returned represents the count of unreachable objects found and freed during that garbage collection cycle.

In [None]:
gc.collect()

In [None]:
# Defining a function to apply to each partition.
def find_duplicates(partition):
    # Finding duplicated "Row ID" within the partition.
    duplicated = partition[partition.duplicated('Row ID')]
    return duplicated

# Applying the function to each partition and computing to get results.
duplicates_per_partition = (
    reduced_dask_dataframe
    .map_partitions(find_duplicates)
    .compute()
)

# Now "duplicates_per_partition" contains all duplicates found in each chunk.
duplicates_per_partition

The output is an empty table, meaning there were no duplicates found in the partitions. 

## 2.10. Inspecting Categories

Inspecting unique values in the `Game Category` column:

In [None]:
unique_game_categories = (
    reduced_dask_dataframe['Game Category'].unique().compute()
)
unique_game_categories

There is no inconsistency.

Checking the number of unique values in the `Retailer Location City` column:

In [None]:
unique_cities = (
    reduced_dask_dataframe['Retailer Location City'].nunique().compute()
)
print(f"Number of unique Retailer Location Cities: {unique_cities}")

The column contains 1 242 unique values. While inspecting and standardizing these values manually is impractical due to their volume, converting all city names to a uniform text case (e.g., all lowercase or all uppercase) could  create issues in Tableau downstream. Given these potential complications and the fact that I won't be using the column in my Python analysis, I have decided to leave the `Retailer Location City` column unchanged and handle any inconsistencies directly in Tableau.

Let's take a look at the `Retailer Location County` column.

In [None]:
# Getting the number of unique values in "Retailer Location County" column.
unique_counties = (
    reduced_dask_dataframe['Retailer Location County'].nunique().compute()
)
unique_counties

Our dataset contains 250 unique counties, whereas the state of Texas actually has 254 counties. Before exporting the aggregated dataset for use in Tableau, I will add the missing counties and impute the missing values. This will ensure a complete and accurate geographical visualization.

# 3. Analyzing Seasonal Trends

I'll focus here on the types of temporal analysis that aren't going to be covered by the Tableau dashboard downstream.

## 3.1. Seasonal Sales Trends

Preparing data for plotting:

In [None]:
# Groupping by "Calendar Year" and "Calendar Month",
# then summing "Gross Ticket Sales Amount".
aggregated_sales = (
    reduced_dask_dataframe
    .groupby(['Calendar Year', 'Calendar Month'])
    ['Net Ticket Sales Amount']
    .sum()
    .compute()
)

# Resetting index to convert the Series to a DataFrame.
aggregated_sales = aggregated_sales.reset_index()

# Combining "Calendar Year" and "Calendar Month" into a single datetime column
# for easier plotting.
aggregated_sales['Date'] = pd.to_datetime(
    aggregated_sales['Calendar Year'].astype(str)
    + '-'
    + aggregated_sales['Calendar Month'].astype(str)
)

# Sorting the DataFrame by "Date" to ensure the line chart follows
# a chronological order.
aggregated_sales = aggregated_sales.sort_values('Date')

# Extracting month and day for seasonal trend plotting.
aggregated_sales['Month'] = aggregated_sales['Date'].dt.month
aggregated_sales['Year'] = aggregated_sales['Date'].dt.year

Plotting:

In [None]:
# Setting up the seaborn style.
sns.set(style="whitegrid", rc={"axes.facecolor": "whitesmoke"})

# Creating a figure.
plt.figure(figsize=(10, 6))

# Defining the time segments, excluding the incomplete year 2020.
years = aggregated_sales['Year'].unique()
for year in years:
    if year > 2020 and year <= 2023:  # Filtering years within the range.
        subset = aggregated_sales[aggregated_sales['Year'] == year]
        sns.lineplot(
            data=subset,
            x='Month',
            y='Net Ticket Sales Amount',
            label=f'{year}'
        )

# Formatting the plot.
plt.title('Total Net Ticket Sales Amount Over Time (Seasonal Trends)', fontsize=16)
plt.xlabel('')
plt.ylabel('Net Ticket Sales Amount, $', fontsize=14)
plt.xticks(
    ticks=range(1, 13),
    labels=[
        'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
        'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
    ],
    fontsize=12
)
plt.yticks(fontsize=12)

# Formatting y-axis to avoid scientific notation.
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))

plt.grid(True, color='white')
sns.despine(left=True, bottom=True)  # Removing the frame.
plt.legend(title='Year', fontsize=12, title_fontsize=14)  # Adding legend.
plt.tight_layout()
plt.show()

Seasonal trends observed from 2021 to 2023 indicate the following patterns: sales drop significantly in February but rise again in March. From March onwards, sales decrease, reaching a local minimum in June. There is an increase in July followed by another decrease. Typically, there is another peak in sales from November to January.

## 3.2. Ticket Price Over Time

Let's take a look at how `Ticket Price` changed over the years.

Preparing data for plotting:

In [None]:
# Groupping by "Calendar Year" and "Calendar Month", then computing
# the average "Ticket Price".
average_price = (
    reduced_dask_dataframe
    .groupby(['Calendar Year', 'Calendar Month'])
    ['Ticket Price']
    .mean()
    .compute()
)

# Resetting index to convert the MultiIndex DataFrame to a flat DataFrame.
average_price = average_price.reset_index()

# Creating a "Date" column that combines "Calendar Year" and "Calendar Month"
# for plotting.
average_price['Date'] = pd.to_datetime(
    average_price['Calendar Year'].astype(str)
    + '-'
    + average_price['Calendar Month'].astype(str)
)

# Sorting by date to ensure the line chart follows chronological order.
average_price = average_price.sort_values(by='Date')

Plotting:

In [None]:
# Setting up the seaborn style.
sns.set(style="whitegrid", rc={"axes.facecolor": "whitesmoke"})

# Creating a figure.
plt.figure(figsize=(10, 6))

# Plotting.
sns.lineplot(data=average_price, x='Date', y='Ticket Price')

# Formatting the plot.
plt.title('Average Ticket Price Over Time', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Average Ticket Price, $', fontsize=14)

# Customizing the x-axis date format.
date_format = mdates.DateFormatter("%b %Y")
plt.gca().xaxis.set_major_formatter(date_format)

plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True, color='white')
sns.despine(left=True, bottom=True)  # Removing the frame.

plt.tight_layout()
plt.show()

## 3.3. Ticket Price Seasonal Trends

Preparing data for plotting. I'll use the already existing `average_price` DataFrame.

In [None]:
# Extracting month and day for seasonal trend plotting.
average_price['Month'] = average_price['Date'].dt.month
average_price['Year'] = average_price['Date'].dt.year

Plotting:

In [None]:
# Setting up the seaborn style.
sns.set(style="whitegrid", rc={"axes.facecolor": "whitesmoke"})

# Creating a figure.
plt.figure(figsize=(10, 6))

# Defining the time segments, excluding the incomplete year 2020.
years = average_price['Year'].unique()
# Sorting years to ensure the legend matches the plot order.
for year in sorted(years, reverse=True):
    if year > 2020 and year <= 2023:  # Filtering years within the range.
        subset = average_price[average_price['Year'] == year]
        sns.lineplot(data=subset, x='Month', y='Ticket Price', label=f'{year}')

# Formatting the plot.
plt.title('Average Ticket Price Over Time (Seasonal Trends)', fontsize=16)
plt.xlabel('')
plt.ylabel('Average Ticket Price, $', fontsize=14)
plt.xticks(
    ticks=range(1, 13),
    labels=[
        'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
        'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
    ],
    fontsize=12
)
plt.yticks(fontsize=12)
plt.grid(True, color='white')
sns.despine(left=True, bottom=True)  # Removing the frame.

# Customizing legend to match the plot order.
handles, labels = plt.gca().get_legend_handles_labels()
order = [labels.index('2023'), labels.index('2022'), labels.index('2021')]
plt.legend(
    [handles[i] for i in order], [labels[i] for i in order],
    title='Year',
    fontsize=12,
    title_fontsize=14
)

plt.tight_layout()
plt.show()

According to our plot for the years 2021-2023, we can observe a rapid increase in ticket prices starting from January and peaking between April and June. From there, the price remains relatively stable or slightly decreases until the next winter.

# 4. Optimization and Export for Tableau

## 4.1. Aggregation

Aggregating the data to reduce the dataset size and make it suitable for use in Tableau:

In [None]:
# Performing the aggregation.
monthly_agg = reduced_dask_dataframe.groupby([
    'Retailer Location County',
    'Retailer Location City',
    'Game Category',
    'Calendar Year',
    'Calendar Month'
]).agg({
    'Gross Ticket Sales Amount': 'sum',
    'Net Ticket Sales Amount': 'sum',
    'Promotional Tickets Amount': 'sum',
    'Cancelled Tickets Amount': 'sum',
    'Ticket Returns Amount': 'sum',
    'Ticket Price': 'sum'
})

# Computing the result to get a Pandas DataFrame.
monthly_df = monthly_agg.compute()

# Resetting the index to turn the grouped columns into regular columns.
monthly_df_reset = monthly_df.reset_index()

Removing the `Ticket Price` column, as it contains data for only one game category and thus won't be used in the Tableau dashboard:

In [None]:
monthly_df_reset = monthly_df_reset.drop(columns=['Ticket Price'])

## 4.2. Imputing Missing Values

As observed earlier, our dataset contains data for only 250 out of the 254 Texas counties. These missing values will become evident on the Tableau dashboard.

To address this, I will create a DataFrame with all possible combinations of Counties and Measurement Types. This will allow me to identify and impute the missing values, ensuring a complete and accurate representation on the dashboard.

Below is the list of all counties in Texas:

In [None]:
counties_list = [
    'Anderson', 'Andrews', 'Angelina', 'Aransas', 'Archer', 'Armstrong', 
    'Atascosa', 'Austin', 'Bailey', 'Bandera', 'Bastrop', 'Baylor', 'Bee',
    'Bell', 'Bexar', 'Blanco', 'Borden', 'Bosque', 'Bowie', 'Brazoria',
    'Brazos', 'Brewster', 'Briscoe', 'Brooks', 'Brown', 'Burleson', 'Burnet',
    'Caldwell', 'Calhoun', 'Callahan', 'Cameron', 'Camp', 'Carson', 'Cass',
    'Castro', 'Chambers', 'Cherokee', 'Childress', 'Clay', 'Cochran', 'Coke',
    'Coleman', 'Collin', 'Collingsworth', 'Colorado', 'Comal', 'Comanche',
    'Concho', 'Cooke', 'Coryell', 'Cottle', 'Crane', 'Crockett', 'Crosby',
    'Culberson', 'Dallam', 'Dallas', 'Dawson', 'Deaf Smith', 'Delta', 'Denton',
    'DeWitt', 'Dickens', 'Dimmit', 'Donley', 'Duval', 'Eastland', 'Ector',
    'Edwards', 'Ellis', 'El Paso', 'Erath', 'Falls', 'Fannin', 'Fayette', 
    'Fisher', 'Floyd', 'Foard', 'Fort Bend', 'Franklin', 'Freestone', 'Frio',
    'Gaines', 'Galveston', 'Garza', 'Gillespie', 'Glasscock', 'Goliad',
    'Gonzales', 'Gray', 'Grayson', 'Gregg', 'Grimes', 'Guadalupe', 'Hale',
    'Hall', 'Hamilton', 'Hansford', 'Hardeman', 'Hardin', 'Harris', 'Harrison',
    'Hartley', 'Haskell', 'Hays', 'Hemphill', 'Henderson', 'Hidalgo', 'Hill',
    'Hockley', 'Hood', 'Hopkins', 'Houston', 'Howard', 'Hudspeth', 'Hunt',
    'Hutchinson', 'Irion', 'Jack', 'Jackson', 'Jasper', 'Jeff Davis',
    'Jefferson', 'Jim Hogg', 'Jim Wells', 'Johnson', 'Jones', 'Karnes',
    'Kaufman', 'Kendall', 'Kenedy', 'Kent', 'Kerr', 'Kimble', 'King', 'Kinney',
    'Kleberg', 'Knox', 'Lamar', 'Lamb', 'Lampasas', 'La Salle', 'Lavaca',
    'Lee', 'Leon', 'Liberty', 'Limestone', 'Lipscomb', 'Live Oak', 'Llano',
    'Loving', 'Lubbock', 'Lynn', 'McCulloch', 'McLennan', 'McMullen',
    'Madison', 'Marion', 'Martin', 'Mason', 'Matagorda', 'Maverick', 'Medina',
    'Menard', 'Midland', 'Milam', 'Mills', 'Mitchell', 'Montague',
    'Montgomery', 'Moore', 'Morris', 'Motley', 'Nacogdoches', 'Navarro',
    'Newton', 'Nolan', 'Nueces', 'Ochiltree', 'Oldham', 'Orange', 'Palo Pinto', 
    'Panola', 'Parker', 'Parmer', 'Pecos', 'Polk', 'Potter', 'Presidio',
    'Rains', 'Randall', 'Reagan', 'Real', 'Red River', 'Reeves', 'Refugio',
    'Roberts', 'Robertson', 'Rockwall', 'Runnels', 'Rusk', 'Sabine',
    'San Augustine', 'San Jacinto', 'San Patricio', 'San Saba', 'Schleicher',
    'Scurry', 'Shackelford', 'Shelby', 'Sherman', 'Smith', 'Somervell',
    'Starr', 'Stephens', 'Sterling', 'Stonewall', 'Sutton', 'Swisher',
    'Tarrant', 'Taylor', 'Terrell', 'Terry', 'Throckmorton', 'Titus',
    'Tom Green', 'Travis', 'Trinity', 'Tyler', 'Upshur', 'Upton', 'Uvalde',
    'Val Verde', 'Van Zandt', 'Victoria', 'Walker', 'Waller', 'Ward',
    'Washington', 'Webb', 'Wharton', 'Wheeler', 'Wichita', 'Wilbarger',
    'Willacy', 'Williamson', 'Wilson', 'Winkler', 'Wise', 'Wood', 'Yoakum',
    'Young', 'Zapata', 'Zavala'
]

Generating a DataFrame of all possible combinations:

In [None]:
# Getting unique values from relevant columns.
game_categories = monthly_df_reset['Game Category'].unique()
years = monthly_df_reset['Calendar Year'].unique()
months = monthly_df_reset['Calendar Month'].unique()

# Creating a DataFrame from the Cartesian product of the unique values
# and the "counties_list".
all_combinations = pd.MultiIndex.from_product(
    [counties_list, game_categories, years, months], 
    names=[
        'Retailer Location County',
        'Game Category',
        'Calendar Year',
        'Calendar Month'
    ]
).to_frame(index=False)

Merging with the original DataFrame to ensure all combinations are present:

In [None]:
merged_df = pd.merge(
    all_combinations,
    monthly_df_reset,
    on=[
        'Retailer Location County',
        'Game Category',
        'Calendar Year',
        'Calendar Month'
    ],
    how='outer'
)

Cleaning out the rows that fall outside of our original time range:

In [None]:
# Removing rows for 2020 up to and including August.
merged_df = merged_df[
    ~(
        (merged_df['Calendar Year'] == 2020)
        & (merged_df['Calendar Month'] <= 8)
    )
]

# Removing rows for 2024 from February onwards.
merged_df = merged_df[
    ~(
        (merged_df['Calendar Year'] == 2024)
        & (merged_df['Calendar Month'] >= 2)
    )
]

Filling missing values for newly created rows:

In [None]:
merged_df.fillna(
    {
        'Promotional Tickets Amount': 0,
        'Cancelled Tickets Amount': 0, 
        'Ticket Returns Amount': 0
    },
    inplace=True
)

For the "Gross Ticket Sales Amount" and "Net Ticket Sales Amount" columns, I replace the missing values with the mean value based on the corresponding "Game Category", "Calendar Year", and "Calendar Month".

In [None]:
# Calculating mean values.
mean_values = (
    monthly_df_reset
    .groupby(['Game Category', 'Calendar Year', 'Calendar Month'])
    [['Gross Ticket Sales Amount', 'Net Ticket Sales Amount']]
    .mean()
    .reset_index()
)

# Merging the mean values back into the merged DataFrame.
merged_df = pd.merge(
    merged_df,
    mean_values,
    on=['Game Category', 'Calendar Year', 'Calendar Month'],
    how='left',
    suffixes=('', '_mean')
)

# Replacing missing values with the mean.
for column in ['Gross Ticket Sales Amount', 'Net Ticket Sales Amount']:
    merged_df[column] = np.where(
        merged_df[column].isna(),
        merged_df[column + '_mean'],
        merged_df[column]
    )

# Dropping the mean columns as they are no longer needed.
merged_df.drop(
    ['Gross Ticket Sales Amount_mean', 'Net Ticket Sales Amount_mean'],
    axis=1,
    inplace=True
)

`merged_df` now contains our original data plus any missing combinations, with `Gross Ticket Sales Amount` and `Net Ticket Sales Amount` filled with mean values where necessary.

In [None]:
merged_df.head()

Saving the DataFrame to a CSV file:

In [None]:
merged_df.to_csv('monthly_aggregation_imputed_.csv', index=False)

## 4.3. Creating a Pivot Table

In the Tableau dashboard, I want to enable filtering by the following Measurement Types:

- `Gross Ticket Sales Amount`
- `Net Ticket Sales Amount`
- `Promotional Tickets Amount`
- `Cancelled Tickets Amount`
- `Ticket Returns Amount`

To achieve this, I will pivot these measurements into a single column.

First, I'll remove the `Retailer Location City` column, as I'm not going to pivot to that level of granularity.

In [None]:
merged_df = merged_df.drop(columns='Retailer Location City')

Checking the resulting table before proceeding:

In [None]:
merged_df.head()

After removing the column, I will group the DataFrame by the remaining categorical columns and sum the numerical columns:

In [None]:
grouped_df = merged_df.groupby([
    'Retailer Location County', 
    'Game Category', 
    'Calendar Year', 
    'Calendar Month'
]).sum()

# Resetting the index of the grouped DataFrame.
grouped_df = grouped_df.reset_index()

Creating a pivot table:

In [None]:
# Pivoting the sales amount columns into a long format.
long_format_df = pd.melt(
    frame=grouped_df, 
    id_vars=[
        'Retailer Location County',
        'Game Category',
        'Calendar Year',
        'Calendar Month'
    ], 
    value_vars=[
        'Gross Ticket Sales Amount', 
        'Net Ticket Sales Amount', 
        'Promotional Tickets Amount', 
        'Cancelled Tickets Amount', 
        'Ticket Returns Amount'
    ],
    var_name='Sales Type', 
    value_name='Amount'
)

The result:

In [None]:
long_format_df.head()

Saving the DataFrame to a CSV file:

In [None]:
long_format_df.to_csv('sales_pivot_table_.csv', index=False)