# Lecture 3: Ethical Considerations and Data Visualization
Date: November 16, 2023
Duration: 3 hours

## Table of Contents
1. [Ethical Implications in Media Data Analysis](#Ethical-Implications-in-Media-Data-Analysis)
2. [Data Visualization Techniques](#Data-Visualization-Techniques)
3. [Practical Session: Data Visualization with Python](#Practical-Session:-Data-Visualization-with-Python)
4. [Q&A and Discussion](#QnA-and-Discussion)


## Ethical Implications in Media Data Analysis

### Privacy Concerns

In the realm of media data analysis, privacy concerns are paramount, especially when dealing with vast amounts of personal data. These concerns broadly fall into two categories: personal privacy concerns, relating to the exposure of sensitive information, and social privacy concerns, associated with the exploitation of personal data for illicit purposes.

**Personal Privacy Concerns**

The digital age has led to an exponential increase in the collection and storage of personal information. Media data analysis often involves delving into this sensitive data, which can include anything from personal preferences and online behavior to confidential communication and financial information. The primary risk here is the inadvertent exposure of such sensitive data. This exposure not only violates individual privacy but can also lead to identity theft, financial fraud, and other personal security issues.

A notable aspect of personal privacy concerns is the ethical responsibility to safeguard data. Analysts must ensure that data is anonymized and securely stored, and that personal identifiers are separated from the analyzed datasets. Furthermore, they must adhere to legal frameworks like the General Data Protection Regulation (GDPR) in the EU, which mandates strict guidelines on data collection, processing, and storage.

**Social Privacy Concerns**

On a broader scale, social privacy concerns revolve around the misuse of personal data for illegal or unethical activities. A prime example of this is the Facebook-Cambridge Analytica scandal. In this case, personal data from millions of Facebook users were harvested without consent by Cambridge Analytica, a political consulting firm. This data was then used to influence voter opinion and behavior in several high-profile elections, demonstrating how personal data can be weaponized to manipulate social and political landscapes.

Such incidents highlight the dangers of personal data falling into the wrong hands, leading to large-scale privacy violations and the undermining of democratic processes. They also underscore the need for more stringent data governance and ethical standards in media data analysis.

The balance between leveraging data for insightful analysis and protecting individual and social privacy is delicate. As such, data analysts must be vigilant in implementing robust privacy protection measures and maintaining ethical integrity in their work. The future of data analysis in media depends not only on technological advancements but also on the adherence to ethical practices that respect and preserve the privacy of individuals and societies at large.

### Misinformation

Misinformation, often encapsulated in the term "fake news," poses a significant challenge in the digital age. The interplay of data analysis and misinformation, particularly in political propaganda, has become a matter of global concern. This phenomenon involves two key aspects: the packaging of false information using data and the targeted dissemination of this misinformation.

**Data-Driven Packaging and Dissemination of Misinformation**

Data plays a crucial role in shaping and spreading misinformation. By analyzing user data, propagators of fake news can create and package content that resonates with specific demographics, reinforcing existing biases or influencing opinions. This is not just limited to the content of the news but extends to its presentation and the context in which it is shared.

The targeting aspect is equally sophisticated. Data analytics enables the identification of specific user segments and geographical areas that are most susceptible to certain types of misinformation. This targeted approach ensures that fake news reaches the most receptive audiences, maximizing its impact. This strategy was notably seen in political campaigns where data was used to identify and influence potential voters with tailored misinformation.

**Fake News Debate and Data's Dual Role**

The debate around fake news underscores the dual role of data – as both a tool for enlightenment and a weapon for misinformation. While data analysis can uncover truths and provide clarity, it can also be used to distort reality and spread falsehoods. The challenge lies in discerning the intention and verifying the authenticity of data-driven narratives in media.

**Deepfakes: A New Frontier in Misinformation**

A particularly alarming development in the misinformation landscape is the advent of deepfakes. These are hyper-realistic audiovisual contents created using advanced machine learning techniques, where a person in an existing image or video is replaced with someone else's likeness. Deepfakes take misinformation to a new level, enabling the creation of entirely fabricated videos that are remarkably convincing.

The potential for misuse of deepfakes is vast - from smearing public figures to inciting unrest or influencing elections. The authenticity of digital content is increasingly difficult to ascertain, making it challenging to combat misinformation. The creation and circulation of deepfakes have sparked debates on the ethics of AI and the need for robust detection techniques.

As the landscape of data-driven misinformation, particularly in the form of deepfakes, evolves, it underscores the need for digital literacy, ethical data practices, and advanced technological solutions to identify and combat fake news. This dynamic field continues to pose both challenges and opportunities for media professionals, policymakers, and the public at large.

### Algorithmic Bias

Algorithmic bias refers to systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others. Distinguishing between error and bias in algorithms is crucial. An error is a random deviation from accuracy or correctness, often due to noise or limitations in the data or the algorithm. Bias, on the other hand, is a non-random error that consistently skews results in a particular direction. Unlike random errors, biases can systematically disadvantage certain groups.

Bias in algorithms can be particularly dangerous because it can perpetuate and amplify societal prejudices. Algorithms, often perceived as neutral and objective, can inadvertently reinforce human prejudices if not carefully designed and monitored.

**Examples Demonstrating the Impact of Algorithmic Bias**

For instance, consider a hiring algorithm that has been trained on historical hiring data where a certain gender or ethnicity was preferred. This algorithm could continue to favor candidates from that group, perpetuating past biases. Similarly, in facial recognition technology, if the training data lacks diversity, the algorithm might perform poorly for underrepresented groups, leading to misidentification or discriminatory practices.

**Bias in Artificial Intelligence**

In AI, bias typically manifests in two primary ways: bias in training datasets and feedback loop bias.

**Bias in Training Datasets** 

AI systems learn to make decisions based on the data they are trained on. If this data is biased, the AI's decisions will reflect that bias. This is a common issue in machine learning where the training data might not be representative of the broader population or might contain historical biases. For example, an AI trained for loan approval using past data might deny loans to certain demographics if those groups were historically marginalized in financial systems.

**Feedback Loop Bias** 

This occurs when an AI system is trained on the results of its own decisions, leading to reinforcement and amplification of initial biases. A classic example is in predictive policing, where AI tools predict future crimes based on past crime data. If police disproportionately patrol certain neighborhoods, more crimes are recorded there, not necessarily because these areas are more prone to crime, but because of increased surveillance. An AI trained on this data will continue to flag these neighborhoods as high-risk, reinforcing the bias and potentially leading to over-policing.

The challenge with algorithmic bias in AI is that it can be self-perpetuating and often invisible, making it difficult to identify and correct. This calls for continuous monitoring and updating of AI systems, along with a diverse and inclusive approach to data collection and algorithm development. Understanding and mitigating algorithmic bias is not just a technical issue but a fundamental ethical imperative in the development and deployment of AI systems.

## Data Visualization Techniques

### Charts, Graphs, and Dashboards

Data visualization is a critical aspect of data analysis, helping to communicate complex data in a clear and effective manner. Different types of visualizations can be used depending on the nature of the data and the story one wishes to tell.

Main Types of Visualizations and Python Libraries

    Line Charts: Ideal for showing trends over time. They are simple and effective for time-series data.
        Python Library: matplotlib

![sphx_glr_simple_plot_001.webp](attachment:sphx_glr_simple_plot_001.webp)

    Bar Charts: Useful for comparing quantities across different categories.
        Python Library: matplotlib and seaborn

![barplot_7_0.png](attachment:barplot_7_0.png)

    Pie Charts: Best for showing the composition of a whole. They display data in a proportional manner.
        Python Library: matplotlib

![sphx_glr_pie_features_001.png](attachment:sphx_glr_pie_features_001.png)

    Scatter Plots: Great for showing the relationship between two continuous variables.
        Python Library: matplotlib and seaborn

![scatterplot_3_0.png](attachment:scatterplot_3_0.png)

    Heatmaps: Useful for representing the intensity of data, often used in correlation matrices.
        Python Library: seaborn

![seaborn-heatmap-5.png](attachment:seaborn-heatmap-5.png)

    Histograms: Ideal for showing the distribution of a dataset.
        Python Library: matplotlib and seaborn

![sphx_glr_histogram_features_001.png](attachment:sphx_glr_histogram_features_001.png)

Each of these visualizations helps in understanding different aspects of the data, and the choice depends on what aspect of the data one wishes to highlight.

Dashboards

Dashboards are comprehensive tools that provide a consolidated view of various data points and metrics. They are interactive and allow for real-time data visualization, making them invaluable for monitoring and decision-making processes.

Python libraries for creating dashboards include:

    Dash: A Python framework for building web applications, ideal for creating interactive dashboards.

    Plotly: Often used in conjunction with Dash, it offers sophisticated visualization capabilities.
    
    Streamlit: A Python framework for building web applications, very easy to use

Here's an example of a dashboard created using Streamlit:

https://gptlab.streamlit.app/

In summary, effective data visualization through charts, graphs, and dashboards can significantly enhance the understanding and communication of complex data sets. Python, with its rich ecosystem of libraries, provides powerful tools to create these visualizations.

### Tools

In the field of data analysis, especially in media, the ability to visualize and interpret data effectively is crucial. Business Intelligence (BI) tools like Tableau and Power BI, along with Python libraries, play a pivotal role in this aspect.

Tableau and Power BI: Business Intelligence Tools

    Tableau: Known for its user-friendly interface, Tableau is a powerful BI tool that enables users to create complex and visually appealing data visualizations. It excels in handling large datasets and offers a wide range of visualization types. Tableau's drag-and-drop functionality makes it accessible even to those with limited coding skills. However, its advanced features might require a steeper learning curve, and the cost can be a limiting factor for smaller organizations or individual users.

    Power BI: Developed by Microsoft, Power BI is known for its integration with other Microsoft products like Excel and Azure. It's particularly useful for businesses already embedded in the Microsoft ecosystem. Power BI offers robust data modeling capabilities and real-time analytics. While it is generally more cost-effective than Tableau, it may not be as powerful in handling very large datasets and complex visualizations.

Python Libraries for Data Visualization

Python, with its extensive range of libraries, is a versatile tool for data visualization, catering to both simple and complex requirements. The most popular libraries include:

    Matplotlib: A foundational library that offers great flexibility and control over every element of a plot. Ideal for creating static, animated, and interactive visualizations.
    Seaborn: Built on top of Matplotlib, Seaborn simplifies the creation of more complex statistical visualizations.
    Plotly: Enables interactive visualizations that can be embedded in web applications, offering features similar to BI tools like Tableau and Power BI.

Python libraries are incredibly powerful for customized data analysis and visualization, particularly for those comfortable with coding. They offer the flexibility to create bespoke visualizations tailored to specific analytical needs. However, they require programming knowledge, which might be a barrier for non-technical users.

Comparing BI Tools and Python Libraries

BI tools like Tableau and Power BI are more user-friendly and require less technical know-how, making them suitable for users who prefer a more intuitive, drag-and-drop interface. They are also ideal for creating dashboards and reports that are easy to share and understand by non-technical stakeholders.

Python libraries, on the other hand, offer more customization and are better suited for in-depth data analysis. They are more cost-effective (often free) and are constantly updated by a large community of developers.

Other Visualization Tools

Other notable tools include D3.js for creating interactive web-based visualizations and GGplot2 in R for creating visually appealing and complex graphics.

Each of these tools and libraries has its unique strengths and limitations. The choice depends on the specific requirements of the project, the technical proficiency of the users, and the nature of the data being analyzed.

## Practical Session: Data Visualization with Python

### Guglielmo

I need a set of Jupyter Notebook cells dedicated to content (explanation) and code for data visualization in Python.
I ask you to prepare some examples of the most popular types of visualization in Python using the two datasets I have uploaded. 
Choose the one that is most suitable depending on the type of visualization. 
Use the Seaborn Library.
You will write an explanation, the code and the visualization.
We will procede an example at time.
Start with  Line Plot

### Data Analysis GPT

Great! We'll start by exploring the two datasets you've uploaded to understand their structure and content. This will help us determine the best dataset to use for creating a line plot example.

A line plot is typically used to visualize data points sequentially connected by straight line segments. It's often used to display trends over time, so any dataset with a time series would be suitable.

Let's first load these datasets and take a quick look at their contents to decide which one would be more appropriate for creating a line plot. I'll start by loading and inspecting the first few rows of each dataset.

### Dataset Overview

1. **Cinema Ticket Dataset (`cinema_ticket_data`):**
   - This dataset includes fields like `film_code`, `cinema_code`, `total_sales`, `tickets_sold`, and `date`.
   - It appears to have a time component (`date`), which could be suitable for a line plot, especially if we are interested in trends over time such as total sales or tickets sold.

2. **Episodes Dataset (`episodes_data`):**
   - This dataset contains information about different episodes, including `No. in series`, `Title`, `Original air date`, and `U.S. viewers (millions)`.
   - The `Original air date` suggests a time series component, which might be interesting to plot, such as tracking viewership over time.

### Line Plot Example

Given that both datasets have a time component, we can create line plots for each. However, since we are proceeding one example at a time, I'll start with the Cinema Ticket Dataset. We can plot the total sales over time.

Let's proceed by creating a line plot showing the trend of `total_sales` over time using the Seaborn library. I'll first aggregate the total sales by date to get a clearer picture of the trend.

Here's a line plot that visualizes the trend of total cinema sales over time using the `cinema_ticket_data` dataset.

![file-xhN2ZiZId8pTkQHu1LFEb2wT.png](attachment:file-xhN2ZiZId8pTkQHu1LFEb2wT.png)

### Explanation of the Visualization:
- **X-axis (Date):** Represents the dates on which the sales were recorded. 
- **Y-axis (Total Sales):** Shows the total sales in the cinema on each date.
- **Line Trend:** The line connects the total sales day by day, providing a visual representation of how sales fluctuated over time.

This type of visualization is particularly useful for identifying trends, seasonal patterns, or anomalies in time-series data.

Would you like to proceed with another type of visualization using one of the datasets, or is there anything specific you'd like to explore further in this plot?

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert 'date' column to datetime format
cinema_ticket_data['date'] = pd.to_datetime(cinema_ticket_data['date'])

# Aggregate total sales by date
total_sales_by_date = cinema_ticket_data.groupby('date')['total_sales'].sum().reset_index()

# Create a line plot
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='total_sales', data=total_sales_by_date)
plt.title('Total Cinema Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Guglielmo

Can you please create another Line Plot example using the episodes dataset?

### Data Analysis GPT

Absolutely! For the episodes dataset, we can create a line plot to visualize the trend in U.S. viewership over time. Given that each episode has an original air date and a corresponding viewership number, this is a good opportunity to see how the audience size changed throughout the series.

Let's proceed by creating a line plot that shows the U.S. viewership (in millions) of each episode over time. We'll first ensure that the `Original air date` is correctly formatted as a date and then plot the data.

Here's the line plot for the episodes dataset, showcasing the trend in U.S. viewership over time.

![line_plot1.png](attachment:line_plot1.png)

### Explanation of the Visualization:
- **X-axis (Original Air Date):** This axis displays the dates on which the episodes were aired.
- **Y-axis (U.S. Viewers in Millions):** Indicates the number of viewers in the United States for each episode, measured in millions.
- **Line Trend:** The line connects the viewership figures for each episode, offering a visual depiction of how the audience size fluctuated throughout the series.

This visualization can be particularly insightful for understanding the popularity of the series over time, pinpointing any episodes with particularly high or low viewership, and identifying any general trends or patterns.

Would you like to explore another type of data visualization, or is there something else you would like to know about this plot?

In [None]:
# Convert 'Original air date' to datetime format
episodes_data['Original air date'] = pd.to_datetime(episodes_data['Original air date'])

# Create a line plot for the episodes data
plt.figure(figsize=(12, 6))
sns.lineplot(x='Original air date', y='U.S. viewers (millions)', data=episodes_data)
plt.title('U.S. Viewership of Episodes Over Time')
plt.xlabel('Original Air Date')
plt.ylabel('U.S. Viewers (millions)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Guglielmo

Now please prepare a bar plot example

### Data Analysis GPT

For a bar plot example, let's use the cinema ticket dataset again. A bar plot is great for comparing different groups or categories. In this context, we could compare the total sales or the number of tickets sold for different films or at different cinemas.

I'll create a bar plot that shows the total sales for different films. We'll aggregate the sales by `film_code` and then plot these aggregated values. This will give us a visual representation of which films had the highest total sales in the dataset.

Let's proceed with this bar plot example:

Here's a bar plot that visualizes the total sales of the top 10 films from the cinema ticket dataset.

![bar.png](attachment:bar.png)

### Explanation of the Visualization:
- **X-axis (Film Code):** Displays the codes of different films.
- **Y-axis (Total Sales):** Represents the total sales accumulated for each film.
- **Bars:** Each bar corresponds to a film and its height indicates the total sales for that film.

This type of visualization is particularly useful for comparing quantities across different categories—in this case, the financial performance of different films. The plot clearly shows which films had the highest sales, making it easy to identify the most commercially successful films in the dataset.

Would you like to explore another type of data visualization, or do you have any specific modifications in mind for this plot?

In [None]:
# Aggregate total sales by film code
total_sales_by_film = cinema_ticket_data.groupby('film_code')['total_sales'].sum().reset_index()

# Sort the data by total sales for better visualization
total_sales_by_film = total_sales_by_film.sort_values(by='total_sales', ascending=False)

# Create a bar plot for the total sales by film
plt.figure(figsize=(12, 6))
sns.barplot(x='film_code', y='total_sales', data=total_sales_by_film.head(10))  # Displaying top 10 for clarity
plt.title('Top 10 Films by Total Sales')
plt.xlabel('Film Code')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Guglielmo

Can you use the tickets dataset and create a scatterplot of the relationship between occupancy percentage and total sales?

### Data Analysis GPT

Certainly! Creating a scatter plot to explore the relationship between occupancy percentage (`occu_perc`) and total sales (`total_sales`) in the cinema ticket dataset is a great idea. This will help us understand if there's a correlation between how full the cinema was and the sales generated.

Let's create this scatter plot:

![file-yFav5GRUk5ObYu0yrBCsbHHY.png](attachment:file-yFav5GRUk5ObYu0yrBCsbHHY.png)

Here's the scatter plot illustrating the relationship between occupancy percentage and total sales from the cinema ticket dataset.

### Explanation of the Visualization:
- **X-axis (Occupancy Percentage):** Indicates the percentage of the cinema's capacity that was filled.
- **Y-axis (Total Sales):** Shows the total sales generated for each data point.
- **Data Points:** Each point represents a unique instance of cinema sales. The location of the point shows the occupancy percentage and the corresponding total sales for that instance.

This visualization can be insightful for seeing if there's a trend or correlation between how full a cinema is and the total sales it generates. For instance, a positive correlation would be indicated by higher total sales with higher occupancy percentages.

In [None]:
# Create a scatter plot for occupancy percentage vs total sales
plt.figure(figsize=(10, 6))
sns.scatterplot(x='occu_perc', y='total_sales', data=cinema_ticket_data)
plt.title('Total Sales vs. Occupancy Percentage')
plt.xlabel('Occupancy Percentage')
plt.ylabel('Total Sales')
plt.show()


### Guglielmo

The occupancy percentage should not go over 100%. How come you have higher values? Is there any wrong data?

### Data Analysis GPT

You're absolutely right; occupancy percentages should logically not exceed 100%. The presence of values over 100% suggests there might be inaccuracies or anomalies in the data. To investigate this, let's first check the range of values in the `occu_perc` (occupancy percentage) column. If there are values exceeding 100%, we can then look more closely to understand the nature of these anomalies. Let's proceed with this analysis.

The analysis of the `occu_perc` (occupancy percentage) column in the cinema ticket dataset reveals:

- The minimum occupancy percentage is 0.0% (which is expected).
- The maximum occupancy percentage is 147.5%, which is clearly beyond the logical maximum of 100%.
- There are indeed entries with occupancy percentages exceeding 100%.

This suggests there are anomalies or inaccuracies in the dataset. The reasons for these anomalies could vary—it might be due to data entry errors, issues in the way occupancy percentage is calculated, or other unknown factors.

Before proceeding with further analysis or visualization involving occupancy percentage, it would be advisable to address these anomalies. This could involve correcting the data (if the correct values are known), removing these anomalous entries, or consulting with the data provider for clarification.

Would you like to proceed with any specific action regarding this dataset, or explore something else?

### Guglielmo

No, let's move on to an example of a histogram

### Data Analysis GPT

Creating a histogram of the `ticket_price` variable from the cinema ticket dataset is indeed a good idea. This will provide insights into the distribution of ticket prices—how they vary, what the most common price ranges are, and whether there are any notable outliers.

Let's proceed with creating a histogram for the `ticket_price` data:

Here's the histogram displaying the distribution of ticket prices in the cinema ticket dataset.

![isto.png](attachment:isto.png)

### Explanation of the Visualization:
- **X-axis (Ticket Price):** Shows the range of ticket prices.
- **Y-axis (Frequency):** Indicates the number of occurrences for each price range.
- **Bars:** Each bar represents a specific range of ticket prices. The height of the bar reflects how frequently tickets within that price range occur in the dataset.
- **Curve (KDE - Kernel Density Estimate):** The smooth line provides an estimate of the distribution, highlighting the data's spread and central tendencies.

This histogram is useful for understanding the typical ticket prices and identifying common pricing strategies. It also helps in spotting any unusual price points, which might be either exceptionally high or low compared to the majority.

Would you like to explore another visualization or delve deeper into this dataset?

In [None]:
# Create a histogram for ticket prices
plt.figure(figsize=(10, 6))
sns.histplot(cinema_ticket_data['ticket_price'], bins=30, kde=True)
plt.title('Distribution of Ticket Prices')
plt.xlabel('Ticket Price')
plt.ylabel('Frequency')
plt.show()


### Guglielmo

Now an heatmap

### Data Analysis GPT

Creating a heatmap is an excellent way to visualize patterns in data, particularly when you want to explore relationships or comparisons across two dimensions. For this example, we can use the cinema ticket dataset to create a heatmap that shows the relationship between two variables, such as the day of the week and the total sales or occupancy percentage.

However, the cinema ticket dataset doesn't contain a direct day-of-the-week column. We can derive it from the `date` column. Then, we can aggregate data, for example, by calculating average total sales or average occupancy percentage for each day of the week.

Let's proceed with creating a heatmap that shows the average total sales for each day of the week, across different months. This will give us an idea of how sales trends vary by day and month.

Here's the heatmap representing the average total sales by day of the week and month using the cinema ticket dataset.

![file-3kzBmBL6niBgrvtWbujcX4jw.png](attachment:file-3kzBmBL6niBgrvtWbujcX4jw.png)

### Explanation of the Visualization:
- **X-axis (Month):** Represents different months.
- **Y-axis (Day of the Week):** Shows days of the week, from Monday to Sunday.
- **Color Intensity:** The intensity of the color indicates the level of average total sales. Warmer colors (towards yellow) represent higher sales, while cooler colors (towards blue) indicate lower sales.
- **Values:** The actual average sales figures are annotated on the heatmap for clarity.

This heatmap allows for easy comparison of sales patterns across different days and months. It can reveal trends such as which days of the week are generally more profitable and how these trends vary across different months.

Would you like to delve into another type of data visualization, or do you have any specific questions about this heatmap?

In [None]:
import numpy as np

# Extract day of the week from the date (Monday=0, Sunday=6)
cinema_ticket_data['day_of_week'] = cinema_ticket_data['date'].dt.dayofweek

# Aggregate average total sales by day of the week and month
sales_heatmap_data = cinema_ticket_data.groupby(['day_of_week', 'month'])['total_sales'].mean().unstack()

# Create a heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(sales_heatmap_data, cmap='viridis', annot=True, fmt=".0f")
plt.title('Average Total Sales by Day of the Week and Month')
plt.xlabel('Month')
plt.ylabel('Day of the Week')
plt.yticks(np.arange(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], rotation=0)
plt.show()


### Guglielmo

Why month 1 and 12 are not represented?

### Data Analysis GPT

The absence of data for months 1 (January) and 12 (December) in the heatmap suggests that there may not be records for these months in the cinema ticket dataset. This could be due to several reasons: the dataset might not cover the full year, data for these months might not have been collected or included, or there could simply be no records for these months in the dataset.

To confirm this, let's check the range of months present in the dataset:

The cinema ticket dataset contains data for months 2 through 11 (February to November), which confirms that there are no records for January (month 1) and December (month 12). This is why these months are not represented in the heatmap.

The dataset's time coverage is an important aspect to consider when interpreting the results of any analysis. In this case, it means we are unable to analyze or draw conclusions about cinema sales patterns for January and December.

Would you like to explore any other aspects of this dataset, or would you prefer to look at another type of data visualization?

## Q&A and Discussion

