## Navigational Links

[<-- Previous Week](week_13.ipynb) | [<-- Back to Course Overview](course_overview.ipynb) | [Next Week -->](course_overview.ipynb)

# Week 14: Data Visualization with Matplotlib

Welcome to Week 14, the final week of Module 4! This week, we will explore Matplotlib, Python's most popular library for creating static, animated, and interactive visualizations. Building on our work with Pandas, you'll learn how to effectively represent your data graphically, uncovering patterns and insights. We'll cover fundamental plot types like line plots, scatter plots, and bar charts, as well as essential customization techniques such as adding labels, titles, and legends. Mastering data visualization is crucial for communicating your data analysis findings clearly and persuasively.

### Reading: Chapter 4 of 'Python Data Science Handbook'

For a comprehensive understanding of this week's topics, please refer to Chapter 4 of the open-source textbook, which covers Matplotlib in detail:
[Python Data Science Handbook - Chapter 4 (Visualization with Matplotlib)](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)

## Installation: Matplotlib Library

The projects in this module require the installation of the Matplotlib Python library. Run the following cell to install it, if you haven't already. The `!pip install` command is a Jupyter/Colab specific way to run shell commands.

In [None]:
!pip install matplotlib

## Interactive Lab: Basic Data Visualization

This section provides hands-on exercises to familiarize you with creating various types of plots using Matplotlib. Experiment with the code cells and modify them to test different scenarios.

#### Exercise 1: Line Plot - Trends Over Time

Line plots are ideal for showing trends over a continuous range, such as time. They connect individual data points with line segments.

**Try It Yourself:** Plot the monthly average temperature for a year.

In [None]:
import matplotlib.pyplot as plt

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
avg_temps = [2, 4, 8, 13, 17, 21, 23, 22, 18, 13, 7, 3]

plt.figure(figsize=(10, 6))
plt.plot(months, avg_temps, marker='o', linestyle='-', color='skyblue')
plt.title('Monthly Average Temperatures')
plt.xlabel('Month')
plt.ylabel('Average Temperature (°C)')
plt.grid(True)
plt.show()

#### Exercise 2: Scatter Plot - Relationships Between Variables

Scatter plots are used to observe relationships between two different quantitative variables. They show individual data points as marks.

**Try It Yourself:** Plot a scatter plot of study hours versus exam scores to see if there's a correlation.

In [None]:
import matplotlib.pyplot as plt

study_hours = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
exam_scores = [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]

plt.figure(figsize=(8, 6))
plt.scatter(study_hours, exam_scores, color='lightcoral', marker='x')
plt.title('Study Hours vs. Exam Scores')
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.grid(True)
plt.show()

#### Exercise 3: Bar Chart - Comparing Categories

Bar charts are excellent for comparing quantities among different categories. They use rectangular bars with lengths proportional to the values they represent.

**Try It Yourself:** Create a bar chart showing the sales figures for different product categories.

In [None]:
import matplotlib.pyplot as plt

product_categories = ['Electronics', 'Clothing', 'Books', 'Home Goods', 'Food']
sales = [15000, 12000, 8000, 10000, 18000]

plt.figure(figsize=(10, 6))
plt.bar(product_categories, sales, color='mediumseagreen')
plt.title('Sales by Product Category')
plt.xlabel('Product Category')
plt.ylabel('Sales ($)')
plt.grid(axis='y', linestyle='--')
plt.show()

## Mini-Project: Sales Data Visualization

**Task:** You are given sales data for different regions over three quarters. Create a series of visualizations using Matplotlib to summarize and present this data.

**Data:**
```python
sales_data = {
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3', 'Q3'],
    'Region': ['East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South'],
    'Sales': [150, 200, 100, 180, 170, 220, 110, 190, 160, 210, 105, 185]
}
```

**Instructions:**
1.  **Create a DataFrame:** Convert the `sales_data` dictionary into a Pandas DataFrame.
2.  **Total Sales per Region (Bar Chart):** Create a bar chart showing the total sales for each region across all quarters. Add appropriate labels and a title.
3.  **Sales Trend per Region (Line Plot):** Create a line plot that shows the sales trend for each region over the three quarters. Each region should be a separate line. Add a legend, labels, and a title.
4.  **Sales Distribution (Histogram):** Create a histogram to visualize the distribution of individual sales figures. Add labels and a title.
5.  **Quarterly Sales (Pie Chart):** Calculate the total sales for each quarter and display it as a pie chart. Add labels and a title.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data: Sales data for different regions over three quarters
sales_data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Q1_Sales': [100, 150, 120, 130, 110, 140, 125, 135, 105, 145, 115, 120],
    'Q2_Sales': [110, 160, 130, 140, 120, 150, 135, 145, 115, 155, 125, 130],
    'Q3_Sales': [105, 155, 125, 135, 115, 145, 130, 140, 110, 150, 120, 125]
}
df_sales = pd.DataFrame(sales_data)
print("Original Sales Data:\n" + str(df_sales.head()))

# 1. Total Sales by Region (Bar Chart)
df_sales['Total_Sales'] = df_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)
regional_sales = df_sales.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
regional_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

# 2. Sales Distribution by Product (Pie Chart)
product_sales = df_sales.groupby('Product')['Total_Sales'].sum()

plt.figure(figsize=(8, 8))
plt.pie(product_sales, labels=product_sales.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Sales Distribution by Product')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

# 3. Quarterly Sales Trend by Region (Line Plot)
quarterly_sales = df_sales.groupby('Region')[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum()

plt.figure(figsize=(12, 7))
for region in quarterly_sales.index:
    plt.plot(['Q1', 'Q2', 'Q3'], quarterly_sales.loc[region], marker='o', label=region)

plt.title('Quarterly Sales Trend by Region')
plt.xlabel('Quarter')
plt.ylabel('Sales')
plt.grid(True, linestyle='--')
plt.legend(title='Region')
plt.tight_layout()
plt.show()


## Unit Tests for Sales Data Visualization

Testing visualizations programmatically can be complex, often involving image comparison. For simplicity, we'll focus on testing the data transformations that lead to the visualizations.

In [None]:
import pandas as pd
import numpy as np

# Helper function to generate data for testing (to match mini-project structure)
def generate_test_sales_data():
    sales_data = {
        'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
        'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
        'Q1_Sales': [100, 150, 120, 130, 110, 140, 125, 135, 105, 145, 115, 120],
        'Q2_Sales': [110, 160, 130, 140, 120, 150, 135, 145, 115, 155, 125, 130],
        'Q3_Sales': [105, 155, 125, 135, 115, 145, 130, 140, 110, 150, 120, 125]
    }
    return pd.DataFrame(sales_data)

# Helper function to run the data transformation parts of the analysis for testing
def run_sales_data_transformations(df_input):
    df_input['Total_Sales'] = df_input[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)
    regional_sales = df_input.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False)
    product_sales = df_input.groupby('Product')['Total_Sales'].sum()
    quarterly_sales = df_input.groupby('Region')[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum()
    return df_input, regional_sales, product_sales, quarterly_sales

# Test Cases
print('--- Running Sales Data Visualization Unit Tests ---')

# Generate test data
df_test = generate_test_sales_data()
df_transformed, regional_sales_test, product_sales_test, quarterly_sales_test = run_sales_data_transformations(df_test.copy())

# Test 1: Verify Total_Sales column
expected_total_sales_first_row = 100 + 110 + 105
assert df_transformed['Total_Sales'].iloc[0] == expected_total_sales_first_row, f'Test 1 Failed: Total_Sales for first row incorrect. Expected {expected_total_sales_first_row}, got {df_transformed['Total_Sales'].iloc[0]}'.replace('
','')
print('Test 1 Passed: Total_Sales calculation is correct.')

# Test 2: Verify regional_sales totals (e.g., North region)
expected_north_sales = df_test[df_test['Region'] == 'North'][['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum().sum()
assert regional_sales_test['North'] == expected_north_sales, f'Test 2 Failed: North regional sales incorrect. Expected {expected_north_sales}, got {regional_sales_test['North']}'.replace('
','')
print('Test 2 Passed: North regional sales aggregation is correct.')

# Test 3: Verify product_sales totals (e.g., Product A)
expected_product_a_sales = df_test[df_test['Product'] == 'A'][['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum().sum()
assert product_sales_test['A'] == expected_product_a_sales, f'Test 3 Failed: Product A sales incorrect. Expected {expected_product_a_sales}, got {product_sales_test['A']}'.replace('
','')
print('Test 3 Passed: Product A sales aggregation is correct.')

# Test 4: Verify quarterly_sales for a specific region and quarter (e.g., North Q1)
expected_north_q1_sales = df_test[df_test['Region'] == 'North']['Q1_Sales'].sum()
assert quarterly_sales_test.loc['North', 'Q1_Sales'] == expected_north_q1_sales, f'Test 4 Failed: North Q1 sales incorrect. Expected {expected_north_q1_sales}, got {quarterly_sales_test.loc['North', 'Q1_Sales']}'.replace('
','')
print('Test 4 Passed: North Q1 sales aggregation is correct.')

print('
All Unit Tests Completed.')


## Hints/Solution (Optional, Expand to View)
This section contains a suggested implementation for the Sales Data Visualization mini-project. Review it if you get stuck or want to compare your approach.

In [None]:
# Suggested solution for Sales Data Visualization mini-project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data (as provided in the mini-project description)
sales_data = {
    'Quarter': ['Q1', 'Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2', 'Q2', 'Q3', 'Q3', 'Q3', 'Q3'],
    'Region': ['East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South'],
    'Sales': [150, 200, 100, 180, 170, 220, 110, 190, 160, 210, 105, 185]
}
df_sales = pd.DataFrame(sales_data)
print("Original DataFrame:"
 + str(df_sales))

# 1. Create a DataFrame (already done above)

# 2. Total Sales per Region (Bar Chart)
regional_sales = df_sales.groupby('Region')['Sales'].sum().reset_index()

plt.figure(figsize=(10, 6))
plt.bar(regional_sales['Region'], regional_sales['Sales'], color=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
plt.title('Total Sales per Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# 3. Sales Trend per Region (Line Plot)
quarterly_sales_by_region = df_sales.groupby(['Quarter', 'Region'])['Sales'].sum().unstack().fillna(0)

plt.figure(figsize=(12, 7))
for region in quarterly_sales_by_region.columns:
    plt.plot(quarterly_sales_by_region.index, quarterly_sales_by_region[region], marker='o', label=region)
plt.title('Sales Trend per Region Over Quarters')
plt.xlabel('Quarter')
plt.ylabel('Total Sales')
plt.legend(title='Region')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

# 4. Sales Distribution (Histogram)
plt.figure(figsize=(10, 6))
plt.hist(df_sales['Sales'], bins=5, color='lightsalmon', edgecolor='black')
plt.title('Distribution of Individual Sales Figures')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# 5. Quarterly Sales (Pie Chart)
quarterly_total_sales = df_sales.groupby('Quarter')['Sales'].sum()

plt.figure(figsize=(8, 8))
plt.pie(quarterly_total_sales, labels=quarterly_total_sales.index, autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99'])
plt.title('Total Sales Distribution by Quarter')
plt.ylabel('') # Hide y-label for pie chart
plt.show()


In [None]:
# Suggested solution for Sales Data Visualization mini-project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data (as provided in the mini-project description)
sales_data_solution = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Q1_Sales': [100, 150, 120, 130, 110, 140, 125, 135, 105, 145, 115, 120],
    'Q2_Sales': [110, 160, 130, 140, 120, 150, 135, 145, 115, 155, 125, 130],
    'Q3_Sales': [105, 155, 125, 135, 115, 145, 130, 140, 110, 150, 120, 125]
}
df_sales_solution = pd.DataFrame(sales_data_solution)

# Calculate Total Sales
df_sales_solution['Total_Sales'] = df_sales_solution[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)

# 1. Total Sales by Region (Bar Chart)
regional_sales_solution = df_sales_solution.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
regional_sales_solution.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

# 2. Sales Distribution by Product (Pie Chart)
product_sales_solution = df_sales_solution.groupby('Product')['Total_Sales'].sum()

plt.figure(figsize=(8, 8))
plt.pie(product_sales_solution, labels=product_sales_solution.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Sales Distribution by Product')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

# 3. Quarterly Sales Trend by Region (Line Plot)
quarterly_sales_solution = df_sales_solution.groupby('Region')[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum()

plt.figure(figsize=(12, 7))
for region in quarterly_sales_solution.index:
    plt.plot(['Q1', 'Q2', 'Q3'], quarterly_sales_solution.loc[region], marker='o', label=region)

plt.title('Quarterly Sales Trend by Region')
plt.xlabel('Quarter')
plt.ylabel('Sales')
plt.grid(True, linestyle='--')
plt.legend(title='Region')
plt.tight_layout()
plt.show()


## Navigational Links

[<-- Back to Course Overview](course_overview.ipynb)


## Hints/Solution (Optional, Expand to View)

This section contains a suggested implementation for the Sales Data Visualization mini-project. Review it if you get stuck or want to compare your approach.

In [None]:
# Suggested solution for Sales Data Visualization mini-project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data (as provided in the mini-project description)
sales_data_solution = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Q1_Sales': [100, 150, 120, 130, 110, 140, 125, 135, 105, 145, 115, 120],
    'Q2_Sales': [110, 160, 130, 140, 120, 150, 135, 145, 115, 155, 125, 130],
    'Q3_Sales': [105, 155, 125, 135, 115, 145, 130, 140, 110, 150, 120, 125]
}
df_sales_solution = pd.DataFrame(sales_data_solution)

# Calculate Total Sales
df_sales_solution['Total_Sales'] = df_sales_solution[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)

# 1. Total Sales by Region (Bar Chart)
regional_sales_solution = df_sales_solution.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
regional_sales_solution.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

# 2. Sales Distribution by Product (Pie Chart)
product_sales_solution = df_sales_solution.groupby('Product')['Total_Sales'].sum()

plt.figure(figsize=(8, 8))
plt.pie(product_sales_solution, labels=product_sales_solution.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Sales Distribution by Product')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

# 3. Quarterly Sales Trend by Region (Line Plot)
quarterly_sales_solution = df_sales_solution.groupby('Region')[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum()

plt.figure(figsize=(12, 7))
for region in quarterly_sales_solution.index:
    plt.plot(['Q1', 'Q2', 'Q3'], quarterly_sales_solution.loc[region], marker='o', label=region)

plt.title('Quarterly Sales Trend by Region')
plt.xlabel('Quarter')
plt.ylabel('Sales')
plt.grid(True, linestyle='--')
plt.legend(title='Region')
plt.tight_layout()
plt.show()


## Hints/Solution (Optional, Expand to View)

This section contains a suggested implementation for the Sales Data Visualization mini-project. Review it if you get stuck or want to compare your approach.

In [None]:
# Suggested solution for Sales Data Visualization mini-project
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Data (as provided in the mini-project description)
sales_data_solution = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Q1_Sales': [100, 150, 120, 130, 110, 140, 125, 135, 105, 145, 115, 120],
    'Q2_Sales': [110, 160, 130, 140, 120, 150, 135, 145, 115, 155, 125, 130],
    'Q3_Sales': [105, 155, 125, 135, 115, 145, 130, 140, 110, 150, 120, 125]
}
df_sales_solution = pd.DataFrame(sales_data_solution)

# Calculate Total Sales
df_sales_solution['Total_Sales'] = df_sales_solution[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum(axis=1)

# 1. Total Sales by Region (Bar Chart)
regional_sales_solution = df_sales_solution.groupby('Region')['Total_Sales'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
regional_sales_solution.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

# 2. Sales Distribution by Product (Pie Chart)
product_sales_solution = df_sales_solution.groupby('Product')['Total_Sales'].sum()

plt.figure(figsize=(8, 8))
plt.pie(product_sales_solution, labels=product_sales_solution.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Sales Distribution by Product')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

# 3. Quarterly Sales Trend by Region (Line Plot)
quarterly_sales_solution = df_sales_solution.groupby('Region')[['Q1_Sales', 'Q2_Sales', 'Q3_Sales']].sum()

plt.figure(figsize=(12, 7))
for region in quarterly_sales_solution.index:
    plt.plot(['Q1', 'Q2', 'Q3'], quarterly_sales_solution.loc[region], marker='o', label=region)

plt.title('Quarterly Sales Trend by Region')
plt.xlabel('Quarter')
plt.ylabel('Sales')
plt.grid(True, linestyle='--')
plt.legend(title='Region')
plt.tight_layout()
plt.show()
