# Visualization: Day 2
## Learning Outcomes

 - Analyze data trends and patterns: Identify key causes, trends, and sector impacts of data breaches.
 - Apply effective visualization techniques: Use bar charts, line charts, heatmaps, and other visuals to represent data insights.
 - Build interactive dashboards: Combine visualizations into an interactive dashboard for clear, structured data presentation.
 
 There are many visualizations
 
 Now that you understand the basic grammar you can explore other examples on the web
 https://altair-viz.github.io/gallery/index.html
 
 Here is the api https://altair-viz.github.io/user_guide/api.html
 
 All of the customization we will do today can be found in the documentation
 
 

## Breaches Dataset and Environment Setup

In [1]:
!pip install altair
import pandas as pd
import altair as alt



### Breaches Dataset
The original dataset is available here
https://docs.google.com/spreadsheets/d/1i0oIJJMRG-7t1GT-mr4smaTTU7988yXVz8nPlwaJ8Xk/edit?gid=2#gid=2

There are a series of visualizations that have been created off of this dataset
https://informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/


Some of the data wrangling I did **last night** includes 
- formatted the date column
- standardized the records_lost column
- collapsed some of the methods and sector to focus on only one instead of a list (e.g. changed government, miliary to just government), changed (web, health) to just health 



In [2]:
#Read in the data 

url_path = "https://raw.githubusercontent.com/kemiolamudzengi/viz_4_all/main/cleaned_breaches.csv"
breaches = pd.read_csv(url_path, parse_dates=True)

print("********************** The size of the dataset **********************")
print(breaches.shape)
print("********************** The data types of the attributes **********************")
print(breaches.info())
print("********************** The first three records in the dataset **********************")
print(breaches.head(3))



********************** The size of the dataset **********************
(487, 13)
********************** The data types of the attributes **********************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487 entries, 0 to 486
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   organisation        487 non-null    object
 1   records_lost        487 non-null    int64 
 2   year                487 non-null    int64 
 3   date                487 non-null    object
 4   story               482 non-null    object
 5   sector              487 non-null    object
 6   method              487 non-null    object
 7   interesting_story   98 non-null     object
 8   data_sensitivity    487 non-null    int64 
 9   source_name         487 non-null    object
 10  first_source_link   486 non-null    object
 11  second_source_link  46 non-null     object
 12  ID                  487 non-null    int64 
dtypes: int64(4)

## Part 1: Recap of Day 1 

### LBD Task: Scatter Plot of Breaches

Using the breaches dataset,  create a scatter plot with the following encodings:
 - **x** and **size**: `records_lost`
 - **y**: `sector`
 - **color**: `data_sensitivity`
 - **tooltip**: Show `organisation`, `date`, `records_lost`, `method`, and `story`.

In [3]:
alt.Chart(breaches).mark_circle().encode(
    x = 'records_lost',
    size = 'records_lost',
    y = 'sector',
    color = 'data_sensitivity',
    tooltip = ['organisation', 'date', 'records_lost', 'method',  'story'],

)

#### Lil' Data Wrangling
Notice how there are a few sectors that don't have that many records.
Let's exclude records that fall under 

In [4]:
# Remove rows where 'sector' is in the specified list
breaches_filtered = breaches[~breaches['sector'].isin(['academic', 'legal', 'media', 'ngo', 'misc'])]


**For the rest of the session today we will be using the breaches_filtered dataset**


### Simple Scatter Plot
Now let's recreate the viz from the previous step using the breaches_filtered dataset

Using the breaches_filtered dataset, create a scatter plot with the following encodings:
 - **x** and **size**: `records_lost`
 - **y**: `sector`
 - **color**: `data_sensitivity`
 - **tooltip**: Show `organisation`, `date`, `records_lost`, `method`, and `story`.

In [5]:
alt.Chart(breaches_filtered).mark_circle().encode(
    x = 'records_lost',
    size = 'records_lost',
    y = 'sector',
    color = 'data_sensitivity:N',
    tooltip = ['organisation', 'date', 'records_lost', 'method',  'story'],

)

## Part 2: Customizing Visualizations

### Chart Labels, Title, and Size
**Pay attention how I switch from attribute to method format.**
notice how


x = 'records_lost'

becomes 

x = alt.X('records_lost)



#### Equivalent Code Different Syntax


In [6]:
alt.Chart(breaches_filtered).mark_circle().encode(
    x=alt.X('records_lost:Q'),
    y=alt.Y('sector:N'),
    size=alt.Size('records_lost'),
    color=alt.Color('data_sensitivity:N'),
    tooltip=alt.Tooltip(['organisation', 'date', 'records_lost', 'method', 'story'])
)


### Instructions:
   - **Add Axis Titles**: Change the x-axis title to `"Records Lost"` and the y-axis title to `"Sector"`.
     - This is done using `alt.X().title()` for the x-axis and `alt.Y().title()` for the y-axis.
   - **Add a Chart Title**: Give the chart an overall title, `"Data Breaches by Sector and Sensitivity"`.
     - Use `.properties(title="Your Title")` to add a title to the entire chart.
   - **Resize the Chart**: Adjust the width to `450` and height to `250` using `.properties(width=450, height=250)`.

In [7]:
alt.Chart(breaches_filtered).mark_circle().encode(
    x=alt.X('records_lost:Q').title("Records Lost"),
    y=alt.Y('sector:N').title("Sector"),
    size=alt.Size('records_lost'),
    color=alt.Color('data_sensitivity:N'),
    tooltip=alt.Tooltip(['organisation', 'date', 'records_lost', 'method', 'story'])
).properties(
    width = 450,
    height = 250,
    title = "Data Breaches by Sector and Sensitivity",
)


### **Step 2: Customize Point Size and Opacity**

1. **Objective**: Make the plot more informative by adjusting the size of the points based on `records_lost` and setting their opacity.
2. **Instructions**:
   - **Set Point Size**: Use `size=alt.Size('records_lost:Q')` to encode the size of the points based on `records_lost`, so larger values appear bigger.
     - Use `.scale(domain=[1e6, 1e9], range=[10, 1000])` to control the scale of the point sizes from 1 million to 1 billion.
   - **Set Opacity**: Add `.mark_circle(opacity=0.6)` to make points slightly transparent.


In [8]:
alt.Chart(breaches_filtered).mark_circle(opacity = 0.6).encode(
    x=alt.X('records_lost:Q').title("Records Lost"),
    y=alt.Y('sector:N').title("Sector"),
    size=alt.Size('records_lost').scale(domain=[1e6,1e9], range = [100, 1000]), 
    color=alt.Color('data_sensitivity:N'),
    tooltip=alt.Tooltip(['organisation', 'date', 'records_lost', 'method', 'story'])
).properties(
    width = 450,
    height = 250,
    title = "Data Breaches by Sector and Sensitivity",
)


### Step 3: Apply a Color Scheme

1. **Objective**: Improve the chart's visual clarity by changing the color scheme for `data_sensitivity`.
2. **Instructions**:
   - **Set Color Scheme**: Use `.scale(scheme='purplered')` to change the color scheme of `data_sensitivity` to `purplered` for better contrast.
   
   There are so many color schemes https://vega.github.io/vega/docs/schemes/ 


In [9]:
alt.Chart(breaches_filtered).mark_circle(opacity = 0.6).encode(
    x=alt.X('records_lost:Q').title("Records Lost"),
    y=alt.Y('sector:N').title("Sector"),
    size=alt.Size('records_lost').scale(domain=[1e6,1e9], range = [100, 1000]), 
    color=alt.Color('data_sensitivity:N').scale(scheme='purplered'),
    tooltip=alt.Tooltip(['organisation', 'date', 'records_lost', 'method', 'story'])
).properties(
    width = 450,
    height = 250,
    title = "Data Breaches by Sector and Sensitivity",
)


### Step 4: Customize Grid Lines, Scale Formatting, and Title Styling

1. **Objective**: Fine-tune the grid lines, format the x-axis scale, and style the chart title.
2. **Instructions**:
   - **Grid Lines**: Enable grid lines on the x-axis, showing only 4 major grid lines with `.axis(grid=True, tickCount=4)`.
   - **Remove Y-Axis Grid and Ticks**: For a cleaner look, remove grid lines and ticks on the y-axis with `.axis(grid=False, ticks=False)`.
   - **Title Font Color and Size**: Change the title font color to `darkblue` and increase the font size to `16` using `alt.TitleParams`.


In [30]:
first = alt.Chart(breaches_filtered).mark_circle(opacity = 0.6).encode(
    x=alt.X('records_lost:Q').title("Records Lost").axis(grid=True, tickCount = 4),
    
    y=alt.Y('sector:N').title("Sector").axis(grid=False, ticks = False),
    
    size=alt.Size('records_lost').scale(domain=[1e6,1e9], range = [100, 1000]), 
    
    color=alt.Color('data_sensitivity:N').scale(scheme='purplered'),
    
    tooltip=alt.Tooltip(['organisation', 'date', 'records_lost', 'method', 'story'])
    
).properties(
    width = 450,
    height = 250,
    title = alt.TitleParams(
        text = "Data Breaches by Sector and Sensitivity",
        color = "darkblue",
        fontSize = 20)
)


WOW.
COMPARE WHAT YOU STARTED WITH AND NOW WE NOW ARE.


There are different customizations you can make on a chart to make it adhere to good design principles.

We have barely scratched the surface.
Notice how there are links to the Altair's documentation.
Read it and apply it.

**The sky is just the ceiling not the limit**

## Part 3: Theoretical Underpinnings

_See slides the dets. (ha ha, i sound young and hip, using dets. instead of details)_

## Part 4: Exploratory Data Analysis


There are different ways we can explore our data, we can start with a fixed set of questions or we can just focus on visualizing each attribute to see what we **stumble** across. 

We will take a more targeted approach. 

Here are the four questions we will seek to answer 
 1. What are the Most Common Causes of Data Breaches?
 2. How Has the Number of Data Breaches Changed Over Time?
 3. Which Sectors Experience the Most Significant Data Losses?
 4. How Are Data Sensitivity Levels Distributed Across Sectors?

### **1. What are the Most Common Causes of Data Breaches?**

- **Visualization**: Stacked Bar Chart and Normalized Bar Chart
- **Purpose**: Identify and compare the frequency of different breach methods, grouped by `data_sensitivity`, to understand which breach causes are most prevalent.

#### **Instructions**:
1. **Stacked Bar Chart**: 
   - **X-axis**: Use `count()` for the number of breaches.
   - **Y-axis**: Use `method`.
   - **Color**: Differentiate by `data_sensitivity` level.
   - **Title**: “Most Common Causes of Data Breaches by Sensitivity Level”
2. **Normalized Bar Chart**: 
   - **Update**: Convert the stacked bar chart to a normalized bar chart to observe the relative proportions of each cause.


In [23]:
# Stacked Bar Chart
stacked_bar_chart = alt.Chart(breaches_filtered).mark_bar().encode(
    x = alt.X('count()'),
    y = 'method',
    color  = 'data_sensitivity:O',
).properties(
    title  = "Most Common Causes of Data Breaches by Sensitivity Level"
)
# Normalized Bar Chart

normalized_bar_chart = alt.Chart(breaches_filtered).mark_bar().encode(
    x = alt.X('count()').stack("normalize"),
    y = 'method',
    color  = 'data_sensitivity:O',
).properties(
    title  = "Most Common Causes of Data Breaches by Sensitivity Level (normalized)"
)



WAIT WHERE IS MY CHART
ha ha gotcha. 
It is stored as an object in the variable `stacked_bar_chart` to viz
just type the name of the variable in the next cell



In [24]:
stacked_bar_chart


In [25]:
normalized_bar_chart

### **2. How Has the Number of Data Breaches Changed Over Time?**

- **Visualization**: Line Chart
- **Purpose**: Show the trend in breach frequency over time, grouped by `data_sensitivity` to observe how breach occurrences vary over the years.

#### **Instructions**:
1. **Line Chart**:
   - **X-axis**: Encode `year(date)`.
   - **Y-axis**: Count the number of breaches.
   - **Color**: Use `data_sensitivity` to distinguish different sensitivity levels.
   - **Title**: “Data Breaches Over Time by Sensitivity Level”


In [26]:
# Line Chart for breach trends over time
line_chart = alt.Chart(breaches_filtered).mark_line().encode(
    x = 'year(date)',
    y = 'count()',
    color = 'data_sensitivity',

).properties(title = "Data Breaches Over Time by Sensitivity Level")


line_chart


### **3. Which Sectors Experience the Most Significant Data Losses?**

- **Visualization**: Pie Chart
- **Purpose**: Show the proportion of records lost by `sector`, grouped by `method`, for a clear comparison of data loss magnitude across sectors and breach methods.

#### **Instructions**:
1. **Pie Chart**:
   - **Theta**: Sum of `records_lost`.
   - **Color**: Encode `sector` to differentiate each slice.
   - **Legend**: Use a legend for `method`.
   - **Title**: “Proportion of Data Loss by Sector and Breach Method”


In [27]:
# Pie Chart for data losses by sector
pie_chart = alt.Chart(breaches_filtered).mark_arc().encode(
    theta = 'sum(records_lost)',
    color = 'sector'
)

pie_chart

### **4. How Are Data Sensitivity Levels Distributed Across Sectors?**

- **Visualization**: Heatmap
- **Purpose**: Use a heatmap to display the distribution of `data_sensitivity` levels across different `sector`s.

#### **Instructions**:
1. **Heatmap**:
   - **X-axis**: Encode `data_sensitivity`.
   - **Y-axis**: Encode `sector`.
   - **Color**: Use color intensity to show the count of breaches.
   - **Title**: “Distribution of Data Sensitivity Levels Across Sectors”

In [29]:
# Heatmap for data sensitivity levels across sectors
heatmap = alt.Chart(breaches_filtered).mark_rect().encode(
    x = 'data_sensitivity:N',
    y = 'sector',
    color  = 'count()'
)
heatmap

## Part 5: Dashboard
Bringing it all together 

#### **Instructions**:
1. Use **Horizontal Concatenation** (`|`) and **Vertical Concatenation** (`&`) to organize the visualizations into a clear layout.
2. For example:
   ```python
   # Arrange the charts into a dashboard layout
   (stacked_bar_chart & normalized_bar_chart) | (line_chart & pie_chart) & heatmap
   ```

3. **Objective**: Create a comprehensive dashboard that provides a holistic view of data breach trends and characteristics across different dimensions.


In [31]:
(stacked_bar_chart & normalized_bar_chart) | (line_chart & pie_chart) & heatmap

In [18]:
### Customizing the Dashboard

In [32]:
# Resize individual charts
stacked_bar_chart = stacked_bar_chart.properties(width=270, height=250)
normalized_bar_chart = normalized_bar_chart.properties(width=270, height=250)
line_chart = line_chart.properties(width=300, height=200)
pie_chart = pie_chart.properties(width=200, height=200)
heatmap = heatmap.properties(width=270, height=250)

# Dashboard heading
dashboard_heading = alt.Chart().mark_text(
    align='center',
    fontSize=20,
    fontWeight='bold'
).encode(
    text=alt.value("Data Breach Dashboard")
).properties(width=900, height=30)

# Concatenate heading and dashboard with bounding box
dashboard_with_heading = alt.vconcat(
    dashboard_heading,
    (stacked_bar_chart | normalized_bar_chart | heatmap) & (line_chart | pie_chart | first)
).configure_view(
    stroke='black',
    strokeWidth=2,
    cornerRadius=5
)

dashboard_with_heading



congratulations oooo

tomorrow we will look into how to make these visualizations interactive. 



Dr. K