# Assignment 06

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment you'll learn how to:

- apply the fundamentals of data visualization.

- connect data moves to visualization.

- use `pandas` and `matplotlib` visualizations methods.

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

## Purpose and Fundamentals of Data Visualization

Data visualization is the practice of representing data graphically to reveal patterns, trends, and insights that might be difficult to interpret from raw numbers alone. Effective visualizations simplify complex information, making it easier to communicate findings, support decision-making, and identify relationships within data. The fundamentals of data visualization include selecting appropriate graphical representations, ensuring clarity and accuracy, and emphasizing key insights while avoiding misleading representations. 

Tufte, E. R. (2001). _The visual display of quantitative information (2nd ed.)_. Graphics Press.

While data visualization involves many concepts, techniques, and considerations, our discussion will focus on fundamental visualization types commonly used in exploratory data analysis, including bar charts, histograms, box plots, line charts, and scatter plots. These visualizations can be created in Python and R without requiring advanced coding techniques, making them accessible to beginners.

Import the `pandas` library with the appropriate alias and load the `titanic.csv` dataset from the `data` folder into a `pandas` `DataFrame` named `titanic`. Display the first five rows to verify that the data loaded correctly.

**Note:** Use separate code cells, one for importing the library and another for loading the dataset.

In [None]:
## Import the pandas library
...

In [None]:
## Load the dataset
titanic = pd.read_csv('...')
titanic.head()

Let's examine the metadata for the `titanic` `DataFrame`.

In [None]:
titanic.info()

**Question 1.** Write three research questions that can be answered using the Titanic dataset. Use the guidelines below to help. 

- Choose two or more variables that show relationships, not causation. Avoid causal terms like **"affect"** or **"impact"**. For example, instead of asking _“What is the average salary?”_, frame it as _“What is the relationship between years of experience and salary?”_

- Be specific and avoid vague or overly broad phrasing. For example, _“How does study time relate to final exam scores among high school students?”_ is more precise and meaningful than _“Is there a connection between studying and grades?”_.

- Consider multiple dimensions by exploring variations within subgroups and different conditions. For example, *“What is the relationship between physical activity level and heart rate, and does it differ by age group?”*

**Note:** Present your questions in a numbered list.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

## Visualizations

We will use plotting features from `pandas` and from `matplotlib` to create our visualizations. Run the cell below to import the libraries we need.

**Note:** To learn more about visualizations in `matplotlib` click [here](https://matplotlib.org/) and for documentation on creating visualizations using `pandas` click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

Let's import the `matplotlib.pyplot` module from the `Matplotlib` library.

**Note:** A **library** is a collection of modules that provide functionality for a specific purpose. In this case, `Matplotlib` is the library, and it contains multiple modules, including `pyplot`. A **module** is a single file containing Python code (e.g., functions) that can be imported and used in a program.

In [None]:
# Import the matplotlib.pyplot library
import ... as ...

# Set default parameters for all visualizations
# This sets the default figure size for all Matplotlib plots in the session to use a 
# wider figure (12 inches) with a shorter height (5 inches).
plt.rcParams['figure.figsize'] = (12, 5)

# This sets the dots per inch (DPI) for the figure, which controls the resolution.
# Higher DPI (e.g., 200) results in sharper images, but larger file sizes.
# Setting dpi = 100 makes the figure clearer compared to the default which is typically 72 DPI.
plt.rcParams['figure.dpi'] = 100

### Categorical Data

#### Bar Plot

A bar plot displays the counts or aggregated values from a column that contains categorical data. It can be created using Pandas with pre-aggregated values

```python
df.plot.bar()
```

or using Matplotlib

```python
plt.bar()
```

**Example 1.** Create a table to summarize the counts of categories in the `embarked` column.

In [None]:
tbl = ...
tbl

In [None]:
type(tbl)

In [None]:
tbl.index

**Example 2.** Create bar chart using `df.plot.bar()` to visualize the counts of the `embarked` column.

In [None]:
# The semicolon is included at the end of a plotting command to prevent the display of additional text output
tbl.plot.bar(rot = 0);

**Question 2.** Create a bar chart using `plt.bar()` to visualize the counts of the `embarked` column.  

**Notes:**  

- `x` represents the unique categories in the `embarked` column.  

- `height` represents the count of occurrences for each category.  

In [1]:
...

Ellipsis

**Question 3.** Create a bar chart to visualize the number of passengers who survived and did not survive. Include a title and axis labels to describe the visualization.


**Note:** You can use either `matplotlib` or `pandas`.

In [None]:
...

plt.xticks(ticks = [0, 1], labels = ['Did Not Survive', 'Survived'])
plt.ylabel("Number of Passengers")
plt.title("Passenger Survival Count");

The `plt.title()`, `plt.xlabel()`, and `plt.ylabel()` methods in Matplotlib's `pyplot` module are used to add labels and a title to a plot.  

- `plt.title("Title Text")` sets the title of the plot.  

- `plt.xlabel("$x-$axis Label")` adds a label to the $x-$axis.  

- `plt.ylabel("$y-$axis Label")` adds a label to the $y-$axis.  

These methods modify the **current figure** to provide context for the data being visualized.

### Numerical Data

#### Histograms

A histogram displays the distribution of numerical data by grouping values into bins and showing their frequency. It can be created using Pandas with numerical columns

```python
df.plot.hist()
```

or using Matplotlib using

```python
plt.hist()
```

**Example 3.**  Use the `.hist()` `Series` method to create a histogram showing the distribution of passenger ages on the Titanic.

In [None]:
titanic['age'].hist();

**Example 4.** Add customizations to the plot from **Example 3** by removing the gridlines, including axis labels, applying an edge color to the bars, and adding a title.

In [None]:
titanic['age'].hist(edgecolor = 'white')

plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.title('Age Distribution of Titanic Passengers')
plt.grid(False);

**Example 5.** Add customizations to the plot from **Example 3** by specifing the number of bins to be 20.

In [None]:
titanic['age'].hist(bins = 20, edgecolor = 'white')

plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.title('Age Distribution of Titanic Passengers')
plt.grid(False);

**Example 6.** Change the $y-$axis to a percentage (i.e. density).

In [None]:
titanic['age'].hist(edgecolor = 'white', density = True)

plt.xlabel('Age (Years)')
plt.ylabel('Frequency')
plt.title('Age Distribution of Titanic Passengers')
plt.grid(False);

##### Frequecy or Density?

A histogram can display data in two different ways: frequency or density. The choice depends on the analysis you need. 

Use a frequency histogram when:

- You want to see the raw number of observations in each bin.

- The dataset size is fixed, and comparisons across datasets aren’t needed.

Use a density histogram when:

- You are comparing distributions across datasets of different sizes.

- You need probabilities or normalized values (e.g., working with probability distributions).

- You plan to overlay a probability density function (PDF) (e.g., normal distribution curve).

**Example 7.** Compare the distribution of `age` across different survival statuses using the `by =` parameter.

In [None]:
titanic['survived'].value_counts()

In [None]:
titanic['age'].hist(edgecolor = 'white', by = titanic['survived'], rot = 0);

In [None]:
titanic['age'].hist(edgecolor = 'white', density = True, by = titanic['survived'], rot = 0);

**Question 4.** Examine the histograms comparing age distributions of Titanic survivors and non-survivors. What trends do you notice, and what insights can be drawn from the distributions? In particular, explore how survival rates vary across age groups, the differences between frequency and density histograms, and additional factors (e.g., sex, class) that could be analyzed for deeper insights.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

#### Scatterplots

A scatter plot visualizes the relationship between two numerical variables by plotting individual data points on a coordinate plane. It can be created using Pandas with numerical columns

```python
df.plot.scatter(x = , y = )
```

or using Matplotlib

```python
plt.scatter(x, y)
```

**Question 5.** Create a scatter plot to visualize the relationship between `age` and `fare`.

In [None]:
...

**Example 8.** Customize the marker type and color.

**Note:** Marker is the shape used to represent each data point in the scatter plot. It can be customized using the `marker` parameter, with options such as circles (`'o'`), squares (`'s'`), triangles (`'^'`), and more.

In [None]:
color = 'lightblue'
shape = 's'

titanic.plot.scatter(x = 'age', y = 'fare', c = color, marker = shape);

**Question 6.** Color the markers in a scatter plot based on survival status to distinguish survivors from non-survivors in the Titanic dataset.

In [None]:
titanic['survived'].value_counts()

In [None]:
titanic['survived'].value_counts()

In [None]:
# Map the 'sex' column to colors: assign 'blue' for males and 'orange' for females
colors = ...

# Create variables for x and y values
x_values = titanic['age']
y_values = titanic['fare']

# Create scatter plot using matplotlib directly with two colors
plt.scatter(x = x_values, y = y_values, c = colors, alpha = 0.75)

# Add handles
handle_m = plt.scatter([], [], color = 'blue', label = 'Male')
handle_f = plt.scatter([], [], color = 'orange', label = 'Female')

# Add legend
plt.legend(handles = [handle_m, handle_f], title = 'Sex')

# Add labels and title
plt.xlabel('age')
plt.ylabel('fare')
plt.title('Scatter Plot Colored by Sex');

**Notes:**

- The `.map(`) function in Pandas is used to transform or map values in a `Series` based on a given dictionary, function, or another `Series`.

- The `alpha` parameter in Matplotlib controls the transparency (opacity) of plotted elements. It accepts a value between 0 and 1:

    - `alpha=1`: Fully opaque (default)

    - `alpha=0.5`: 50% transparent

    - `alpha=0`: Fully transparent (invisible)
    
- Normally, `plt.legend()` automatically picks up legend entries from `plt.scatter()`, but only if they are plotted with data. Since we want a legend for color categories (e.g., 'Male' and 'Female'), we manually create a legend handle without affecting the actual scatter plot.

- These lines create invisible scatter plot points that are used only for the legend. 
```python
handle_m = plt.scatter([], [], color = 'blue', label = 'Male')
handle_f = plt.scatter([], [], color = 'blue', label = 'Female')
```

- This line adds a legend to a Matplotlib plot by explicitly defining which legend handles should be displayed. 
```python
plt.legend(handles = [handle_m, handle_f], title = 'Sex')

```

**Question 7.** Examine the scatter plot comparing age and fare, with colors distinguishing male and female passengers. What patterns or trends do you observe, and what insights can be drawn from the distribution of points? In particular, explore how fare varies across different age groups, whether there are noticeable differences between male and female passengers, and what additional factors (e.g., survival status, class) could provide deeper insights.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

### Box Plots

A box plot visualizes the distribution of numerical data by displaying the median, quartiles, and potential outliers. It helps identify skewness, spread, and variability in the data. A box plot can be created using Pandas with numerical columns

```python
df.plot.box()
```

or using Matplotlib

```python
plt.boxplot()
```

<img src="https://lsc.studysixsigma.com/wp-content/uploads/sites/6/2015/12/1435.png" width="800" height="400">

**Source:** https://www.leansigmacorporation.com/box-plot-with-minitab/ 

**Author:** Michael Parker

**Question 8.** Create a box plot to compare the distribution of the numerical variable fare grouped by passenger ticket class status.

In [None]:
...

plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.title('Fare Distribution by Passenger Class')

# Remove the automatic Pandas suptitle
plt.suptitle("");

**Question 9.** Examine the box plot comparing fare distributions across passenger classes. What patterns or trends do you observe, and what insights can be drawn from the spread of fares within each class? In particular, explore how fare varies across different classes, whether there are significant differences in fare ranges, and what additional factors (e.g., survival status, age) could provide deeper insights.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

### Line Charts

A line chart visualizes trends over time or sequential data by connecting data points with a continuous line. It is useful for showing changes, patterns, or trends in numerical data. It can be created using Pandas with numerical columns

```python
df.plot.line()
```

or using Matplotlib

```python 
plt.plot()
```

#### Time Series Data

To visualize trends over time, we need time-series data, which is a collection of data points recorded at successive time intervals. This type of data allows us to observe patterns, trends, and fluctuations over time, making it useful for analyzing changes, forecasting future values, etc.

We'll examine the [AIDS dataset from CORGIS](https://corgis-edu.github.io/corgis/csv/aids/). The correpsonding data sheet can be found [here](https://docs.google.com/document/d/1SbQhneStEjX_nulIoLNPuIy1v2qfeVilZqVtw_Ke5aI/edit?tab=t.0). 

The UNAIDS Organization is an entity of the United Nations that looks to reduce the transmission of AIDS and provide resources to those currently affected by the disease. The following data set contains information on the number of those affected by the disease, new cases of the disease being reported, and AIDS-related deaths for a large set of countries over the course of 1990 - 2015.

Load the `aids.csv` file from the data folder.

In [None]:
aids = pd.read_csv('...')

Let's check the metadata.

In [None]:
aids.info()

**Example 9.** Rename the relevant columns to follow the `snake_case` convention, select the necessary columns from the `aids` dataframe, and store the result in a new dataframe named `df`.

**Columns (Variables)**

- `Year`

- `Country`

- `Data.AIDS-Related Deaths.Adults`

- `Data.AIDS-Related Deaths.Female Adults`

- `Data.AIDS-Related Deaths.Male Adults`

In [None]:
df = aids.rename(
    columns = {'Year' : 'year',
               'Country' : 'country',
               'Data.AIDS-Related Deaths.Adults' : 'aids_deaths_adult',
               'Data.AIDS-Related Deaths.Female Adults' : 'aids_deaths_female_adults',
               'Data.AIDS-Related Deaths.Male Adults' : 'aids_deaths_male_adults'
              }
)[['year', 'country', 'aids_deaths_adult', 'aids_deaths_female_adults', 'aids_deaths_male_adults']]

Verify that our output is useful.

In [None]:
df.info()

**Example 10.** Create a line chart showing the annual trend of AIDS-related deaths among adults.

In [None]:
df.plot(x = 'year', y = 'aids_deaths_adult');

**Example 11.** Correct the line chart from **Example 10**.

In [None]:
grps = df.groupby('year')
grps

In [None]:
grps['aids_deaths_adult'].sum()

In [None]:
tbl = grps['aids_deaths_adult'].sum()
tbl

In [None]:
tbl.plot.line(marker = 'o', ms = 3);

**Question 10.** Modify the line plot from **Example 11** to begin 2 years before the peak and extend to the end of the data.

In [None]:
...

## Encoding Multiple Features

When we talk about encoding multiple features in data visualization, we mean representing more than one variable in a single chart by using different visual encodings such as color, shape, size, position, or multiple axes. We explored this concept in **Question 6**, where we plotted points with the following encodings:  

- The $x-$coordinate represented `age`.

- The $y-$coordinate represented `fare`.  

- Color distinguished between male and female, representing the variable `sex`.  

This visualization displayed how `age` and `fare` were distributed while incorporating `sex` as a distinguishing factor.

**Example 12.** Create a line chart in the same figure that visualizes trends over time for `aids_deaths_female_adults` and `aids_deaths_male_adults`. Label the axes appropriately, add a title, and include a legend to distinguish between the two groups. Ensure the x-axis represents years correctly and that the trends are clearly displayed.

In [None]:
# Group the data by year and sum the total AIDS deaths for adult females and males
grps = df.groupby('year')[['aids_deaths_female_adults', 'aids_deaths_male_adults']].sum()

# Create a line chart to visualize the trends over time
grps.plot(kind = 'line');

plt.xlabel("Year")
plt.ylabel("Total Deaths")
plt.title("AIDS Deaths by Sex Over Time")
plt.legend(['Female Adults', 'Male Adults'], title = 'AIDS Deaths');

**Example 13.** Create a stacked bar chart in the same figure that visualizes the distribution of survival status across embarkation towns. Label the axes appropriately, add a title, and include a legend to distinguish between survived and did not survive.

1. Create a contingency table

   - Use `pd.crosstab()` to count the number of survivors (1) and non-survivors (0) for each embarkation town (`embarked`). **Note:** `margins = True` sums the rows and columns.

In [None]:
# Create a contingency table
# Count of survivors and non-survivors by embarkation town
tbl = pd.crosstab(titanic['embarked'], titanic['survived'], margins = True)
tbl

Since Sothhampton had more passengers its better to look at the proportions.

In [None]:
# Create a contingency table 
# Proportionn of survivors and non-survivors by embarkation town
tbl = pd.crosstab(titanic['embarked'], titanic['survived'], normalize = 'index')
tbl

2. Plot the stacked bar chart

   - Use `tbl.plot(kind = 'bar', stacked = True)` to create a stacked representation.

In [None]:
# Create stacked bar chart
tbl.plot(kind = 'bar', stacked = True, rot = 0)

# Add whitespace above bars for the legend
plt.ylim(0, 1.5)

# Add title, label axes, and label legend
plt.xlabel('Embarked Town')
plt.ylabel('Proportion of Passengers')
plt.title('Survival Status by Embarkation Town')
plt.legend(title = 'Survived', labels = ['Did Not Survive', 'Survived']);

**Example 14.** Create a side-by-side bar chart in the same figure that visualizes the distribution of survival status across embarkation towns. Label the axes appropriately, add a title, and include a legend to distinguish between survived and did not survive.

In [None]:
# Create side-by-side bar chart
tbl.plot(kind = 'bar', stacked = False, rot = 0)

# Add list for bin labels
labels = ['Southampton, UK', 'Cherbourg, France', 'Queenstown, Ireland']

# Customize x-axis labels
plt.xticks(ticks = range(len(labels)), labels = labels, rotation = 0, ha = 'center')

# Add title, label axes, and label legend
plt.xlabel('')
plt.ylabel('Proportion of Passengers')
plt.title('Survival Status by Embarkation Town')
plt.legend(title = 'Survived', labels = ['Did Not Survive', 'Survived']);

**Note:**

- This line modifies the x-axis tick labels in Matplotlib.

```python
plt.xticks(ticks = range(len(labels)), labels = labels, rotation = 0, ha = 'center')
```

| Parameter                    | Description |
|------------------------------|-------------|
| `plt.xticks(...)`            | Customizes the $x-$axis tick positions and labels. |
| `ticks = range(len(labels))` | Specifies the tick positions (0, 1, 2,) based on the number of labels. |
| `labels = labels`            | Assigns custom names. |
| `rot = 0`                    | Ensures labels stay horizontal (0-degree rotation). |
| `ha = 'center'`              | Aligns labels horizontally centered under each tick mark. |

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.