# Assignment 07

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment you'll be introdcued to visualizations in Python, you'll learn how to:

- import modules used to create visualizations in Python

- create visualizations for categorical and numerical data.

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

## Purpose and Fundamentals of Data Visualization

Data visualization is the practice of representing data graphically to reveal patterns, trends, and insights that might be difficult to interpret from raw numbers alone. Effective visualizations simplify complex information, making it easier to communicate findings, support decision-making, and identify relationships within data. The fundamentals of data visualization include selecting appropriate graphical representations, ensuring clarity and accuracy, and emphasizing key insights while avoiding misleading representations (Tufte 2001). 

While data visualization involves many concepts, techniques, and considerations, our discussion will focus on fundamental visualization types commonly used in exploratory data analysis, including bar charts, histograms, box plots, line charts, and scatter plots. These visualizations can be created in both Python and R. 

### Matplotlib

`matplotlib.pyplot` is a module within the Matplotlib library that makes it easy to create visualizations in Python. 

```python
import matplotlib.pyplot as plt
```

Each command in `pyplot` such as `plt.plot()` and `plt.title()` controls part of a figure. This allows you  to build and customize your visualization.

**Note:** To learn more about visualizations in `matplotlib` click [here](https://matplotlib.org/stable/) and for documentation on creating visualizations using `pandas` click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

**Question 1.** Import the necessary libraries for analyzing data and creating our visualizations.

In [None]:
...

**Question 2.** Load your dataset into a `pandas` `DataFrame` and verify that it loaded correctly by displaying the dataframe’s metadata.

In [None]:
...

## Data Dive with Visualizations

In the Data Moves framework, visualize is one of the key data moves and often builds on earlier steps like selecting, filtering, grouping, and summarizing. After you choose which data to focus on, organize it into meaningful groups, and calculate summaries or aggregates, the visualize move transforms those results into charts or graphs that reveal patterns, trends, or relationships.

## Bar Chart

A bar chart is a graphical representation of categorical data where each category is represented by a bar, with the height or length of the bar corresponding to its value. It is used to compare the frequency, count, or other metrics across different categories. Bar charts are ideal when you need to visually compare discrete categories or show trends over time. 

This can be done using `pandas`.

```python
df.plot.bar('column_name')
```

or using `matplotlib`

```python
plt.bar('column_name')
```

**Question 3.** Select a categorical variable from your dataset and create a bar chart to display the frequency of each category.

In [None]:
tbl = ...
tbl

In [None]:
tbl.plot.bar();

**Question 4.** Add a title to your bar chart.

In [None]:
tbl.plot.bar(rot = 1)

plt.title(...);

In [None]:
tbl.plot.bar(rot = 1)

plt.title(...);

## Histogram

A histogram is a graphical representation of the distribution of numerical data. Unlike a bar chart, which displays categorical data, a histogram groups continuous data into bins (ranges) and shows the frequency or count of data points within each bin. The height of each bar represents the number of data points that fall within each bin.

This can be done using `pandas`:

```python
df['column_name'].plot.hist()
```

or using `matplotlib`:

```python
plt.hist(df['column_name'])
```

**Question 5.** Select a numerical variable from your dataset and create a histogram to display the distribution of its values across different ranges (bins).

In [None]:
...

**Question 6.** Customize your histogram by adding contrasting edge colors to the bars and including a title and axis labels.

In [None]:
...

plt.title(...)
plt.xlabel(...)
plt.ylabel(...);

## Box Plot

A boxplot is a way of displaying the distribution of data based on a five-number summary. 

- Median (Q2): The line inside the box represents the median (the middle value of the data set).

- First quartile (Q1): The lower edge of the box, representing the 25th percentile (where 25% of the data lies below this value).

- Third quartile (Q3): The upper edge of the box, representing the 75th percentile (where 75% of the data lies below this value).

- Interquartile Range (IQR): The range between the first quartile (Q1) and third quartile (Q3), which contains the middle 50% of the data.

- Whiskers: These extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. They represent the range of most of the data.

- Outliers: Data points outside the whiskers are considered outliers and are usually plotted as individual dots.

This can be done using `pandas`:

```python
df.plot.box(column = 'column_name')
```

or using `matplotlib`:

```python
plt.boxplot(df['column_name'])
```

**Question 7.** Select a **different numerical variable from the one used in your histogram** and create a box plot to display its distribution, highlighting the median, quartiles, and any potential outliers.

In [None]:
...

**Question 8.** Create a box plot that compares crab shucked weight across sex categories.

In [None]:
...

**Note:** Customize your boxplot by adding a title and axis labels and replacing the default $x-$axis labels with full names.

In [None]:
...

plt.title(...)
plt.suptitle('')
plt.xlabel('')
plt.xticks(ticks = [...], labels = [...])
plt.ylabel(...);

## Scatter Plot

A scatterplot is a graphical representation used to display the relationship between two continuous variables. Each point on the plot represents an observation, with the position of the point determined by the values of the two variables. The $x-$axis represents one variable, and the $y-$axis represents the other.

This can be done using `pandas`:

```python
df.plot.scatter(x = 'column_name', y = 'column_name')
```

or using `matplotlib`:

```python
plt.scatter(df['column_name'], df['column_name'])
```

**Question 9.** Select two numerical variables from your dataset and create a scatter plot to display their relationship. Use one variable for the $x-$axis and the other for the $y$-axis.

In [None]:
...

**Question 10.** Customize your scatterplot by adding a contrasting edge color to the points and including a title and axis labels.

In [None]:
...

plt.title(...)
plt.xlabel(...)
plt.ylabel(...);

## References

Tufte, E. R. (2001). _The visual display of quantitative information (2nd ed.)_. Graphics Press.

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.