<a href="https://colab.research.google.com/github/kaorimiyazonoo/file_processing/blob/main/Scientific_Visualizations_with_Python_Session_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scientific Visualizations with Python

---

Welcome to the Scientific Visualizations with Python course with the Lane Medical Library!

**Audience**: Scientists or medical professionals interested in visualizing data using Python. Some introductory experience in Python is expected.

## Python Prerequisites

❓ **Try it yourself!** Retrieve the second to fourth elements in `odds` using index range notation.

Indices in Python start at zero. Recall that for a list `x`, the expression `x[0:2]` for example will retrieve the first two elements in the list.

In [None]:
# List of odd numbers
odds = list(range(1, 21, 2))

# YOUR CODE HERE

❓ **Try it yourself!** Import the `math` module and calculate the square root of 17.

In [None]:
# YOUR CODE HERE

## Tabular Data

When making figures, we will be dealing with **tabular data** in **flat files**. Tabular data are data organized in tables. You can imagine this data as being in a "spreadsheet" format. Flat files contain one record per line. This is the most common way scientific data is stored, and file formats include: tab-separated file (TSV), comma-separated file (CSV), and excel spreadsheet (XLSX).

A popular library for opening such kinds of data is `pandas`. We will be using `pandas` to load some data that Google Colab provides on each instance.

In [None]:
import pandas as pd

Let us explore the data provided to us! This is California Housing Prices data from the 1990's US Census. Google provides documentation on the data set [here](https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub).

The data are provided as CSV files. To visualize the actual data in the file, let us use a `bash` command to print the first 5 lines in the file. Notice that there is a header row containing column names, followed by one record per line.

In [None]:
!head -n 6 sample_data/california_housing_train.csv

`pandas` has a `read_csv` function to open files. The function returns a `DataFrame` object that represents the data. We will store this as `cal_housing`.

The `DataFrame` object has many useful functions to operate on the data. To start, we will use the `.head()` function to visualize the first few lines of data. These should match the records from the `bash` command above!

In [None]:
cal_housing = pd.read_csv('sample_data/california_housing_train.csv')

print(type(cal_housing))

In [None]:
cal_housing.head()

Each column in the `DataFrame` is an object called a `Series`. You can retrieve series using two different notations, as follows. We will use these interchangeably. However, the second notation using index notation is required to create new `Series` objects. As such, I generally recommend using the second method to avoid confusion.

```
cal_housing.longitude.head()
cal_housing['longitude'].head()
```

In [None]:
longitude_column = cal_housing['longitude']

print(type(longitude_column))

In [None]:
longitude_column.head()

❓ **Try it yourself!** Retrieve the `total_bedrooms` column in the `cal_housing` data frame. Visualize the first few lines using the `.head()` function.

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** The `Series` object also has a `.sum()` function. What do you think it does? Try using it on the `population` column. The `Series` object also has a `.mean()` function. Try using it on the `population` column as well.

In [None]:
# YOUR CODE HERE

One important processing step with data is to filter based on features of the record. For example, suppose we want to filter to records where the `latitude` is smaller than `34`. The syntax for this is the following.

```
cal_housing[cal_housing['latitude'] < 34]
```

This will return a new `DataFrame` object with the filtered records.

In [None]:
cal_housing[cal_housing['latitude'] < 34].head()

❓ **Try it yourself!** Confirm that the filtering worked by using the `.min()` and `.max()` functions of the `Series` object. Use these to make sure that the maximum latitude is less than 34 in the resulting `DataFrame` object.

In [None]:
# YOUR CODE HERE

## Loading Some Data

Here, we are creating a `DataFrame` called `cal_housing` that will contain sample data provided in all Google Colab notebooks on housing prices in California. Each row represents one house. All of the variables are continuous (e.g., `latitude` or `median_income`).

We will create two discrete variables. The first is called `median_age`, and it will put the house into one of six bins based on the age of the house. The second is `id`, which is a unique value for each house.

The `.cut()` function in the `pandas` library is quite useful. It converts a continuous variable into a categorical variable by creating bins.

The `.index` property of a `DataFrame` contains unique identifiers for each row. Here, we add it as the `id` column to the data frame.

In [None]:
# Read data using the pandas read_csv function
cal_housing = pd.read_csv('sample_data/california_housing_train.csv')

# Create the median_age variable using the pandas cut function
cal_housing['median_age'] = pd.cut(
    cal_housing.housing_median_age, bins=[0, 10, 20, 30, 40, 50, 60],
    labels=['0-10', '10-20', '20-30', '30-40', '40-50', '50-60'],
    include_lowest=True
)

# Store the index of the DataFrame as a variable, which assigns a unique value
# to each house
cal_housing['id'] = cal_housing.index

# Visualize the first few lines
cal_housing.head()

## Bar Plot

The bar plot is great for plotting the distribution of a single, discrete variable. In the California housing data set, we created the `median_age` discrete variable. We will create a bar plot for this variable.

First, we need to process the data to extract the distribution of `median_age`. To get the distribution, we need to **count** the number of houses in each bin. `pandas` provides many helpful functions to accomplish this. First, we tell `pandas` that we are binning by `median_age` using the `.groupby()` function. Next, we tell it to count the instances of `id` in each group using the `.count()` function.

The `reset_index()` function turns the index into a column. Try running the code without the `reset_index()` function to see what it does!

In [None]:
plot_data = cal_housing.groupby('median_age', observed=True)['id'].count()
plot_data = plot_data.reset_index()
plot_data['median_age'] = plot_data['median_age'].astype(str)

plot_data.head()

Now, we are ready to visualize the data. `matplotlib` is the most popular data visualization library in Python. To see how quickly we can visualize this data, use the next three lines of code to create a bar plot!

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Create a new figure that is 8-by-8 inches
plt.figure(figsize=(8, 8))

# Create a bar plot
# Use the median age bins on the horizontal axis
# Set the height of each bar to be the number of houses in each bin
plt.bar(plot_data.median_age, plot_data.id)

# Show the plot
# Not necessary to call this in a Jupyter notebook
plt.show()

❓ **Try it yourself!** The `pyplot` interface has three useful functions to make the graph more readable:

1. `plt.xlabel()`
2. `plt.ylabel()`
3. `plt.title()`

Use the internet to figure out what these functions do, and use them to make your plot more readable!

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** The `plt.bar` function (and many other graph types) have a `color` argument that can be used to change the color of the graphic elements. There are some colors that are provided by name, such as `'red'`, `'green'`, etc. You can also use hex code encodings of RGB colors, such as `'#8C1515'`. Use this argument to change the color of the bars.

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** The `color` argument can also take a list of colors. Color the first two bars, second two bars, and last two bars using unique colors.

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** Another method to visualize bars is using the `hatch` argument in the `plt.bar` function. Try setting the `hatch` to `'///'` and see what happens!

In [None]:
# YOUR CODE HERE

## Histogram

The histogram is an extension of the bar plot for continuous variables. We will look at the `median_house_value` variable in the California housing data.

A histogram also bins the data, but does so automatically for you. The goal of a histogram is to visualize the actual distribution of the data using an approximation. A bar plot, in contrast, assumes a fixed set of discrete categories.

In [None]:
plt.figure(figsize=(8, 8))

plt.hist(cal_housing.median_house_value)

plt.xlabel('Median House Value')
plt.ylabel('Number of Houses')
plt.title('Distribution of Median House Value in California')

❓ **Try it yourself!** The bin with a lot of houses at the very tail of the distribution might be an artifact. To check, use the `bins` argument in the `plt.hist()` function. Set the bins to `50` to visualize the distribution better.

In [None]:
# YOUR CODE HERE

This plot should show that there are many houses with a median house value of around 500,000. This probably means that any houses with a median house value larger than 500,000 were clamped to this value. We would want to plot the distribution without those values.

❓ **Try it yourself!** Filter the `cal_housing` data frame to `median_house_value` less than 500,000. Plot the histogram of `median_house_value` after filtering.

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** Change the color of histogram bars using the `color` argument in `plt.hist()`.

In [None]:
# YOUR CODE HERE

❓ **Try it yourself!** Plot the distribution of the `latitude` column using a histogram. Can you explain the pattern?

In [None]:
# YOUR CODE HERE

![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.mapsofworld.com%2Fusa%2Fstates%2Fcalifornia%2Fmaps%2Fcalifornia-lat-long-map.jpg&f=1&nofb=1&ipt=de5aa3a686ab5b16d19c2c80ccc86f70fc36052fbafc1cd9f7e915aa885857e7&ipo=images)

In [None]:
plt.figure(figsize=(4, 8))

plt.scatter(cal_housing.longitude, cal_housing.latitude, s=10, alpha=0.2, color='black')

plt.axhline(34)
plt.axhline(38)

## Saving Figures

Saving high-quality figures is important for publication. It is always recommended that you export figures in a **vector graphics format**. This means that the resulting figure is editable by other software, which can allow you to edit the figure outside of the Python environment.

In this course, we will use the portable document format (PDF). Other examples of vector graphics formats include scalable vector graphics (SVG) and encapsulated PostScript (EPS).

In [None]:
plt.figure(figsize=(8, 8))

plt.hist(cal_housing.median_house_value)

plt.xlabel('Median House Value')
plt.ylabel('Number of Houses')
plt.title('Distribution of Median House Value in California')

plt.savefig('figure.pdf')

There are some cases where a vector graphics format might not be appropriate. For example, if you are plotting millions of points (for example, when analyzing single-cell RNA-seq data), the vector graphics format file will be gigantic. Instead, you must use a raster format such as portable network graphics (PNG) or joint photographic experts group (JPEG).

In [None]:
plt.figure(figsize=(8, 8))

plt.hist(cal_housing.median_house_value)

plt.xlabel('Median House Value')
plt.ylabel('Number of Houses')
plt.title('Distribution of Median House Value in California')

plt.savefig('figure.png')