# Data Visualization #

Here we will visualize some data!

We will be using **pandas** to load the data from file, and **matplotlib** to visualize it. More on pandas later.

Make sure you have downloaded the files file_1.csv[<sup>1</sup>](#fn1) and file_2.csv[<sup>2</sup>](#fn2) from Canvas and put them in the same folder as this jupyter notebook.

[<sup id="fn1">1</sup>]("fn1") United Nations, Department of Economic and Social Affairs, Population Division (2024). World Population Prospects 2024, Online Edition.

[<sup id="fn2">2</sup>]("fn2") World Development Indicators (WDI) database, published by the World Bank; World regions and names from Our World in Data: OurWorldinData.org/world-region-map-definitions

pandas and matplotlib are modules, so we need to import them.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# this lets us show plots in the jupyter notebook
%matplotlib inline

## Line plot ##

Load the population.csv file

In [None]:
world_population = pd.read_csv("population.csv")

Let's look at some information about it.

In [None]:
world_population

In [None]:
world_population.info()

In [None]:
world_population.head()

In [None]:
plt.plot(world_population["Year"], world_population["Population"])
plt.show()

Let's add axes and a title!

In [None]:
plt.plot(world_population["Year"], world_population["Population"])
plt.xlabel("Year")
plt.ylabel("Population")
million = 1000000
plt.yticks([0,2*million, 4*million,6*million,8*million,10*million,12*million], [0,"2B","4B","6B","8B","10B","12B"])
plt.title("World population")

plt.savefig("world_pop.png")
plt.show()

Let's save the figure to a file.

Let's try and figure out when the population starts decreasing.

## Another plot ##

Now let's make a graph of the relationship between life expectancy and gdp per capita in 2023. (Inspired by the picture from https://www.gapminder.org/answers/how-does-income-relate-to-life-expectancy/)

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url="world_development.png")

In [None]:
df = pd.read_csv("gdp_and_life_expectancy.csv")

Let's take a look at the data!

In [None]:
df

In [None]:
df.info()

In [None]:
df.head()

The columns le_1990 and le_2023 contain the **life expectancy** in 1990 and 2023, respectively. The columns gdp_1990 and gdp_2023 contain the **gdp per capita (in USD)** in 1990 and 2023, respectively.

In [None]:
plt.plot(df.gdp_2023, df.le_2023)
plt.show()

## Scatter Plot ##

Oops. **plot** only makes sense when the points are some kind of order (like in a time series). Let's make a scatter plot instead!

In [None]:
plt.scatter(df.gdp_2023, df.le_2023)
plt.show()

Let's change the x-axis to be a log scale.

In [None]:
plt.scatter(df.gdp_2023, df.le_2023)
plt.semilogx()
plt.show()

Let's add a title and axis labels

In [None]:
plt.scatter(df.gdp_2023, df.le_2023)
plt.semilogx()
plt.xlabel("GPD per capita [USD]")
plt.ylabel("Life expectancy [years]")
plt.title("Relationship between life expectancy and gdp")
plt.show()

Looks pretty nice! (We will get back to it in a bit!)

## Histogram ##
Let's make a histogram of the life expectancy in 2023!

In [None]:
plt.hist(df.le_2023, edgecolor="black")
plt.show()

### Bins ###

The number of bins is important. Too many bins overcomplicates the data, too few oversimplifies. The default value is 10, but let's try some other values.

In [None]:
plt.hist(df.le_2023, bins=5, edgecolor="black")
plt.show()

We can compare it with the life expectancy in 1990 (in column le_1990).

In [None]:
import numpy as np
print(np.histogram(df.le_1990, bins=5))
print(np.histogram(df.le_2023, bins=5))

In [None]:
plt.hist(df.le_2023, bins=5, alpha=0.5)
plt.hist(df.le_1990, bins=5, alpha=0.5)
plt.savefig("test2.png")
plt.show()

### Now on to Exercise! ###