# Lecture 2-3: Data analysis and visualization

Despite the central place of data in scientific work, we often don't think much about how to properly organize or store it. Today we'll discuss the organizing principles of [tidy data](https://vita.had.co.nz/papers/tidy-data.pdf), an increasingly common approach to organizing data that makes it easy to understand and to analyze.

For our examples below, we'll use the Python library `pandas`, which [stores data](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html) in a structure called a `DataFrame`. As we'll see below, it's very easy to manipulate data in this format.

### Example: Examining the `iris` dataset from `seaborn`

As a first test, let's load a precompiled dataset called `iris` from the `seaborn` package, which we've used extensively for plotting. This dataset is stored in a "tidy" format. We'll start by just looking at the rough properties of the data.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plot
import pandas as pd
import numpy as np
%matplotlib inline

df = sns.load_dataset("iris")
df.head(20)

In [None]:
df.columns

In [None]:
len(df)

### Showing the relationships between variables

As we can see from the column names, `iris` describes measurements from different species of iris flowers. We can look more closely to see how different variables are related. For example, what is the relationship between the length of the petals and sepals?

In [None]:
sns.jointplot('petal_length', 'sepal_length', data=df)

### Selecting out subsets

In the plot above, all species were included together. What happens if we look at just a single species -- is the relationship different? We can explore this by using `pandas` to selection out a subset of the data that corresponds to just a single species and repeating the plot.

In [None]:
print(np.unique(df.species))

In [None]:
df_setosa = df[df.species == 'setosa']
df_versicolor = df[df.species == 'versicolor']
df_virginica = df[df.species == 'virginica']

sns.jointplot('petal_length', 'sepal_length', data=df_setosa)

### Visualizing many variables at once

We can also use `seaborn` to show how many variables in the data are related to one another. What other questions might you ask about this data? Because the data is stored in a simple format, it's easy to quickly analyze the data using `pandas`.

In [None]:
sns.pairplot(df, hue="species")

### A quick test on another dataset

The `planets` dataset contains information about recently discovered planets. We can use this to explore, for example, how the distance of the planet is related to the year in which it was discovered, or how methods of discovery have changed over time. See [here](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html) for some more examples.

In [None]:
df = sns.load_dataset("planets")
df.head(10)

In [None]:
sns.jointplot('year', 'distance', data=df)
plot.yscale('log')

In [None]:
sns.catplot('year', kind='count', aspect=4, data=df, hue='method')

### Example: Tidying the TB dataset

Earlier, we saw an example of a dataset about rates of tuberculosis in different countries that was very hard to interpret and to work with. Here, we'll use `pandas` to load the data set and clean it for analysis. This exercise follows the description [here](http://www.jeannicholashould.com/tidy-data-in-python.html).

FYI: `iso2` refers to the [two-digit country code](https://www.nationsonline.org/oneworld/country_code_list.htm) for different countries. This dataset was collected by the [WHO](https://www.who.int/).

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv')
df.head(10)

In [None]:
# A first step: 'melting' the dataset to extract the sex and age range
# In this process all column names except for 'iso2' and 'year' are stored as 'sex_and_age'
# All values are stored as 'cases'

df = pd.melt(df, id_vars=["iso2", "year"], value_name="cases", var_name="sex_and_age")
df.head(10)

In [None]:
# Parse the column names to extract sex, age upper and lower bounds
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})")

# Add new columns for these variables
tmp_df.columns = ["sex", "age_lower", "age_upper"]

# Create a single `age` column using `age_lower` and `age_upper`
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]

# Merge the data frames together
df = pd.concat([df, tmp_df], axis=1)

df.head()

In [None]:
# Drop unnecessary columns and rows
df = df.drop(['sex_and_age', 'age_lower', 'age_upper'], axis=1)
df = df.dropna()

# Rename `iso2` to `country`
df = df.rename(index=str, columns={'iso2': 'country'})

# Sort the data frame
df = df.sort_values(["country", "year", "sex", "age"], ascending=True)
df.head(10)

### Examining the dataset

Now, it is much easier for us to analyze the data. For example, we can check: how many cases of tuberculosis were there in the US among men? How do the number of cases vary over time?

In [None]:
df_US = df[(df.country == 'US') & (df.sex == 'm')]
df_US.head(20)

In [None]:
sns.lineplot(x='year', y='cases', data=df_US, hue='age')