# Data analysis basics with pandas

## Installation

Even though we have Python installed, we still need to install some extra pieces of software! Python is a whole ecosystem of content, where many of the best abilities are from packages/libraries/modules that are made by other people or companies.

In [None]:
# First we need to download some things!
# Run this cell to get the necessary data and software
import os
import urllib.request
import zipfile

# Install required packages
!pip install -q pandas altair lxml tqdm requests

# Download and extract data files
url = 'https://github.com/jsoma/2025-birn/raw/main/docs/01-pandas-data.zip'
print(f'Downloading data from {url}...')
urllib.request.urlretrieve(url, '01-pandas-data.zip')

print('Extracting 01-pandas-data.zip...')
with zipfile.ZipFile('01-pandas-data.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

os.remove('01-pandas-data.zip')
print('✓ Data files extracted!')

## Using pandas

To use pandas, we first need to **import it**. Then we can go ahead with reading in our data and analyzing it.

In [None]:
import pandas as pd

# This creates a "dataframe" - the Python version of a spreadsheet
df = pd.read_csv("countries.csv")
df

In [None]:
df.head()

In [None]:
df.head(2)

In [None]:
df.tail()

In [None]:
df.sort_values(by='gdp')

In [None]:
df.sort_values(by='gdp', ascending=False)

In [None]:
df.sort_values('life_expectancy', ascending=False).head(10)

In [None]:
df.head(10).sort_values('life_expectancy', ascending=False)

In [None]:
df['life_expectancy']

In [None]:
df['life_expectancy'].median()

In [None]:
df['life_expectancy'] > 75

In [None]:
df[df['life_expectancy'] > 75]

In [None]:
df['continent'].value_counts()

In [None]:
df['continent'].unique()

In [None]:
df[df['continent'] == 'Europe']

In [None]:
df[df['continent'] == 'Europe'].sort_values(by='life_expectancy', ascending=False)

In [None]:
df[df['continent'] == 'Europe']['life_expectancy'].median()

In [None]:
df['life_expectancy'].describe()

In [None]:
df.groupby('continent')['life_expectancy'].median()

In [None]:
df.groupby('continent')['life_expectancy'].median().reset_index()

In [None]:
df.groupby('continent').agg({
    'life_expectancy': 'median',
    'gdp': 'max'
})

In [None]:
# Try to save this as a column??????
df['gdp_per_capita'] = df['gdp'] / df['population']
df.head(2)

## Saving

When you save your CSV, you always need to include `index=False`. If you don't, you get extra unnamed columns that are irritating to you and your coworkers!

In [None]:
df.to_csv("output.csv", index=False)

## Graphing

There's a good way to graph and a bad way to graph: the default is [matplotlib](https://matplotlib.org/), which is 100% the worst. A great alternative is [Altair](https://altair-viz.github.io/gallery/index.html), which is more useful and produces prettier (and interactive!) graphics.

In [None]:
df.plot(x='gdp_per_capita', y='life_expectancy', kind='scatter')

In [None]:
import altair as alt

alt.Chart(df).mark_circle(size=50).encode(
    x='gdp_per_capita',
    y='life_expectancy',
    color='continent',
    tooltip=['country', 'continent', 'life_expectancy', 'population']
).properties(
    width=800,
    height=300
).interactive()