# Introduction to data analysis with pandas

NICAR 2022, Jonathan Soma / js4571@columbia.edu / [@dangerscarf](https://twitter.com/dangerscarf)

With a very simple, very incorrect, very boring dataset called `countries.csv`.

## Download the file we're going to analyze

In [None]:
import requests

response = requests.get("https://raw.githubusercontent.com/jsoma/NICAR22-pandas/main/simple-data/countries.csv")
with open('countries.csv', 'w') as f:
    f.write(response.text)

## Read in our data

We're going to be reading in a file called `countries.csv`.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("countries.csv")
df.head()

## Sorting

In [None]:
df.sort_values(by='population')

In [None]:
df.sort_values(by='population', ascending=False)

## Summary statistics

In [None]:
# What's the median life expectancy?
df.life_expectancy.median()

In [None]:
# What's the sum of every row's population?
df.population.sum()

In [None]:
# What's the median population?
df.population.median()

In [None]:
# Let's get a lot of different calculations
df.life_expectancy.describe()

In [None]:
# How many countries are on each continent?
df.continent.value_counts()

### Plot it!

In [None]:
# How many countries are on each continent?
df.continent.value_counts().plot(kind='barh')

In [None]:
# How many countries are on each continent?
# Plot with biggest on top
df.continent.value_counts().sort_values().plot(kind='barh')

## Grouped statistics

In [None]:
# Mean life expectancy by continent
df.groupby('continent').life_expectancy.median()

In [None]:
# Total population by continent
df.groupby('continent').population.sum()

## Calculating new columns

In [None]:
# Calculate the per-capita GDP
df['per_capita_gdp'] = df.gdp / df.population
df.head()

In [None]:
# Who has the highest per-capita GDP?
df.sort_values(by='per_capita_gdp', ascending=False).head()

## Graphing

In [None]:
df.sort_values(by='per_capita_gdp', ascending=False).head().plot(y='per_capita_gdp', x='country', kind='barh')

## Filtering

In [None]:
df[df.continent == 'Africa']

In [None]:
africa = df[df.continent == 'Africa']

In [None]:
africa.sort_values(by='per_capita_gdp', ascending=False).head()

In [None]:
africa.sort_values(by='per_capita_gdp').plot(
    x='country',
    y='per_capita_gdp',
    kind='barh',
    figsize=(8,12)
)