In [None]:
# | execute: false

# UPPP 135 Week 2: Working with ***Data***

<a target="_blank" href="https://colab.research.google.com/github/knaaptime/uppp135-winter26-assn/week2/00_geodata.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

we gain access to a wide library of additional functionality by importing *packages*, which are bundles of code we can reuse. After you `import` a package, you interact with its contents (members) using "dot notation" (tab complete is your friend)

it's common tp *alias* a package when you import it so that you can access its functionality with fewer keystrokes.

Here we use`numpy` for numerical computing, `pandas` for representing data in a tabular form , and `geopandas`, which extends pandas for geospatial operations.

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
from fsspec import filesystem

## Pandas for Tabular Data

pieces of text are called *strings* and we need to enclose them in quotation marks.

When tabular data are stored in a common format like csv or xlsx or json, we can "read" it into the computer's memory to create a *dataframe* that represents pur table. Most of the time we can even read the data right from a web address without even downloading the file separately

In [None]:
schedule_url="https://github.com/knaaptime/uppp135-winter26-assn/raw/refs/heads/main/week2/_schedule_table.csv"

schedule = pd.read_csv(schedule_url)

the object in the last line of the cell you execute will be shown in the jupyter display

In [None]:
schedule

the `shape` attribute holds the number of rows and columns in the dataframe

In [None]:
schedule.shape

we can access the column names using the `columns` attribute

In [None]:
schedule.columns

columns themselves can be accessed in two ways:

using dot-notation (case sensitive)

:::{.callout-tip}

dot notation is convenient because you can tab-complete, but it's also brittle because you cant handle special characters. What if the column name has a space in it?
:::

In [None]:
schedule.Topic

you can also use bracket-notation where you put the column name (in quotations--it's a string) inside brackets (still case-sensitive)

In [None]:
schedule['Topic']

columns can hold different kinds of values. Notice the `Topic` column holds "object" type

In [None]:
schedule['Week']

the `Week` column holds numeric data (specifically integers) so we can do math operations on that column, like adding all the values together

In [None]:
schedule['Week'].sum()

to see just the top $x$ number of rows from a table, use the `head` method. To see a specific number of rows, put that number in the parentheses 

In [None]:
schedule.head()

there's also a "convenience method" called `plot` that we can call on a single column or the entire dataframe. By default it generates a line plot

In [None]:
schedule['Week'].plot()

passing the `kind` keyword lets us change the type of plot

In [None]:
schedule['Week'].plot(kind='bar')

## Geopandas for Geospatial Data

to read geospatial data, we use the `geopandas` package. It too can read data right off the web using a URL. It can also read files from inside zip archives. Here we read in census tracts from California that I have stored in a github repository

In [None]:
fs = filesystem('https')
tracts = gpd.read_parquet("https://github.com/oturns/example_datasets/raw/refs/heads/main/acs/ca_tracts_2021.pq", filesystem=fs)

a geopandas *geodataframe* works exactly the same as a pandas *dataframe*

In [None]:
tracts.head()

this dataset is much larger than our class schedule table

In [None]:
tracts.shape

something that makes geodataframes different is that they contain a **geometry** column that encodes each row using points, lines, or polygons

In [None]:
tracts.geometry

this is a dataset of polygons, which are encoded as a list of the coordinates of their vertices.

:::{.callout-warning}
the coordinates are stored in a particular system. One system you are probably familiar with is latitude/longitude used in most GPS systems. But there are many! See the video below for a wuick primer on coordinate systems and why they matter

one important thing to remember is that we define coordinates as (x,y), meaning the first number is the horizontal axis and the second is the vertical axis. When we say "latitude/longitude", we're putting this in reverse! Longitude is the x dimension. Thus coordinates are stored (lon,lat)

:::

In [None]:
tracts.crs

coordinate systems provide a way to translate the round(ish) globe into a flat map

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo("WWp1k0SlMUU")

another difference with geodataframes is the default `plot` method now creates a map

In [None]:
tracts.plot()

to select multiple columns, we use double-bracket-notation

In [None]:
tracts[['geoid', 'median_home_value', 'geometry']].head()

we can *subset* the dataframe (i.e. create a particular selection of rows) by putting a condition inside our brackets. For example we can create a new dataframe of orange county by subsetting the california table, selecting only those rows with FIPS codes beginning with `06059`

In [None]:
oc = tracts[tracts.geoid.str.startswith('06059')]

In [None]:
oc.head()

In [None]:
oc.shape

In [None]:
oc.plot()

another convenience method is `explore` which generates an interactive map

In [None]:
oc.explore()

## Geospatial Operations

we can read in traditional formats like shapefiles, even if they are inside zip files. This file contains cities in Orange County, from the county's open data porta;

In [None]:
oc_cities = gpd.read_file("https://github.com/knaaptime/uppp135-winter26-assn/raw/refs/heads/main/week4/OCTraffic_Cities.zip")

In [None]:
oc_cities.head()

In [None]:
oc_cities.crs

notice the cities and the tracts have different coordinate systems. To do geospatial operations on two dataframes, we need them to be in the same CRS. To convert between coordinate systems, we use the `to_crs` method on one (or both) dataframes to get them into the same system

In [None]:
# both of these do the same thing
oc = oc.to_crs(3857)
oc = oc.to_crs(oc_cities.crs)

we can also subset by condition as usual

In [None]:
irvine = oc_cities[oc_cities['city']=='Irvine']

In [None]:
irvine.plot()

we can put multiple layers onto the same interactive map by plotting them on the same map object

In [None]:
m = oc.explore(tooltip=False)
irvine.explore(color='red', tooltip=False, m=m)

we can now subset by *geographic condition*, i.e. selecting all census tracts that touch the city of Irvine

In [None]:
irvine_tracts = oc[oc.intersects(irvine.union_all())]

to create a chotopleth map of a single column, we pass the name of the column to the `plot` or `explore` methods

In [None]:
irvine_tracts.explore('median_household_income', tooltip=False)

geopandas also has some handy tools, like geocoding. Geocoding is the process of turning textual location data (like an address) into a geospatial representation (like a point)

In [None]:
uci = gpd.tools.geocode('UC Irvine, school of social ecology')

In [None]:
uci

In [None]:
uci = uci.to_crs(3857)

In [None]:
m = irvine_tracts.explore()
uci.explore(m=m, color='red', marker_kwds={'radius':5})

The FIPS code for Los Angeles County is `037`. Can you create a map of LA county tracts with UCLA shown on top?