Urban Data Science & Smart Cities <br>
URSP688Y Spring 2025<br>
Instructor: Chester Harvey <br>
Urban Studies & Planning <br>
National Center for Smart Growth <br>
University of Maryland

# Demo 3 - Functions and Intro to Pandas

- Writing functions
- Installing packages
- Importing packages
- Pandas
  - DataFrames
  - Calculations with columns
  - Selection and filtering
  - Grouping

## Functions

Functions are pre-defined programming components that do things. Often, they take inputs and produce outputs.

<img src="https://miro.medium.com/v2/resize:fit:880/0*xMEO8AbXwdsgnHSH.png" alt="Diagram of a function with input and output" width="400"/>

- Some basic functions are built-in to Python (e.g., `print`)

- We can write our own custom functions.

- We can use custom functions other people have written.

In [1]:
# Let's write a function that takes an age as input and tells us whether a person is an adult
def check_adult(age):
    if age < 18:
        adult = False
    else:
        adult = True
    return adult

In [2]:
check_adult(20)

True

In [3]:
def label_age(name, age): 
    if age < 18:
        label = 'a child'
    else:
        label = 'an adult'
    return f'{name} is {label}'

In [4]:
label_age('Chester', 22)

'Chester is an adult'

#### Namespaces

Functions are a good way to understand a somewhat complicated (but, in the end, VERY useful) aspect of Python: namespaces.

Namespaces are the sections of code in which certain variables, _names_, exist and are accessible to other code. Having different namespaces makes it possible for the same variable name to store different values in different places. 

Namespaces minimize name clutter (because you don't need many versions of a variable name), maximize flexibility, and allow code to be written in ways that are generalizable to lots of applications.

The function we just wrote has two arguments, `name` and `age`, which are variables inside the function. It also defines another variable, `label`, which is usable inside the function. We call these variables that are _local_ to the function. We can see the variables local to a namespace by printing the output of the `locals` function (notice that it doesn't need any arguments).

In [5]:
def label_age(name, age): 
    if age < 18:
        label = 'a child'
    else:
        label = 'an adult'
    
    print(f'Local variables: {locals()}')
    
    return f'{name} is {label}'

In [6]:
label_age('Chester', 22)

Local variables: {'name': 'Chester', 'age': 22, 'label': 'an adult'}


'Chester is an adult'

What's going on here?

In [7]:
name = 'Chester'

def label_age(age): 
    if age < 18:
        label = 'a child'
    else:
        label = 'an adult'
    
    print(f'Local variables: {locals()}')
    
    return f'{name} is {label}'

label_age(22)

Local variables: {'age': 22, 'label': 'an adult'}


'Chester is an adult'

## Importing packages

Now that we have basic data structures under our belts—integers, floats, booleans, strings, lists, and dictionaries—we can put them together into a more complex and capable data structure: a table.

We could write our own custom code to combine lists and dictionaries into a table, *or* we could use someone else's code (actually, many, many other peoples' code) to do this in a way that has become an industry standard.

The easiest way to use other peoples' code in a way that is well-tested and documented is through a **package**.

To use a package that's not already in our environment, we first have to install it.

In [8]:
# ! conda install pandas

Next, we import it into the current namespace.

Packages are often imported with aliases for brevity. I'll use the standard aliases, but they are technically arbitrary, just like variable names.

## Pandas

[_Pandas_](https://pandas.pydata.org/) (Python Data Analysis Library) is currently the most popular way to analyze tables in Python.

The tabular data structure at the heart of Pandas is the DataFrame.

Let's import `pandas` with the alias `pd` for short.

In [9]:
import pandas as pd

## DataFrames

Now we can use Pandas to make a DataFrame.

Notice that we're just entering dictionaries, strings, and ints? Under the hood, Pandas is also storing these data with these basic types. But it will give us a lot of tools to do sophisticated things with them.

In [10]:
columnwise_data = {
    'english': {'Daniela': 83, 'Zoe': 97, 'Rowen': 77, 'Jude': 95, 'Austin': 87, 'Jasper': 92, 'Liora': 88, 'Kieran': 72},
    'math': {'Daniela': 95, 'Zoe': 83, 'Rowen': 73, 'Jude': 80, 'Austin': 100, 'Jasper': 94, 'Liora': 89, 'Kieran': 96},
    'science': {'Daniela': 90, 'Zoe': 87, 'Rowen': 95, 'Jude': 73, 'Austin': 80, 'Jasper': 99, 'Liora': 87, 'Kieran': 90},
    'school':{'Daniela': 'Fairview', 'Zoe': 'New Vista', 'Rowen': 'Fairview', 'Jude': 'New Vista', 'Austin': 'New Vista', 'Jasper': 'Fairview', 'Liora': 'New Vista', 'Kieran': 'Fairview'},
}

df = pd.DataFrame(columnwise_data)
df

Unnamed: 0,english,math,science,school
Daniela,83,95,90,Fairview
Zoe,97,83,87,New Vista
Rowen,77,73,95,Fairview
Jude,95,80,73,New Vista
Austin,87,100,80,New Vista
Jasper,92,94,99,Fairview
Liora,88,89,87,New Vista
Kieran,72,96,90,Fairview


## Previewing DataFrames

Dataframes can get big fast. It can be helpful just to see the first few rows, or just to see the column names.

the `head` method is used to show the first five rows by default, or you can set the argument with the number you want to see.

In [11]:
df.head()

Unnamed: 0,english,math,science,school
Daniela,83,95,90,Fairview
Zoe,97,83,87,New Vista
Rowen,77,73,95,Fairview
Jude,95,80,73,New Vista
Austin,87,100,80,New Vista


The `columns` attribute is very handy for listing all the columns. I tend to add the `to_list` method so Jupyter prints them out nicely without extra clutter.

In [12]:
df.columns.tolist()

['english', 'math', 'science', 'school']

The `value_counts` method is very handy for previewing unique values in a column.

In [13]:
df['school'].value_counts()

school
Fairview     4
New Vista    4
Name: count, dtype: int64

#### Slicing

Just like lists, we can select parts of tables based on indexes. This is called 'slicing.'

Columns and rows are identified by the bold headers to the left and top. You can index data based on these headers.

In [14]:
df['english'] # One column 

Daniela    83
Zoe        97
Rowen      77
Jude       95
Austin     87
Jasper     92
Liora      88
Kieran     72
Name: english, dtype: int64

In [15]:
df[['english','math']] # Multiple columns; note that the input is a list

Unnamed: 0,english,math
Daniela,83,95
Zoe,97,83
Rowen,77,73
Jude,95,80
Austin,87,100
Jasper,92,94
Liora,88,89
Kieran,72,96


In [16]:
df.loc['Daniela'] # One row based on index value

english          83
math             95
science          90
school     Fairview
Name: Daniela, dtype: object

In [17]:
df.iloc[0] # One row based on index order (starting with 0)

english          83
math             95
science          90
school     Fairview
Name: Daniela, dtype: object

## Filtering

You can also retrieve a subset of a DataFrame based on a condition. This requires making a 'boolean mask', then selecting by that mask. Pandas will only return the rows or columns that are `True` in the mask.

In [18]:
df['school'] == 'Fairview'

Daniela     True
Zoe        False
Rowen       True
Jude       False
Austin     False
Jasper      True
Liora      False
Kieran      True
Name: school, dtype: bool

In [19]:
df[df['school'] == 'Fairview']

Unnamed: 0,english,math,science,school
Daniela,83,95,90,Fairview
Rowen,77,73,95,Fairview
Jasper,92,94,99,Fairview
Kieran,72,96,90,Fairview


## Grouping

A very powerful thing to do with tables is to group rows, then make calculations within groups. This is like PivotTable in Excel.

Let's calculate the average grade in English by school.

In [20]:
df.groupby('school')['english'].mean()

school
Fairview     81.00
New Vista    91.75
Name: english, dtype: float64

## Wide vs Long Tables

Data scientists often talk about tables being organized in two ways: wide and long

- Wide: Multiple attributes for the same object stored in each row
- Long: Only one attribute per row (potential for multiple rows per object)

Let's restructure our table so it's long and see what the differences are.

In [21]:
# The first step is to convert the row index into a column with the header 'name'
df = df.reset_index().rename(columns={'index':'name'})

In [22]:
# Then we can use the melt function to convert from wide to long
df_long = pd.melt(
    df, 
    id_vars=['name','school'], 
    value_vars=['english','math','science'], 
    var_name='subject', 
    value_name='grade',
)
df_long

Unnamed: 0,name,school,subject,grade
0,Daniela,Fairview,english,83
1,Zoe,New Vista,english,97
2,Rowen,Fairview,english,77
3,Jude,New Vista,english,95
4,Austin,New Vista,english,87
5,Jasper,Fairview,english,92
6,Liora,New Vista,english,88
7,Kieran,Fairview,english,72
8,Daniela,Fairview,math,95
9,Zoe,New Vista,math,83


With the data in this format, we can more easily calculate grade averages across all subjects.

In [23]:
df_long['grade'].mean()

np.float64(87.58333333333333)

We can still easily break down by subject using groups.

In [24]:
df_long.groupby('subject')['grade'].mean()

subject
english    86.375
math       88.750
science    87.625
Name: grade, dtype: float64

## Loading Data from a File

Enough with these toy data! Let's get our hands on some real-world data by loading a table from a file.

Let's load data from the [Maryland Eviction Case Database](https://opendata.maryland.gov/Housing/District-Court-of-Maryland-Eviction-Case-Data/mvqb-b4hf/data).

In [25]:
df = pd.read_csv('District_Court_of_Maryland_Eviction_Case_Data_2024Q4.csv')

  df = pd.read_csv('District_Court_of_Maryland_Eviction_Case_Data_2024Q4.csv')


In [26]:
df

Unnamed: 0.1,Unnamed: 0,Event Date,Event Type,Event Comment,County,Location,Tenant City,Tenant State,Tenant ZIP Code,Case Type,Case Number,Evicted Date,Event Year,Eviction Year
0,334193,2024-09-02,Warrant of Restitution - Return of Service - E...,,Baltimore,Essex,Dundalk,MD,21222.0,Failure to Pay Rent,D-08-LT-24-20104-040,,2024.0,
1,334194,2024-09-02,Warrant of Restitution - Return of Service - E...,,Baltimore,Essex,DUNDALK,MD,21222.0,Failure to Pay Rent,D-085-LT-24-002047,,2024.0,
2,334195,2024-09-02,Petition - For Warrant of Restitution Filed,OFFIT::Warrant of Restitution,Anne Arundel,Glen Burnie,Millersville,MD,21108.0,Failure to Pay Rent,D-072-LT-24-28356-003,,2024.0,
3,334196,2024-09-02,Petition - For Warrant of Restitution Filed,OFFIT::Warrant of Restitution,Anne Arundel,Glen Burnie,Millersville,MD,21108.0,Failure to Pay Rent,D-072-LT-24-28356-006,,2024.0,
4,334197,2024-09-02,Petition - For Warrant of Restitution Filed,OFFIT::Warrant of Restitution,Anne Arundel,Glen Burnie,Millersville,MD,21108.0,Failure to Pay Rent,D-072-LT-24-28356-010,,2024.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76840,411035,2024-12-18,Petition - For Warrant of Restitution Filed,,Worcester,Snow Hill,SNOW HILL,MD,21863.0,Failure to Pay Rent,D-024-LT-24-000117,,2024.0,
76841,411036,2024-12-18,Petition - For Warrant of Restitution Filed,,Worcester,Snow Hill,SNOW HILL,MD,21863.0,Failure to Pay Rent,D-024-LT-24-000120,,2024.0,
76842,411037,2024-12-18,Petition - For Warrant of Restitution Filed,WOR,Worcester,Snow Hill,Snow Hill,MD,21863.0,Failure to Pay Rent,D-024-LT-24-39869-001,,2024.0,
76843,411038,2024-12-18,Petition - For Warrant of Restitution Filed,Warrant of restitution,Worcester,Snow Hill,Snow Hill,MD,21863.0,Failure to Pay Rent,D-024-LT-24-39869-002,,2024.0,


## Errors and debugging

Errors are frustrating and inevitable. Even professional programmers probably spend most of their time debugging.

Luckily, there are good tools and techniques for making debugging a little easier.

Despite these, you will probably nearly tear your hair out with some frequency, especially as a beginner. It will get better with time.

There are two types of errors in programming: logic and syntax. They both result in your program not achieving its goal, but the first may not be as easily detectable because the code may still run.

### Logic errors
These are issues with how you have approached or executed your problem. If your code runs but produces nonsensical results, there is probably a logic error. However, your erroneous code might also produce logical but *wrong* results; you might never notice until the problem has rippled downstream. It's best to address this proactively by planning your code well so it's less likely to be illogical, and writing readable code that can be easily reviewed.

Here's a logic error. Can you find it? (Hint: the issue is syntactical, but it's still a logic error because the code works without throwing an error.)

In [27]:
def check_adult(age):
    if age > 18:
        adult = False
    else:
        adult = True
    return adult

check_adult(20)

False

### Syntax errors
These are more obvious because your code will simply fail. There are lots of tools for figuring out where and why.

Error messages are usually the starting place for debugging a syntax error.

In [28]:
def check_adult(age):
    if age < 18:
        adult = False
    else:
        adult = True
    return adult

check_adult('20')

TypeError: '<' not supported between instances of 'str' and 'int'

The error message tells us where the problem is located.

Sometimes, it can be helpful to turn on line numbers.
- In Colab: `Tools -> Settings -> Editor -> Show line numbers`
- In JupyterLab: `View -> Show Line Numbers`

The `ValueError` tells us that the issue is related to the value of a variable on this line, but it's still pretty vague.

Time to start [Googling](https://www.google.com/).


## Style guidelines for Python
- At the very least, do things consistently
- One statement per line
- Try to limit line length to 72 characters
- Use four spaces to indent
- Put spaces around operators (e.g., `1 + 1` or `day = 'Monday'`) (except in keyword function arguments)
- Use blank lines intentionally and consistently
- Use meaningful names
- Name variables and functions with `lowercase_underscores`
- Constants are often named in `ALL_CAPS_WITH_UNDERSCORES` (e.g., `C = 2.99792458e+8`)
- Name custom classes with `CapWords`
- In general, avoid spaces in folder and filenames used for programming

See [Code Readability](https://github.com/ncsg/ursp688y_sp2024/blob/main/README.md#code-readability) on the syllabus. [CS61A](https://cs61a.org/articles/composition/) has an excellent composition guide. [PEP 8](https://peps.python.org/pep-0008/) is a standard Python style guide. [Google](https://google.github.io/styleguide/pyguide.html) publishes their internal Python style guide.