<div>
<img src="img/python_logo.png" width="100px" style="float: left; "/> 
<div style="font-size: 40px; padding-top: 20px">Python basics</div>   
<div style="font-size: 30px; padding-top: 20px">Part 4 - Dictionaries and Functions</div>   
    
<div>
    <img style="float: left" src="img/dictionary.png" width=150/>
    <img src="img/recipe.png" width=150/>
</div>
</div>

 


<h2 style="clear: both">COMM4190 Spring 2025</h2>

<h3>Instructor: Matt O'Donnell (mbod@asc.upenn.edu)</h3>

-----

<div class="alert alert-info">

## Overview
    
This notebook is the fourth of four that will cover some of the very basics of Python programming that serve as the building blocks for being able to use Python for computational text analysis. These are:

### Part 1 

* __1. OBJECTS__ - a core concept for programming Python, where an __object__ holds data of different _types_ and a group of actions ( _functions_ ) that can be applied to these data
    * __Data types__ - data comes in different kinds/flavors, e.g. numbers, fractions, proportions, categories, text, etc., that are best represented by different objects (e.g. `int`, `float`, `str`) 
    * __Named Pointers__ - labels that can be attached to objects to make it possible reference the object in subsequent steps

* __2. `str` STRING OBJECTS__ - a string is an ordered sequence of characters and the core object we use in Python to represent text data.
    * __String specific functions__

### Part 2

* __3. LISTS__ a simple ordered sequence data structure
    * _indexing_ - a way of accessing a single item within a list
    * _slicing_ - a ways of accessing a sub-sequence of items within a item
    * list-specific functions
        * `.count()` and `.index()`
        * `.append()`


### Part 3


* __4. CONDITIONS__ - a way of carrying out different actions based on a logic test

* __5. LOOPS__ - a way of repeating an action or series of action in code

* __6. LOOP and FILTER paradigm__ - a frequently used combination of loop and conditional constructs

### Part 4 (This notebook)

* __7. DICTIONARIES__ - a data structure that links a label or key with a piece of data
  * Dictionaries are _unordered sets of key-value pairs_, where the keys are unique identifers linking to (or indexing) and object (the value)



* __8. FUNCTIONS__ - a reusable and flexible way of grouping steps together for reuse

</div>




-----

## Overview

* This notebook will cover:
    1. __DICTIONARIES__ - are data structures in Python where each _value_ has an associated label called its _key_ that allows you to access the value directly.
        * A __dictionary__ (`dict` object) is __an UNORDERED SET__ of **KEY**-**VALUE** pairs
        * You can define a `dict` object with curly braces `{` `}`, e.g.
          ```
          my_dict = { 'Bat': 11,
                      'Apple': 434,
                      'Mouse': 32 }
          ```
            * each pair of value+label has the format: `key : value`
            * and pairs are separated by commas
            * Keys are __UNIQUE__ in the `dict` object
              
        * To access an item in a `dict` object you can:
             1. use square brackets and the key
                 ```
                 my_dict['Bat']
                 ```
             2. use the `.get()` function:
                 ```
                 my_dict.get('Bat')
                 ```
          
          both of these will return the value `11`
        * The `dict` object specific functions we most often use are:
            1. `.get()` - to return the value for a specific key (or `None` if that key does not exist in the `dict`)
            2. `.items()` - to return a list of `(key, value)` tuples
            3. `.keys()` - to return a list of all of the keys in the dictionary
            3. `.values()` - to return a list of all of the the values 
         <br/><br/>
    2. __FUNCTIONS__ - are a way to take a block of code and make it flexible and reusable in multiple places in your code without having to repeat all code each time.
        * The code steps you want to turn into a function become a code block within a function definition.
        * Functions can have zero or more arguments that are pointers in the definition that can be used in the function code block. Specific objects can be associated with these pointers when you use the function from other parts of your code.
        * The syntax to define a function follows the pattern:
          ```   
            def FUNCTION_NAME(arguments):
                # code block

                return RESULT
          ```
            where:
            * `FUNCTION_NAME` is the pointer to your function
            * `arguments` are any pointers to objects that are going to be used in the function and will be specified when the function is called.
            * `RESULT` is the object(s) you want to pass back from the function


------

## Encoding data in Python

![](img/data_encoding.png)


* So far we have focused on using `list` objects to hold our data. Lists are a simple __ORDERED SEQUENCE OF OBJECTS__. 


* In a `list` object _the order is crucial_ and the position of an item in a list (its __INDEX__) is the only way we can keep track of what the item refers to.
    * When we want to have a label to go with each piece of data we need to create two lists of the same length and make sure the items in each are align.
        * _Note: It would also be possible to use a more complex nested data structure (see below)_



* For example, to record the heights of three people:
    1. Jay - 69"
    2. Deborah - 65"
    3. Min - 62"
        
    we could do this:
    ```
    names =  ['Jay', 'Deborah', 'Min']
    heights = [ 69, 65, 62 ]
    ```    
    <br/>
    Then we can link the height data with the name label by using the same index:
    
    ```
    
    print(f'{ names[0] } is { heights[0] } inches tall.')
    ```



In [None]:
# create two lists using position to align name and height values

names =  ['Jay', 'Deborah', 'Min']
heights = [ 69, 65, 62 ]

In [None]:
# use indexing to display matching name and height values

print(names[0], heights[0])

In [None]:
# use fstring to incorporate data values into a text content

print(f'{ names[0] } is { heights[0] } inches tall.')

* When you have two lists of the same length you can combine them item-by-item using the `zip(list1, list2)` function. a for-loop and the `zip()` function: 

In [None]:
list(zip(names,heights))

* This should look familiar from the output we get from a `Counter` object we have been using to create frequency lists.

* We can then use a loop to walk through the values item-by-item:

In [None]:
for name, height in zip(names, heights):
    print(f'{ name } is { height } inches tall.')

## 1. Dictionaries

* The second data structure builtin to Python is the __DICTIONARY__. 
    * A dictionary is _an unordered set of KEY-VALUE pairs_.
    * Dictionaries are realized as `dict` objects.

* This allows each piece of data (the value) to have an associated label bound to it. It removes the need to use position/order to align values and labels.


* Returning to our simple height data set, this would be encoded like this:

    | KEY | VALUE |
    | --- | ----- |
    | Jay | 69   |
    | Deborah | 65 |
    | Min | 62 |
    
    
### Creating a dictionary with a `dict` object
    
* The syntax to create a `dict` object in Python is:
    
    ```
    {
       key : value,
       key : value,
       ...
    }
    ```
    
    * curly braces `{ }` make the bounaries or beginning and end of the dictionary 
    * each pair of value+label has the format: `key : value`
    * and pairs are separated by commas



* Keys are UNIQUE in the dict object

* Let's create a dictionary with the name and height data and reference the resulting `dict` object with a named pointer `height_dict`:

In [None]:
height_dict = {
    'Min': 62,
    'Deborah': 65,
    'Jay': 69
}

In [None]:
type(height_dict)

* Just like with a list we can use the `len()` function to find out how many items are in a `dict` object.

In [None]:
len(height_dict)

* But unlike a `list` we __DO NOT__ use numeric indexing

In [None]:
# this will create an error
# trying to get the first item (key-value pair)
height_dict[0]

* This is because _there is NO ORDER_ in a dictionary.


* Instead, we use the `key` as a way to retrieve items from a `dict` object.


* The syntax for accessing an item in a dictionary is:
    ```
    dict[key]
    ```
    
    e.g.
    ```
    height_dict['Jay']
    ```

In [None]:
# retrieve the value for the item 
# with key 'Jan'
height_dict['Jay']

* The key must exist in the `dict` and match the key exactly.
    * For instance, using `jay` or `JAY` instead of `Jay` will cause a `KeyError`

In [None]:
height_dict['jay']

* The dictionary specific function `.get(key)` provides another way to access an item in a `dict` object that doesn't creat an error if the key does not exist.

In [None]:
# try and retrieve an item using a non-existant key
height_dict.get('jay')

* If the key does not exist the `.get()` function will return `None` (which does not display any output). 


* If the key does exist in the `dict` then the value associated with the key will be returned.

In [None]:
height_dict.get('Jay')

In [None]:
height_dict.get('Min')

#### Creating an _EMPTY DICTIONARY_ and adding items to it

In [None]:
# create am empty dictionary
height2_dict = {}

In [None]:
len(height2_dict)

* We can add an item to the dict object referenced by `height2_dict` using the syntax:
    ```
    height2_dict[key] = value
    ```

In [None]:
# add a key:value pair Jay:69 
height2_dict['Jay'] = 69

In [None]:
len(height2_dict)

In [None]:
# look at the contents of the dict
height2_dict

In [None]:
height2_dict['Min']=62
height2_dict['Deborah']=65

In [None]:
len(height2_dict)

In [None]:
height2_dict

In [None]:
height2_dict['Jay'] = 70

In [None]:
height2_dict

#### Using `.keys()`, `.values()` and `.items()` dictionary function to list the keys

* A `dict` object has a number of additional object specific functions
  * `.keys()` - returns a list of all the keys in the `dict`
  * `.values()` - returns a list of all the values
  * `.items()` - returns a lists of `(key, value)` pairs
      


| KEY | VALUE |
| --- | ----- |
| Jay | 69   |
| Deborah | 65 |
| Min | 62 |

* We can get a list of all the keys in a `dict` (i.e. select the __KEY__ column in the table)

In [None]:
height_dict.keys()

* And similarly a list of all the values

In [None]:
height_dict.values()

* We will often use these in a loop to walk through the items in a dictionary.
  * For example, we can use the `.keys()` function to get a list of keys and walk through this list to access the associated value at each step

In [None]:
# first illustrate what walking through the list of keys
# looks like by displaying each value

for name in height_dict.keys():
    print(name)

* Now at each step we use the current key, referenced by the loop pointer `name` to retrieve the associated value

In [None]:
for name in height_dict.keys():
    print(name, 'is', height_dict[name], 'inches tall' )

* But just like we found we could use the `zip()` function in a loop to combine two or more lists, the `.items()` dictionary specific function gives as a pair of objects in a list.

In [None]:
height_dict.items()

* And we can then create a loop with two pointers, one to the key and other to the value

In [None]:
for key, value in height_dict.items():
    print(f"Key is '{key}' & associated value is '{value}'.")

------



![](img/state_population.png)

### Using a dictionary as a lookup table/mapping

* A common use for a dictionary is to create a mapping or lookup table between two forms of a piece of data, such as an abbreviation and the long form.


* An example would be the two letter abbreviations for US states.


* Here we have a selection of states and create a dictionary where:
    * the __KEYS__ are the two letter abbreviations
    * the __VALUES__ are the full state name

In [None]:
state_mapping = {
     'AK': 'Alaska',
     'AL': 'Alabama',
     'AR': 'Arkansas',
     'AZ': 'Arizona',
     'CA': 'California',
     'CO': 'Colorado',
     'MO': 'Missouri',
     'MT': 'Montana',
     'NE': 'Nebraska',
     'NH': 'New Hampshire',
     'NJ': 'New Jersey',
     'NV': 'Nevada'
}

* So we can use the abbreviation to look up the associated state name.

In [None]:
state_mapping['CO']

In [None]:
# use the dict.get(KEY) function 

state_mapping.get('MT')

In [None]:
print("NV is the 2 letter abbreviation for", state_mapping['NV'])

* This kind of lookup or mapping dictionary can be really useful when you have other data sources that use the keys.


* For example, if we had some data about the population in some US states (ranked by population size):

| State | Population |
| ----- | ---------- |
| CA  | 39613493 |
| NJ  | 8874520  |
| AZ  | 7520103  |
| MO  | 6169038  |
| CO  | 5893634  |
| AL  | 4934193  |
| NV  | 3185786  |
| AR  | 3033946  |
| NE  | 1951996  |
| NH  | 1372203  |
| MT  | 1085004  |
| AK  | 724357   |


* We could encode this table as a `dict` object, where:
    * the 2 letter state abbreviation is the __KEY__
    * the __VALUES__ are the population counts

In [None]:
population_dict = {
 'CA': 39613493,
 'NJ': 8874520,
 'AZ': 7520103,
 'MO': 6169038,
 'CO': 5893634,
 'AL': 4934193,
 'NV': 3185786,
 'AR': 3033946,
 'NE': 1951996,
 'NH': 1372203,
 'MT': 1085004,
 'AK': 724357
}

* We can look up a state name using the `state_mapping` dictionary and the population in the `population_dict` using a common key, e.g. `CA`

In [None]:
print(state_mapping['CA'], 'population is', population_dict['CA'])

* Incorporating this into a loop so we can display the states in our population table but replace the abbreviations.

In [None]:
for state_abbrev, population in population_dict.items():
    # look up the state name using abbreviation
    sname = state_mapping[state_abbrev]
    
    # display state and population
    print(f'{sname:>15} population: {population}')

* We will see this use of dictionaries frequently as we move on to working with larger, more complex datasets.


----

## 2. Functions

* Python allows you to create your own functions that are:
    1. A block of code to do something, e.g. the steps involved in normalizing and tokenizing a text
    2. These steps are _wrapped_ up with a function pointer and zero or more __INPUT ARGUMENTS__
    3. Optionally the function can __RETURN__ the result of running the steps in the code block.
    
    
* The syntax template for a function is:
    
```
def function_name(ARGS):
    # CODE BLOCK
    # ...
    
    return output   # optional
```

* The simplest kind of function is one that wraps up one or more steps of code with a label (the __function name__) and no arguments (extra pieces of data [objects] passed to the function).


* For example, imagine we wanted to display the message:
    ```
    Hello!
    How are you?
    
    
    ```
  
  We could do:

In [None]:
print('Hello!')
print('How are you?')

* If we wanted to repeat this message multiple times, it would become tedious to repeat those two lines of code over and over again.

In [None]:
print('Hello!')
print('How are you?')

print('Hello!')
print('How are you?')

print('Hello!')
print('How are you?')

* So we could define a function called `say_hi` that had those two lines of code as the function code block.


* The function definition would look like this:

In [None]:
def say_hi():
    print('Hello!')
    print('How are you?')

* The first line will have the format:
    ```
    def FUNCTION_NAME():
    ```
  * keyword `def`
  * followed by the function name 
  * then a set of parentheses that could contain any _arguments_ that could be passed to the function (but here we are not using arguments so the parentheses are empty)
  * finally we need a colon `:` to indicate the start of the code block (this should be familiar now from `for-loops` and `if` conditional constructions.
  
  
* Once the function definition cell is executed we can use the function name just like we would a general function like `print()` or `len()` or `type()`, etc.

In [None]:
say_hi()

In [None]:
say_hi(), say_hi(), say_hi()

* When we call the function, Python looks for the function definition that is referenced by the function name (`say_hi`) and executes the code block.


* Let's modify the function `say_hi` to require a name to be specified when the function is called. Then we can use the specified value in the function code block to print a specific greeting.

In [None]:
def say_hi(name):
    print(f'Hello {name}!')
    print('How are you?')

In [None]:
say_hi('Zeek')

In [None]:
say_hi('Zelda')

* So as we change the argument we give to `say_hi` the greeting changes.

#### Returning an object from a function

* Most often we will want to define functions that given some data will carry out a series on steps on these data and pass back the result.


* For instance, if we wanted to add two numbers together, we could have a function:

In [None]:
def add_nums(n1, n2):
    result = n1 + n2
    print(f'{n1} + {n2} = {result}')

* It expects two numeric arguments, which it will add together and then display the result using `print()`.

In [None]:
add_nums(5,7)

In [None]:
add_nums(12,11)

* It works as we expect. But usually we won't the answer back as an object we can use in some other code steps.


* To pass back an object we use `return OBJ` as the last line of the function code block.

In [None]:
def add_nums(n1, n2):
    result = n1+n2
    return result    # pass back the result object 

* So now when we call `add_nums(arg1, arg2)` we get back an object that contains the result of adding the two numbers together.

In [None]:
add_nums(1,15)

In [None]:
add_nums(-10,3)

* So we might use the result of calling `add_nums()` twice in a conditional construction
    * Is 10+15 > 11+13
    * If the answer is 'Yes' (`True`) then display:
        > 10+15 > 11+13

In [None]:
if add_nums(10,15) > add_nums(11,13):
    print('10+15 > 11+13')

* Another example would be calculate the average or mean of a list of numbers.


* So far, given a list of numbers called `nums`, we have done this:
    ```
    total = sum(nums)
    avg = total/len(nums)
    
    
    ```
 
 
* If we want to do it for a second list, `nums2`:
    ```
    total2 = sum(nums2)
    avg2 = total/len(nums2)
    
    
    ```
    
  and so on.
  
  
* But we definition a function called `mean` that takes a list of numbers as its argument and returns the resuls of suming the numbers and dividing by the length of the list of numbers.

In [None]:
def mean(nums):
    total = sum(nums)
    avg = total / len(nums)
    
    return avg

* Then we can calculate the mean of any list of numbers easily, with one line of code.

In [None]:
mean([1,5,10])

In [None]:
mean([434, 3434, 343434])

In [None]:
# use range(1,101) to generate a sequence of integers
# from 1 to 100

mean(range(1,101))

### Example: Function to calculate the median

* The median of a list of numbers is usually defined as __the number in the middle of the list when the list is sorted in ascending order__.


* This is straight forward _when the list has an odd number of items_, because there is a single middle item.


* When _the list has an even number of items_, the median is defined as the average of the two middle items.


* For example, given the list:
    ```
    [ 15, 1, 74, 3, 2 ]
    ```
  1. sort the list
    ```
    [ 1, 2, 3, 15, 74 ]
    ```
  2. select the middle (the 3rd of 5 items, i.e. index `2`)
    ```
    3
    ```
    
    
* If we add another item to the list to make it 6 items long:
    ```
    [ 15, 1, 74, 3, 2, 9 ]
    ```
  1. sort the list
    ```
    [ 1, 2, 3, 9, 15, 74 ]
    ```
  2. select the two idems in middle (the 3rd and 4th of 6 items, i.e. slice `2:4`)
    ```
    [3, 9]
    ```
  3. calculate the mean of these two numbers
    ```
    6
    ```

* In Python the `/` operator is used for division. So given a list of 5 items the middle would be the _3rd item_ with an index of `2`.

In [None]:
5 / 2

* But list indexes must be whole numbers (`int` objects)

* So we could use the `int()` function

In [None]:
int(5/2)

* Or more conveniently we can use the `//` operator with does __integer division__.

In [None]:
5//2

* Another useful operator is the __modulo operator__ which is a `%` character. It returns the remainder of the division that results in a fraction, e.g. 2 goes into 5 twice with a _remainder_ of 1.

In [None]:
5 % 2

* This is helpful for determining whether a number is odd (will have a modulo > 0) or even (modulo value will be `0`).


#### Defining the `median()` function

In [None]:
def median(nums):
    
    # get the length of the list of numbers
    length = len(nums)
    
    # use integer division to get the index of the middle item
    middle_idx = length//2
    
    # create a copy of the nums sorted in ascending order
    nums_ordered = sorted(nums)
    
    # if the list has an odd number of items there will be a single middle item
    # and that can be return
    if length % 2:
        middle_item = nums_ordered[middle_idx]
    else:
    # if the list has an even number of items calculate the mean of the middle two
        middle_item = sum(nums_ordered[middle_idx-1:middle_idx+1])/2

    # return the middle item
    return middle_item

In [None]:
median([ 15, 1, 74, 3, 2 ])

In [None]:
median([ 1, 2, 3, 9, 15, 74 ])