# Agenda, day 3: Dictionaries and files

1. Jupyter Lite
2. Q&A
3. Dictionaries ("dicts")
    - Creating them
    - Retrieving from them
    - Iterating over them
4. How can we use dictionaries in a variety of ways?
    - As a read-only database
    - To accumulate values we know of
    - To accumulate the unknown
5. How do dicts work?
6. Files
    - Reading from files
    - Different ways to read from files
    - Writing to files (a little)
    - The `with` construct



In [1]:
print('Hello!')

Hello!


# WASM -- Web Assembly

WASM is a universal programming language in your browser. Any programming language that knows how to run on a WASM platform can run in your browser.

Python can run in WASM! Which means that Python can run in your browser!

Does that mean that Jupyter can run in your browser? Until last week, the answer was "Yes, but".

This is known as Jupyter Lite.

It worked great... except for the `input` function, which didn't work so well.

# Dictionaries

Dictionaries are the most important data structure in Python. Let's put that in some context:

- Strings are for storing (and retrieving) text.
- Lists are for storing and retrieving values of the same type, where we put them in the list, and can retrieve them via their index. We know that lists are mutable, meaning that we can modify their values, make them longer, and make them shorter.
- Tuples are for storing and retrieving values of different types. They are immutable, meaning that we cannot change the values, make them longer, or make them shorter.

For certain kinds of tasks, none of these is really flexible or efficient enough.

Dictionaries are extremely fast and extremely flexible. The idea behind dicts is that you don't have individual values. Rather, you have keys (the indexes) and values. There is no such thing as an individual item in a dict; it's always based on pairs. 

Some basic things to know about dicts:

- Every key has a value, and every value has a key. There isn't any such thing has a valueless key or a keyless value.
- Keys are unique within a dict. This is like how a string, list, or tuple only has one item at each index; there aren't 3 items at index 3.
- A key can be anything at all in Python... if it is immutable. Meaning, we normally use numbers (ints and floats) and strings as dict keys.
- In a dict, I can dictate what the keys are, so long as they are unique and immutable. I am not constrained by any ordering.
- The values can be absolutely anything at all, without any restrictions whatsoever -- they can be mutable/immutable, big or small, repeat if you want, etc.

So what's the advantage of a dict?

The first one that you normally encounter is that whereas a list has integer indexes (0, 1, 2, 3) that don't have anything to do with the data, a dict's keys are set by you, and can be relevant to the data:

- ID numbers
- Usernames
- IP addresses
- Dates
- SKUs in a store

### To create a dict:

- We use `{}`
- The key and value in each pair are separated by `:`
- Pairs are separated by `,`

In [2]:
d = {'a':10, 'b':20, 'c':30}

In [3]:
type(d)

dict

In [4]:
# How many key-value pairs are in the dict?

len(d)  # we count the pairs (not the individual items)

3

In [5]:
# how can I retrieve from a dict? I use [], just like with a string/list/tuple
# in the [], I put the key that I want to retrieve

d['a'] 

10

In [6]:
d['b']

20

In [7]:
d['c']

30

In [8]:
d['x']

KeyError: 'x'

In [9]:
# get the key exactly right!
d['a ']

KeyError: 'a '

In [10]:
d['A']

KeyError: 'A'

In [11]:
# how can I know if a key is in the dict?
# we use the "in" operator
# this returns True if the key is in the dict
# note that "in" on a dict ignores the values COMPLETELY

d

{'a': 10, 'b': 20, 'c': 30}

In [12]:
'a' in d

True

In [13]:
10 in d

False

In [14]:
# if I use d['a'], I get the value associated with the key 'a'
d['a']

10

In [15]:
# what if I have the value 10? Can I get the key based on it?
# NO. Dicts are a one-way street. You can use keys to get values, but not vice versa
# among other things, this is beacuse every key is unique, but values are not guaranteed to be unique

# Who needs dicts?

There are a ton of places in the programming world that can solve their problems with these key-value stores. In fact, dicts are known by many other names, because many other languages have them, as well:

- Hash tables
- Hashes
- Hash maps
- Maps
- Key-value stores
- Name-value stores
- Associative arrays

The idea that you have a key that lets you retrieve the value is everywhere -- if you choose a good key (e.g., a user's ID number) then the value can be a big record of their information.

In [16]:
d = {'a':10, 'b':20, 'c':30}

k = 'a'

d[k]

10

In [18]:
k = input('Enter a key: ').strip()

if k in d:   # if the user's string is a key in d
    print(f'Great, the value is {d[k]}')
else:
    print(f'{k} is not a key in d')

Enter a key:  z


z is not a key in d


# Exercise: Restaurant

1. Define a dict, called `menu`, in which the keys are strings -- the names of items on a restaurant's menu -- and the values are integers -- the prices of those items.
2. Define `total` to be 0.
3. Repeatedly ask the user to order something:
    - If they enter the empty string (`''`), then stop asking and print the total.
    - If they enter the name of something on the menu, then tell them the price, add it to total, and print the new total before asking for the next item
    - If they enter the name of something *not* on the menu, then scold them and let them try again.
4. Print the total after ordering.

Example:

    Order: sandwich
    sandwich costs 10, total is 10
    Order: tea
    tea costs 8, total is 18
    Order: cake
    cake costs 5, total is 23
    Order: elephant
    sorry, we're fresh out of elephant today!
    Order: [ENTER]
    Total is 23

In [19]:
menu = {'sandwich':10, 'tea':8, 'cake':5, 'apple':3}
total = 0

while True:   # infinite loop!
    order = input('Order: ').strip()

    if order == '':
        break

    if order in menu:
        price = menu[order]  # get the price from the menu
        total += price       # update the total
        print(f'{order} costs {price}, total is now {total}')
    else:
        print(f'{order} is not on the menu; try again')

print(f'Your total is {total}')        

Order:  sandwich


sandwich costs 10, total is now 10


Order:  apple


apple costs 3, total is now 13


Order:  elephant


elephant is not on the menu; try again


Order:  


Your total is 13


In [22]:
# ID

menu = {'sandwich': 10, 'tea': 8, 'cake' : 5}

total = 0


while True:
    order = input('Please put, in your order! ').strip()
    
    if not order:  # if I got the empty string
        print(f'Total is: {total}')
        break
    elif order in menu:
        total += menu[order]
        print(f'Total is: {total}')
        order = input('Please put, in your order! ')
    else:
        print(f'Sorry, We are out of {order}')
        

Please put, in your order!  tea


Total is: 8


Please put, in your order!  
Please put, in your order!  


Total is: 8


In [23]:
print(menu)

{'sandwich': 10, 'tea': 8, 'cake': 5}


# Are dictionaries mutable?

You might remember that strings are immutable, but lists are mutable. That means:

- We can replace values in a list
- We can add new elements to the list, making it longer/larger
- We can remove elements from the list, making shorter

Can we do this with a dict? Yes!

In [24]:
d = {'a':10, 'b':20, 'c':30}

# how can I replace a value with another value
# answer: just assign to the key (very similar to how we replace a value in a list)

d['b'] = 99

d

{'a': 10, 'b': 99, 'c': 30}

In [26]:
# how can I add a new key-value pair to our dict?
# with a list, we add with the "append" method 
# with a dict, it's far easier -- we just assign! 

d['x'] = 8877   # if 'x' already existed as a key, we've replaced the value. If not, then we've added a new key-value pair

In [27]:
d

{'a': 10, 'b': 99, 'c': 30, 'x': 8877}

In [28]:
# what about removing values?
# we can, with the dict.pop method
# note that it's pretty rare in my experience to remove a key-value pair from a dict

d.pop('x')  # this removes the pair and returns the value

8877

In [29]:
d

{'a': 10, 'b': 99, 'c': 30}

In [30]:
d.pop('x')

KeyError: 'x'

In [31]:
# can I put more complex types in the values?
# answer: of course!

d = {'a':[],
     'b':[],
     'c':[]}

In [32]:
d['a']

[]

In [33]:
d['a'].append(10)

In [34]:
d['b'].append(20)

In [35]:
d['c'].append(30)

In [36]:
d['c'].append(40)

In [37]:
d

{'a': [10], 'b': [20], 'c': [30, 40]}

In [38]:
d['a']

[10]

In [39]:
d['a'][0]

10

# Read-only databases

In the restaurant example, we used a dict as a read-only database in memory. We could have updated/changed it, but we didn't -- it was set, and we read from it.

I've seen this in a wide variety of places:

- Month names -> month numbers
- Month numbers -> month names
- User IDs -> user records
- Country names -> international calling area codes

# Next up: Using dicts

- Accumulating with dicts
- Accumulating the unknown

# Accumulating

We've seen already that we can use a list for accumulating information when a program is running:

- Track inputs from the user
- Track error messages
- Track user logins

If we use a dict, then we can have a number of these accumulators tracking all together. Each key represents one item we're tracking, and each value represents the number of times it appeared.

In [40]:
# Example: Odds and evens

counts = {'odds':0, 'evens':0}  # setting up my accumulation dict

while True:
    s = input('Enter a number: ').strip()

    if s == '':
        break

    if not s.isdigit():
        print(f'{s} is not numeric; try again')
        continue

    n = int(s)

    if n % 2 == 0:             # if, when we divide n by 2, we have a remainder of 0
        counts['evens'] += 1   # ... increment counts['evens'] by 1
    else:
        counts['odds'] += 1    # otherwise, increment counts['odds']

print(counts)        

Enter a number:  10
Enter a number:  15
Enter a number:  18
Enter a number:  23
Enter a number:  24
Enter a number:  29
Enter a number:  712
Enter a number:  hello


hello is not numeric; try again


Enter a number:  


{'odds': 3, 'evens': 4}


# Exercise: Digits, vowels, and others (dict edition)

1. Set up a dict with keys `digits`, `vowels`, and `others`, and all of them having 0 for values.
2. Ask the user, repeatedly, to enter a string.
    - If the user enters an empty string, stop asking
3. Go through each character in the user's input
    - If the character is a digit, add 1 to `counts['digits']`
    - If the character is a vowel, add 1 to `counts['vowels']`
    - Otherwise, add 1 to `counts['others']`
4. Print `counts`

In [41]:
counts = {'digits':0,
          'vowels':0,
          'others':0}

while True:
    s = input('Enter text: ').strip()

    if s == '':
        break

    for one_character in s:
        if one_character.isdigit():
            counts['digits'] += 1
        elif one_character in 'aeiou':
            counts['vowels'] += 1
        else:
            counts['others'] += 1

print(counts)            

Enter text:  hello out there
Enter text:  this is great!
Enter text:  my shoe size is 46
Enter text:  


{'digits': 2, 'vowels': 15, 'others': 30}


In [42]:
# AE

counts = {'digits':0, 'vowels':0, 'others':0}

while True:
      stringy = input("Enter a string (or press Enter to quit): ")
      if stringy == '':
            break
      
      for char in stringy:
            if char.isdigit():
                  counts['digits'] += 1
            elif char.lower() in 'aeiou':
                  counts['vowels'] += 1
            else:
                  counts['others'] += 1

print(counts)

Enter a string (or press Enter to quit):  hello!! 123
Enter a string (or press Enter to quit):  goodbye!
Enter a string (or press Enter to quit):  


{'digits': 3, 'vowels': 5, 'others': 11}


In [43]:
# ID

counts = {'digits':0, 'vowels': 0, 'others': 0}

while True:
    inpt = input('Please, enter a string: ')
    
    if inpt == '':
        break
    for char in inpt:
        if char.isdigit():
            counts['digits'] += 1
        elif char in 'aeouiAEOUI':
            counts['vowels'] += 1
        else:
            counts['others'] += 1
            
print(f'vowels: {counts['vowels']}')
print(f'vowels: {counts['digits']}')
print(f'vowels: {counts['others']}')

Please, enter a string:  hello!! 123
Please, enter a string:  


vowels: 2
vowels: 3
vowels: 6


# Accumulate the unknown

What if I don't know what data I'm going to get? I can still count what comes in, but I'll need a different strategy.

For example, let's say that I want to count how many times each character appears in a user's input string. I can use a dict for this, but I don't want to create a dict in which the keys are EVERY CHARACTER IN THE WORLD and the values are all 0.

Instead, we can start with an empty dict. As we go through each character:

- If we've seen the character before, we'll add 1 to its count
- If we haven't seen the character before, we'll add a new key-value pair with the character and 1

In the end, we can print the whole dict

In [44]:
# Example

counts = {}  # empty dict -- it will be populated by characters as keys and integers (counts) as values

s = input('Enter text: ').strip()

for one_character in s:
    counts[one_character] += 1   # this is what I want!

print(counts)    

Enter text:  hello


KeyError: 'h'

In [46]:
# Example

counts = {}  # empty dict -- it will be populated by characters as keys and integers (counts) as values

s = input('Enter text: ').strip()

for one_character in s:
    if one_character in counts:      # if one_character is already a key in counts...
        counts[one_character] += 1   #     increment by 1
    else:
        counts[one_character] = 1    # add a new key-value pair

print(counts)    

Enter text:  whatever you say


{'w': 1, 'h': 1, 'a': 2, 't': 1, 'e': 2, 'v': 1, 'r': 1, ' ': 2, 'y': 2, 'o': 1, 'u': 1, 's': 1}


# Exercise: Rainfall

1. Define an empty dict, `rainfall`. We will fill it with city names (strings) as keys and mm of rain (integers) as values.
2. Ask the user, repeatedly to enter the name of a city.
    - If the city name is empty, stop asking
3. If you got a city, ask (a separate question) the user for the number of mm of rain that fell in that city most recently.
    - Let's assume that you will get numeric input.
4. If you've seen this city before, add its rainfall to the existing total
5. If you haven't seen this city before, add the city + the mm rain as a key-value pair.
6. When you're done, print the entire dict.

Example:

    City: Tel Aviv
    Rain: 4
    City: Jerusalem
    Rain: 3
    City: Tel Aviv
    Rain: 2
    City: [ENTER]

    {'Tel Aviv': 6,
     'Jerusalem': 3}

In [48]:
rainfall = {}

while True:
    city_name = input('City: ').strip()

    if city_name == '':   # didn't get a city name? break!
        break

    s = input('Rain: ').strip()
    mm_rain = int(s)  # we should, at some point, check to see if this won't break things

    if city_name in rainfall:            # if we've seen city_name before...
        rainfall[city_name] += mm_rain   # ... add mm_rain to the existing value
    else:
        rainfall[city_name] = mm_rain    # if this city is new, then add the key, value pair

print(rainfall)        

City:  a
Rain:  5
City:  b
Rain:  4
City:  a
Rain:  3
City:  


{'a': 8, 'b': 4}


In [None]:
# ID


rainfall = {}

while True:
    city = input('Enter a City name: ').strip()

    if city == '':
        break
    
    rain = int(input('Enter the amount of rain in mm: ').strip())
    if city not in rainfall:
        rainfall[city] = rain
    else:
        rainfall[city] += rain

print(rainfall)
    



# Getting valid numeric input

If we want to ensure that we got valid numeric input, so far we've used `str.isdigit`. It's imperfect, but it works for simple cases.

But what if we want a more sophisticated check? What if we want to let people enter fractional numbers (i.e., floats)? How can we check for that?

Python basically says: Don't check. Assume the best, but prepare for the worst.

Meaning: Assume it'll work, but put that code inside of a `try` block. And that means, let's hope that it works ... but if it doesn't, and we get an "Exception" (a fancy for error), then we can trap that error and go on.

In [None]:
try:
    amount = float(input(f"Enter rainfall amount for {city}: "))
    rainfall[city] += amount
except ValueError:
    print("Invalid input. Please enter a numeric value.")

# Rainfall, data analytics edition

What if, instead of wanting to know the total amount of rain in a given city, we wanted to know some more general stats about the rain in that city:

- min rainfall (least in a day)
- max rainfall (most in a day)
- mean (average)
- total

To do this, it's probably easiest not to add to the accumulation each time, but rather to have a list of integers for each city.

In [49]:
rainfall = {}

while True:
    city_name = input('City: ').strip()

    if city_name == '':   # didn't get a city name? break!
        break

    s = input('Rain: ').strip()
    mm_rain = int(s)  # we should, at some point, check to see if this won't break things

    if city_name in rainfall:                 # if we've seen city_name before...
        rainfall[city_name].append(mm_rain)   # ... add mm_rain to the existing value
    else:
        rainfall[city_name] = [mm_rain]      # if this city is new, then add the key, value pair with the value in a 1-element list

print(rainfall)        

City:  a
Rain:  5
City:  b
Rain:  4
City:  c
Rain:  3
City:  a
Rain:  10
City:  b
Rain:  9
City:  c
Rain:  8
City:  


{'a': [5, 10], 'b': [4, 9], 'c': [3, 8]}


In [52]:
# how can I calculate the min, max, mean, and total on city a?

city = 'a'

min(rainfall[city])  # the min function returns the smallest value in a list

5

In [53]:
max(rainfall[city])  # the max function returns the largest value in a list

10

In [54]:
sum(rainfall[city])  # the sum function returns the total of a list

15

In [55]:
sum(rainfall[city]) / len(rainfall[city])  # get the mean

7.5

# Next up

1. Iterating over a dict (different ways to do it)
2. How are dicts implemented?
3. Files

You'll want to download a zipfile with a few text files I'll use in exercises.

https://files.lerner.co.il/exercise-files.zip

# Iterating over a dict

We've seen that we can iterate over a variety of data structures in Python:

- strings (we get each character)
- lists (we get each element)
- tuples (we get each element)

What happens if we iterate over a dict?

We get the keys!

In [59]:
d = {'a':10, 'b':20, 'c':30}

for one_item in d:
    print(one_item)

a
b
c


In [60]:
for one_key in d:
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30


In [61]:
# someone has probably noticed that there is a dict method called dict.keys
# if we call it, we get the keys back

d.keys()

dict_keys(['a', 'b', 'c'])

In [62]:
for one_key in d.keys():
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30


Should we prefer iterating over `d` or `d.keys()`?

Answer: `d`, by a *LOT*.

If we iterate over `d`, it's shorter (to write) and faster (to execute), because it doesn't need to search for the method and then execute it.

In general, iterating over `d.keys()`, searching using `in` with `d.keys()` or I'd even say using `d.keys()` in general is a bad idea. You should have a very good reason for invoking `d.keys()`.

In [63]:
# what if I want to iterate over the values?
# we can do this with the dict.values method, *but* we won't have the keys

for one_value in d.values():
    print(one_value)

10
20
30


In [64]:
# My favorite way to iterate over a dict is to use the dict.items method

for one_item in d.items():
    print(one_item)

('a', 10)
('b', 20)
('c', 30)


In [66]:
# you might remember from last week that we saw how we can use "tuple unpacking"
# right in the for loop

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
b: 20
c: 30


Why do I like `dict.items`?

1. We get both the key and the value. We don't need to start rooting around and retrieving them.
2. They're in variables we can name, which makes our code more expressive.


In [67]:
rainfall

{'a': [5, 10], 'b': [4, 9], 'c': [3, 8]}

In [70]:
# now I can use a "for" loop on rainfall to calculate all of our stats for each city

for city_name, rain_list in rainfall.items():
    print(city_name)
    total = sum(rain_list)
    min_rain = min(rain_list)
    max_rain = max(rain_list)
    mean = total / len(rain_list)

    print(f'\t{total=}\n\t{min_rain=}\n\t{max_rain=}\n\t{mean=}')
    

a
	total=15
	min_rain=5
	max_rain=10
	mean=7.5
b
	total=13
	min_rain=4
	max_rain=9
	mean=6.5
c
	total=11
	min_rain=3
	max_rain=8
	mean=5.5


# Exercise: Input to dict

1. Create an empty dict.
2. Repeatedly ask the user to enter both a word and number, separated by whitespace.
   - If they enter the empty string, break out of the loop
3. Add the new key-value pair to the dict, where the word is the key and the number is the value.
4. After adding all of the key-value pairs, iterate over the dict and display each key-value pair.
5. Print the pair with the highest number and the lowest number.

Example:

    Enter pair: a 17
    Enter pair: b 20
    Enter pair: c 7
    Enter pair: d 35
    Enter pair: [ENTER]

    a 17
    b 20
    c 7
    d 35

    lowest is c:7
    highest is d:35    

In [75]:
pairs = {}
lowest_value = None
highest_value = None

# add to the dict
while True:
    s = input('Enter pair: ').strip()

    if s == '':
        break

    key, value = s.split()   # assumption #1: we only have two items here, key and value
    value = int(value)
    pairs[key] = value  # assumption #2: value can be turned into an int

    if lowest_value == None: # first time? Just use value
        lowest_value = value

    if highest_value == None: # first time? Just use value
        highest_value = value

    if value < lowest_value:  # do we have a new lowest_value?
        lowest_value = value

    if value > highest_value:  # do we have a new highest_value?
        highest_value = value

# iterate over the dict, printing it
for key, value in pairs.items():
    if value == lowest_value:
        note = '(lowest)'
    elif value == highest_value:
        note = '(highest)'
    else:
        note = ''

    print(f'{key}: {value} {note}')

Enter pair:  a 10
Enter pair:  b 20
Enter pair:  c 30
Enter pair:  d 17
Enter pair:  e 5
Enter pair:  f 32
Enter pair:  


a: 10 
b: 20 
c: 30 
d: 17 
e: 5 (lowest)
f: 32 (highest)


In [74]:
highest_value

'7'

In [76]:
# can we have a list of dicts?

mylist = [{'a':10, 'b':20},
          {'a':15, 'b':25},
          {'a':30, 'b':35}]

# How do dicts work?

When you search for a value in a list, Python basically runs a `for` loop. As you can imagine, that means if you have a very long list, finding something there (or finding if it is there) can take a long time. 

By contrast, finding a key in a dict is almost instantaneous. 

The reason is how they are designed.

Lists work as you might expect:

- Each value has an index
- The indexes start at 0
- As we add new values, they go at the end (usually)

Dicts, though, work a different way:
- We store each key-value pair together
- Where do we store it? At a memory location determined by running a function on the key
- That function, known as a "hash" function, gives us what looks like a random number, but is consistent
- hash('a') and hash('b') are nowhere near each other, but will always give the same answer
- When we store ('a', 10), Python runs `hash('a')`, gets a memory location, and stores it there.
- If we want `d['a']`, it runs `hash('a')`, jumps to that location, and retrieves the value (10).
- Or it jumps to that location and finds that the pair isn't there.

Dicts are flexible, in that we can use whatever we want (almost) as a dict key. Our data doesn't need artificial IDs, as in a list.

Dicts are fast, in that the lookup is very speedy.

# Files

What are files?

As a programmer, we think of files as stored versions of data structures.

We have in-memory data structures. But when we turn the computer off, then those go away. Having a file means that we can (later on) retrieve those data structures, and continue from where we left off.

There are lots of formats for doing this, but we are going to use text files. 

How can we read from a text file?

In olden times, a program could just talk to the disk and read from it. Nowadays, that's a problem both from a security perspective and an OS API perspective. 

If you want to talk to a file, you thus need to:

- Tell the operating system that you want to open it
- Open it for reading
- This gives you a "file object," or "file handle," or "file agent" that works on your behalf
- You ask the file handle to read from the file
- When you're done, it's a good idea to close the file object, so that it restores the memory.

In [78]:
# open our file!
f = open('/etc/passwd')  

In [79]:
# I want to see the contents of this file
# I can thus invoke the method f.read()
# this will return the entire contents of the file as a (potentially very long) string

f.read()

'##\n# User Database\n# \n# Note that this file is consulted directly only when the system is running\n# in single-user mode.  At other times this information is provided by\n# Open Directory.\n#\n# See the opendirectoryd(8) man page for additional information about\n# Open Directory.\n##\nnobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false\nroot:*:0:0:System Administrator:/var/root:/bin/sh\ndaemon:*:1:1:System Services:/var/root:/usr/bin/false\n_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico\n_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false\n_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false\n_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false\n_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false\n_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false\n_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false\n_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fa

In [81]:
# what happens if the file is 200 GB in size?
# in theory, this works fine
# if the file is small, it's convenient to get it in one string

# but there is another way, and a more idiomatic way

f = open('/etc/passwd')

for one_line in f:   # iterate over a file object, you get each line, as a string, one at a time
    print(one_line.strip())  # print always adds a newline when it displays things, and we already had a newline in one_line

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [None]:
for one_line in open('/etc/passwd'):   # this is an "absolute path" on my computer -- sort of like c:\abcd\efgh.txt
    print(one_line.strip())  

# Filenames 101

- If the filename doesn't contain any `/` (or `\` on Windows), then we just look in the current directory, the same one that's running Jupyter.
- If the file contains `/` but doesn't start with it, then it's a "relative path," starting in the current directory but either looking in a subdirectory or in a parallel directory (if contains `..` in it)
- If the file starts with `/`, then it's an "absolute path," an unambiguous description of a file's location on disk.

In [82]:
%pwd

'/Users/reuven/Courses/Current/OReilly-2025-06June-python'

# Next up

1. Doing things with the files we read
2. A little bit about writing
3. A little about `with`

In [85]:
# I can read through /etc/passwd
# let's print the usernames from this file

for one_line in open('/etc/passwd'):
    if one_line[0] == '#':
        continue
    print(one_line.split(':')[0])  

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

In [87]:
for one_line in open('linux-etc-passwd.txt'):

    if one_line[0] == '#':
        continue

    if one_line.strip() == '':   # if the line only contains whitespace...
        continue
    
    print(one_line.split(':')[0])  

root
daemon
bin
sys
sync
games
man
lp
mail
news
uucp
proxy
www-data
backup
list
irc
gnats
nobody
syslog
messagebus
landscape
jci
sshd
user
reuven
postfix
colord
postgres
dovecot
dovenull
postgrey
debian-spamd
memcache
genadi
shira
atara
shikma
amotz
mysql
clamav
amavis
opendkim
gitlab-redis
gitlab-psql
git
opendmarc
dkim-milter-python
deploy
redis


In [88]:
!head mini-access-log.txt

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.12 - - [30/J

# Exercise: IP address count

1. Create an empty dict, `counts`. We will fill this with IP addresses (strings) as keys and counts (integers) as values.
2. Go through the `mini-access-log.txt` file, one line at a time.
3. Grab the IP address:
    - If it's already in our dict as a key, increment the count by 1
    - If it isn't in our dict as a key, add it, with a value of 1
4. Iterate over the dict, printing each key-value pair

In [95]:
counts = {}

for one_line in open('mini-access-log.txt'):
    ip_address = one_line.split()[0]

    if ip_address in counts:      # if we have seen this IP address before,
        counts[ip_address] += 1   # ... add 1 to its count
    else:
        counts[ip_address] = 1

for key, value in counts.items():
    print(f'{key}\t{value}')

67.218.116.165	2
66.249.71.65	3
65.55.106.183	2
66.249.65.12	32
65.55.106.131	2
65.55.106.186	2
74.52.245.146	2
66.249.65.43	3
65.55.207.25	2
65.55.207.94	2
65.55.207.71	1
98.242.170.241	1
66.249.65.38	100
65.55.207.126	2
82.34.9.20	2
65.55.106.155	2
65.55.207.77	2
208.80.193.28	1
89.248.172.58	22
67.195.112.35	16
65.55.207.50	3
65.55.215.75	2
