# Agenda: Week 3, Dictionaries and files

1. Recap of topics + Q&A
2. Dictionaries
    - Define them
    - Retrieving from them
    - Modifying them (dicts are mutable)
    - Uses for dictionaries (including accumulating)
    - Looping over dicts
    - How do dictionaries work?
3. Files
    - What are files?
    - Reading from (text) files
    - Writing to files and the "with" statement

# The story so far...

We organize our data into data structures.  So far, we've seen a bunch of data structures:

- Integers and floats (numbers)
- Strings
- Lists
- Tuples

We saw, last week, that strings, lists, and tuples are all part of the same "sequence" family. They all implement similar functionality.  (They are also different, but there's a lot in common.)

What can sequences do?
- Retrieve one item from an index
- Retrieve a slice using `[start:end]`
- Iterate with a `for` loop
- Search with `in`
- Get the length with `len`

# Dictionaries (`dict`)

Lists and tuples are what we call ordered sequences of data: We know what order they'll be in (i.e., their index order). And we can do all of our sequence stuff with them.

However, every list (and every tuple) has fixed indexes -- starting with 0, then 1, then 2... all the way up to `len(seq) - 1`, which will be the highest index.

Dictionaries are well known in the programming world, often with other names:
- Hash tables
- Hashes
- Key-value stores
- Name-value stores
- Associative arrays
- Mappings
- Keymaps
- Hashmaps

The important thing to understand about dicts is that we can dictate not just the values, but also the keys -- which are what we call the indexes in a dictionary.

So every list with 10 elements will have indexes from 0 through 9. But a dict with 10 elements... we don't know what the keys will be without checking.  We can determine what the indexes are.

There are many *many* cases in programming when we would want these key-value pairs:

- Usernames and user IDs
- User IDs and usernames
- Month names and numbers
- Month numbers and names
- Filenames and file contents
- Filenames and timestamps
- Timestamps and lists of users who were logged in then

Dictionaries are the most important data structure in Python.

In [1]:
# how do I create a dictionary?

# (1) we use {}, curly braces, around the dict
# (2) every key has a value, and every value has a key
# (3) each key-value pair is separated by a :
# (4) the pairs are separated by ,

d = {'a':10, 'b':20, 'c':30}

In [2]:
type(d)  # what kind of data structure is d?

dict

In [3]:
len(d)   # how many pairs are there in d?

3

In [4]:
# what is the value associated with the key 'a'?

d['a']   # I use [], just like with strings, lists, and tuples, but here I give a key, not a numeric index

10

In [5]:
d['b']

20

In [6]:
d['c']

30

In [7]:
d['x']   # what happens if we ask for a key that doesn't exist?

KeyError: 'x'

In [8]:
# if I want to know whether a key exists, so as to avoid such an error,
# I can do that with the "in" operator

# in only checks in the keys for an exact match.  It never checks the values.

'a' in d

True

In [9]:
'b' in d

True

In [10]:
10 in d

False

In [11]:
d

{'a': 10, 'b': 20, 'c': 30}

# Two more notes about keys

1. The keys in a dict are unique. No key can exist more than once.
2. Anything at all can be a dict value. But only immutable values (numbers and strings, typically) can be keys.

In [12]:
# if you really want, you can run the method .values() on a dictionary, and get back the values

d.values()    # special data structure, sort of like a list (but not really)

dict_values([10, 20, 30])

In [13]:
20 in d.values()  # this searches for 20 in the values of the dict d

True

In [None]:
# defining a dict is always {key:value, key:value, key:value}

d = {'a':10, 'b':20, 'c':30}

# Exercise: Restaurant

1. Define a new dict, called `menu`, in which the keys are the items on a restaurant's menu, and the values are the prices (in whatever currency you want).
2. Define `total` to be 0.
3. Ask the user, repeatedly, what they want to order.
    - If they enter an empty string, stop asking -- exit the loop, and print the total
    - If they enter a string that is an entry in the menu (i.e., a key in `menu`), then print the price, and the new total
    - If they enter a string that is *not* an entry in the menu, then scold the user
4. After exiting the loop, print the total

Example:

    Order: sandwich
    sandwich is 10, total is 10
    Order: tea
    tea is 3, total is 13
    Order: elephant
    we are fresh out of elephant today!
    Order: [ENTER]
    total is 13
    
Hints/reminders:
- Use a `while True` loop for an infinite loop
- You can get input from the user with `input`
- You can check if the user just pressed ENTER by comparing the result of `input` with an empty string.
- Check if a key is in a dict with `in`
- Retrieve a value from a dict with `[]`
- Don't forget that a key can be assigned to a variable, and the variable can then be used for searching/retrieving.

In [None]:
menu = {'sandwich':10, 'tea':3, 'apple':4, 'cake':6}

total = 0

# ask the user, repeatedly, to order something

while True:
    s = input('Order: ').strip()   # input gets a string from the user, strip removes leading/trailing whitespace
    
    if s == '':
        break   # exits our "while" loop right away
        
    if s in menu:   # is the user's input a key in our "menu" dict?
        price = menu[s]  # get the price for s
        total += price   # add the price to the total
        print(f'{s} costs {price}, total is now {total}')
        
    else:   # the user's order is *not* a key in our dict
        print(f'We are out of {s} today!')
        
print(f'Total is {total}')

In [14]:
months = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4}

months['Feb']

2

In [15]:
# a dict's values can be ABSOLUTELY ANYTHING YOU WANT
# a dict's values can be strings
# a dict's values can be lists
# a dict's values can be dictionaries

In [16]:
# can we modify a dictionary?
# in other words: are they mutable?

# ints, strings, and tuples are all immutable
# lists are mutable -- we can change the object without getting a new object back

In [17]:
d = {'a':10, 'b':20, 'c':30}

# how can I change a value in a dict?
# answer: just assign a new value to that key

d['a'] = 999
d

{'a': 999, 'b': 20, 'c': 30}

In [18]:
# how can I add a new key-value pair to a dict?
# unlike lists, which have a special, different "append" method, in dicts... you just assign

d['x'] = 876   # after this assignment, we find that there's a new key-value pair, x:876
d

{'a': 999, 'b': 20, 'c': 30, 'x': 876}

In [19]:
# if I assign to the 'x' key a second time, we'll simply update the value
d['x'] = 543

d

{'a': 999, 'b': 20, 'c': 30, 'x': 543}

In [21]:
# let's keep track of fastest times for people on a team

times = {}   # empty dictionary

while True:
    print(times)
    name = input('Enter a name: ').strip()
    
    if name == '':   # exit the loop if we got an empty string
        break
        
    new_value = input('Enter new value: ').strip()
    
    times[name] = new_value   # note, we're storing a string here, not an integer
    
print(times)      # print our dictionary

{}
Enter a name: Reuven
Enter new value: 10
{'Reuven': '10'}
Enter a name: Atara
Enter new value: 20
{'Reuven': '10', 'Atara': '20'}
Enter a name: Reuven
Enter new value: 15
{'Reuven': '15', 'Atara': '20'}
Enter a name: 
{'Reuven': '15', 'Atara': '20'}


In [22]:
times.values()   # this returns all of the values in our dict

dict_values(['15', '20'])

In [23]:
times.keys()    # this returns all of the keys in our dict

dict_keys(['Reuven', 'Atara'])

In [24]:
# we normally don't need the "keys" method, because we can search using "in" directly on the dict

'Reuven' in times   

True

In [25]:
'Joe' in times

False

In [26]:
# I could search in times.keys().... but we shouldn't

'Reuven' in times.keys()   # this first runs the keys method, then searches (slowly) through its result

True

Modern Python dictionaries are ordered in the value of key insertion.

Meaning: When you define your dict, the items will according to your definition.  If I define

    {'a':10, 'b':20, 'c':30}
    
Then the dict is ordered 'a' -> 'b' -> 'c'.

But if I now add a key-value pair, it'll be at the end:

    d['x'] = 100
    
Now the order will be 'a' -> 'b' -> 'c' -> 'x'

The order normally doesn't make much difference, except if you're iterating over a dict in a for loop, which we'll do later.

# Next up:

Two more paradigms for working with dicts
- Accumulating with known keys
- Accumulating from nothing


# Accumulating with known keys

In this paradigm, we define a dict with keys and starting values. (Often, the values will be 0.) Then we use our dict to count how many times something occurs.

An example: Let's count odd and even numbers in a list.

In [28]:
mylist = [10, 15, 12, 13, 17, 18]

counts = {'odds':0,    # this is my dict -- I define the keys and starting values
          'evens':0}

for one_number in mylist:
    print(f'Checking {one_number}')
    
    # we can divide the number by 2, and check the remainder -- using the % "modulus" operator
    # if the remainder is 0, the number is even
    # if the remainder is 1, the number is odd
    
    if one_number % 2 == 0:
        counts['evens'] += 1     # update the value associated with the "evens" key
        print(f'\t{one_number} is even')    # \t == tab
    else:
        counts['odds'] += 1      # update the value associated with the "odds" key
        print(f'\t{one_number} is odd')
        
print(counts)          # how does the dict look?

Checking 10
	10 is even
Checking 15
	15 is odd
Checking 12
	12 is even
Checking 13
	13 is odd
Checking 17
	17 is odd
Checking 18
	18 is even
{'odds': 3, 'evens': 3}


In [29]:
# variation -- instead of initializing the dict with 0
# let's initialize it with empty lists!

# then we won't *count* the numbers, but we'll sort them into odds and evens

mylist = [10, 15, 12, 13, 17, 18]

counts = {'odds':[],    # this is my dict -- I define the keys and starting values
          'evens':[]}

for one_number in mylist:
    print(f'Checking {one_number}')
    
    if one_number % 2 == 0:
        counts['evens'].append(one_number)
        print(f'\t{one_number} is even')    # \t == tab
    else:
        counts['odds'].append(one_number)
        print(f'\t{one_number} is odd')
        
print(counts)          # how does the dict look?

Checking 10
	10 is even
Checking 15
	15 is odd
Checking 12
	12 is even
Checking 13
	13 is odd
Checking 17
	17 is odd
Checking 18
	18 is even
{'odds': [15, 13, 17], 'evens': [10, 12, 18]}


# Exercise: Vowels, digits, and others (dict edition)

1. Define a dict in which there are three keys: `vowels`, `digits`, and `others`. Each value should be 0.
2. Ask the user to enter a string.
3. Go through each character in that string, and find if it's a vowel (a, e, i, o, u) or a digit (0-9), or neither of these -- and add 1 to the appropriate dict value.
4. When you're done, print the dict.

Example:

    Enter a string: hello!! 123
    {'vowels':2, 'digits':3, 'others': 6}

In [30]:
# simple version of our vowels-digits-others program

counts = {'vowels':0, 'digits':0, 'others':0}

s = input('Enter a string: ').strip()

for one_character in s:
    if one_character in 'aeiou':   # is it a vowel?
        counts['vowels'] += 1
    elif one_character.isdigit():  # is it a digit, 0-9?
        counts['digits'] += 1
    else:
        counts['others'] += 1
        
print(counts)        

Enter a string: hello!! 123
{'vowels': 2, 'digits': 3, 'others': 6}


In [31]:
# more complex version of our vowels-digits-others program: 
# ask the user to enter mulitple strings, stopping when they enter an empty string

counts = {'vowels':0, 'digits':0, 'others':0}

while True:

    s = input('Enter a string: ').strip()
    
    if s == '':   # exit the while loop if we got an empty string
        break

    for one_character in s:
        if one_character in 'aeiou':   # is it a vowel?
            counts['vowels'] += 1
        elif one_character.isdigit():  # is it a digit, 0-9?
            counts['digits'] += 1
        else:
            counts['others'] += 1
        
print(counts)        

Enter a string: hello!! 123
Enter a string: goodbye?? 456
Enter a string: whatever!
Enter a string: 
{'vowels': 8, 'digits': 6, 'others': 19}


# Paradigms for dictionary use

1. Define the dict at the top of the program, and read from it, but don't change it.
2. Define the dict at the top of the program with keys and starting values. The keys won't change, but the values will, as we accumulate information.
3. Define an empty dict at the top of the program, and as the program runs, we accumulate both keys and values.

In [33]:
# Example of paradigm 3: count characters

# we'll ask the user to enter a string, and we'll count how often 
# each character appears in the string.  Our counts will be done
# in a dict called "counts".

counts = {}    # start with an empty dict!

s = input('Enter a string: ').strip()

for one_character in s:   # iterating over s, one character at a time, assigning each to one_character
    
    # if we have already seen this character before, add 1 to its value
    if one_character in counts:
        counts[one_character] += 1

    # this is the first time we're seeing this character
    else:  
        counts[one_character] = 1
    
print(counts)    

Enter a string: hello
{'h': 1, 'e': 1, 'l': 2, 'o': 1}


# Exercise: Rainfall

We're going to collect information about how much rain has fallen in a number of cities. Before running the program, we don't know what cities we'll be collecting information about.  But at the end of running the program, we will have collected names and values.

1. Define a new, empty dict called `rainfall`. 
2. Ask the user, repeatedly, to enter the name of a city.  We don't know, in advance, what those names might be.
3. If the user enters an empty string, `break` out of that loop -- stop asking, and print the entirety of `rainfall`.
4. If the user does *not* enter an empty string, then ask how much rain fell (in mm).
    - If this is the first time that we're seeing this city, then assign the key-value pair to `rainfall`, with the key being the city name, and the value being the amount of rain.
    - If this is *not* the first time that we're seeing this city, then add the new value to the existing one.
    - NOTE that you'll need to convert the user's rain number to an integer
5. After the loop exits, print `rainfall`.

Example:

    City: Jerusalem
    Rain: 5
    City: Tel Aviv
    Rain: 4
    City: Jerusalem
    Rain: 3
    City: [ENTER]
    {'Jerusalem':8, 'Tel Aviv':4}
    

In [38]:
rainfall = {}   # empty dictionary

while True:
    city_name = input('Enter city: ').strip()
    
    if city_name == '':
        break
        
    mm_rain = input('Rain: ').strip()    # this is a string!
    
    mm_rain = int(mm_rain)    # convert mm_rain to be an integer
    
    if city_name in rainfall:   # have we seen this city before -- is it a key in rainfall?
        rainfall[city_name] += mm_rain    
    else:
        rainfall[city_name] = mm_rain   # new city, just assign to the dict
    
print(rainfall)    

Enter city: a
Rain: 5
Enter city: b
Rain: 4
Enter city: a
Rain: 3
Enter city: 
{'a': 8, 'b': 4}


# Useful dictionary methods

In [41]:
# let's say I have a dict, and want to let the user enter a key and get a value

d = {'a':10, 'b':20, 'c':30}

while True:
    k = input('Enter a key: ').strip()
    
    if k == '':
        break
        
    if k in d:
        print(f'd[{k}] == {d[k]}')
    else:
        print(f'd does not have the key {k}')

Enter a key: a
d[a] == 10
Enter a key: b
d[b] == 20
Enter a key: x
d does not have the key x
Enter a key: 


# This happens a lot!

We want to retrieve a value from a dict based on the key, but it's annoying to constantly be checking if the key is there.

The solution? The `get` method, which works like `[]` but doesn't give an error if the key doesn't exist.  Rather, it returns the special value `None`, or (if you prefer) any value you specify.



In [42]:
# rewrite this code to use "get"

d = {'a':10, 'b':20, 'c':30}

while True:
    k = input('Enter a key: ').strip()
    
    if k == '':
        break
        
    print(f'd[{k}] == {d.get(k)}')


Enter a key: a
d[a] == 10
Enter a key: b
d[b] == 20
Enter a key: x
d[x] == None
Enter a key: 


In [43]:
# rewrite this code to use "get" with a default value

d = {'a':10, 'b':20, 'c':30}

while True:
    k = input('Enter a key: ').strip()
    
    if k == '':
        break
        
    print(f'd[{k}] == {d.get(k, "No key")}')


Enter a key: a
d[a] == 10
Enter a key: b
d[b] == 20
Enter a key: z
d[z] == No key
Enter a key: 


In [44]:
# rewritten rainfall program using .get

rainfall = {}   # empty dictionary

while True:
    city_name = input('Enter city: ').strip()
    
    if city_name == '':
        break
        
    mm_rain = input('Rain: ').strip()    # this is a string!
    
    mm_rain = int(mm_rain)    # convert mm_rain to be an integer
    
    # the first time we encounter a city, get will return 0 (because city_name isn't a key)
    # after that, get will return the current value
    rainfall[city_name] = rainfall.get(city_name, 0) + mm_rain
    
print(rainfall)    

Enter city: 5
Rain: 


ValueError: invalid literal for int() with base 10: ''

# Next up:

- Dicts
    - How to iterate over a dict with a `for` loop
    - How do dicts work?
- Files
    - If you can download the zip file from https://files.lerner.co.il/exercise-files.zip, that's great -- but if not, you're fine
    - Reading from files
    - A little writing to files, as well
    


# Iterating over dictionaries (with `for` loops)

- If we iterate over a string, we get its characters
- If we iterate over a list or tuple, we get its elements
- What happens if/when we iterate over a dict? (Will it even work?)

In [45]:
d = {'a':10, 'b':20, 'c':30}

for one_item in d:   # when we iterate over a dict, we get the keys!
    print(one_item)

a
b
c


In [46]:
# this is why we almost never need to use the .keys() method!
# for searching, it's faster/easier to use "in" on the dict itself
# for iterating, it's faster/easier to run the for loop on the dict itself

In [47]:
for one_key in d:
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30


In [48]:
# I'd really prefer it if I could get keys and values together
# good news: I can, with the .items() method on the dict

for t in d.items():    # we get a 2-element tuple back with each iteration
    print(t)

('a', 10)
('b', 20)
('c', 30)


In [49]:
# we can use tuple unpacking in our for loop, and assign keys and values to variables

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
b: 20
c: 30


In [50]:
rainfall

{}

In [51]:
# what is d.items() really returning?
# it looks sorta kinda like a list of 2-element tuples

d.items()

dict_items([('a', 10), ('b', 20), ('c', 30)])

In [52]:
counts

{'h': 1, 'e': 1, 'l': 2, 'o': 1}

In [53]:
for key, value in counts.items():
    print(f'{key}: {value}')

h: 1
e: 1
l: 2
o: 1


In [54]:
counts = {'a':3, 'b':5, 'c':10, 'd':2, 'e':8}

for key, value in counts.items():
    print(f'{key}: {value}')

a: 3
b: 5
c: 10
d: 2
e: 8


In [55]:
# we cannot add a string and an integer together in Python -- that leads to an error.
# but we CAN multiply them!

'a' * 5

'aaaaa'

In [58]:
# we can use this when printing our "counts" dict:

for key, value in counts.items():
    print(f'{key}: {value * "x"}')     # histogram!

a: xxx
b: xxxxx
c: xxxxxxxxxx
d: xx
e: xxxxxxxx


# How do dicts work?

They're based on something known as a "hash function." Hash functions are very common in software today, especially in encryption and security.

The idea of a hash function is that you give it an input, and you get a numeric output. The output looks random, but it isn't. There's no way to get the input based on the output.

How is this connected?

### A tale of two offices

- Searching one office at a time: The more offices there are, the longer it might take to find your friend. It's even possible that they aren't in the office, in which case you'll go through the whole building, and not find them.  This is similar to searching in a string, list, or tuple with `in`.  It basically runs a `for` loop, and searches through our data, one element at a time.

- A sign indicates that there are 26 offices, and each office is used by the people whose last names start with the appropriate letter -- so everyone whose last name starts with "A" is in office 1, with "B" in office 2, etc.  You can go directly to the right office, without any searching!  If your friend is there, great! If not, then you found out quickly.  This is how dictionary searches work.

Behind the scenes, Python stores a dict's key-value pair in a memory location that is decided by a hash function run on the key.  The key determines where things are stored.



In [59]:
mylist = [10, 20, 30]

d[mylist] = 'hello'

TypeError: unhashable type: 'list'

In [60]:
d = {'a':10, 'b':20, 'c':30}

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
b: 20
c: 30


In [61]:
d.pop('b')   # remove the key-value pair associated with 'b'

20

In [62]:
d['b'] = 100   # re-insert b -- which is now at the end

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
c: 30
b: 100


# Files!

What are files?  Basically, they are the contents of the computer's memory stored in a way that even after the computer is turned off, or we want to use the data on another computer, we can do so.  We can write memory to a file, and we can then read from a file into memory, into data structures.

How is data organized in files?  There are many, many, *many* different formats out there.  We're going to talk about text files, which only contain readable characters.  Can Python work with Excel, Word, PDF, and PowerPoint files (among others)? Yes, but we're going to ignore those.  Working with them requires reading lots of specifications -- it's much easier to assume that someone has probably solved this problem already, and use their implementation.

In order to read from a file, we need to ask the operating system for some help, for an agent that will talk to the file on our behalf. In some languages, this agent is called a "file handle." In Python, we just call it a "file object" -- or if you want to be pedantic, a "file-like object."  

To create a file object that'll allow us to read from a file on the filesystem, we use the `open` function.

    f = open(FILENAME)       # to read from a file, we pass open its name.  Open then returns a file object.
    f = open(FILENAME, 'r')  # explicitly open for reading
    

In [63]:
# I want to open the Unix file /etc/passwd
# this does *not* exist on Windows!  

f = open('/etc/passwd')  # the filename is a string, on Unix/Mac it contains /.  On Windows, \ -- use a raw string!

In [64]:
type(f)

_io.TextIOWrapper

In [65]:
# how can we now read the contents of the file we opened for reading

# we can read the data in f in a variety of ways.  But the easiest and best way,
# in my opinion, is to iterate over it in a "for" loop

# every iteration will return a string - the next line in the file, up to and including its \n character

for one_line in f:
    print(one_line)


##

# User Database

# 

# Note that this file is consulted directly only when the system is running

# in single-user mode.  At other times this information is provided by

# Open Directory.

#

# See the opendirectoryd(8) man page for additional information about

# Open Directory.

##

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false

_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false

_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false

_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false

_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false

_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fal

In [66]:
f = open('/etc/passwd')  

for one_line in f:
    print(one_line.strip())  


##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [67]:
for one_line in open('/etc/passwd'):
    print(one_line.strip())  

##
# User Database
#
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33:

In [71]:
# what if I want to print only the usernames (i.e., before the first : on each main line)?

for one_line in open('/etc/passwd'):
    if one_line[0] != '#':             # only look for a username to print if we aren't in a commentline
        print(one_line.split(':')[0])   # turn each line into a list, using : as a delimiter, and get index 0

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

In [75]:
for one_line in open('/etc/passwd'):
    if one_line[0] == '#':
        print("No name!")
    else:
        print(one_line.split(':')[0])   

No name!
No name!
No name!
No name!
No name!
No name!
No name!
No name!
No name!
No name!
nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydev

In [72]:
# in Jupyter, if a line starts with ! that means: run a program on the operating system, in the shell
# I use a Mac, and I've installed the Unix utility wget, which downloads things from a URL
# if you don't have all of these pieces, then this cell won't work for you.
# instead, paste the URL into your browser, and then move the files to wherever you're running Jupyter

!wget https://files.lerner.co.il/exercise-files.zip

--2022-10-27 21:40:12--  https://files.lerner.co.il/exercise-files.zip
Resolving files.lerner.co.il (files.lerner.co.il)... 138.197.26.202
Connecting to files.lerner.co.il (files.lerner.co.il)|138.197.26.202|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6148 (6.0K) [application/zip]
Saving to: ‘exercise-files.zip’


2022-10-27 21:40:13 (23.2 MB/s) - ‘exercise-files.zip’ saved [6148/6148]



In [73]:
!unzip exercise-files.zip

Archive:  exercise-files.zip
  inflating: mini-access-log.txt     
  inflating: wcfile.txt              
  inflating: shoe-data.txt           
  inflating: nums.txt                
  inflating: linux-etc-passwd.txt    


In [74]:
!head mini-access-log.txt

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.12 - -

# Exercise: Print IP addresses

1. Open the file `mini-access-log.txt`, which is a logfile from an actual Apache Web server I ran years ago.
2. On each line, grab the IP address - which is the first thing on each line, before the first space character.
3. Print the IP addresses from the file.

# How can we open files?

The `open` function opens a file, based on its name. How can we specify the filename?

### Unix/Mac/Linux

- `open('myfile.txt')` means: open `myfile.txt` in the current directory
- `open('abc/myfile.txt')` means: open `myfile.txt`, which is in the `abc` subdirectory under the current directory
- `open('/abc/myfile.txt')` means: open `myfile.txt`, which is in the `abc` directory at the top of our filesystem hierarchy

### Windows

- `open('myfile.txt')` means: open `myfile.txt` in the current directory
- `open(r'abc\myfile.txt')` means: open `myfile.txt`, which is in the `abc` subdirectory under the current directory
- `open(r'c:\abc\myfile.txt')` means: open `myfile.txt`, which is in the `abc` directory at the top of our C drive


In [76]:
!ls *.txt

linux-etc-passwd.txt  mini-access-log.txt  nums.txt  shoe-data.txt  wcfile.txt


In [79]:
for one_line in open('mini-access-log.txt'):
    print(one_line.split()[0])   # split without an argument splits on whitespace, then grab first item

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


# Next up:

1. More ways to read from files, and to parse them
2. Writing to files and `with`



In [81]:
# read the entire file into memory, into a single string

f = open('/etc/passwd')
s = f.read()     # this is potentially dangerous!

len(s)

7868

In [82]:
# Using the Unix "cat" command to view the file

!cat nums.txt

5
	10     
	20
  	3
		   	20        

 25


# Exercise: Sum numbers

1. One each line of the file `nums.txt`, there is at most one number. (There might be zero numbers.) That number might be surrounded by whitespace.
2. Define a variable, `total`, and set it to 0.
3. Go through each line of the file:
    - If, after removing all whitespace with strip, what remains returns True from `isdigit`, then turn that into an integer and add to `total`
4. After going through each line, print `total`


In [85]:
total = 0

for one_line in open('nums.txt'):
    s = one_line.strip()      # remove leading/trailing whitespace from one_line, assign to s
    
    # isdigit is a string method, but it tells you if you can turn the string into an integer without error

    if s.isdigit():           # if s only contains the digits 0-9 (and isn't empty)
        total += int(s)       # get an integer based on s, and add to total
        
print(f'total = {total}')        

total = 83


# Exercise: IP address count

Earlier, we iterated over the lines of `mini-access-log.txt`, grabbing and printing the IP addresses.  This time, I want you to first define an empty dict, `counts`.  Our plan is for the keys of `counts` to be IP addresses (strings) and the values of `counts` to be the number of times each IP address appeared.

1. Define `counts`, an empty dict.
2. Go through `mini-access-log.txt`, one line at a time.
3. Grab the IP address from the line.
    - If this is the first time you're seeing an IP address, add it to `counts` with the address as the key, and 1 as the value
    - If this is *not* the first time, then add 1 to the current count
4. Iterate over the dictionary, printing each key and value    

In [90]:
counts = {}

for one_line in open('mini-access-log.txt'):
    ip_address = one_line.split()[0]
    
    if ip_address in counts:      # if ip_address is a key in counts
        counts[ip_address] += 1   # add 1 to the existing count
    else:
        counts[ip_address] = 1    # otherwise, set it to be 1
    
counts    

{'67.218.116.165': 2,
 '66.249.71.65': 3,
 '65.55.106.183': 2,
 '66.249.65.12': 32,
 '65.55.106.131': 2,
 '65.55.106.186': 2,
 '74.52.245.146': 2,
 '66.249.65.43': 3,
 '65.55.207.25': 2,
 '65.55.207.94': 2,
 '65.55.207.71': 1,
 '98.242.170.241': 1,
 '66.249.65.38': 100,
 '65.55.207.126': 2,
 '82.34.9.20': 2,
 '65.55.106.155': 2,
 '65.55.207.77': 2,
 '208.80.193.28': 1,
 '89.248.172.58': 22,
 '67.195.112.35': 16,
 '65.55.207.50': 3,
 '65.55.215.75': 2}

# Writing to files

If we want to write to a file, we need to still open it -- but we need to open it for writing.  When we open a file for reading, we can do it either as

    f = open(FILENAME)
    
or, more explicitly, as 

    f = open(FILENAME, 'r')    # the 2nd argument indicates "reading"
    
To open the file for writing, we would say

    f = open(FILENAME, 'w')    # the 'w' means: open for writing
    
When you open a file for writing, one of two things happens:

1. You get an exception, indicating that the file was not opened for a reason 
2. The file is opened for writing, and now contains 0 bytes. Any previous contents are *GONE*.

If you want to add to an existing file, you can open it in `a` ("append") mode.

When a file is open for writing, we can write data to it using the `write` method. Note that `write` doesn't add a newline to the end -- which `print` does.

In [94]:
f = open('myfile.txt', 'w')

f.write('abcd\n')
f.write('efgh\n')
f.write('ijklmnop\n')

9

Have we really written this data to the file?

No. 

Every time we write to the file, the operating system actually puts our data in a "memory buffer." Only when it fills up does the buffer get "flushed" to disk. Until then, if the power goes out, our data is not actually stored.

You can force the data to be written by calling `f.flush()`.

Also, if you close the file (meaning: you don't want to write to it any more) with `f.close()`, it's automatically flushed first.

When Python exits, it automatically flushed and closes are remaining files.

In [95]:
# cat is a Unix command to show the contents of a file

!cat myfile.txt

In [96]:
f.close()

In [97]:
!cat myfile.txt

abcd
efgh
ijklmnop


# Automating flushing+closing

If we are writing to a file, and we know when we're done writing to it, we can use the `with` statement that Python provides, to ensure that the file is flushed and closed.

Here's the syntax:

```python
with open(FILENAME, 'w') as f:     # this assigns f = open(FILENAME, 'w')
    f.write('abcd\n')
    f.write('efghi\n')
    f.write('jklmnop\n')
    
# at this point, the file is flushed + closed    
```

In [99]:
# we can use with when reading, too

with open('/etc/passwd', 'r') as f:
    for one_line in f:
        print(len(one_line), end=' ')   # put a ' ' after each printout, not \n

3 16 3 76 71 18 2 70 18 3 59 50 54 72 62 64 70 61 71 70 70 72 56 66 62 67 52 63 60 69 58 50 50 54 66 67 59 63 64 61 62 61 62 61 55 55 74 53 65 65 55 56 50 56 88 66 61 70 81 65 62 56 75 65 54 64 75 72 85 72 67 53 55 69 77 74 94 85 97 73 84 68 71 70 63 55 82 74 64 66 76 55 78 80 56 63 82 76 63 55 69 61 99 73 55 63 79 100 57 83 62 77 104 55 67 92 89 60 62 51 74 75 53 

In [100]:
f.closed

True

In [101]:
# write to a file

with open('myfile.txt', 'w') as f:
    f.write('abcd\n')
    f.write('efghijklmn\n')  # \n means: newline

In [102]:
!cat myfile.txt

abcd
efghijklmn


In [105]:
s = 'abcd\nefgh'
print(s)

abcd
efgh


# Exercise: Dict to config file

I define a "config file" as a file in which we have names and values, separated by a `=`.  

1. Define a small dict with keys and values.
2. Iterate over this dict, one pair at a time.
3. Write each pair to a new file (`myconfig.txt`), writing each pair in `key=value` format in the file.
4. The file should contain the same number of lines as there are pairs in the dict.

In [106]:
d = {'a':10, 'b':20, 'c':'hello out there'}

with open('myconfig.txt', 'w') as f:  # open the file for writing, auto-closing at the end of the block
    for key, value in d.items():      # iterate over the dict, one key-value pair at a time
        f.write(f'{key}={value}\n')   # write the key-value pair as one line in our file

In [107]:
!cat myconfig.txt

a=10
b=20
c=hello out there


In [109]:
# what happens if we try to open a file, but it doesn't exist? Or we don't have permission?
# we get an EXCEPTION.

# I can trap exceptions with special Python keywords, try and except

try:
    open('asdfafdafasffa.txt')
except FileNotFoundError as e:
    print(f'Problem: {e}')

Problem: [Errno 2] No such file or directory: 'asdfafdafasffa.txt'


# Next week: Functions!