# Agenda: Week 3 (Dictionaries and files)

1. Recap data structures so far
2. Dictionaries ("dicts")
    - Storing
    - Retrieving
    - Keys and values
    - Modifying dictionaries
3. Accumulating in dicts
4. Acuumulating the unknown
5. Looping over dicts
6. Files
    - File objects
    - Reading from files
    - Writing to files
    - Using with `with` statement when working with files
    
We'll use files from this zipfile: https://files.lerner.co.il/exercise-files.zip    

# Data structures so far

We've seen that different types of data can be stored in different types of structures. Each structure provides us with different functionality and speed trade-offs.

Integers are really small and very fast -- but very annoying if we're going to use them for text.

Strings are great for text - but very annoying if we want to store multiple things (that aren't characters).

Lists are great for collections -- but very annoying if I just want to hold onto the text of a book or article.

Generally speaking:
- Integers are for whole numbers, and floats are for numbers with a fractional part
- Strings are for text (of any length), or collections of characters
- Lists are the go-to ordered collection in Python.  Lists can contain anything at all, and can be any size we want. We can append to the end, and we can remove from anywhere.
- Tuples are Python's version of structs or records. They are immutable, but more importantly, we typically use tuples when we have a collection of different types.

Some examples:
- If I have a text document, I'll store that in a string.
- If I have a few words that our company wants to be sure are never used in a press release, then we could put those in a list, and search for each of them in outgoing correspondence.
- If I have information about an employee -- their name, age, address, and salary -- then I'll put those in a tuple, because it's a collection of information of different types.

# Getting help in Jupyter

Every Python environment is a bit different, but in Jupyter, you have a few ways to get help, and to find out what your options are.

1. At any point, you can press `TAB` to complete your options. If you press `TAB` halfway through a variable name, then Python will try to complete it. If the completion is ambiguous, then it'll show you a menu of possibilities. If you're after a `.`, then it'll show you all of the methods that you can invoke on that type of object.
2. You can use the `help` function to find out more about something, as in `help(len)` or `help(str.upper)`. Notice that if I'm getting help on a function, I don't invoke it with `()`, but just pass its name as an argument to `help`.
3. You can put any variable, function, or other name in Jupyter, and put a `?` after it, to get information about it. If you use `??`, you'll sometimes get more information, such as the definition of a function.

In [2]:
variable_x = 100
variable_y = [10, 20, 30]
variable_z = 'Hello, out there!'

In [3]:
help(variable_y)

Help on list object:

class list(object)
 |  list(iterable=(), /)
 |  
 |  Built-in mutable sequence.
 |  
 |  If no argument is given, the constructor creates a new empty list.
 |  The argument must be an iterable if specified.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate sign

In [5]:
variable_y??

# Dictionaries (aka "dicts")

When we store data in a list, we know several things:

1. We can store any type of data that we want.
2. Each new element in the list is put at a new index, 1 higher than the previous final element's index. Indexes start at 0.
3. We can update the values in a list by assigning to the list at a particular index.

There are some problems with this, though:

1. We have to use integers to retrieve our values.
2. Those integers start at 0, and are rather inflexible.
3. If we want to search for a value in our list, we need to (potentially) go through all of the values until we'll either find it or see that it's not there.

Imagine we're running a new streaming service. Someone wants to know whether we have a particular movie in stock. Can you imagine looking through 1m films in a list, one at a time, to see if we have it in stock? That would take forever!

Dictionaries provide a wonderful alternative:
- They are more flexible with their indexes (known as "keys")
- They are far faster to search through than lists
- They also provide us with more semantic power than lists do

A dict is also known by many other names in other languages:
- key-value store
- name-value store
- hash table
- hash map
- hash
- map
- associative array

You can think of a dict as a two-column table, in which the left column contains keys and the right column contains values.

A list can only have integer keys, and they start with 0, and go up by 1.

A dict, by contrast, can have keys of *any immutable type* (which basically means integers and strings), and values of any type at all. We, the users, can say what the dict keys are. We aren't restricted to 0, 1, 2, etc. We can use:

- ID numbers
- names
- usernames
- IP addresses

In [6]:
# to define a dict, we use {}
# each key-value pair is defined with the key, then :, then the value
# pairs are separated by ,

# in a dict, every key must have a value, and every value must have a key
# also: keys are unique! It's impossible for a dict to have the same key twice

d = {'a':10, 'b':20, 'c':30}   # here, I define a dict and assign it to d

In [7]:
type(d)

dict

In [8]:
# how big is this dict?

len(d)  # len usually gives us the number of elements in a data structure. Here, it gives us the number of *pairs*.

3

In [9]:
# I can search the keys for something, and find out if it's there, with "in"
'a' in d

True

In [10]:
30 in d   # we only look at keys, so this is False

False

In [11]:
# I can retrieve from the dict using []

d['a']   

10

In [12]:
d['b']

20

In [13]:
d['c']

30

In [15]:
d['x']   # this key does not exist, thus there is no value for it

KeyError: 'x'

# First paradigm for dict use: A read-only database

Many times, we'll create a dict and never modify it. Then, inside of the program, we can read from that dict and use it as a database, but we won't change it.

Example: Months -> numbers. Or numbers -> months.

In [17]:
months = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4}

In [18]:
months['Jan']  # we use [] to retrieve from dicts and strings, not just lists

1

In [19]:
months['Dec']

KeyError: 'Dec'

# Exercise: Restuarant

0. Define `total` to be 0.
1. Define a dict called `menu` whose keys are the entree names and whose values are their prices.
2. Ask the user, repeatedly, to enter an order:
    - If the order is empty (empty string), then stop asking and exit the program
    - If the order is key in the dict, then add the price to `total`, and print the item, new total, and price.
    - If the order is *not* in the dict, then give the user a scolding
    
Example:

    Order: sandwich
    sandwich is 12, total is 12
    Order: tea
    tea is 10, total is 22
    Order: elephant
    we are fresh out of elephant today!

In [20]:
total = 0


menu = {'sandwich':12, 'tea':10, 'apple':1, 'cake':5}   # 4 key-value pairs

In [21]:
len(menu)

4

In [22]:
menu['sandwich']

12

In [23]:
order = 'sandwich'   # assign to a variable
menu[order]

12

In [25]:
total = 0

menu = {'sandwich':12, 'tea':10, 'apple':1, 'cake':5}   # 4 key-value pairs

while True:   # infinite loop -- it'll last until we encounter "break"

    order = input('Order: ').strip()

    if order == '':   # this is how we can get out of the infinite loop!
        break
        
    # is this order on the menu?
    if order in menu:
        price = menu[order]
        total += price
        print(f'{order} costs {price}; total is now {total}')
    else:
        print(f'We are out of {order} today!')
        
print(f'Price is {total}')        

Order: sandwich
sandwich costs 12; total is now 12
Order: sandwich
sandwich costs 12; total is now 24
Order: tea
tea costs 10; total is now 34
Order: cake
cake costs 5; total is now 39
Order: apple
apple costs 1; total is now 40
Order: table
We are out of table today!
Order: 
Price is 40


In [26]:
# f-strings are special

x = 10

print('The value of x is ' + str(x))   # here, we're combining two strings into a new one

The value of x is 10


In [27]:
# I can also say:

print(f'The value of x is {x}')  # here, we're depending on {} returning a string, no matter the type of x

The value of x is 10



# Dicts are mutable

You might remember that Python data can be *mutable* or *immutable*. The question is whether we can change an existing data structure. This is *not* the same as whether we can assign a value to a variable!

We can *always* assign a new value to a variable. For example:

```python
s = 'abcd'      # assigning a string to s
s = s.upper()   # assigning a new string to s 
```

The above is *not* mutable data, though.  Strings are immutable.

By contrast, lists are mutable. We change a list with `append` and the like, without assigning or re-assigning it to a variable.

Dictionaries are also mutable. We can change them without re-assigning them to a variable.

In [29]:
d = {'a':10, 'b':20, 'c':30}

# how can I modify an existing key-value pair?
# I can update the value by assigning to it

d['a'] = 2345    # the key 'a' already exists, and we change/update/modify the existing value

d

{'a': 2345, 'b': 20, 'c': 30}

In [30]:
# how can I add a new key-value pair to my dict?
# in exactly the same way -- we just assign to a key that doesn't (yet) exist

d['x'] = 987   # 'x' doesn't exist as a key, so we add a new key-value pair

d

{'a': 2345, 'b': 20, 'c': 30, 'x': 987}

In [31]:
# how can I remove a key-value pair?
# (by the way, I almost never remove key-value pairs from a dicts)

d.pop('x')  # this removes the pair with 'x' as a key, and returns its value

987

In [32]:
d

{'a': 2345, 'b': 20, 'c': 30}

# Dicts are always key -> value

In a dict, keys are unique. There's no way for the same key to exist more than once. Values can repeat, if we want.

Thus, we can retrieve values from dicts with simple retrieval, using `d[k]`. We know that `k` either exists as a key, or not.

Can we retrieve keys based on values? The answer is: No! You always want to think about dicts as using the keys to get values, but not the other way around. 

In [33]:
d.keys()  # this returns a list-like object that contains all dict keys

dict_keys(['a', 'b', 'c'])

# Next up

1. Using dicts for accumulation
2. Using dicts to accumulate starting from nothing

# Dicts for accumulation (paradigm 2)

The idea of this dict paradigm is that we create a dictionary with keys and initialized values, typically 0. We won't add keys to the dict while the program runs, but we will increment the values over time, to reflect data we've found.

For example: Let's say that a list contains integers, and I want to know how many of them are odd, and how many are even.

In [34]:
numbers = [10, 15, 20, 25, 30]

# create a dict to keep track of the numbers
counts = {'evens': 0,
          'odds': 0}

for one_number in numbers:
    
    if one_number % 2 == 0:    # if there is no remainder after dividing by 2... it must be even
        counts['evens'] += 1   # add 1 to the current value of counts['evens']
    else:
        counts['odds'] += 1
        
print(counts)        

{'evens': 3, 'odds': 2}


# Exercise: Vowels, digits, and others (dict edition)

1. Create a dict called `counts`, with three keys: `vowels`, `digits`, and `others`. The values for all three keys should be 0.
2. Ask the user to enter a string.
3. Go through the string, one character at a time:
    - If the character is a vowel (a, e, i, o, u), then add 1 to `vowels`
    - If the character is a digit (0-9), then add 1 to `digits`
    - Otherwise, add 1 to `others`
4. Print the dict at the end of the program's run    

In [36]:
counts = {'vowels':0, 
          'digits':0,
          'others':0}

s = input('Enter a string: ').strip()

for one_character in s:            # go through s, one character at a time
    if one_character.isdigit():    # if one_character is the string '0' through '9'
        counts['digits'] += 1      # add 1 to counts['digits']
    elif one_character in 'aeiou': # if it's a vowel...
        counts['vowels'] += 1      # add 1 to counts['vowels']
    else:
        counts['others'] += 1
        
print(counts)        
    

Enter a string: hello 123 !!
{'vowels': 0, 'digits': 12, 'others': 0}


In [37]:
s = '     '  # 5 spaces
len(s)

5

# Accumulating the unknown (paradigm 3)

Sometimes, we don't know what we're going to want to accumulate, but we do want to count it. For example, let's say that I want to know how many times each letter appears in a book. I can use paradigm 2, set up a dict with a-z as keys, and then count how many times each appears.

But if I want to count words, I'm not going to use paradigm 2, going through the whole article or book, creating a key-value pair for each unique words, and then iterating through it again, and accumulating.

In such cases, we can create an empty dict at the start of the program and then populate it -- keys and values alike -- as we encounter them. 

In [38]:
# example: letter frequencies

counts = {}   # empty dict

s = input('Enter a string: ').strip()

for one_character in s:
    counts[one_character] += 1   # add 1 to the existing value of one_character, whatever was there before 
    

Enter a string: hello


KeyError: 'h'

In [39]:
# example: letter frequencies

counts = {}   # empty dict

s = input('Enter a string: ').strip()

for one_character in s:
    if one_character in counts:    # does one_character exist in counts?
        counts[one_character] += 1  
    else:
        counts[one_character] = 1  # first time, just assign 1 

Enter a string: hello out there, this is a great new day


In [40]:
counts

{'h': 3,
 'e': 5,
 'l': 2,
 'o': 2,
 ' ': 8,
 'u': 1,
 't': 4,
 'r': 2,
 ',': 1,
 'i': 2,
 's': 2,
 'a': 3,
 'g': 1,
 'n': 1,
 'w': 1,
 'd': 1,
 'y': 1}

# Rainfall 

1. Define an empty dict. Eventually, the dict's keys will be cities, and the dict's values will be how much rain (in mm) fell there in the last 24 hours.
2. Ask the user repeatedly to enter the name of a city.
    - If they give us an empty string, then stop asking.
3. Ask the user how much rain fell there in the last 24 hours.
4. Check if the city was already mentioned in our dict:
    - If not, then add the new city and amount to the dict as a key-value pair
    - If so, then just add the new amount to the old amount
5. Print the entire dict.

```
Exmaple:

    City: Chicago
    Rain: 5
    City: New York
    Rain: 3
    City: Seattle
    Rain: 4
    City: Seattle
    Rain: 2 
    Rain: [ENTER]
    

{'Chicago']:5, 'New York':3, 'Seattle':6}
```



In [44]:
rainfall = {}    # empty dict, eventually with string keys (city names) and integer values (mm rain)

while True:  
    city_name = input('City: ').strip()
    
    if city_name == '':   # no city name? exit the loop!
        break
        
    mm_rain = input('Rain: ').strip()
    mm_rain = int(mm_rain)
    
    if city_name in rainfall:
        rainfall[city_name] += mm_rain  # add to existing value
    else:   
        rainfall[city_name] = mm_rain   # no existing value, so add one
    
print(rainfall)    
    

City: Chicago
Rain: 5
City: New York
Rain: 3
City: Seattle
Rain: 4
City: Seattle
Rain: 2
City: 
{'Chicago': 5, 'New York': 3, 'Seattle': 6}


# Next up

1. Looping over dicts
2. How do dicts work?
3. Intro to files (please remember to download the zipfile from https://files.lerner.co.il/exercise-files.zip)


# Looping over dicts

We've seen that we can use `for` loops to iterate over a bunch of different data structures:

- Strings -- we get one character at a time
- Lists -- we get one element at a time
- Tuples -- we get one element at a time

What happens when we iterate over a dictionary?

In [45]:
d = {'a':10, 'b':20, 'c':30}

for one_item in d:    # when we iterate over a dict, we get the keys (in chronological order of insertion)
    print(one_item)

a
b
c


In [46]:
# one way to get keys and values is this:

for one_key in d:
    print(f'{one_key}: {d[one_key]}')   # key + the value

a: 10
b: 20
c: 30


In [47]:
# I prefer a slightly different way, though

for t in d.items():  # d.items() gives us one (key, value) tuple for each pair in d
    print(t)

('a', 10)
('b', 20)
('c', 30)


In [48]:
# we can use tuple unpacking to make this more readable
# we know that the tuple we get with each iteration will be (key, value) -- two elements

for key, value in d.items(): 
    print(f'{key}: {value}')

a: 10
b: 20
c: 30


In [49]:
# what about iterating over d.keys()?

for one_key in d.keys():  # this works just fine... but please don't do it!
    print(one_key)

a
b
c


In [50]:
# similarly, we can search in a dict with "in" (remember, it only looks at the keys)

'b' in d

True

In [51]:
'c' in d

True

In [52]:
# can we search in d.keys()?  -- but why do this, when you can say "'b' in d"?

'b' in d.keys()

True

In [53]:
# there is, however, d.values()

d.values()

dict_values([10, 20, 30])

In [54]:
d = {'a':10, 'b':20, 'c':30, 'd':30, 'e':20, 'f':10}
d

{'a': 10, 'b': 20, 'c': 30, 'd': 30, 'e': 20, 'f': 10}

In [55]:
d.values()

dict_values([10, 20, 30, 30, 20, 10])

In [None]:
# can a key have more than one value?

# no, because keys are unique (only one instance per dict) and they only get one value.
# BUT that value could be a list, tuple, or dict containing other values.

# Exercise: Age statistics

1. Define a dict in which the keys are names of people in your family, and the values are their ages.
2. Print the average (mean) age of all these people.

In [57]:
people = {'Reuven': 52,
          'Atara': 21,
          'Shikma': 19,
          'Amotz': 17}

# a few different possibilities!

# option 1: grab the values in a for loop, put them into a list, and then calculate

ages = []
for one_item in people.values():
    ages.append(one_item)
    
print(sum(ages) / len(ages))    

27.25


In [58]:
# option 1a (variation): calculate while we're iterating

counter = 0
total = 0

for one_item in people.values():
    counter += 1
    total += one_item
    
print(total / counter)

27.25


In [65]:
# option 2: another way: just turn the ages (values) into a list

sum(people.values()) / len(people)

27.25

# How do dictionaries work?

Dictionaries have some weird (and some useful) properties:

1. Keys are unique and immutable
2. Via a key, we can get a value, but not vice versa
3. Values can be anything at all
4. Searching for a key is very, very fast

This is all possible thanks to what's known as a "hash function." 

Metaphor/analogy of finding people in an office building.

This is how a dictionary works: Check the first letter of the person's last name, and turn that into a number (A=1, B=2, C=3, until Z=26). That's the office number in which a person will be.

When we store `d['a'] = 1`, Python takes the key (`'a'`), and runs a function on it. That function returns a number, which is used to decide where in memory the key-value pair should be stored.

So what does this way of storing and retrieving have to do with the rules I wrote earlier in this cell?

1. Keys must be unique, because otherwise they would sit on top of each other in memory, thanks to the hash function.
2. Keys must be immutable, because if they could change, then the hash function would calculate something different than what they actually are, and the data would get lost.
3. Searching is very fast, because of the office metaphor -- you just jump to the appropriate office/location, and retrieve the data.

# Files

We, as computer users, normally think of files as Word documents, Excel spreadsheets, or PDF documents. In the computer world, a file is just any collection of data stored in a way that we can then retrieve it later on and continue with our work as if we had just entered that data. We're going to concentrate on text files in this class, in part because they're the easiest to deal with, and in part because they're quite common -- logfiles and the like are all text files.

If I want to read a file from the filesystem, I can't! I can't do it directly, without help. I need the assistance of the operating system.  Your program asks the OS to provide an "agent," often known as a "file handle," which will allow us access to the file.  We then work with this file handle, reading from the file via the handle, and writing to the file via the handle.

In Python, these agents and file handles are typically just known as "file objects" or "file-like objects."

In [66]:
# let's look through the file called "mini-access-log.txt"
# I'll need to get a file object (file handle), and then use it to read from the file

f = open('mini-access-log.txt')    # the "open" function returns a new file object

In [67]:
# there are several ways to read from a file
# the best (in many cases) is to iterate over it, line by line
# how? with a for loop!

for one_line in f:    # each iteration gives me one line, up to and including the next \n 
    print(one_line)

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"

66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - -

In [68]:
# shorter way to do it 

for one_line in open('mini-access-log.txt'):  # The file mini-access-log.txt is the same directory Jupyter
    print(one_line)

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"

66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.65.12 - -

# Exercise: Word count

1. There is a Unix utility called `wc` which counts lines, words, and characters in a file. (You can run it, if you're on Unix, as `wc FILENAME`. 
2. I want you to write a short Python program that implements the same thing as `wc`. That is:
    - It prints the number of lines in a file (including blank lines)
    - It prints the number of words in a file 
    - It prints the number of characters in a file
3. Ask the user what file they want to open, open it, iterate over it with a `for` loop, and accumulate the counts for `lines`, `characters`, and `words`.
4. Finally, print their values.

There is a file in the zipfile I gaveyou called `wcfile.txt`. It's probably a good idea to use it for this exercise.

In [69]:
!cat wcfile.txt

This is a test file.

It contains 28 words and 20 different words.

It also contains 165 characters.

It also contains 11 lines.

It is also self-referential.

Wow!


In [72]:
# (1) I'm going to assume that wcfile.txt is in the same directory as Jupyter was run
# (2) I'm going to iterate over the file, one line at a time using "for"

number_of_lines = 0
number_of_characters = 0
number_of_words = 0

for one_line in open('wcfile.txt'): 
    number_of_lines += 1   # each iteration means 1 more line
    
    number_of_characters += len(one_line)   # how many characters in the current line? add to number_of_characters
    
    words_on_line = one_line.split()
    number_of_words += len(words_on_line)   # how many words on this line? Add to the running total
    
print(f'lines: {number_of_lines}')    
print(f'characters: {number_of_characters}')    
print(f'words: {number_of_words}')    

lines: 11
characters: 165
words: 28


In [73]:
# let's rewrite this code, so that we don't accumulate our information into
# three variables, but rather into one.

# I can use a dictionary to keep track of all of them!

counts = {'lines':0,
          'characters':0,
         'words':0}

for one_line in open('wcfile.txt'): 
    counts['lines'] += 1   # each iteration means 1 more line
    
    counts['characters'] += len(one_line)   # how many characters in the current line? add to number_of_characters
    
    words_on_line = one_line.split()
    counts['words'] += len(words_on_line)   # how many words on this line? Add to the running total
    
for key, value in counts.items():
    print(f'{key}: {value}')

lines: 11
characters: 165
words: 28


# Next up

1. (More on) reading from files
2. Using data structures + reading from files
3. Writing to files and `with`

In [77]:
# what if the file contains things we don't care about, or want?
# we can always use an "if" statement to decide if the current line is interesting
# if so, we go ahead and do our thing... if not, then we go onto the next line.

# for example, I have a file here, linux-etc-passwd.txt, which is based on the standard
# Linux (Unix) "passwd" file where usernames and info are stored.

# Each line of the file should contain the username, user ID, group ID, etc., separated by : characters

# We need to ignore comment lines and blank lines

for one_line in open('linux-etc-passwd.txt'):
    if one_line.startswith('#'):
        continue  # ignore comment lines
        
    if one_line.strip() == '':
        continue   # ignore blank lines
    
    print(one_line.split(':')[0])  # turn each line into a list of fields, and get index 0 -- the username

root
daemon
bin
sys
sync
games
man
lp
mail
news
uucp
proxy
www-data
backup
list
irc
gnats
nobody
syslog
messagebus
landscape
jci
sshd
user
reuven
postfix
colord
postgres
dovecot
dovenull
postgrey
debian-spamd
memcache
genadi
shira
atara
shikma
amotz
mysql
clamav
amavis
opendkim
gitlab-redis
gitlab-psql
git
opendmarc
dkim-milter-python
deploy
redis


# Exercise: Summing numbers

1. Define `total` to 0.
2. Go through the file `nums.txt`.  Each line of this file might be:
    - Blank (or contain only whitespace)
    - Contain one integer with possible whitespace before or after it
3. If the line contains an integer, then call `int` on it and add the result to `total`.
4. If the line contains only whitespace, which you can find by running `strip` on it and getting an empty string back, then ignore it.
5. Print `total`, which should be 83.

In [78]:
s = '     abcd      '

s.strip()

'abcd'

In [79]:
s.upper()

'     ABCD      '

In [81]:
s.strip().upper()   # method chaining -- we're only running one method on each string

'ABCD'

In [83]:
total = 0

for one_line in open('nums.txt'):
    s = one_line.strip()   # remove leading/trailing whitespace from one_line
    
    if s.isdigit():
        total += int(one_line)
    
print(total)    

83


In [84]:
total = 0

for one_line in open('nums.txt'):

    if one_line.strip().isdigit():  # remove leading/trailing whitespace, and check if it contains only digits
        total += int(one_line)
    
print(total)    

83


In [85]:
# there's a big difference between the empty string, '', and a space character, ' '

len('')   # how big is the empty string

0

In [86]:
len(' ')  # how big is the space character?

1

In [None]:
# if you compare one_line.strip() with the empty string, that'll check to see if
# nothing at all survived the stripping

# if you compare it with the space character... 

In [87]:
# it turns out that int is very forgiving if you have whitespace before or after the numbers
int('5')

5

In [88]:
int('     5      ')

5

In [89]:
int('\n\n\n\r\r\r\t\t\t   5  \t\t\t')

5

In [90]:
# but....
int('')

ValueError: invalid literal for int() with base 10: ''

# Writing to files

So far, we've only read from files. If you call `open` with only a filename as an argument, it'll open that file for reading.

If you want to open the file for writing, you need to pass a second argument, `'w'`, which tells `open` to write to the file, *not* to read from it. (It is possible to do both, but it's very messy, and I urge you not to do it.)

If you open a file for writing, with the `'w'` argument, one of two things will then happen:
- The file will be open for writing, and will contain 0 bytes. Any previous content is obliterated.
- You get an error message, saying you cannot open the file.



In [91]:
f = open('myfile.txt', 'w')

In [92]:
f

<_io.TextIOWrapper name='myfile.txt' mode='w' encoding='UTF-8'>

In [93]:
f.write('hello!\n')    # I use the "write" method, and have to ensure my string ends with \n

7

In [94]:
f.write('hello again!\n')

13

In [95]:
# let's take a look at our file!

!cat myfile.txt

In [96]:
# it turns out that output to files is buffered -- meaning, only when there's enough 
# data will it be written to disk.  

# this means that so far, the file wasn't written!

# we can get around this by closing the file, or if we want to keep it open, flushing the buffer

f.flush()

In [97]:
!cat myfile.txt

hello!
hello again!


In [98]:
f.close()   # not going to write to this file any more

The `with` keyword in Python is useful in a number of ways, but the most common one is to open a file (especially for writing), such that after the `with` block, the file is automatically flushed and closed.



In [101]:
# open takes two arguments:
# (1) filename
# (2) mode -- 'r' (reading), 'w' (writing), 'a' (appending), etc.

# if you want to write to a file, you'll likely use 'w'.  We open the file in the context
# of the "with" statement, and that ensures the file is flushed and close by the end of the block.

with open('myfile.txt', 'w') as f:  
    # inside of this block, we can write to f
    
    f.write('hello 12345\n')
    
    f.write('!!!')

    # as the block is closing, the file will be flushed and closed -- guaranteeing the buffer is empty

In [100]:
!cat myfile.txt

hello 12345
!!!