Introduction to dictionaries and reading files
====================================================

Dictionaries are data structures consisting of an unordered collection of *key-value pairs*. Some languages refer to dictionaries as *associative arrays*. The basic idea is that you can reference items by their key value as opposed to an index location (as you'd do in a numpy ndarray or a list). They are like "lookup tables" in some sense. Each key-value pair is separated by commas and the collection surrounded by curly braces (squiqqly brackets).

In [1]:
# Create a dictionary with three items in it. Note
# the curly braces used when **constructing** a dictionary.

params = {"start" : 1,
          "stop" : 10,
          "step" : 1}

print(type(params))
print(params)

<class 'dict'>
{'start': 1, 'stop': 10, 'step': 1}


To reference a specific value, just use its key. Notice the square brackets when **referencing** dictionary elements.

In [2]:
params["stop"]

10

Modifying existing dictionary elements is easy.

In [3]:
# Look, we can even change data types on the fly and mix different data
# types together in the same dictionary. That's what dictionaries are great for.

params["start"] = "A"
params["stop"] = "B"

# add a new entry
params["note"] = "Dictionaries can store values having differing types"

print(params)

{'start': 'A', 'stop': 'B', 'step': 1, 'note': 'Dictionaries can store values having differing types'}


Dictionaries have numerous properties and methods. Let's explore a few.

In [14]:
# Recreate our params dict
params = {"start" : 1,
          "stop" : 10,
          "step" : 2}

To check if a key exists, we can use the `in` keyword and the collection of dictionary keys.

In [4]:
# Does a certain key exist

param = 'step'

if param in params.keys():
    print(f'Yes, {param} is a valid key')
else:
    print(f'No, {param} is NOT a valid key')

Yes, step is a valid key


In [5]:
params.keys()

dict_keys(['start', 'stop', 'step', 'note'])

In [6]:
# iterate over the keys
for k in params.keys():
    print (k)

start
stop
step
note


**IMPORTANT:** Before Python 3.7, keys were **not** guaranteed to be sorted in any way. As of Python 3.7, the order in which keys are inserted is preserved as the key ordering.

What if we want to list out the key, value pairs for a dictionary. The `dict.items()` method does this. Technically it returns something known as a *view*. However, it's iterable and behaves like a list of tuples. Let's see this.

In [9]:
for key, val in params.items():
    print(f'key={key}, value={val}')

key=start, value=A
key=stop, value=B
key=step, value=1
key=note, value=Dictionaries can store values having differing types


Example: Using dictionaries as a container counts
-------------------------------------------------

In one of the examples we'll explore, we are going to use dictionaries to hold the values of counts of website hits per month based on an apache log file. The records look like:

    local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
    local - - [24/Oct/1994:13:41:41 -0600] "GET 1.gif HTTP/1.0" 200 1210
    local - - [24/Oct/1994:13:43:13 -0600] "GET index.html HTTP/1.0" 200 3185
    local - - [24/Oct/1994:13:43:14 -0600] "GET 2.gif HTTP/1.0" 200 2555
    local - - [24/Oct/1994:13:43:15 -0600] "GET 3.gif HTTP/1.0" 200 36403
    local - - [24/Oct/1994:13:43:17 -0600] "GET 4.gif HTTP/1.0" 200 441
    
Each month is represented by a three character abbreviation. Let's say that our basic strategy is to:

* create an empty dictionary called `monthly_counts`
* read a line, get the month into a variable. For example, month = 'Oct'
* Increment the counts for that month via `monthly_counts[month] = monthly_counts[month] + 1`

In [19]:
# Create an empty dictionary
monthly_counts = {}

Now, let's assume that the variable `month` has the value 'Oct'. What happens if we try to increment the dictionary value for that key?

In [20]:
month = 'Oct'
monthly_counts[month] = monthly_counts[month] + 1

KeyError: 'Oct'

Ah, so if we haven't added a key yet, we can't assume it starts out with a value of 0 (or anything else, for that matter). Of course, we could simply add a bunch of initialization lines such as `monthly_counts['Jan'] = 0`, `monthly_counts['Feb'] = 0`, and so on. However, there's another way of accessing a dictionary value using its `get` method. The beauty of the `get` method is that it has an optional second parameter in which you can specify the return value if the key doesn't exist.



In [21]:
monthly_counts.get('Oct', 0)

0

In [22]:
month = 'Oct'
monthly_counts[month] = monthly_counts.get(month,0) + 1
print (monthly_counts['Oct'])

1


Reading Files
-------------

So far we've just loaded data into numpy arrays using `loadtxt`. Soon we'll use the `csv` package for doing a similar thing. When we learn Pandas we'll see functions like `read_csv`. However, often you need to read a text file line by line and do some data scraping, parsing, manipulating, transforming, ..., whatever. Here are a few of the basic ideas.

### Example 1: open file, read a line, strip, print, repeat until no more lines, close file

In [1]:
# Store input filename in variable. Include necessary path info.
in_filename= "data/apache-mini.log"

# Open the input file for reading. in_file is a "file object" or "file handle".
in_file = open(in_filename, 'r')

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Loop through each line in the file. Check out the nice looping syntax for traversing a file.
for line in in_file:
    # The variable line contains the current line as one big string and includes things like
    # end of line characters. Also, to be clear, 'line' is a variable name we made up. We could have
    # called it 'peanutbutter' had we chose to.
    
    # Let's strip off any end of line characters
    # After running this cell, let's comment out this line to see what happens.
    line = line.rstrip()
    
    # Increment the line counter
    line_number += 1
    
    # Print the line and line number. What do you think the ':6' is for? Hint: There are < 1 million rows.
    print(f'{line_number:6}: {line}')
    
# After the loop is done, close the file
in_file.close()

     1: local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
     2: local - - [24/Oct/1994:13:41:41 -0600] "GET 1.gif HTTP/1.0" 200 1210
     3: local - - [24/Oct/1994:13:43:13 -0600] "GET index.html HTTP/1.0" 200 3185
     4: local - - [24/Oct/1994:13:43:14 -0600] "GET 2.gif HTTP/1.0" 200 2555
     5: local - - [24/Oct/1994:13:43:15 -0600] "GET 3.gif HTTP/1.0" 200 36403
     6: local - - [24/Oct/1994:13:43:17 -0600] "GET 4.gif HTTP/1.0" 200 441
     7: local - - [24/Oct/1994:13:46:45 -0600] "GET index.html HTTP/1.0" 200 3185
     8: local - - [24/Oct/1994:13:46:45 -0600] "GET 2.gif HTTP/1.0" 200 2555
     9: local - - [24/Oct/1994:13:46:47 -0600] "GET 3.gif HTTP/1.0" 200 36403
    10: local - - [24/Oct/1994:13:46:50 -0600] "GET 4.gif HTTP/1.0" 200 441
    11: local - - [24/Oct/1994:13:47:19 -0600] "GET index.html HTTP/1.0" 200 150
    12: local - - [24/Oct/1994:13:47:19 -0600] "GET 1.gif HTTP/1.0" 200 1210
    13: local - - [24/Oct/1994:13:47:41 -0600] "GET index.

In [2]:
# What is in_file?

type(in_file)

_io.TextIOWrapper

### Example 2: an alternate way of opening and closing

Now let's see a more "Pythonic" way of working with files

In [23]:
# Store input filename in variable. Include necessary path info.
in_filename = "data/apache-mini.log"

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Open the input file for reading using a `with` block
with open(in_filename, 'r') as in_file:
    # Loop through each line in the file. Check out the nice looping syntax for traversing a file.
    for line in in_file:
        # The variable Line contains the current line as one big string and includes things like
        # end of line characters. 
        
        # Let's strip off any end of line characters
        line = line.rstrip()
        
        # Increment the line counter
        line_number += 1
        
        # Print the line and line number
        print(f'{line_number:6}: {line}')
                
# After the loop is done,  there is no need to close the file. It's already been
# closed for you. :) To see that:

if in_file.closed:
    print("\nFile already closed.")
else:
    print("\nFile NOT closed yet")


     1: local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
     2: local - - [24/Oct/1994:13:41:41 -0600] "GET 1.gif HTTP/1.0" 200 1210
     3: local - - [24/Oct/1994:13:43:13 -0600] "GET index.html HTTP/1.0" 200 3185
     4: local - - [24/Oct/1994:13:43:14 -0600] "GET 2.gif HTTP/1.0" 200 2555
     5: local - - [24/Oct/1994:13:43:15 -0600] "GET 3.gif HTTP/1.0" 200 36403
     6: local - - [24/Oct/1994:13:43:17 -0600] "GET 4.gif HTTP/1.0" 200 441
     7: local - - [24/Oct/1994:13:46:45 -0600] "GET index.html HTTP/1.0" 200 3185
     8: local - - [24/Oct/1994:13:46:45 -0600] "GET 2.gif HTTP/1.0" 200 2555
     9: local - - [24/Oct/1994:13:46:47 -0600] "GET 3.gif HTTP/1.0" 200 36403
    10: local - - [24/Oct/1994:13:46:50 -0600] "GET 4.gif HTTP/1.0" 200 441
    11: local - - [24/Oct/1994:13:47:19 -0600] "GET index.html HTTP/1.0" 200 150
    12: local - - [24/Oct/1994:13:47:19 -0600] "GET 1.gif HTTP/1.0" 200 1210
    13: local - - [24/Oct/1994:13:47:41 -0600] "GET index.

### Example 3: splitting lines into a list

One common thing you might want to do when reading a formatted text file, is to split each line on some sort of special character such as a comma, tab, or space. Let's split the apache log on space - each line will become a list. We'll store each of these lists in a master list. Sometimes this can do exactly what you need in terms of getting lines ready for import into something like a Pandas DataFrame. 

We can always use more powerful tools like [regex](http://regexr.com/) to do this job. And of course, Python supports regex. We'll see this a little later.

In [24]:
# Store input filename in variable. Include necessary path info.
in_filename = "data/apache-mini.log"

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Create empty list
loglines = []

# Open the input file for reading
with open(in_filename, 'r') as in_file:
    # Loop through each line in the file. Check out the nice looping syntax for traversing a file.
    for line in in_file:
        
        # Let's strip off any end of line characters
        line = line.rstrip()
        
        # Before we split on the spaces, let's get rid of the brackets around the date
        line = line.replace('[', '')
        line = line.replace(']', '')
        
        # Now split the line using space as our delimiter
        logline_list = line.split(' ')
        
        # Append the logline list to the master list
        loglines.append(logline_list)
        
        # Increment the line counter
        line_number += 1
        
# All done, print the list
print(loglines)


[['local', '-', '-', '24/Oct/1994:13:41:41', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '150'], ['local', '-', '-', '24/Oct/1994:13:41:41', '-0600', '"GET', '1.gif', 'HTTP/1.0"', '200', '1210'], ['local', '-', '-', '24/Oct/1994:13:43:13', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '3185'], ['local', '-', '-', '24/Oct/1994:13:43:14', '-0600', '"GET', '2.gif', 'HTTP/1.0"', '200', '2555'], ['local', '-', '-', '24/Oct/1994:13:43:15', '-0600', '"GET', '3.gif', 'HTTP/1.0"', '200', '36403'], ['local', '-', '-', '24/Oct/1994:13:43:17', '-0600', '"GET', '4.gif', 'HTTP/1.0"', '200', '441'], ['local', '-', '-', '24/Oct/1994:13:46:45', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '3185'], ['local', '-', '-', '24/Oct/1994:13:46:45', '-0600', '"GET', '2.gif', 'HTTP/1.0"', '200', '2555'], ['local', '-', '-', '24/Oct/1994:13:46:47', '-0600', '"GET', '3.gif', 'HTTP/1.0"', '200', '36403'], ['local', '-', '-', '24/Oct/1994:13:46:50', '-0600', '"GET', '4.gif', 'HTTP/1.0"', '200', '44

Well, not so pretty. Sometimes we need to "pretty print" - https://docs.python.org/3/library/pprint.html.

In [25]:
from pprint import pprint

In [26]:
pprint(loglines)

[['local',
  '-',
  '-',
  '24/Oct/1994:13:41:41',
  '-0600',
  '"GET',
  'index.html',
  'HTTP/1.0"',
  '200',
  '150'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:41:41',
  '-0600',
  '"GET',
  '1.gif',
  'HTTP/1.0"',
  '200',
  '1210'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:43:13',
  '-0600',
  '"GET',
  'index.html',
  'HTTP/1.0"',
  '200',
  '3185'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:43:14',
  '-0600',
  '"GET',
  '2.gif',
  'HTTP/1.0"',
  '200',
  '2555'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:43:15',
  '-0600',
  '"GET',
  '3.gif',
  'HTTP/1.0"',
  '200',
  '36403'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:43:17',
  '-0600',
  '"GET',
  '4.gif',
  'HTTP/1.0"',
  '200',
  '441'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:46:45',
  '-0600',
  '"GET',
  'index.html',
  'HTTP/1.0"',
  '200',
  '3185'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13:46:45',
  '-0600',
  '"GET',
  '2.gif',
  'HTTP/1.0"',
  '200',
  '2555'],
 ['local',
  '-',
  '-',
  '24/Oct/1994:13

... or of course, we could iterate over the list and print a line at a time for finer control. 

In [27]:
for logline in loglines:
    print(logline)

['local', '-', '-', '24/Oct/1994:13:41:41', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '150']
['local', '-', '-', '24/Oct/1994:13:41:41', '-0600', '"GET', '1.gif', 'HTTP/1.0"', '200', '1210']
['local', '-', '-', '24/Oct/1994:13:43:13', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '3185']
['local', '-', '-', '24/Oct/1994:13:43:14', '-0600', '"GET', '2.gif', 'HTTP/1.0"', '200', '2555']
['local', '-', '-', '24/Oct/1994:13:43:15', '-0600', '"GET', '3.gif', 'HTTP/1.0"', '200', '36403']
['local', '-', '-', '24/Oct/1994:13:43:17', '-0600', '"GET', '4.gif', 'HTTP/1.0"', '200', '441']
['local', '-', '-', '24/Oct/1994:13:46:45', '-0600', '"GET', 'index.html', 'HTTP/1.0"', '200', '3185']
['local', '-', '-', '24/Oct/1994:13:46:45', '-0600', '"GET', '2.gif', 'HTTP/1.0"', '200', '2555']
['local', '-', '-', '24/Oct/1994:13:46:47', '-0600', '"GET', '3.gif', 'HTTP/1.0"', '200', '36403']
['local', '-', '-', '24/Oct/1994:13:46:50', '-0600', '"GET', '4.gif', 'HTTP/1.0"', '200', '441']
['loca

## Using Spyder or PyCharm - Python IDEs (Integrated Development Environments)
While these Jupyter notebooks are great for interactive computing, tutorials like this one, and relatively small programs, as your programs grow larger and more complex you'll likely want to use an IDE. Both Spyder and PyCharm are IDEs for Python and have all kinds of great features like:

* a visual debugger that allows you to step through your code, see values of variables, set breakpoints and watches
* syntax highlighting
* tab completion
* integrates with version control packages such as git
* helps organize code into projects
* integrated IPython console
* a slew of other features

For now, we'll just:

* launch Spyder, 
* open a Python script version of our Apache log file reader program, 
* learn how to use the debugger.

Anaconda Python includes an IDE called Spyder. Just type `spyder` at the shell command line prompt.

You can find a Spyder Tutorial in the Help menu.

PyCharm is a full-featured IDE aimed at Python developers. I use it extensively. However, it can be a bit overwhelming for new programmers. Spyder is a little easier to use for beginners and is focused on scientific computing. While PyCharm forces the use of a Project construct, this is optional in Spyder (just as R projects are optional in R Studio). I do recommend using them.

Ok, now I'll show you how to run and debug our little program.

If you want to experiment with PyCharm, you can [download the free Community Edition](https://www.jetbrains.com/pycharm/download/#section=linux) and then [check out their Quick Start Guide](https://www.jetbrains.com/help/pycharm/quick-start-guide.html#create).

