jcchurch edited this page Jan 10, 2012 · 28 revisions
Clone this wiki locally

CS 390 - Data Analysis

Table of Contents

Day 1. January 2, 2011. Monday

Discuss the Syllabus


Business run on large amounts of data. The first problem that you encounter in data analysis is determining what problem you are trying to solve. Usually these problems are abstract. "How do we get more customers to buy our product?" This is an example of an ambiguous question. The data isn't going to tells us what needs to be done. An analysis of the data can tells us under what circumstances we can expect customers to behave against the wishes of the business. From there, we can discuss steps to address customer concerns.

A Real World Example from Your Instructor

I work for a company that buys textbooks. Our business model is interesting. We do dynamic pricing based on information found on the Internet and use that information to quote you, our customer a price.

Once we offer a price on a book, the student has the opportunity to accept or reject the offer. It's a binary decision: yes or no.

The owner of the company approached me and asked "How do we get more customers to accept our offers?" There is always an easy answer: Pay more money per book! But how much do we increase? If we increase, will we really see an increase in customer acceptance? If so, how much more of an increase?

Obviously, the data isn't going to tells us anything specific. But we did know four things about each transaction:

  • The employee who is conducting the transaction.
  • The location of the transaction.
  • The price quoted for the transaction.
  • The reaction of the customer.

There is always an easy answer: Pay more money per book! A hard answer (and the point of this class) is to analyze the data for a relationship.

Using our data, we perform the following steps:

  1. We clean the data. This is boring, necessary, and always frustrating.
  2. Extract a sample of the data.
  3. We make plots of our sample and try to discover any trends.
  4. We make an assumption.
  5. We test our assumption on a completely new, independent test sample from the original data set.
  6. If we are satisfied that our assumptions accurately predict the results from the new, independent sample, we call our assumption a model. Likewise, if we are not satisfied, we throw the assumption out, the sample data out, test data out, we start back over at Step 2.

There is a possible criticism of the above plan. Notice that we repeat the steps until we reach an assumption that satisfies our assumptions. If we repeat these steps and never change our assumption, this is an example of "data dredging" and is considered unethical.

Back to my story of my boss asking me how to improve customer acceptance rates. I discovered that there was no correlation between the acceptance rate and the offered price, a small, yet noticeable correlation between the acceptance rate and the location, and a small, yet noticeable correlation between the acceptance rate and the employee. In the end, we decided not to change our prices but to improve employee training.

Three types of data.

All data boils down the three types:

  • Univariate data. Data with only one variable. Usually we are only looking at the shape of the data.
  • Bivariate data. Data with two variables. Usually we are looking to find a relationship (if any) in the datasets.
  • Multivariate data. Data with more than two variables. We are looking to discover a relationship (if any) between any or all combinations of the variables.

What is Python?

Python is a high-level, object oriented, scalable, and extensible language. It is highly portable. The python interpreter has been well tested on the three biggest operating systems to ensure that when you write your code on one machine it can easily run on a python interpreter running on a completely different machine.

  • High-level: There are several layers of processing between your code and the end result. These layers do work for you.
  • Object oriented: Python has objects built into the language from the very beginning of the history of Python.
  • Scalable: All python scripts double as modules with zero extra code. Modules allow you to import other scripts into your code with ease, thus reducing time spent rewriting code.
  • Extensible: For those times when python is too slow to get the job done, you can write C extensions that can be called by your python scripts.

Best of all: Python has a simple to learn syntax that is fun to learn. It is easier to read than most languages. It requires fewer lines of code than C/C++/Java to develop an application (usually a python program is 50% shorter than a Java program).

Python's language syntax is based around a language called ABC, which few people use any more. Python draws inspiration from a variety of languages: C (expression handling and printf), Haskell (functional programming), Java (memory management), Perl (text processing), and many other languages.

There are very few symbols in Python compared to other languages. There are also very few reserve words. The creator of Python, Guido Van Rossum, is a stubborn minimalist when it comes to language design. That's an inspiration drawn from the language Lisp, which is also minimalist. The opposite to python is probably C++, which gives you every conceivable tool as part of the core language.

Python compares to other languages

Much like Java, Python is a byte-code compiled language. Unlike Java, you would never notice the byte-code files unless someone pointed them out to you.

  • Basic/Ruby/JavaScript/R
    • Source Code→Interpreter→Output
  • C/Pascal
    • Source Code→Compiler→Object Code, Object Code→Execution→Output
  • Java
    • Source Code→Compiler→Byte Code, Byte-Code→Java Virtual Machine→Output
  • Perl/PHP/Python
    • Source Code→Compiler→Byte Code→Perl/PHP/Python Virtual Machine→Output
  • Python (if source code used as a library)
    • Source Code→Compiler→Byte Code→(Written to File and Python Virtual Machine)→Output
  • Jython
    • Source Code→Compiler→Byte Code→(Written to File and Java Virtual Machine)→Output


The most noticeable difference between Python and other languages is that proper indentation is required by the interpreter. Most languages convert all space down to a single character and then parse through your code. Not Python. Indentation is essential to organizing a program. (This is how ABC code also looks.)

# Global Block. Everything exist in the global block.
# We are now in Block 1
    # Still in Block 1
        # We are now in Block 1-1 (sub-block of Block 1)
        # Still in Block 1-1
    # Back in Block 1
        # Block 1-2

# We change indentation, so this is Block 2

How would this look in a language like C or Java?

// Global Block. Everything exist in the global block
    // Block 1
        // Block 1-1

    // Block 1
        // Block 1-2
    // Block 2

Remember that indentation isn't required by C or Java, so this is the same thing as the last example. There is no Python equivalent to this.

/* Global Block */ { /* Block 1 */ { /* Block 1-1 */ } /* Block 1 */ { /* Block 1-2 */ } } { /* Block 2 */ }

We indent with tabs or spaces. It is not wise to use both tabs AND spaces. So, tabs or spaces?

This is a "religious argument" and I am from the Church of Spaces. All of those who use tabs will be deems as heretics. All blocks shall be indented with 4 spaces. This is the end of the matter.

First Steps in Python

Start the Interpreter

Go to a command line on your favorite operating system and type “python”. If you get an error, you are probably using Windows. Download python and get python.

Hello, world!

Type the following:

print "Hello, world!"

That was simple. It turns out that anything that can be done on the Python command line can also be done within a Python script. This is very handy, as it helps to interactively step through your code and see what is changing.

#!/usr/bin/env python
print "Hello, world!"

Let's look at the first line. That's important. There's three parts here.

  • '#!' → Sh-bang. On Unix operating systems, it's possible to treat a text file as an executable. '#!' as the first two characters means this is an executable text file. This holds true for the Mac OSX system. On Windows, this is ignored. '#' is also the beginning of a comment, so this statement is ignored.
  • /usr/bin/env → This is the command that will search your system path and execute a command. If python exist in one directory on your computer and a different directory on your friend's computer, then using '/usr/bin/env' will find the python executable and the program will work.
  • python → The name of the interpreter

This first line starts the Python interpreter. Every line after the first line is sent to the interpreter for execution. This should be on line 1, character 1 of every Python script. It will never change. Do not put comments before this line. Do not put blank spaces before this line.

Differences between Python 2 and Python 3.

We will be using Python 2 in this course. Python 3 was released in late 2008, making it 3 years old. Python 2 and Python 3 are incompatible, yet very similar languages. Why don't we use the newer language? It's because even after 3 years, when people say "python", most of the time they are referring to the older Python 2 syntax. It represents the status quo. Most textbooks are still using the Python 2 syntax, including the one selected for this course.

Here's a taste of the biggest difference between Python 2 and Python 3. The word "print" is a reserved word in Python 2. It has been demoted to a built-in function in Python 3.

#!/usr/bin/env python3
print("Hello, world!")



Python can perform basic math like a normal calculator. The basic “int” type is built into Python, and it's a 32 bit integer. There is also a long type in Python that is an arbitrary precision type. As computer scientists, you should be aware of the difference between a standard integer and a BigInt. In reality, Python will change between these two types automatically with no extra programming. You, the programmer, get this for free.

>>> 2+2
>>> 6/3
>>> 4*5
>>> 11/2

You'll notice that there is no need of a semicolon in Python. As long as there is only one statement on a line, there is no need for a semicolon. If it makes you feel better, you can put a semicolon there and Python will ignore it.

You'll also notice that Python performs integer division. That's because both types are interpreted to being integer, thus the remainder is truncated like in most other languages. To get a floating point value, one of the two values needs to be a float. (This is another change between Python 2 and Python 3. Python 2 performs division in the same behavior as C or Java or a programming-oriented language. Python 3 performs division in the same manner as Matlab, Mathematica, R or another math-oriented language.)

Two more operators.

// -> performs integer division no matter what the numbers are.
** -> calculates a number raised to a power.
>>> 11.0/2
>>> 11.0 // 2
>>> 2 ** 16
>>> 25 ** 0.5
>>> 2**100

Having a built-in exponent operator is handy. Any number raised to the half power is the square root of that value. Using it is a quick way to find square roots without having to call a math function. You'll also notice that any time Python converts from the basic integer to the long type, it adds a “L” to the end. That “L” only shows up when prototyping. You don't have to worry about these strange “L”s appearing in your code.

Boolean Logic

In an effort to reduce the number of symbols used by a language, Python has renamed logical and, or and not.

  • Logical Operator OR is “or”. The equivalent in C/C++/Java is ||
  • Logical Operator AND is “and”. The C equivalent is &&
  • Logical Operator NOT is “not”. The C equivalent is !


Python has several built-in types (int, float, string, list, and two more types called a dictionary and a tuple).

Through a system called “duck typing”, it automatically figures out the type and determines the appropriate storage structure. It's called “duck typing” due to the classic quote on ducks. If it walks like a duck and quacks like a duck, then it must be a duck. If something looks like an int, Python stores it as an int.

Use the built-in function “type” to determine the type of a variable if you are not sure.

>>> a=5
>>> b=3.14
>>> c="hello"
>>> d=2**100
>>> type(a)
<type 'int'>
>>> type(b)
<type 'float'>
>>> type(c)
<type 'str'>
>>> type(d)
<type 'long'>

In the tradition of C and Java, the double quotes is used for strings and the single quotes are used for single characters. Not so in Python. Single quotes and double quotes play the same role. If you begin a string with a double quote, it must end with a double quote. If you begin a string with a single quote, it must end with a single quote. This is helpful if you wish to embed a double quote symbol into a string: just mark the string with single quotes.

Python has no notion of a character type. The closest thing to a character is a string of length 1. Python also uses the triple-single quote operator and a triple-double quote operator. This is useful if you wish to embed both symbols into a string without escaping them. Triple-single quote operator and triple-double quote operators also allow you to extend a string to multiple lines. If you begin a string with one operator, it must also end with the same operator.

>>> s = "hello"
>>> t = 'this is also a string'
>>> u = 'John Henry once said, "Give me liberty, or give me death!"'
>>> u
'John Henry once said, "Give me liberty, or give me death!"'
>>> v = "It's nice to use words with a contraction."
>>> v
"It's nice to use words with a contraction."
>>> w = """This is a long string.
... You can see that it goes on for multiple lines.
... Use the same operator to end the string."""
>>> w
'This is a long string.\nYou can see that it goes on for multiple lines.\nUse the same operator to end the string.'
>>> print w
This is a long string.
You can see that it goes on for multiple lines.
Use the same operator to end the string.

Because python doesn't require a string to be used with anything, it can be used by itself. Often times programmers will use the triple-single quote string as comments in their program instead of the pound sign.

Reading user input

There are two basic input statements in Python.

  • input(prompt) → Allows the user to input strings, then it immediately evaluates those strings as Python code. You'll probably never use this. This is now considered a flaw in Python's original design.

  • raw_input(prompt) → Allows the user to input strings. This is what we'll use to prompt the user for information.

    age = raw_input("What is your age? ") What is your age? 32 age '32'

The data is always returned as a string. If you wish to convert this value to a number, use the “int” function. Here, I'm changing 'age' from the string value “32” to the integer value “32”. Python doesn't care that I'm reassigning this variable on the fly with a new type.

>>> age = int(age)
>>> age


The classic 'if/else' statement is here. In Python, it's called 'if/elif/else'. The syntax for a basic if statement is this:

if condition:
    # true block
elif another condition:
    # true block
elif yet another condition:
    # true block
    # All conditions failed block

As a side note, there is no “switch” statement in Python. Use this instead.


In python, a list is basic data structure to represent a sequence of objects. This is not to be confused with term “Array”, which is never used to represent anything in the core Python implementation. A list is defined by '[' and ']' braces. A list containing no objects is simply '[]'. Examples of list:

>>> numbers = [10, 20, 30, 40]
>>> words = ["the","quick","brown","fox"]
>>> mixed = [10, "the", 20, "quick", 30, "brown"]

List elements can be reassigned and referenced by the same syntax as found in C or Java:

>>> numbers[1] = 25
>>> numbers[3] = [41,42,43] # (This embeds a list into a list)
>>> numbers[0] = numbers[2]

List can be concatenated:

>>> numbers = [10, 20, 30, 40]
>>> words = ["the","quick","brown","fox"]
>>> numbers + words
[10, 20, 30, 40, 'the', 'quick', 'brown', 'fox']
>>> numbers += [50] # This is a "push" stack operation
>>> numbers
[10, 20, 30, 40, 50]
>>> numbers = [5] + numbers # This is an "enqueue" queue operation
>>> numbers
[5, 10, 20, 30, 40, 50]

File Reading

(I jumped ahead in my notes on Monday. We reordered.)

Reading from a file works in much the same way that it does in C or Java. First, you issue a command to open a file. The command returns a "file handle". You then use the file handle to issue arguments such as reading or writing to a file.

There are three methods associated with file reading:

  • read - Read an entire document into a string.
  • readline - Read a single line of a document into a string.
  • readlines - Read all the lines, split them into an array and return the document as an array of strings, where one string equals one line.

But there is my preferred way to open a file:

for line in file(filename):
    line = line.strip()
    print line

This does the job.

Day 2. January 3, 2011. Tuesday

Review of Monday's class

  • Basic math
  • Brief overview of data structures
  • Strings
  • Boolean logic
  • Conditional Statements
  • Reading from a file
  • Basic string processing
  • Basics to Lists

File Writing

By default, if we open a file, it is assumed that we intend to read from it. File writing is similar to most other formats.

fh = file(filename, "w")

Unlike the print statement, the write command will not add newlines on the end of each write call.

A list can be sliced by referencing the list with the colon operator inside the square braces. Think of slices boundaries as “fencepost” rather than “blocks”. The first number and second numbers refer to the fencepost within the array (with the numbering starting at 0).

>>> [21,22,23,24,25][0]
>>> [21,22,23,24,25][0:5]
[21, 22, 23, 24, 25]
>>> [21,22,23,24,25][0:4]
[21, 22, 23, 24]
>>> [21,22,23,24,25][1:5]
[22, 23, 24, 25]
>>> [21,22,23,24,25][1:3]
[22, 23]
>>> [21,22,23,24,25][:1] # This is a "pop" stack operation
>>> [21,22,23,24,25][1:] # This is a "dequeue" queue operation

List can be sliced with negative values:

>>> [21,22,23,24,25][-1]
>>> [21,22,23,24,25][-2]

List can be sliced with a skip call:

>>> [21,22,23,24,25][1:4]
[22, 23, 24]
>>> [21,22,23,24,25][1:4:2]
[22, 24]
>>> [21,22,23,24,25][::-1] # This reverses a list
[25, 24, 23, 22, 21]

Boolean searches are simple in Python using “in” and “not in” operators:

>>> 21 in [21, 22, 23, 24, 25]
>>> 21 not in [21, 22, 23, 24, 25]
>>> 4 in [21, 22, 23, 24, 25]
>>> 4 not in [21, 22, 23, 24, 25]

Common built-in list operations:

>>> len([21, 22, 23, 24, 25]) # Length of a list
>>> min([21, 22, 23, 24, 25]) # Minimum value of a list
>>> max([21, 22, 23, 24, 25]) # Maximum value of a list
>>> sum([21, 22, 23, 24, 25]) # Sum of a list

Methods to the list object:

>>> numbers = [21,22,23,24,25]
>>> numbers.insert(2, 42) # Inserting element into list
>>> numbers
[21, 22, 42, 23, 24, 25]
>>> numbers.insert(10, 42) # Inserting element beyond the end of the list
>>> numbers
[21, 22, 42, 23, 24, 25, 42]
>>> numbers.pop(2) # Removes an element from a list an returns it.
>>> numbers
[21, 22, 23, 24, 25, 42]

Built-in method: range

“range” is a special built-in method for generating list sequences. A range can take 1, 2, or 3 arguments passed to it.

Arguments Range will return an list elements…

  • range(m) creates a list of integers from 0 to m-1
  • range(n,m) creates a list of integers from n to m-1
  • range(n,m,p) creates a list of integers from n to m-1, and skipping every p numbers

For new python programmers, you may be wondering why the range function doesn't gives you the final integer in the range. The reason is so that you can call "range(len([21, 22, 23, 24, 25]))" and produce a list of all of the index values in a list.

>>> range(5) # Generates 0 to 4
[0, 1, 2, 3, 4]
>>> range(1,5) # Generates 1 to 4
[1, 2, 3, 4]
>>> range(0,10,2) # Generates even numbers from 0 to 8
[0, 2, 4, 6, 8]
>>> range(0,-10,-1) # Generates numbers from 0 to -9
[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]


There exist two loop constructs in Python: “while” and “for”. “while” is a pre-conditional loop and acts just like it does in the C/C++/Java family of languages. There is no post-conditional “do while” construct in Python.

>>> i = 0
>>> while i < 10:
...     print i,
...     i = i + 1
0 1 2 3 4 5 6 7 8 9

The “for” construct is a list iterator. To print a list of numbers in a sequence, you have to first generate a list of numbers. For that, it's best to use the “range” method: “for” uses the following syntax:

for iterator variable in list object :

>>> for i in range(10):
...  print i,
0 1 2 3 4 5 6 7 8 9

The ideal use of the “for” construct is to iterate over a list.

>>> words = ["the","quick","brown","fox"]
>>> for w in words:
...  print w

Using the “range” and “len”, we can get the positions of each element in a list:

>>> words = ["the","quick","brown","fox"]
>>> for i in range(len(words)):
print "%s: %s" % (i, words[i])
0: the
1: quick
2: brown
3: fox


Methods are simple in Python:

def Method Name ( Comma separated arguments ):

Because of duck typing, a method can return any data type it wants. It is the programmer's responsibility to make sure it is used in the right context. (Side note: This is different compared to a language like Java, which is strictly typed and method context is the compiler's responsibility. Perl operates similar to Python by using a Dynamic typing system. Haskell's You-Must-Be-Mad-Authoritarian style typing system is so strongly typed that your program won't compile even when you think it should, and it's usually because of a data type mix-up.)

>>> def hello():
...     print "hello"
>>> hello()

Python passes all values by reference. But that's not to say that you can change a value after it is passed. Data types in Python are put into two classes: mutable types (which can change) and immutable types (which cannot change):

  • Immutable types (which cannot change): Numbers (including Integers, Booleans, Floating point, and Complex Numbers) Strings (including regular Strings and Unicode Strings) Tuples (which are essentially constant list)
  • Mutable Types (which can change): List Dictionaries (Python's name for associative arrays or hash tables)

Here's an example of the selection sort:

def sort(list):
    l = len(list)
    for i in range(l):
        min = i
        for j in range(i+1,l):
            if list[j] < list[min]:
            min = j
        list[i], list[min] = list[min], list[i] # Swaps two variables.

a = [4,5,7,3,5,3,65,7,4,34,5,76,5,43,3,5,7]
[3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 7, 7, 7, 34, 43, 65, 76]

Here's an example of the quadratic equation problem to show how a method can return a list. It requires the math library to get the square root function.

>>> import math
>>> def quad(a, b, c):
...     s = b*b - 4 * a * c
...     if s < 0:
...         return []
...     if s == 0:
...         return [ -b / (2 * a) ]
...     sr = math.sqrt(s)
...     return [ (-b + sr) / (2 * a), (-b - sr) / (2 * a) ]
>>> quad(3,1,-2) # 3x^2 + x - 2, 2 solutions
[0.66666666666666663, -1.0]
>>> quad(2,4,2) # 2x^2 + 4x + 2, 1 solution
>>> quad(6,-2,27) # 6x^2 - 2x + 27, 0 solutions

List Comprehension

List Comprehension is an odd syntax for generating list that was taken straight from a language called Haskell. (Dr. Cunningham is a big fan of Haskell.) The syntax for list comprehension works like this:

[ expression for iterator value in list if conditional statement ]

>>> [i for i in range(3)]
[0, 1, 2]
>>> [i+1 for i in range(3)]
[1, 2, 3]
>>> [i**2 for i in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
>>> [i for i in range(100) if i % 2 == 1]
[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61
>>> binary = "00101010"
>>> [2**i for i in range(7,-1,-1)]
[128, 64, 32, 16, 8, 4, 2, 1]
>>> times = [2**i for i in range(7,-1,-1)]
>>> [i for i in range(8)]
[0, 1, 2, 3, 4, 5, 6, 7]
>>> [2**i for i in range(7,-1,-1)]
[128, 64, 32, 16, 8, 4, 2, 1]
>>> times = [2**i for i in range(7,-1,-1)]
>>> [int(binary[i])*times[i] for i in range(8)]
[0, 0, 32, 0, 8, 0, 2]
>>> sum([int(binary[i])*times[i] for i in range(8)])
>>> sum([int(binary[i])*(2**(len(binary)-1-i)) for i in range(len(binary))])
>>> [int(binary[i])*(2**(len(binary)-1-i)) for i in range(len(binary))]
[0, 0, 32, 0, 8, 0, 2, 0]

Prime numbers with List Comprehension using the Sieve of Eratosthenes technique:

primes = range(2,120)
i = 0
while i < len(primes):
    primes = [x for x in primes if x % primes[i] != 0 or x == primes[i]]
    i = i + 1

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113]

For extremely large list, this technique is slow. This is a less elegant, but more efficient method:

def seive(count):
    primes = range(2,count)
    i = 0
    while i < len(primes):
        j = i+1
        while j < len(primes):
            if primes[j] % primes[i] == 0:
                del primes[j]
            j = j + 1
        i = i + 1
    return primes

Python Modules

Python's modules are the equivalent to Java's packages. A module is a file containing variables, methods, and classes that contains code needed for common reuse. There is nothing that distinguishes a Python module from a normal Python script other than the way the files are used. Python modules still end in ”.py” and there is no special code to change a Python script to a Python module. math is a Python module for math routines.

There are multiple ways to import a module, but these are two most common ways:

import math

Imports the math module, but all calls to that module must begin with math.

from math import *

Imports the math module and all calls to that module look like native fields, methods, and classes.

>>> import math
>>> math.pi
>>> math.e
>>> math.sin(0)
>>> math.sin(math.pi / 2.0)
>>> from math import *
>>> pi
>>> e
>>> sin(0)
>>> sin(pi/2.0)

The advantage to using “from math import *” is that it allows you to write cleaner, more readable code. The disadvantage is that it may clobber existing variables in your script or collide with other imported modules. My advice is always use the “import X” call.

Plotting in Python using matplotlib

The matplotlib library is a Python library that allows you to generate graphs and charts. The nice thing about matplotlib is that is was modeled after Matlab's plotting libraries. In my opinion, the plotting libraries are the only good thing about Matlab. Since matplotlib gives you most of Matlab's best features in Python, why use Matlab?

The command to import matplotlib's plotting library is this:

import matplotlib.pyplot as plt

For now, the main function we will be using is called "plot". The syntax for plot is as follows:

plt.plot(x, y, description)

Where the following variables are used:

  • x is a list of numerical values intended for the x axis. Optional.
  • y is a list of numerical values intended for the y axis.
  • description allows you to change the color and dot shapes of the plot.

The x values are optional. If you plot with a single value, it is assumed that the values represent y values and it generates an integer sequence of x values for you.

The description is optional. The description defines the points and color of the line. The format is the color code then the shape code. By default, plots are generated with a blue line.

For example, 'r.' generates a scatter plot using red dots.

Here are the color codes:

  • b: blue
  • g: green
  • r: red
  • c: cyan
  • m: magenta
  • y: yellow
  • k: black
  • w: white

Here are the shape codes:

  • .: dots
  • -: lines
  • +: pluses

There are actually many shape codes. I'm putting the most important ones here.

Python's Dictionaries

A dictionary is the Python word for associative arrays or hash tables. A dictionary is used to look up definitions of words. The definition of a word themselves are found based on how the word is spelled and nothing more. A dictionary is different from a list in that a list is a sequential ordering of values, where as a dictionary is a randomordering of values. The ordering of the values appears random to us (similar to the groupings of words in a real dictionary), but Python decides the ordering based on how quickly the value can be accessed based on the input key.

Dictionaries require keys to be mapped to values. In Python ...

  • a dictionary is declared using {} symbols.
  • vales are called from the dictionary using the [] notation (which is identical to list).
  • A key-value pair is defined by two pieces of data separated by the : (colon) symbol.
  • key-value pairs is separated by the , (comma) symbol.

The value of a key-value pair are normal containers, meaning they can hold any normal data structure. The key of a key-value pair is limited to holding hashable types, such as numbers or strings. Data structures such as list and dictionaries are not hashable, therefore they can not be made the key inside a key-value pair.

>>> dict = {}
>>> dict
>>> single = {"planet":"Earth"}
>>> single["planet"]
>>> pets = {"dog":"Fido.", "cat":"Mittens."}
>>> pets
{'dog': 'Fido.', 'cat': 'Mittens.'}
>>> pets['dog']
>>> pets['cat']
>>> numbers
{64: 'squares on a checkerboard', 3: 'Stooges', 12: 'items in a dozen'}
>>> numbers[3]
>>> numbers[12]
'items in a dozen'
>>> numbers[64]
'squares on a checkerboard'
>>> numbers[0]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 0
>>> len(numbers)

Even after a key-value pair has been set, it can be updated and modified:

>>> pets
{'dog': 'Fido.', 'cat': 'Mittens.'}
>>> pets["mouse"] = "Squeak" # Add a new pet "mouse"
>>> pets
{'mouse': 'Squeak', 'dog': 'Fido.', 'cat': 'Mittens.'}
>>> del pets["cat"] # Delete the pet "cat"
>>> pets
{'mouse': 'Squeak', 'dog': 'Fido.'}

There are several built-in functions that assist with working with Dictionaries.

>>> numbers
{64: 'squares on a checkerboard', 3: 'Stooges', 12: 'items in a dozen'}
>>> len(numbers)
>>> numbers.keys()
[64, 3, 12]
>>> numbers.values()
['squares on a checkerboard', 'Stooges', 'items in a dozen']

Day 3. January 4, 2011. Wednesday

Review of the last class

Remind the students that they should be reading Chapter 2 and Chapter 3 in the textbook.

Topics discussed in the last class.

  • The matplotlib plotting library
  • Methods
  • List Comprehension
  • Dictionaries
  • The Monty Hall Problem (you won't be tested on this)
  • Loops
  • Writing to a file

Return to Dictionaries discussion

Working with a dictionary in real problems:

>>> hhga = """And Saint Attila raised the hand grenade up on high, saying, "O Lord, bless this Thy hand grenade that..."""
>>> words = {}
>>> for w in hhga.lower().split():
if w in words:
words[w] += 1
words[w] = 1
>>> words
{'being': 2, 'lobbest': 1, 'four': 1, 'more,': 1, 'blow': 1, 'to': 4, 'lord': 2, 'then': 3, 'sight,': 1, 'five': 1, 'n

Python Classes

Classes are similar to classes in Java in that they have most of the same properties. Classes are comprised to two parts: fields and methods. In Java, the fields of a class must be explicitly defined.

class Bicycle {
    private int gear; // Private Field: This field is not accessible from outside the class, unless get/set used
    public int speed; // Public Field: This field is accessible outside of the class
    Bicycle() {
        gear = 1;
        speed = 0;
    int getSpeed() {
        return speed;
    int getGear() {
        return gear;
    void setSpeed(int speed) {
        this.speed = speed;
    void setGear(int gear) {
        this.gear = gear;
    public String toString() {
        return "Gear: "+gear+" Speed: "+speed;

In Java, there are four field types (public, private, protected, and package) and the default type is package. In Python, there are just two (public and private) and the default type is public. (Actually, private can be made public through a tweak of the language.)

class Bicycle:
    def __init__(self):
        self.__gear = 1 # Private field, get/set methods must be used to change this value
        self.speed = 0 # Public field
    def getGear(self):
        return self.__gear
    def setGear(self, gear):
        self.__gear = gear
    def __str__(self):
        return "Gear: %s Speed: %s" % (self.__gear, self.speed)

There are a few things that need to be pointed out in the above example: To make a field private in Java, you use the keyword private. To do the same thing in Python, you begin the field with two underscore marks.

Python's self is the equivalent to Java's this.

Unfortunately, self must be the first argument to every Python method inside a class. That does get annoying after a while.

Because everything is public, get and set operations are not needed. The constructor is named init in Python, but in Java, the constructor is the same as the class name. Python's str method is the equivalent to Java's “public String toString()” method. It must return a string data type.

>>> b = Bicycle()
>>> b
<__main__.Bicycle instance at 0xb7ef52ec>
>>> print b
Gear: 1 Speed: 0
>>> b.speed
>>> b.speed = 5 # Change speed of bike directly in object
>>> b.speed
>>> b.__gear # Will not work because __gear is a private field
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: Bicycle instance has no attribute '__gear'
>>> b.__gear = 2 # Adds a meaningless value to the object. This doesn't change the internal __gear
>>> b.getGear() # Internal gear setting left unchanged
>>> b.__gear # But the meaningless value is still retained.
>>> b.setGear(3) # Correct way to change gear on bike
>>> b.getGear()

Summary statistics.

Summary statistics are the tools that you are probably already familiar with:

  • Mean or Average
  • Median: The sorted middle
  • Harmonic Mean: The inverse of the average of your inverted values
  • Geometric Mean: The n-th root of the product of your values.

We tend to put too much emphasis on these tools. We really shouldn't. Using our example from the first homework assignment, you learned that the average county's gross income per taxpayer was $42,000 per year. That barely tells you the story.

What is the difference between Mean and Median? Both are summary statistics, in that they try to condense a very large image ("How much does each county in the United States make?") into a singular number. Mean and Median have different biases.

All summary statistics have biases. Knowing which biases you want to guard against and which ones that you don't care about should be considered into selecting a summary statistic.

  • Mean is biased towards outliers. Extreme data points will skew the mean towards these values.
  • Median is biased towards repetition. Duplicate data will skew the median towards these values.

Knowing the picture that you wish to present to your audience will determine if you should use a mean or median.


A histogram is a binning or quantizing technique. Once a dataset has been quantized, simply sort the data and count the repetitions. After that, graph your counts. You need a quantizing function. Essentially quantizing is a form of rounding numbers. We could quantize any number of ways.

The book suggests Scott's Rule for determining the number of needed bins:

w = 3.5σ / n^(1/3)

Another binning technique is called Sturge's Rule:

w = 1 + ln(n)

Here is a simple version of a histogram: Rounding floating point numbers into integer data.

Histograms are extremely intuitive. At a glance you can tell where most of the data is aggregated. In matplotlib, these are simple to create.

import matplotlib.pyplot as plt

The Gaussian Distribution

The Bell Curve:

f(x) = e^(-0.5 * x * x) / sqrt(2*pi)

import sys
import math
import matplotlib.pyplot as plt

SQRT_2PI = math.sqrt(2.0 * math.pi)

def gaussian(x):
    return math.exp(-0.5*x*x)/SQRT_2PI

def function_kde(x, y, binpoints):
    bins = [0] * len(binpoints)
    for i in range(len(binpoints)):
        bins[i] = gaussian( (binpoints[i] - x) / float(y) ) / float(y)

    return bins

if __name__=='__main__':

    low = -5.0
    high = 5.0
    nbins = 201

    binpoints = [0] * nbins
    masterbin = [0] * nbins

    delta = (high - low) / (nbins - 1)

    b = low
    i = 0
    while b <= high:
        binpoints[i] = b
        b += delta
        i += 1

    g = function_kde(0, 1, binpoints)

    # Plot the master bin
    plt.plot(binpoints, g)

Kernel Density Estimators

Kernel Density Estimators are a method of smoothing a graph. This are relatively new approach to graph smoothing, but it is simple enough to do via a computer program.

import sys
import math
import matplotlib.pyplot as plt

SQRT_2PI = math.sqrt(2.0 * math.pi)

def gaussian(x):
    return math.exp(-0.5*x*x)/SQRT_2PI

def function_kde(x, y, h, binpoints):
    bins = [0] * len(binpoints)
    for i in range(len(binpoints)):
        bins[i] = y * gaussian( (binpoints[i] - x) / float(h) ) / float(h)

    return bins

if __name__=='__main__':

    x = []
    y = []

    for line in file('presidential_days_in_office.txt'):
        line = line.strip()
        [order, name, daysInOffice] = line.split("\t")
        x.append( float(order) )
        y.append( float(daysInOffice) )

    n = len(x)
    low = min(x)
    high = max(x)
    nbins = n * 1 

    binpoints = [0] * nbins
    masterbin = [0] * nbins

    delta = (high - low) / (nbins - 1)

    b = low
    i = 0
    while b <= high:
        binpoints[i] = b
        b += delta
        i += 1

    for i in range(n):
        bins = function_kde(x[i], y[i], 3, binpoints)

        for j in range(nbins):
            masterbin[j] += bins[j]

    # Plot the master bin
    plt.plot(binpoints, masterbin)
    plt.plot(x, y)

Day 4. January 5, 2011. Thursday

Linear Transformation

Often times when we have data and we want to compare its shape with another dataset. The entire stock market is an example of this problem. Imagine that you have a stock worth $5 and another stock worth $500. The $5 stock will go up and down in small amounts, usually pennies. The $500 dollar stock will change in drastically larger values, maybe even in $5 and $10 increments. Just because one stock moves $5 and another stock moves 5 cents does this mean that the first stock is better. If you attempted to graph the history of these two stocks, the smaller stock would be a flat line compared to the larger stock. We need an objective way to compare two things that exist on different scales.

A simple linear transformation solves this problem.

For a series of values in vector x, f(x) = (xi - low) / (high - low)

  • low represents the lowest x
  • high represents the highest x
  • xi represents a single element in vector x

This function scales any dataset so that the values exist between the scores of 0 and 1 and always retains the preserves of the dataset. It is worth committing to memory. This allows you to easily compare the shape of two lines on the same plot.

Typically in stock market research, you want to know which of two or more stocks has the higher yield (i.e. best growth over a predefined time span). After a research has computed the simple linear transformation of a plot, they subtract the low from the each plot. This is done to ensure that both plots start out at 0 on the left hand side of the graph. The right hand side of the graph will quickly identify the best and worst stock growth.

Numpy Tutorial

In Python, the default list data structure is called "list". It's an internal data structure native to the Python environment.

NumPy is a linear algebra library used to simply the process of working with vectors and matrices. The default data structure in NumPy is the "array" because it maps directly to a C array of n one dimenstional elements.

We can create an array in NumPy three different ways:

  • Converting one from an existing Python list

  • Generating one from a function

  • Reading data straight from a text file

    import numpy as mp vector = np.array([0., 1., 2., 3., 4.] vector = np.arange(0, 5, 1, dtype=float) vector = np.linspace(0, 4, 1) vector = np.zeros(5) vector[0] = 0 vector[1] = 1 vector[2] = 2 vector[3] = 3 vector[4] = 4 vector = np.loadtxt("data")

In Python, when the plus sign is used on a list, it represents concatenation. In NumPy, the plus sign represents pair-wise addition. Loops aren't needed if you wish to add two vectors together.

The nice thing about NumPy arrays is that they automatically work with matplotlib's plotting libraries even though they are of a different data type.

Let's do a simple example. Let's plot a sin curve with various amplitude and frequency. The amplitude is the max peak of the line. The frequency represents how many times the curve crosses the x-axis.

import matplotlib as plt
import numpy as np

x = np.linspace(0, 4*np.pi, np.pi*1000)
y = np.sin(x)
plt.plot(x, y)

To increase the frequency of a sin curve, we multiply each value in x by a number larger than 1. In Python, this usually requires a loop or list comprehension. In Numpy, it's simpler.

y = np.sin(x*2)
plt.plot(x, y)

Likewise, if we wish to increase the amplitude of our function, that also is simple enough.

y = np.sin(x)*2
plt.plot(x, y)

NumPy can multiply each element in a vector times each element in a second vector if each vector has the same length. There is a identity in trigonometry called the Pythagorean Identity. It is this:

1 = sin(x)^2 + cos(x)^2

We can implement and test this identity in NumPy code.

y = np.sin(x)*np.sin(x) + np.cos(x)+np.cos(x)
plt.plot(x, y)

You can even take the exponential of a NumPy vector.

y = np.sin(x)**2 + np.cos(x)**2
plt.plot(x, y)

Once you are done with your vector manipulations and need to return the vector to a Python list (which is sometimes more accessible), you can use the method "tolist()" to change an NumPy array back to a Python list.

Cumulative Distribution Function Review

On Tuesday's homework assignment, I gave you the formula for a Cumulative Distribution Function but failed to give it any kind of purpose. It is a paramter-less function that tells you the "area so far". In other words, if your series is about how much of an item you've gained in a day, the CDF is a computation of all of the items that you have gathered up to the point in time.

There is a simple function in numpy for computing Cumulative Sums of a vector, and it has the name of "cumsum". For example, if you wish to find the "midpoint" of a time series (where exactly half of your observations by cumulative sum have been seen), then find the spot in the CDF closest to the point where

The CDF always tells you what probability of events happened before an event in time.

pth percentile: smallest x for which cdf(x) ≥ p/100

This is how it is determined if you are in the 98th (or in my case 68%) percentile on math and reading scores on your ACT.

A Comparison of Two Stocks on the Stock Market: Apple and Microsoft

#!/usr/bin/env python

import numpy as np
import matplotlib.pyplot as plt

# Read the data of AAPL and MSFT
AAPL = np.loadtxt("aapl.csv", delimiter=',', skiprows=1, usecols=[6])
MSFT = np.loadtxt("msft.csv", delimiter=',', skiprows=1, usecols=[6])

# In both datasets, the data is in order from NEWEST to OLDEST.
# To make things plot from OLDEST to NEWEST, we reverse the older.
AAPL = AAPL[::-1]
MSFT = MSFT[::-1]

# Pass the data through a linear transformation.
AAPL_linear = (AAPL - min(AAPL)) / (max(AAPL) - min(AAPL))
MSFT_linear = (MSFT - min(MSFT)) / (max(MSFT) - min(MSFT))

plt.plot(AAPL, 'b-')
plt.plot(MSFT, 'g-')


# Now center the starting points at Zero
AAPL_linear -= AAPL_linear[0]
MSFT_linear -= MSFT_linear[0]

plt.plot(AAPL_linear, 'b-')
plt.plot(MSFT_linear, 'g-')


# Compute the CDF
# This must be done manually
AAPL_CDF = np.cumsum(AAPL)
MSFT_CDF = np.cumsum(MSFT)

AAPL_CDF = (AAPL_CDF - min(AAPL_CDF)) / (max(AAPL_CDF) - min(AAPL_CDF))
MSFT_CDF = (MSFT_CDF - min(MSFT_CDF)) / (max(MSFT_CDF) - min(MSFT_CDF))

plt.plot(AAPL_CDF, 'b-')
plt.plot(MSFT_CDF, 'g-')

Scatter Plots

So far in the class, we've only studied univariate data. This is data where we simply look at the shape of the data to get an idea of how the data looks. When data is missing, we have tools, such as Kernel Density Estimators

Scatter plots are our first steps into bivariate data. This is the simplest way to compare two variables: plot them! What do they look like? In this example, we simple plot the x and the y, much like your 4 graph in Homework assignment #2.

Linear Regression

The most basic tool in the bivariate data toolkit is simple linear regression. In this tool, we are trying to pass a straight line through a data set. Hopefully this straight line will be a good description and predictor of the overall dataset. This is typically represented with the equation

y = mx + b

Some textbooks write this equation different:

y = b1 * x + b2

It means the same thing.

What we are attempting to do is to draw a line through all of the points that minimizes the vertical distance between each point. The squared summation of each of these vertical distances is called the error term. The most general approach to accomplishing this is the "Method of Least Squares".

When we have a bivariate data set, we need to compute the following:

  • Number of records: n
  • Sum of x: EX
  • Sum of y: EY
  • The sum of squares of x: EX2
  • The sum of x times y: EXY
  • m = (nEXY - EXEY) / (nEX2 - EXEX)
  • b = (EY - m*EX)/n

The Quartet

Show the quartet data.

Day 5. January 6, 2011. Friday

We went over R and had a test. I did not feel 100% today.

Day 6. January 9, 2011. Monday

Logarithmic Plots

There is a fascinating data set of the heart rate of various mammals compared to their body masses. I couldn't find the exact one used by our text book, but I did find one in this academic paper.


If we plot this data, it doesn't look very meaningful. (Plot it using R)

It wasn't until someone decided to plot these values using a logarithmic plot that they discovered that it has a linear shape! The research determined that the line had a slope of -1/4, and thus it was determined that your heart rate can be estimated by the following equation:

heart rate, bpm = (mass, kg)^-1/4

If only one axis is used as a logarithmic plot, it's called a semilog plot. If both axis are used, it's called a log-log plot.

In Python/matplotlib, we have these options:

  • X-axis is a logarithm: semilogx(X, Y)
  • Y-axis is a logarithm: semilogy(X, Y)
  • Both axis are a logarithm: loglog(X, Y)

In R, we have the following options:

  • X-axis is a logarithm: plot(X, Y, log="x")
  • Y-axis is a logarithm: plot(X, Y, log="y")
  • Both axis are a logarithm: plot(X, Y, log="xy")

Other Types of Bivariate Regression Analysis (Non-Linear Analysis)

There are many types of bivariate regression equations that don't involve linear analysis. Here are a few examples.

  • y = (a/x)+b
  • y = a*sqrt(x)+b
  • y = a*log(x)+b
  • y = x^m + b

Imagine that you have a data set, and it sort looks linear. But maybe not. Maybe it's actually a logarithmic curve like we saw in our last example. Maybe not. Maybe it's something else. What can you do? Try them all!

The trick to solving each of these regression fitting problems is to trick each equation into thinking it's actually a linear regression problem. We can do this by shifting the X-axis by the inverse of each equation, then passing the new values of X and the current values of Y into our linear regression solver. Once we have our new estimations of each value, we can compute the correlation ratio of our values. We then compute R-squared and see which equation fits best.

Disclaimer: Remember our problems with Anscombe's quartet: sometimes we fit a regression curve to something that obviously doesn't have a regression pattern to it. We are shoving a round peg into a square hole. It can even happen if we have a library of regression patterns. At this point, we are shoving a round peg into a variety of square holes. If the conclusions drawn from this approach do not make any sense when plotted, then you should abandon this approach.

How to build a Bivariate Regression Tester Software

We can build our own software to do several of these regression tests at once. In fact, that is your homework for tonight.

Let's get a start on it.

I want you to create your own Python module to do this. A python module is just an extra python script that you can call with an "import" statement. Call this script "".

Things we'll need to write:

  • The linreg algorithm, which we've written several times so far in this course.
  • The correlation ratio algorithm.
  • The estimated fit based on our results algorithm.
  • A conversion algorithm for each of our proposed regression equations.

The correlation matrix.

When we discuss the correlation ratio, we are actually talking about something small in a bigger equation: the correlation matrix.

The correlation matrix is a grid of correlation ratios for every variable compared to itself and to every other variable. This will become important when we get to multiple regression analysis, but for now, a true correlation matrix for a bivariate data set produces a 2x2 matrix.

The estimated fit based on our results.

This is a simple enough algorithm to write. I was able to do it in one short line with list comprehension.

A conversion algorithm per regression equation.

Do it. I have already shown you how.

Software used to compute LOESS

Day 7. January 10, 2011. Tuesday

T-test of a regression equation

Confidence Intervals

Multiple Regression Analysis

Selecting which variables to use.

In the beginning, we'll use all of them.

Computing the regression line with numpy.

A'Ax = A'd

Solve for x.

Coming up with an estimate fit.

Using R to evaluate our regression

Using Adjusted R to evaluate our regression

Coding Schemes

Interpretation of Coefficients

F-test of a regression equation

Spare Topics

These are all topics that I hope to reach eventually. It's always good to have a backup plan.

Try/Except blocks

It is still good to check to see if we are allowed to open the file for writing.
>>> try:
f = open("my_file", "w")
f.write("This will be written to a file (we hope).")
... except IOError:
print """The file "my_file" would not open!"""

Just like in Java, any exception can be called with a blank “except” call. It's better to know you want to know what you are catching. I used “IOError” in the example, but how would I know this? Here's a trick: trigger the error in the terminal and it will tell you the exception code to catch.

>>> int("cat")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: invalid literal for int(): cat
>>> file("doesnotexist", "r")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IOError: [Errno 2] No such file or directory: 'doesnotexist'
>>> 1/0
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ZeroDivisionError: integer division or modulo by zero
>>> range(5)[10]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: list index out of range

Python's urllib library The urllib library is good for downloading source code of websites. The “urlopen” method returns an object pointer of a web site and that object's “read” method will return the actual source code.

>>> import urllib
>>> yahoo = urllib.urlopen("").read()
>>> print yahoo

String Interpolation in Python

For many things done in class so far, I've used string interpolation. The syntax that Python uses is borrowed from C's printf function. Because the original printf is so pervasive in computing, many languages have replicated that style of string interpolation in their language. In my opinion, C still does it best. Java's System.out.printf is very good, but it doesn't support all of the conversion symbols that C does. Perl, Ruby, and PHP all have good printf functions. Python has string interpolation built into the language itself, which means you don't need the print statement to use it.

Conversion Symbols and Syntax

  • s: String conversion using str(). This is slow, but works for most needs.
  • d: Decimal number.
  • u: Unsigned decimal number.
  • f: Floating point number.
  • x: Lowercase hexadecimal number
  • X Uppercase hexadecimal number
  • e Lowercase exponential number
  • E Uppercase exponential number
  • % Percent Sign


% - + 0 number .number (variable) d
    • Left justifies the printing (optional)
    • prints positive numbers with (optional)
  • 0 Zero pads integer (optional)
  • number Represents the total width of the space needed to print value (optional)
  • .number Represents the total width of the digit places needed to print value (optional) Note: using the
  • above 0 option will override whatever value used here.
  • (label) Replace with a variable name (optional)
  • d Signals that this is an integer.

Floating point number

% - + 0 number .number (variable) f
    • Left justifies the printing (optional)
    • prints positive numbers with (optional)
  • 0 Zero pads integer part(optional)
  • number Represents the total width of the space needed to print value (optional)
  • .number Represents the spaces needed in the decimal part and zero pads out to that many places
  • (optional)
  • (label) Replace with a variable name (optional)
  • f Signals that this is an integer.

Using an integer.

>>> b = 4
>>> b
>>> "%d" % (b)
>>> "%5d" % (b)
>>> "%05d" % (b)
>>> "%+05d" % (b)
>>> "%5.2d" % (b)
>>> "%5.3d" % (b)
' 004'
>>> "%05.2d" % (b)
>>> "%+05.2d" % (b)

Using a floating point number.

>>> a = 2.13
>>> a
>>> "%.3f" % (a)
>>> "%10.3f" % (a)
>>> "%010.3f" % (a)
>>> "%+010.3f" % (a)
>>> "%0+10.3f" % (a)
>>> "%+010.3f" % (a)
>>> "%-+010.3f" % (a)

Using labels at variable replacement points within strings

If multiple replacement points are used in a single string, they each must be labeled or trail the string in order. String interpolation points can be labeled using ”%(label)(format code)”. The list that follows the % needs to be a dictionary style list of key/value pairs. Because it is a dictionary style list, it must begin and end with ”{” and ”}” rather than ”(” and ”)”. The argument list can be changed out with different dictionaries.

>>> badjoke = "There are %(animal)s on a %(vehicle)s."
>>>>>> badjoke % {'animal': 'pink elephants', 'vehicle': 'parade float'}
'There are pink elephants on a parade float.'
>>> samjackson = {}
>>> samjackson['animal'] = 'snakes'
>>> samjackson['vehicle'] = 'plane'
>>> badjoke % samjackson
'There are snakes on a plane.'
>>> jameschurch = {}
>>> jameschurch['animal'] = 'aardvarks'
>>> jameschurch['vehicle'] = 'gondola'
>>> badjoke % jameschurch
'There are aardvarks on a gondola.'

Sum of squares

I like programming challenges. I encourage students to solve problems from Javabat or Project Euler to improve their skills. I think it's quite fun. Here's a simple programming task:

Each line of an input file named “” begins with a positive integer n and is followed by n integer values. For each of these lines, write to the screen the sum of the squares of the n values. The end of the file is indicated by a line that begins with 0. Assume that the file is valid.

For example, the sum of squares of 5, 19, and -2 is: 5^2 + 19^2 + (-2)^2 = 390.
5 19 -2
14 71
-3 -11 0 100 -2
-10 46 123 1

My solution.

#!/usr/bin/env python
if __name__ == '__main__':
    for line in open("", "r"):
        nums = line.split()
        if nums[0] == '0':
        del nums[0]
        print sum([int(i)*int(i) for i in nums])

Output of executing


Random Numbers

Random numbers can easily be generated using the “random” module.


Returns a random integer number, a <= Value <= b

random.uniform(a,b) Returns a random floating point number, a <= Value < b

Returns a random floating point number in the range of 0.0 <= Value < 1.0


Returns a randomly selected item from list


Shuffles all of the values in list in place, returns nothing

>>> random.random()
# Random float x, 0.0 <= x < 1.0
>>> random.uniform(1, 10) # Random float x, 1.0 <= x < 10.0
>>> random.randint(1, 10) # Integer from 1 to 10, endpoints included
>>> random.randrange(0, 101, 2) # Even integer from 0 to 100
>>> random.choice('abcdefghij') # Choose a random element
>>> items = [1, 2, 3, 4, 5, 6, 7]
>>> random.shuffle(items)
>>> items
[7, 3, 2, 5, 6, 4, 1]
>>> random.sample([1, 2, 3, 4, 5], 3)
[4, 1, 5]

Regular Expressions

A “Regular Expression” is a way of describing the format of some textual information. In a mathematical sense, a regular expression is a finite state automate. We use Regular Expressions to validate complex textual patters to make suer they validate to a particular requirement, such as the format of a date or a phone number, or email address. Regular Expression Libraries are found in several open source tools, including Python.

  • A regular expression is defined as a sequence of atoms or groups of atoms.
  • An atom is any single character.
  • A group of atoms is defined by a sequence of atoms inside ”()” parentheses.
  • By default, a regular expression can match anywhere inside of a string.

For example, the expression “cat” will match inside any string containing the letter c, followed by the letter a, followed by the letter t. This includes “cat”, “catapult”, and “Yucatan”.

An atom or group of atoms can be modified to signify how often that atom is allowed to be repeated. The modifier symbol immediate follows an atom or group. There are 4 modifier symbols.

  • ?, min: 0, max: 1, Makes an atom optional
  • *, min:0, max: Infinite, Greedy
  • +, min:1, max: Infinite, Greedy
  • {4}, min: 4, max: 4
  • {3,5}, min: 3, max: 5, Greedy
  • {3,}, min:3, max: Infinite, Greedy
  • {,3}, min:0, max: Infinite, Greedy
  • {,}, Varies depending on regex engine. Sometimes this means '*'. Sometimes this means {0}.

An Alternation allows a sample of text to match more than one regular expression. To combine regular expressions into one expression, separate them using the “|” pipe. The regular expression “cat|dog” will match either the letter c, followed by the letter a, followed by the letter t, or the letter d, followed by the letter o, followed by the letter g. This expression will match “cat”, “catapult”, and “Yucatan”, as well as “dog”, “dogma” and “boondoggled”.

A Character Class is an atom that is comprised of more than one character. Typically, this is defined by packing characters into ”[]”. To reverse the boolean property of a charter class, make the first character in that class the “^” symbol.

Python's Regular Expressions

>>> import re
<_sre.SRE_Match object at 0xb7e8f058>
>>> if"dog","dogma"):
print "Hello"
>>> m ="dog","dogma")
>>> m ="^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,4}$","")
>>> m
<_sre.SRE_Match object at 0xb7e8f058>