## Welcome to the Python course @ Precima

Hopefully everyone has read the syllabus, cheat sheets and course outline on the website.

They are available at www.jeremy.kiwi.nz/pythoncourse.

All lessons are made as iPython notebooks and will be available on this site to download. Grab todays lesson from [this link!](/pythoncourse/assets/notebooks/r&d/lesson 00.ipynb).

Today we will briefly go over course logistics and the Anaconda environment and then dive into Python data types.

### Python

Python is a programming language first released in 1991 by Guido van Rossum and one of the most popular languages used in computing today. Python currently ranks as the [5th most common language used on github](https://github.com/blog/2047-language-trends-on-github). 

Python is used as a scripting language, as well as in web development and to create applications - some of the more popular websites and applications running at least partially on Python include: Google, Youtube, Facebook, Instagram, Reddit, Dropbox, Civilization IV, EVE Online and BitTorrent.

Python as a language is based on readability, flexibility, simplicity and extensibility. 

The extensibility of Python has caused it to be adopted, along with R, as one of the premier data science programming languages.

Python has had many added on modules (or libraries) added to it, to allow data science work which we will cover in this course.

In general compared to R, Python is faster, more programmer focussed and less restrictive in licensing. The downside is new statistical methods tend to appear in R before Python, although as the community grows, this has become less of a problem.

### Anaconda, iPython and Spyder

We will use Anaconda, a distribution of Python by Continuum Analytics, put together for use in data science. 

Anaconda comes with most of the modules we need for data analysis, as well as Jupyter notebooks, and the IDE we will use for the first couple of lessons, Spyder.

We have installed the launcher, which allows updating and launching these apps, the Spyder IDE which allows coding and running scripts in an integrated environment, the iPython-QT console, which is an advanced console allowing inline graphing, and the iPython notebook, which allows development in interactive notebooks in the browser. These programs all run Python code - the difference is in how you interact with the environment. 

iPython notebooks are currently in the process of being rebranded into Jupyter notebooks - the launcher and documentation will refer to them interchangeably.

Coding along with the lesson is encouraged! 



### 2.7 vs 3

In 2008, Python version 3.0 was released. Due to the number and nature of changes, Python 2.6 and Python 3.0 were not compatible. This has lead to a split in the Python community, as many users were unwilling to fix existing code to 3 compatibility, and have since continued to develop Python 2.6 into 2.7, while 3.0 has been developed in parallel and currently stands at version 3.5.

In this course, we will use Python 3. This is due to the majority of users having no legacy code to worry about, the better memory management, and the availability of the data science stack in Python 3. If you are a die hard Python 2.7 user, feel free to continue using it, although you will need to fix the code yourself.

In the level we will be coding at the changes are not too big, the largest differences we will see are `print()` vs `print`, `xrange` vs `range` and other generators. Python code found online will often be 2.7, but should be readable by a 3 trained user.


### Course Logistics

This stream of the course will cover applied data science in Python. We will start by covering the basic data types and structures, move on to loops and statements, then functions and classes. In lessons 4 and 5 we will introduce the standard data  modules, Pandas and NumPy, then move on into graphing, reproducible research and machine learning in the later stages of the course. Please let me know if you have any particular applications or problems, and we will try and address the common ones in class.

We are tracking progress, to make sure this course is giving value for your time - Please make sure you answer the quizzes and assessments we will send out later in the course.

### Outside Resources

A number of people have asked about text books or online resources for the course. Most Python tutorials online will assume a knowledge at around the level we will reach at the end of lesson 3. I'm using [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) for the numpy and pandas section - It won't be necessary to purchase but is a good book if a little outdated. For the other lessons, links will be provided to the relevant docs.



## Basic Data Types

### Numbers

Python has ints, floats and complex numbers - we can check using the `type` function

In [1]:
type(3)

int

In [2]:
type(3.0)

float

In [4]:
type(1 + 0.3J)

complex

Internally, floats are stored as double in `C` - use sys.float_info for more information about tolerances etc.

Complex is stored as two floats - the real and imaginary parts.

Integers have unlimited precision

In [5]:
import sys
sys.float_info

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

In [6]:
z = 10**1000000 +1
z % 10

1

Python 2.7 has another class, `large` for large numbers, Python 3 makes the conversion implicilty and keeps type `int`:

In [7]:
print(type(z))
del(z)

<class 'int'>


We can assign values to variables:

In [8]:
a = 23
b = 2 ** 4
c, d = 2 - 4, 3 / 2 

There are some rules about variable names - they cannot start with numbers, must be alphanumeric with no spaces, and should be lowercase with underscores and somewhat informative about what they represent:

Check out the [official Python style guide](http://docs.python-guide.org/en/latest/writing/style/), or [Googles Python style guide](https://google.github.io/styleguide/pyguide.html) for more tips regarding variable names and general coding style. 

In iPython based interpreters, we can use `whos` to see all our declared variables (or use the spyder IDE)

In [9]:
whos

Variable   Type      Data/Info
------------------------------
a          int       23
b          int       16
c          int       -2
d          float     1.5
sys        module    <module 'sys' (built-in)>


### Warning - Floating Point Errors 

We would expect the below to be equal to -0.2 - what is going on?

In [10]:
-0.1 + 0.2 - 0.3

-0.19999999999999998

This is a floating point error. It occurs as computers work in binary and 1/10 is a repeating fraction in binary, much like 1/3 is in base 10. We will discuss mitigation techniques later in the course - for now be aware when working with floats.

### Booleans

We have the standard boolean operators

Which can then be chained using `and` `or` with not as a negation

In [11]:
1 == 1 and 2 >= 3

False

In [12]:
1 <= 1 or 2 > 3 

True

In [13]:
not True

False

In addition to True and False, we also have None, which we will discuss later in the course

In Python, booleans are short circuit operators - once one conditon is impossible we stop evaluating. this can get a good speed up on code if you have one expensive comparison and one that is not - put the cheap calculation first

In [14]:
#del(l)
x = 5
#l is not defined! This would give and error if it hits l 
print(x == 5 or l == 4)
print(x == 3 and l == 4)

True
False


### Warning - Bitwise Operations

You probably don't want to use `|`, `&` or `^` which many other languages do for `or` `and` and `not`- they are for [bitwise comparison](https://wiki.python.org/moin/BitwiseOperators).

In [15]:
12 | 9

13

In [16]:
print(bin(12))
print(bin(9))
print(bin(13))

0b1100
0b1001
0b1101


### Strings

Strings are character type data:

In [17]:
a = "abcde"
g = 'use the \\ to escape \'special\' characters or \n make new lines and \t tabs'
print(g)

use the \ to escape 'special' characters or 
 make new lines and 	 tabs


In [18]:
g ='''
this is 
a long
string'''
print(g)


this is 
a long
string


We can do (some) math on strings

In [19]:
print(a * 2)
print(a + d)
print(a * a)

abcdeabcde


TypeError: Can't convert 'float' object to str implicitly

## Subsetting

When we want to get a subset of a string (or many other Python objects) we use subsetting.

Python is a 0-indexed language, meaning we start counting at 0. For a digression, see here for a [possible origin of 0 indexing](http://exple.tive.org/blarg/2013/10/22/citation-needed/). 

This leads to the famous quote "The two big problems in computer science are cache invalidation, naming things and off by one errors".

In [20]:
my_string = 'Hello world'
#a single element
print(my_string[0])
#a slice
print(my_string[0:3])
#from the start
print(my_string[:3])
#to the end
print(my_string[3:])
#negative indices count from the end
print(my_string[-3:])

H
Hel
Hel
lo world
rld


Indexing can also use a third place ie `x[from:to:by]`, but this is not used often in practice.

In [21]:
print(my_string[::2])
print(my_string[::-1])
print(my_string[0:6:2])

Hlowrd
dlrow olleH
Hlo


Strings are immutable - we cannot easily change elements of them in place:

In [22]:
my_string[0] = 'h'

TypeError: 'str' object does not support item assignment

### Methods

Methods are functions which are specific to a certain type of object. They are called using x.method()

We can see a list in Spyder by using tab or use help(object) to get a list

In [23]:
my_string = 'abcde'
print(my_string.upper())
print(my_string.upper().lower())
print(my_string.capitalize())
#help(my_string)

ABCDE
abcde
Abcde


### Functions

We have a number of built in functions in Python, similar to methods but not linked to a specific class. See the list at the [official Python docs website](https://docs.python.org/3.5/library/functions.html)

For help, use help(function) or read the online docs.


### Print Formatting

We can format our print output using our defined variables:

In [24]:
x = 13.13
print('blah blah %s' %(x))
print('Floating point numbers: %1.2f' %(13.144))
print('First: %s, Second: %1.2f, Third: %r' %('hi!', 3.14, 22))
print('First: {y}, Second: {x}, Third: {z}'.format(y = "hi", z = 12.1, x = 12))

blah blah 13.13
Floating point numbers: 13.14
First: hi!, Second: 3.14, Third: 22
First: hi, Second: 12, Third: 12.1


## Basic Data Containers


### Lists

Lists are the workhorse of Python - they are a way of containing multiple pieces of data in a single place.

Lists may be recognised as similar to arrays in other languages - we don't need to preassign type, size or any other attributes


In [25]:
my_list = [1, 2, 3, 4]
#subsetting similar to strings
print(my_list[0:2])
#add similarly to strings
print(my_list * 2)
print(my_list + [1, 2, 3])
print(my_list + [[1, 2, 3, 4]])

[1, 2]
[1, 2, 3, 4, 1, 2, 3, 4]
[1, 2, 3, 4, 1, 2, 3]
[1, 2, 3, 4, [1, 2, 3, 4]]


Lists can contain multiple data types, including other lists:

In [26]:
my_list = [True, "hi", 1, 3.4, [1, 2, 3]]
print(type(my_list))
#nested subset
print([my_list[1][1], my_list[-1][-1]])

<class 'list'>
['i', 3]


Lists are mutable

In [27]:
my_list[1] = "bye"
print(my_list)

[True, 'bye', 1, 3.4, [1, 2, 3]]


### List Methods

Lists have their own methods. Many of these will modify a list in place, so beware

In [28]:
my_list = [True, "hi", 1, 3.4, [1, 2, 3]]
my_list.append('a new list item')
my_list

[True, 'hi', 1, 3.4, [1, 2, 3], 'a new list item']

In [29]:
my_list.extend([1,2,3])
my_list

[True, 'hi', 1, 3.4, [1, 2, 3], 'a new list item', 1, 2, 3]

In [30]:
my_list = [3,4,1,2]
my_list.sort()
print(my_list)
my_list.sort(reverse=True)
print(my_list)

[1, 2, 3, 4]
[4, 3, 2, 1]


### Warning

Python lists are not copied on reassignment. This will cause problems!

In [31]:
my_list = [1, 2, 3, 4]
my_list2 = my_list
my_list2[0] = 5
my_list

[5, 2, 3, 4]

We can fix it using slicing, copying or the list() function:

In [32]:
my_list = [1, 2, 3, 4]
my_list2 = my_list[:]
my_list2[1] = 5
my_list

[1, 2, 3, 4]

In [33]:
my_list = [1, 2, 3, 4]
my_list2 = list(my_list)
my_list2[1] = 5
my_list

[1, 2, 3, 4]

In [34]:
my_list = [1, 2, 3, 4]
my_list2 = my_list.copy()
my_list2[1] = 5
my_list

[1, 2, 3, 4]

### Tuples

Tuples are more or less immutable lists, and as such have less methods associated with them

In [35]:
my_tup = (1,2,3,4,5,1)
print(my_tup.index(5))
print(my_tup[2])
print(my_tup.count(1))
print(list(my_tup))
my_tup[0] = 1

4
3
2
[1, 2, 3, 4, 5, 1]


TypeError: 'tuple' object does not support item assignment

In [36]:
#multiple assignment technically uses tuples
a,b,c,d = 1,2,3,4

Tuples are a great choice when using parameters in a script - we cannot overwrite them by mistake without reassigning the entire tuple

### Sets

Sets work like the mathematical notion of sets - only unique elements are allowed, and they are unordered

In [37]:
my_set = {1,2,3,4,5,1,2}
print(my_set)
print(type(my_set))

{1, 2, 3, 4, 5}
<class 'set'>


In [38]:
# no subsetting by index!
# hence no mutability either
my_set[0]

TypeError: 'set' object does not support indexing

### Dictionaries

Python dictionaries are a very useful data structure, which we will modify into DataFrames inside pandas.

For now they are very similar to hashes or lookup tables from other languages - we access data inside them by key, rather than index. As such, they are unordered, similar to sets.


In [39]:
#key:values pairs, keys must be unique
my_dict = {'key1' : "val1", 'key2':'val2', 1:[1,2,3,4], 2:"a"}
print(my_dict)

{1: [1, 2, 3, 4], 'key2': 'val2', 2: 'a', 'key1': 'val1'}


In [40]:
print(my_dict[1])
print(my_dict['key2'])
print(my_dict[0])

[1, 2, 3, 4]
val2


KeyError: 0

In [41]:
my_dict['newkey'] = 'newval'
my_dict.update({"newkey2":'newval2'})
print(my_dict)

{1: [1, 2, 3, 4], 'key2': 'val2', 'newkey2': 'newval2', 'newkey': 'newval', 2: 'a', 'key1': 'val1'}


In [42]:
print(my_dict.keys())
print(my_dict.items())
print(my_dict.values())

dict_keys([1, 'key2', 'newkey2', 'newkey', 2, 'key1'])
dict_items([(1, [1, 2, 3, 4]), ('key2', 'val2'), ('newkey2', 'newval2'), ('newkey', 'newval'), (2, 'a'), ('key1', 'val1')])
dict_values([[1, 2, 3, 4], 'val2', 'newval2', 'newval', 'a', 'val1'])


In [43]:
my_dict['key1']

'val1'

## Python Statements

Python uses indentation, rather than braces, as its primary way of parsing loops and statements. This was originally to force some degree of readability on code.

### If, Else, Elif

We can do if and else statements

In [44]:
x = 5

if not True:
    print(x)
elif x == 5:
    print(x+1)
else:
    print("no")


#if True:
#print(x)

6


and nest them as deep as we would like:

In [45]:
x = [10]

if x[0] >= 10 and len(x) == 1:
    if x[0] % 2 == 0:
        if abs(x[0]) > 50:
            print(x)
        else:
            print(x[0] + 1)
            if x[0] // 3 == 3:
                x.append(1)
    elif x[0] * 3 > 25:
        x.append("hi")
    else:
        x.append("bye")
else:
    print(x[0]+5)
x

11


[10, 1]

### For loops

For loops work similarly in syntax to if and else. We can use them on any of the major data structures we just introduced

In [46]:
l = [1,2,3,4,5,6,7,8]

for num in l:
    if num % 2 == 0:
        print("even")
    else:
        print("odd")
        
print("\n")
s = {1,2,3,4,5,6,7,8}

for num in s:
    if num % 2 == 0:
        print("even")
    else:
        print("odd")

odd
even
odd
even
odd
even
odd
even


odd
even
odd
even
odd
even
odd
even


On dictionaries, we have to do it a little differently

In [47]:
d = {'k1':'v1', 'k2':'v2', 'k3':'v3'}

for k, v in d.items():
    print(k + v)
    
#similar for nested lists:

l = [[1,2],[3,4],[5,6],[7,8]]

for one, two in l:
    print(one * two)
    #no modifying!
    one = one * 2
l

k3v3
k2v2
k1v1
2
12
30
56


[[1, 2], [3, 4], [5, 6], [7, 8]]

### While loops

Similarly, we can use while loops, and run a loop until a condition is met

In [48]:
x = 5

while x > 1:
    print(x)
    x -= 1

5
4
3
2


### Break, Pass, Continue

We can skip, break or pass elements in our loops

In [49]:
l = [1,2,3,4,"hi",6]

for num in l:
    if type(num) == int:
        print(num)
    else:
        print("it's not a number")
        break

for num in l:
    if type(num) == int:
        print(num)
    else:
        print("it's not a number")
        continue
        print("how did you get here?")
        
for num in l:
    if type(num) == int:
        print(num)
    else:
        print("it's not a number")
        pass
        print("how did you get here?")

        

1
2
3
4
it's not a number
1
2
3
4
it's not a number
6
1
2
3
4
it's not a number
how did you get here?
6


continue and break only go up to the nearest closest loop - if we are nested we don't go right to the end of the statement:

In [50]:
l = [1,2,3,4,"hi",6]

for num in l:
    if type(num) == int:
        if num % 2 == 1:
            continue
            print(num)
        else:
            print(num)
    else:
        continue

2
4
6


### Range

we saw earlier, we cannot modify in place using a loop. We can instead use `range` to generate indices and use these (NB range in Python 3 is similar to xrange in Python 2.7, not `range`). Range objects are not instantly ennnumerated - we dont have to hold them in memory.

In [51]:
#similar to the subset from:to:by
range(0,10,2)
type(range(0,10))

range

In [52]:
l = [1,2,3,4,5,6]
range(len(l))

range(0, 6)

In [53]:
for i in range(len(l)):
    l[i] = l[i]*2
l

[2, 4, 6, 8, 10, 12]

In [54]:
#also useful for generating simple lists:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Comprehensions

Comprehensions are are syntactic sugar for for loops. We can make lists (or set or dicts) using this method. Tuples might seem like they should also have a method, but this is reserved for generators, which will discuss later on in the course.


In [55]:
#for loop
l = []
for i in range(10):
    l.append(i * 5)
l

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

In [56]:
#comprehension
[i * 5 for i in range(10)]

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45]

In [57]:
#dict comprehension
{i * 10 : j * 2  for i, j in {1: 'a', 2: 'b'}.items()}

{10: 'aa', 20: 'bb'}

In [58]:
celsius = [0,100,25,37]
[((9/5)*temp + 32) for temp in celsius]

[32.0, 212.0, 77.0, 98.60000000000001]

In [59]:
#nested
matrix = [[1,2,3],[4,5,6],[7,8,9]]
[[el * 2 for el in row] for row in matrix]

[[2, 4, 6], [8, 10, 12], [14, 16, 18]]

### Ternary expressions

Ternary expressions are a fancy way of doing `if` `else` and is very pythonic:

In [60]:
a = 12
a if a > 11 else b

12

In [61]:
a = 10
b = "no"
a if a > 11 else b

'no'

we can chain this with our comprehensions

In [62]:
[a if a else 2 for a in [0,1,0,3]]

[2, 1, 2, 3]

or our nested comprehensions

In [63]:
matrix = [[1,2,3],[4,5,6],[7,8,9]]
[[el * 2 if el % 2 == 0 else el for el in row] for row in matrix]

[[1, 4, 3], [8, 5, 12], [7, 16, 9]]

As these comprehensions are not faster than a for loop, it is worth using for loops once you can't follow clearly what the comprehension is doing (I would almost never use a nested comprehension)! 

## Intro to Functions

Functions are one of the key parts of any programming language. Today we will touch on basic syntax and definition, and move into a more thorough exploration in the next class.

In [64]:
def celcius_to_fahr(temp):
    return (9/5)*temp + 32

celcius_to_fahr(100)

212.0

In [65]:
def kelvin_to_celcius(temp):
    return temp - 273

kelvin_to_celcius(100)

-173

In [66]:
def kelvin_to_fahr(temp):
    return celcius_to_fahr(kelvin_to_celcius(temp))

kelvin_to_fahr(273)

32.0

In [67]:
def myfirstfun(arg1, arg2):
    '''Here is the docstring. this will be displayed to a user calling help(myfirstfun)
    It can be as long as you'd like.
    This argument takes two arguments, and returns the sum of them
    '''
    return(arg1 + arg2)

help(myfirstfun)

Help on function myfirstfun in module __main__:

myfirstfun(arg1, arg2)
    Here is the docstring. this will be displayed to a user calling help(myfirstfun)
    It can be as long as you'd like.
    This argument takes two arguments, and returns the sum of them



In [68]:
myfirstfun(1, 2)

3

In [69]:
myfirstfun("h", "i")

'hi'

We might not want our function to work on strings! We can Raise an error is the arguments are not numeric:

In [70]:
def myfirstfun(arg1, arg2):
    '''Here is the docstring. this will be displayed to a user calling help(myfirstfun)
    It can be as long as you'd like.
    This argument takes two arguments, and returns the sum of them
    '''
    for i in arg1, arg2:
        assert type(i) == int or type(i) == float 
    return(arg1 + arg2)

myfirstfun(1,2)

3

In [71]:
myfirstfun("h", "i")

AssertionError: 

That's it for today. Next week we will move onto advanced functions - scope, recursion, decorators and hashing, as well as classes - how to define our own data types and methods

## Motivation

How much can we actually learn and predict from Data Science using Python?

Recently, a match fixing ring in professional tennis was alleged by a joint investigation between [BBC news](http://www.bbc.com/sport/tennis/35319202) and [Buzzfeed](http://www.buzzfeed.com/heidiblake/the-tennis-racket#.eplO3d4px), resulting in a large amount of news coverage. 

The data analysis carried out was done in Python, and released online as an [iPython Notebook](https://github.com/BuzzFeedNews/2016-01-tennis-betting-analysis/blob/master/notebooks/tennis-analysis.ipynb). This story, and its continued fall out, was [front page news on the Guardian](http://www.theguardian.com/sport/2016/feb/09/revealed-tennis-umpires-secretly-banned-gambling-scam) yesterday (9-Feb-2016).

Have a read through the notebook, take note of the functions and methods used on the data, and see if you believe the analysis.