# Introduction to Python for Data Science

TODO

- table of contents
- modules section
- spell checking?
- proof-read

## Welcome! <a id='welcome'></a>

This is a 4 week (8 hour) course that will introduce you to the basics of handling, manipulating, and modeling data with Python. This notebook is a review of some Python language essentials. We'll go over this material during the first class, but taking a look at it before then will help you out, especially if you haven't used Python before.

The environment you're in right now is called a Jupyter notebook. [Project Jupyter](http://jupyter.org/) is an interactive environment that data scientists use for collaborating and communicating the results of projects. Each cell in the notebook can either contain text or code (often Python, but R, Julia, and lots of other languages are supported). This allows you to combine snippets of code with explanations and documentation.

Each cell in a notebook can be executed independently, but object and function declarations persist across cells. For example, I can define a variable in one cell...

In [4]:
my_variable = 10

... and then access that variable in a later cell:

In [5]:
print(my_variable)

10


We'll be using Jupyter notebooks extensively in this class. I'll give a more detailed introduction during the first class, but for now, the most important thing is to understand how to run code in the notebook.

As I mentioned above, there are two fundamental types of cells in a notebook - text (i.e. markdown) and code. When you click on a code cell, you should see a cursor appear in the cell that allows you to edit the code in that cell. A cell can have multiple lines - to begin a new line, press `Enter`. When you want to run the cell's code, press `Shift`+`Enter`.

Try changing the values of the numbers that are added together in the cell below, and observe how the output changes:

In [1]:
a = 10
b = 15
print(a + b)

25


You can also edit the text in markdown cells. To display the editable, raw markdown in a text cell, double click on the cell. You can now put your cursor in the cell and edit it directly. When you're done editing, press `Shift`+`Enter` to render the cell into a more readable format.

Try editing text cell below with your name:

**Make some edits here ->** Hello, my name is Nick!

To change whether a cell contains text or code, use the drop-down in the toolbar. When you're in a code cell, it will look like this:

![](images/code_cell.png)

and when you're in a text cell, it will look like this:

![](images/markdown_cell.png)

Now that you know how to navigate the notebook, let's review some basic Python.

## What is Python? <a id='whatispython'></a>

This is actually a surprisingly tricky question! There are (at least) two answers:

- A language specification
- A program on your computer that interprets and executes code written to that language specification

### Python (the language)

Python is an open source programming language that is extremely popular in the data science and web development communities. The roots of its current popularity in data science and scientific computing have an interesting history, but suffice to say that it's darn near impossible to be a practicing data scientist these days without at least being familiar with Python.

The guiding principles behind the design of the Python language specification are described in "The Zen of Python", which you can find [here](https://www.python.org/dev/peps/pep-0020/) or by executing:

In [62]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


To boil this down a bit, Python syntax should be easy to write, and well-written Python code should be easy to read. Code that follows these norms is called *Pythonic*. We'll touch a bit more on what it means to write Pythonic code in class.

#### What's the deal with whitespace?

A unique feature of Python is that *whitespace matters*, because it defines scope. Many other programming languages use braces or `begin`/`end` keywords to define scope. For example, in Javascript, you write a `for` loop like this:

```
var count;
for(count = 0; count < 10; count++){
               console.log(count);
               console.log("<br />");
            }
```

The curly braces here define the code executed in each iteration of the for loop. Similarly, in Ruby you write a `for` loop like this:

```
for count in 0..9
   puts "#{count}"
end
```

In this snippet, the code executed in each iteration of the `for` loop is whatever comes between the first line and the `end` keyword.

In Python, `for` loops look a bit different:

In [13]:
print('Entering the for loop:\n')
for count in range(10):
    print(count)
    print('Still in the for loop.')

print("\nNow I'm done with the for loop.")

Entering the for loop:

0
Still in the for loop.
1
Still in the for loop.
2
Still in the for loop.
3
Still in the for loop.
4
Still in the for loop.
5
Still in the for loop.
6
Still in the for loop.
7
Still in the for loop.
8
Still in the for loop.
9
Still in the for loop.

Now I'm done with the for loop.


Note that there is no explicit symbol or keyword that defines the code executed during each iteration - it's the indentation that defines the scope of the loop. When you define a function or class, or write a control structure like a `for` look or `if` statement, you should indent the next line (4 spaces is customary). Each subsequent line at that same level of indentation is considered part of the scope. You only escape the scope when you return to the previous level of indentation.

### Python (the interpreter)

If you open up the terminal on your computer and type `python`, it runs a program that looks something like this:

![](images/repl.png)

This is a program called CPython (written in C, hence the name) that parses, interprets, and executes code written to the Python language standard. CPython is known as the "reference implementation" of Python - it is an open source project (you can [download](https://www.python.org/downloads/source/) and build the source code yourself if you're feeling adventurous) run by the [Python Software Foundation](https://www.python.org/psf/) and led by Guido van Rossum, the original creator and "Benevolent Dictator for Life" of Python.

When you type simply `python` into the command line, CPython brings up a REPL (**R**ead **E**xecute **P**rint **L**oop, pronounced "repple"), which is essentially an infinite loop that takes lines as you write them, interprets and executes the code, and prints the result.

For example, try typing

```
>>> x = 'Hello world"
>>> print(x)
```

in the REPL. After you hit `Enter` on the first line, the interpreter assigns the value "Hello world" to a string variable `x`. After you hit `Enter` on the second line, it prints the value of `x`.

We can accomplish the same result by typing the same code

```
x = "Hello world"
print(x)
```

into a file called `test.py` and running `python test.py` from the command line. The only difference is that when you provide the argument `test.py` to the `python` command, the REPL doesn't appear. Instead, the CPython interpreter interprets the contents of `test.py` line-by-line until it reaches the end of the file, then exits. We won't use the REPL much in this course, but it's good to be aware that it exists. In fact, behind the pretty front end, this Jupyter notebook is essentially just wrapping the CPython interpreter, executing commands line by line as we enter them.

So to review, "Python" sometimes refers to a language specification and sometimes refers to an interpreter that's installed on your computer. We will use the two definitions interchangeably in this course; hopefully, it should be obvious from context which definition we're referring to.

## Variables, Objects, Operators, and Naming

One fundamental idea in Python is that *everything is an object*. This is different than some other languages like C and Java, which have fundamental, primitive datatypes like `int` and `char`. This means that things like integers and strings have attributes and methods that you can access. For example, if you want to read some documentation about an object `my_thing`, you can access its `__doc__` attribute like this:

In [16]:
thing_1 = 47    # define an int object
print(thing_1.__doc__)

int(x=0) -> integer
int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is a number, return x.__int__().  For floating point
numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base.  The literal can be preceded by '+' or '-' and be surrounded
by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4


In [17]:
thing_1 = 'blah'    # reassign thing_1 to an string object
print(thing_1.__doc__)

str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.


To learn more about what attributes and methods a given object has, you can call `dir(my_object)`:

In [18]:
dir(thing_1)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

That's interesting - it looks like the string object has a method called `__add__`. Let's see what it does -

In [19]:
thing_2 = 'abcd'
thing_3 = thing_1.__add__(thing_2)
print(thing_3)

blahabcd


So calling `__add__` with two strings creates a new string that is the concatenation of the two originals. As an aside, there are a lot more methods we can call on strings - `split`, `upper`, `find`, etc. We'll come back to this.

The `+` operator in Python is just syntactic sugar for the `__add__` method:

In [20]:
thing_4 = thing_1 + thing_2
print(thing_4)
print(thing_3 == thing_4)

blahabcd
True


Any object you can add to another object in Python has an `__add__` method. With integer addition, this works exactly as we would expect:

In [21]:
int_1 = 11
int_2 = 22
sum_1 = int_1.__add__(int_2)
sum_2 = int_1 + int_2
print(sum_1)
print(sum_2)
print(sum_1 == sum_2)

33
33
True


But it's unclear what to do when someone tries to add an `int` to a `str`:

In [22]:
thing_1 + int_1

TypeError: Can't convert 'int' object to str implicitly

### Data types

There are a few native Python data types, each of which we'll use quite a bit. The properties of these types work largely the same way as they do in other languages. If you're ever confused about what type a variable `my_var` is, you can always call `type(my_var)`.

#### Booleans

Just like in other languages, `bool`s take values of either `True` or `False`. All of the traditional Boolean operations are present:

In [24]:
bool_1 = True
type(bool_1)

bool

In [43]:
dir(bool_1)

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

In [25]:
bool_2 = False

In [26]:
bool_1 == bool_2

False

In [27]:
bool_1 + bool_2

1

In [28]:
bool_1 and bool_2

False

In [31]:
type(bool_1 * bool_2)

int

#### Integers

Python `ints` are whole (positive, negative, or 0) numbers implemented as `long` objects of arbitrary size. Again, all of the standard operations are present:

In [40]:
int_1 = 2
type(int_1)

int

In [44]:
dir(int_1)

['__abs__',
 '__add__',
 '__and__',
 '__bool__',
 '__ceil__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floor__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__le__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__round__',
 '__rpow__',
 '__rrshift__',
 '__rshift__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 '__xor__',
 'bit_length',
 'conjugate',
 'denominator',
 'from_bytes',
 'imag',
 'numerator',
 'real',
 'to_bytes']

In [41]:
int_2 = 3
print(int_1 - int_2)

-1


In [48]:
int_1.__pow__(int_2)

8

In [46]:
int_1 ** int_2

8

One change from Python 2 to Python 3 is the default way that integers are divided. In Python 2, the result of `2/3` is `0`, the result of `4/3` is `1`, etc. In other words, dividing integers in Python 2 always returned an integer with any remainder truncated. In Python 3, the result of the division of integers is always a `float`, with a decimal approximation of the remainder included. For example:

In [63]:
int_1 / int_2

0.6666666666666666

In [64]:
type(int_1 / int_2)

float

In [65]:
int_1.__truediv__(int_2)

0.6666666666666666

In [53]:
int_1.__divmod__(int_2)

(0, 2)

#### Floats

Python floats are also consistent with other languages:

In [56]:
float_1 = 23.46
type(float_1)

float

In [57]:
dir(float_1)

['__abs__',
 '__add__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getformat__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__int__',
 '__le__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmod__',
 '__rmul__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setformat__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 'as_integer_ratio',
 'conjugate',
 'fromhex',
 'hex',
 'imag',
 'is_integer',
 'real']

In [66]:
float_2 = 3.

In [67]:
float_1 / float_2

7.82

With `int`s and `float`s, we can also do comparison operators like in other languages:

In [69]:
int_1 < int_2

True

In [70]:
float_1 >= int_2

True

In [71]:
float_1 == float_2

False

#### Strings

In [58]:
str_1 = 'hello'
type(str_1)

str

In [59]:
dir(str_1)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

We already saw that the + operator concatenates two strings. Generalizing from this, what do you expect the * operator to do?

In [72]:
a = 'Hi'
print(a*5)

HiHiHiHiHi


There are a number of very useful methods built into Python `str` objects. A few that you might find yourself needing to use when dealing with text data include:

In [74]:
# count the number of occurances of a sub-string
"Hi there I'm Nick".count('i')

2

In [75]:
# Find the next index of a substring
"Hi there I'm Nick".find('i')

1

In [77]:
"Hi there I'm Nick".find('i', 2)

14

In [78]:
# Insert variables into a string
digit = 7
'The digit "7" should appear at the end of this sentence: {}.'.format(digit)

'The digit "7" should appear at the end of this sentence: 7.'

In [79]:
another_digit = 15
'This sentence will have two digits at the end: {} and {}.'.format(digit, another_digit)

'This sentence will have two digits at the end: 7 and 15.'

In [80]:
# Replace a sub-string with another sub-string
my_sentence = "Hi there I'm Nick"
my_sentence.replace('e', 'E')

"Hi thErE I'm Nick"

In [81]:
my_sentence.replace('N', '')

"Hi there I'm ick"

There are plenty more useful string functions - use either the `dir()` function or Google to learn more about what's available.

So, to sum it up - basic data types like `bool`, `int`, `float`, and `str` are all objects in Python. The methods in each of these object classes define what operations can be done on them and how those operations are performed. For the sake of readability, however, many of the common operations like + and < are provided as syntactic sugar.

### An aside: What's the deal with those underscores???

When we were looking at the methods in the various data type classes above, we saw a bunch of methods like `__add__` and `__pow__` with double leading underscores and double trailing underscores (sometimes shorted to "dunders"). As it turns out, underscores are a bit of a *thing* in Python. Idiomatic use dictates a few important uses of underscores in variable and function names:

- Underscores are used to separate words in names. That is, idiomatic Python uses snake_case (`my_variable`) rather than camelCase (`myVariable`).
- A single leading underscore (`_my_function` or `_my_variable`) denotes a function or variable that is not meant for end users to access directly. Python doesn't have a sense of strong encapsulation, i.e. there are no strictly "private" methods or variables like in Java, but a leading underscore is a way of "weakly" signaling that the entity is for private use only.
- A single training underscore (`type_`) is used to avoid conflict with Python built-in functions or keywords. In my opinion, this is often poor style. Try to come up with a more descriptive name instead.
- Double leading underscore and double trailing underscore (`__init__`, `__add__`) correspond to special variables or methods that correspond to some sort of "magic" syntax. As we saw above, the `__add__` method of an object describes what the result of `some_object + another_object` is.

For lots more detail on the use of underscores in Python, check out [this](https://hackernoon.com/understanding-the-underscore-of-python-309d1a029edc#.3ll4ywc85) post.

## Collections of objects

Single variables can only take us so far. Eventually, we're going to way to have ways of storing many individual variables in a single, structured format.

### Lists

The list is one of the most commonly used Python data structures. A list is an ordered collection of (potentially heterogeneous) objects. Similar structures that exist in other languages are often called arrays.

In [62]:
my_list = ['a', 'b', 'c', 'a']

In [63]:
len(my_list)

4

In [64]:
my_list.append(1)
print(my_list)

['a', 'b', 'c', 'a', 1]


To access individual list elements by their position, use square brackets:

In [65]:
my_list[0]    # indexing in Python starts at 0!

'a'

In [66]:
my_list[4]

1

In [67]:
my_list[-1]    # negative indexes count backward from the end of the list

1

Lists can hold arbitrary objects!

In [68]:
type(my_list[0])

str

In [69]:
type(my_list[-1])

int

In [70]:
# let's do something crazy
my_list.append(my_list)
type(my_list[-1])

list

In [71]:
my_list

['a', 'b', 'c', 'a', 1, [...]]

In [72]:
my_list[-1]

['a', 'b', 'c', 'a', 1, [...]]

In [73]:
my_list[-1][-1]

['a', 'b', 'c', 'a', 1, [...]]

In [74]:
my_list[-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1][-1]

['a', 'b', 'c', 'a', 1, [...]]

Lists are also *mutable* objects, meaning that any part of them can be changed at any time. This makes them very flexible objects for storing data in a program.

In [75]:
my_list = ['a', 'b', 1]

In [76]:
my_list[0] = 'c'
my_list

['c', 'b', 1]

In [77]:
my_list.remove(1)
my_list

['c', 'b']

### Tuples

A tuple in Python is very similar to a list, except that tuples are *immutable*. This means that once they're defined, they can't be changed. Otherwise, they act very much like lists.

In [78]:
my_tuple = ('a', 'b', 1, 'a')

In [79]:
my_tuple[2]

1

In [80]:
my_tuple[0] = 'c'

TypeError: 'tuple' object does not support item assignment

In [81]:
my_tuple.append('c')

AttributeError: 'tuple' object has no attribute 'append'

In [82]:
my_tuple.remove(1)

AttributeError: 'tuple' object has no attribute 'remove'

### Sets

A set in Python acts somewhat like a list that contains only unique objects.

In [85]:
my_set = {'a', 'b', 1, 'a'}
print(my_set)    # note that order

{1, 'b', 'a'}


In [86]:
my_set.add('c')
print(my_set)

{1, 'c', 'b', 'a'}


Note above that the order of items in a set doesn't have the same meaning as in lists and tuples.

In [88]:
my_set[0]

TypeError: 'set' object does not support indexing

Sets are used for a couple reasons. Sometimes, finding the number of unique items in a list or tuple is important. In this case, we can convert the list/tuple to a set, then call `len` on the new set. For example,

In [89]:
my_list = ['a', 'a', 'a', 'a', 'b', 'b', 'b']
my_list

['a', 'a', 'a', 'a', 'b', 'b', 'b']

In [90]:
my_set = set(my_list)
len(my_set)

2

The other reason is that the `in` keyword for testing a collection for membership of an object is much faster for a list than a set.

In [91]:
my_list = list(range(1000000))    # list of numbers 0 - 999,999
my_set = set(my_list)

In [92]:
%%timeit
999999 in my_list

100 loops, best of 3: 12.8 ms per loop


In [93]:
%%timeit
999999 in my_set

The slowest run took 18.90 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 57.9 ns per loop


Any idea why there's such a discrepancy?

### Dictionaries

The final fundamental data structure we'll cover is the Python dictionary (aka "hash" in some other languages). A dictionary is a map of *keys* to *values*.

In [94]:
my_dict = {'name': 'Nick',
           'birthday': 'July 13',
           'years_in_durham': 4}

In [95]:
my_dict['name']

'Nick'

In [96]:
my_dict['years_in_durham']

4

In [98]:
my_dict['favorite_restaurant'] = 'Mateo'
my_dict['favorite_restaurant']

'Mateo'

In [101]:
my_dict['age']    # hey, that's personal. Also, it's not a key in the dictionary.

KeyError: 'age'

In addition to accessing values by keys, you can retrieve the keys and values by themselves as lists:

In [103]:
my_dict.keys()

dict_keys(['name', 'birthday', 'favorite_restaurant', 'years_in_durham'])


In [104]:
my_dict.values()

dict_values(['Nick', 'July 13', 'Mateo', 4])


Note that if you're using Python 3.5 or earlier, the order that you insert key/value pairs into the dictionary doesn't correspond to the order they're stored in by default (we inserted `favorite_restaurant` after `years_in_durham`!). This default behavior was just recently changed in Python 3.6 (released in December 2016).

## Control structures



As data scientists, we're data-driven people, and we want our code to be data-driven, too. Control structures are a way of adding a logical flow to your programs, making them reactive to different conditions. These concepts are largely the same as in other programming languages, so I'll quickly introduce the syntax here for reference without much comment.

###  if-elif-else

Like most programming languages, Python provides a way of conditionally evaluating lines of code.

In [6]:
x = 3

if x < 2:
    print('x less than 2')
elif x < 4:
    print('x less than 4, greater than or equal to 2')
else:
    print('x greater than or equal to 4')

x less than 4, greater than or equal to 2


### For loops

In Python, a for loop iterates over the contents of a container like a list. For example:

In [1]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)

a
b
c


To iterate for a specific number of times, you can create an iterator object with the `range` function:

In [4]:
for i in range(5):   # iterate over all integers (starting at 0) less than 5
    print(i)

0
1
2
3
4


In [5]:
for i in range(2, 6, 3):    # iterate over integers (starting at 2) less than 6, increasing by 3
    print(i)

2
5


### While loops

Python also has the concept of while loops. From a stylistic reasons, while loops are used somewhat less often than for loops. For example, compare the two following blocks of code:

In [10]:
my_list = ['a', 'b', 'c']
idx = 0
while idx < len(my_list):
    print(my_list[idx])
    idx += 1

a
b
c


In [11]:
my_list = ['a', 'b', 'c']
for element in my_list:
    print(element)

a
b
c


There are occassionally other reasons for using while loops (waiting for an external input, for example), but we won't make extensive use of them in this course.

## Functions

Of course, as data scientists, one of our most important jobs is to manipulate data in a way that provides insight. In other words, we need ways of taking raw data, doing some things to it, and returning nice, clean, processed data back. This is the job of functions!

### Built-in Python functions
It turns out that Python has a ton of functions built in already. When we have a task that can be accomplished by a built-in function, it's almost always a good idea to use them. This is because many of the Python built-in functions are actually written in C, not Python, and C tends to be much faster for certain tasks.

https://docs.python.org/3.5/library/functions.html

In [107]:
my_list = range(1000000)

In [108]:
%%timeit
sum(my_list)

10 loops, best of 3: 20.4 ms per loop


In [110]:
%%timeit
my_sum = 0
for element in my_list:
    my_sum += element
my_sum

10 loops, best of 3: 63.2 ms per loop


Some common mathematical functions that are built into Python:

- `sum`
- `divmod`
- `round`
- `abs`
- `max`
- `min`

And some other convenience functions, some of which we've already seen:

- `int`, `float`, `str`, `set`, `list`, `dict`: for converting between data structures
- `len`: for finding the number of elements in a data structure
- `type`: for finding the type that an object belongs to

### Custom functions

Of course, there are plenty of times we want to do something that isn't provided by a built-in. In that case, we can define our own functions. The syntax is quite simple:

In [111]:
def double_it(x):
    return x * 2

In [112]:
double_it(5)

10

Python has *dynamic typing*, which (in part) means that the arguments to functions aren't assigned a specific type:

In [113]:
double_it('hello')   # remember 'hello' * 2 from before?

'hellohello'

In [114]:
double_it({'a', 'b'})    # but there's no notion of multiplication for sets

TypeError: unsupported operand type(s) for *: 'set' and 'int'

#### Required arguments vs optional arguments

When defining a function, you can add defaults to arguments that you want to be optional. When defining and providing arguments, required arguments always go first, and the order they're provided in matters. Optional arguments follow, and can be passed by their keyword in any order.

In [124]:
def multiply_them(x, y, extra_arg1=None, extra_arg2=None):
    if extra_arg1 is not None:
        print(extra_arg1)
    if extra_arg2 is not None:
        print(extra_arg2)
    
    print('multiplying {} and {}...'.format(x, y))
    return x * y

In [125]:
multiply_them(3, 5)

multiplying 3 and 5...


15

In [126]:
multiply_them(3, 5, extra_arg1='hello')

hello
multiplying 3 and 5...


15

In [127]:
multiply_them(3, 5, extra_arg2='world', extra_arg1='hello')

hello
world
multiplying 3 and 5...


15

In [128]:
multiply_them(extra_arg2='world', extra_arg1='hello', 3, 5)

SyntaxError: positional argument follows keyword argument (<ipython-input-128-89490f4161a8>, line 1)

## Modules

Knowing how to create your own functions can be a rabbit hole - once you know that you can make Python do whatever you want it to do, it can be easy to go overboard. Good data scientists are efficient data scientists, however - you shouldn't reinvent the wheel by reimplementing a bunch of functionality that someone else worked hard on. Doing anything nontrivial can take a ton of time, and without spending even more time to write tests, squash bugs, and address corner cases, your code can easily end up being much less reliable that code that someone else has spent time perfecting.

Python has a very robust standard library of external modules that come with every Python installation. For even more specialized work, the Python community has also open-sourced *tens of thousands* of packages, any of which is a simple `pip install` away.

#### The standard library
A tour of some of the interesting ones, especially for data science.

#### Third party libraries and the Python Package Index

When someone creates and open-sources a Python package that isn't meant for the standard library, the most common way of distributing the package to other people is by using the [Python Package Index](https://pypi.python.org/pypi) (PyPI, pronounced pie-pee-eye).

pip installing

## Wrapping up

This notebook is fairly information-dense, especially if you haven't used Python before. Keep it close by for reference as the course goes along! Thankfully, Python syntax is fairly friendly toward beginners, so picking up the basics usually doesn't take too long. I hope you'll find as the course goes along that the Python syntax starts to feel more natural. Don't get discouraged; know when to ask for help, and look online for resources. And remember - the Python ecosystem is deep, and it can take years to master!

### Other resources
 - The official docs - https://docs.python.org/3.5/ - always start with the official docs!
 - [*Automate the Boring Stuff*](https://automatetheboringstuff.com/) and [*Dive Into Python*](http://www.diveintopython.net/) are two often-recommended, free, online books for diving further into the language.
 - Googling! Lots of good answers out there on the common programming help websites.