# Unit 2: Data Types#

In [1]:
from shared import display_unit_toc
display_unit_toc('notebook.ipynb')

# Table of Contents

* [Unit 2: Data Types](#Unit-2:-Data-Types)
 * [Objects](#Objects)
 * [Integers and Floats](#Integers-and-Floats)
 * [Booleans](#Booleans)
 * [Errors](#Errors)
 * [Strings](#Strings)
 * [Bytes](#Bytes)
 * [Functions](#Functions)
 * [Dates](#Dates)
 * [Collections: Lists, Tuples, Sets](#Collections:-Lists,-Tuples,-Sets)
  * [Tuples](#Tuples)
  * [Sets](#Sets)
 * [Collections: Dictionaries](#Collections:-Dictionaries)
 * [Vectors and Matrices: Numpy](#Vectors-and-Matrices:-Numpy)
 * [Data Frames: Pandas](#Data-Frames:-Pandas)
 * [Saving Data: Pickle](#Saving-Data:-Pickle)

## Objects#

Python is an object-oriented programming language: *Everything* is an object. And objects have **object types**, also called **classes**.

We usually make objects by using the object name as a function and passing any object-related information as its arguments, also called a **class constructor**.

For instance, we can make a new integer using the `int()` function as a class constructor:

In [2]:
x = int(3)
print(x)

print ('\n--\n')

x = int()
print(x)

3

--

0


Note above we can use `print()` or the object name to see its contents. And we can use `type()` to see its type.

Surprisingly, we can see that `int()` works as a class constructor even without an argument. By typing `?int` we can open the help documentation and see if this makes sense. In this case, the help says `int(x=0) -> integer`, which means that the function `int()` takes an input `x` which, if not provided, will default to `x=0`. And the `-> integer` means that the function returns an integer. So this blank constructor works because `int()` has a default for when input doesn't exist.

Now, we don't construct all objects using a constructor. It's easier to write `x = 3`, for instance than `x = int(3)`. However, one case this comes in handy is when we want to force an object into a new class.

In [3]:
x = 3.14159
print(type(x), x)

print('\n--\n')

y = int(x)
print(type(y), y)

<class 'float'> 3.14159

--

<class 'int'> 3


The native object types we'll cover here include:

* Integer `int`
* Floating point number `float`
* Boolean `bool`
* Complex number `complex`
* String `str`
* Bytes `bytes`
* List `list`
* Set `set`
* Tuple `tuple`
* Dictionary `dict`
* Function `builtin_function_or_method`

By having a type (also called a class), an object inherits

* **attributes**, i.e. attached variables
* **methods**, i.e. attached functions

We can access both using `TAB`-complete. For instance, we can assign a variable `x = 3.14159` then type `x.` and hit `TAB`. As a result, we see the `float` object has several attributes/methods:

* `as_integer_ratio`
* `conjugate`
* `fromhex`
* `hex`
* `imag`
* `is_integer`
* `real`

But which are which? Any attributes we can access like so

In [4]:
x.real

3.14159

For ones that aren't attributes, Python will tell us the object we've chosen is a function

In [5]:
x.is_integer

<function float.is_integer>

Alternatively, we can use `type()`:

In [6]:
print(type(x.real))
print(type(x.is_integer))

<class 'float'>
<class 'builtin_function_or_method'>



Another way to explore the structure of an object is using the `dir()` function:

In [7]:
dir(x)

['__abs__',
 '__add__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getformat__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__le__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmod__',
 '__rmul__',
 '__round__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__setattr__',
 '__setformat__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__trunc__',
 'as_integer_ratio',
 'conjugate',
 'fromhex',
 'hex',
 'imag',
 'is_integer',
 'real']

Notice that, in addition to the ones we saw above, the `float` object has several more attributes and methods beginning with and ending `_`. These tend to be implementation-level properties and not part of the main functionality of the class.

For further detail on the implementation-level attributes see [Python 3.6 documentation](https://docs.python.org/3.6/reference/datamodel.html#objects-values-and-types).


## Integers and Floats#

The two numeric data types in Python are Integer and Float (or floating point precision number). Integers are bounded between -X to X and have no decimal component, while floats can have a decimal component and accept a wider range from -X to X.

Python tries to assign integers when possible. For addition, subtraction, and multiplication this works fine, since the result is always an integer, too. However, in Python 2 can occasionally cause unintended behavior when we're dividing numbers. In these cases we often want to cast one of the numbers to `float` so that the result will also be a floating point number. We can either use the `float()` function or add a decimal point.

This behavior has changed in Python 3, which returns a division result as type `float` regardless of the inputs.

In [8]:
100 / 3

33.333333333333336

In [9]:
100 / float(3)

33.333333333333336

In [10]:
100. / 3

33.333333333333336

In [11]:
# Compare these types:
print(type(100), type(100.))

<class 'int'> <class 'float'>


## Booleans#

A **boolean** is a binary logical data type, i.e. assumes the values True or False. The results of logical comparisons like "is greather than" (`>`), "is equal" (`==`), "is not equal" (`!=`) are returned as booleans. They can also be combined with the two operators `and`, `or` and negated with the operator `not`. Also, parentheses `( )` can be used to group boolean statements together, which can be important for expressing more complex conditions.




In [12]:
-10 < -20

False

In [13]:
2 + 3 == 5

True

In [14]:
2 + 3 != 5.1

True

In [15]:
type(2) == int

True

In [16]:
x = 3.14159
not (int(x) < 3 or int(x) > 2)

False

In [17]:
not int(x) < 3 or int(x) > 2

True

Python also will treat the boooleans `True` and `False` like the integers `1` and `0` when the context calls for it. For instance, if we add two booleans it will operate on them like integers. Similarly, integers will be treated like booleans if used in clauses with `and`, `or`, `not`. The integer `0` is `False`, while all other integers are `True`.

In [18]:
False - True

-1

In [19]:
print(3 * 0 or -100)
print(bool(3 * 0 or -100))

-100
True


Booleans are useful for `if` / `else` constructions, which we'll explore in Unit 3.

In [20]:
from datetime import datetime
today = datetime.now()
print(today, '\n')

if today.month >= 3 and today.month < 6:
    print('Spring')
elif today.month >= 6 and today.month < 9:
    print('Summer')
elif today.month >= 9 and today.month < 12:
    print('Fall')
else:
    print('Winter')
    

2018-09-08 05:56:06.649811 

Fall


## Errors#

Like all other data, Python errors are objects too, with their type specifying the type of error. Let's try a few error-throwing operations.

First, let's see what happens if we write a clause that doesn't make logical sense. Try running some of these examples:

In [21]:
# True not and False
# True not False
# 3 2

For all three cases we've written operations that aren't clearly defined. In other words, Python doesn't know how to interpret the syntax and throws a `SyntaxError`. The line above the error shows the offending line and often (though not always) the part of the line that's directly responsible.

What about other error types?

In [22]:
import math
# math.sqrt(-4)

Recall the (real) square root isn't defined for negative numbers. Thus the `sqrt` function receives an unexpected value and throws a corresponding `ValueError`. 

One more example:

In [23]:
x = 3.14
#  print(y)

Even though we have two errors, Python throws the first error it encounters. As the interpreter is scanning our code, it gets to the line `print(y)` and finds an unexpected space. This is because Python is *whitespace sensitive*: It uses the indentation level of code as information, as we saw with `if/else` blocks. We'll discuss this more in the next unit, but for now you can try correcting the indentation and see how it changes the error.

## Strings#

To this point we have dealt exclusively with quantitative (numeric) data. Of course, there are also qualitative (non-numeric) data. These tend to be expressed as **strings**, or text-based variables.

A string variable in Python is enclosed by a pair of quotes. Either single or double quotes are fine, so long as we are consistent.


In [24]:
eastern_playoff_teams = 'Raptors, Celtics, 76ers, Cavaliers, Pacers, Heat, Bucks, Wizards'
print(eastern_playoff_teams)

western_playoff_teams = "Rockets, Warriors, Trailblazers, Jazz, Thunder, Pelicans, Spurs, Timberwolves"
print(western_playoff_teams)

Raptors, Celtics, 76ers, Cavaliers, Pacers, Heat, Bucks, Wizards
Rockets, Warriors, Trailblazers, Jazz, Thunder, Pelicans, Spurs, Timberwolves


Strings can be concatenated, i.e. combined, with a `+` sign. 


In [25]:
all_playoff_teams = eastern_playoff_teams + ', ' + western_playoff_teams
print(all_playoff_teams)

Raptors, Celtics, 76ers, Cavaliers, Pacers, Heat, Bucks, Wizards, Rockets, Warriors, Trailblazers, Jazz, Thunder, Pelicans, Spurs, Timberwolves


Strings have several useful built-in functions for accessing and changing their data. For instance, `replace()` takes two arguments and attempts to replace any matches of the first argument with the second argument. If there are no matches, it doesn't replace anything.

This can be a handy way to delete text from a string by replacing the text we want to delete with a blank string ``

In [26]:
western_playoff_teams.replace('Timberwolves', 'Nuggets')
print(western_playoff_teams)
print(western_playoff_teams.endswith('Nuggets'))

print('\n--\n')

western_playoff_teams = western_playoff_teams.replace('Timberwolves', 'Nuggets')
print(western_playoff_teams)
print(western_playoff_teams.endswith('Nuggets'))


Rockets, Warriors, Trailblazers, Jazz, Thunder, Pelicans, Spurs, Timberwolves
False

--

Rockets, Warriors, Trailblazers, Jazz, Thunder, Pelicans, Spurs, Nuggets
True


The function `find()` can test whether the string variable calling `find()` contains a given sub-string. It returns the location of the first match, if one exists, or else `-1`.

In [27]:
print(eastern_playoff_teams.find('76ers'))
print(eastern_playoff_teams.find('Hawks'))

18
-1


Special cases of `find()` are the functions `startswith()` and `endswith()`.

Other functinos like `upper()`, `lower()`, and `capitalize()` can be used to force strings to match a given format.

Another key string function is `split()`. Split separates a string into multiple strings, breaking each time it matches the argument string. For instance, by splitting on a comma and space `, ` we can separate our long strings above into two lists of smaller strings:

In [28]:
print(eastern_playoff_teams.split(', '))
print(western_playoff_teams.split(', '))

['Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards']
['Rockets', 'Warriors', 'Trailblazers', 'Jazz', 'Thunder', 'Pelicans', 'Spurs', 'Nuggets']


Note our single strings (one pair of quotes each) are now broken into multiple strings (eight pairs of quotes each). The commas are no longer part of the string and instead they now separate the multiple strings we received after splitting. The most common delimiters we split on are commas `,`, spaces ` `, tabs `\t`, and new lines `\n`.

One question to ask yourself: What would have happened if we had omitted the space above? If you're not entirely sure, try running this file as `.ipynb` and making the change yourself.


#### String formatting

Strings have a method `format()` that allows us to build strings with placeholders, which allow flexible filling of variables into strings. The placeholder(s) is specified by braces `{ }` which either a) are empty or b) have a variable name. Our input to `format()` is the variable we're filling in.


In [29]:
team_cities = {'Rockets': 'Houston', 'Warriors': 'Golden State', 'Hawks': 'Atlanta', 'Wizars': 'Washington'}

for team, city in team_cities.items():
    print('Ths {team} play at an arena in {city}'.format(team=team, city=city))
#     print('Ths {} play at an arena in {}'.format(team, city)) # same

Ths Rockets play at an arena in Houston
Ths Warriors play at an arena in Golden State
Ths Hawks play at an arena in Atlanta
Ths Wizars play at an arena in Washington



#### Regular expressions

Another common tool for manipulating strings is the package `re`. The name `re` is short for "regular expressions". which are a powerful syntax for matching and replacing strings. Regular expressions can handle exact string matches as well as more general concepts like "find all digits within the string".

In [30]:
import re

# [\d] = digit, + continues the match as long as it keeps matching digits
x = re.search('[\d]+', eastern_playoff_teams)

print(x.group())

76


We'll explore regular expressions and string matching more in Unit 2 Lab.

## Bytes#

Bytes are another encoding for strings.

In [31]:
name = u'James Naismith'
print(type(name))
print(name)

print('\n--\n')

name_bytes = b'James Naismith'
print(type(name_bytes))
print(name_bytes)


<class 'str'>
James Naismith

--

<class 'bytes'>
b'James Naismith'


## Functions#

We can start a function definition with the letters `def` followed by the function name. In parentheses and after the function name, we can specify which arguments the function receives. A line, like the function below, only has one input. However, a function like `str.replace()` takes multiple inputs. We'll explore functions more in Unit 3.

In [32]:
def my_line(x):
    return 3*x + 2

x = 2
y = my_line(x)
print(y)

type(my_line)

8


function

## Dates#

Dates are another common data type, especially when we're dealing with time series. While we could represent dates as multiple integers (i.e. one variable for `year`, another for `month`, etc.), this ignores an important feature of dates: The arithmetic doesn't match typical integer arithmetic: For instance, if `day` is `31` then `day+1` does not equal `32`.

Thankfully, Python's `datetime` package implements a class for dates that can handle date arithmetic and text-formatting. Let's import the package and construct a date variable `leap_day` using the class constructor `datetime.date()`:

In [33]:
import datetime
leap_day = datetime.date(2020, 2, 29)

print(leap_day.year, leap_day.month, leap_day.day, leap_day.weekday())


2020 2 29 5


Our variable `leap_day` has attributes `year`, `month`, and `day` among others. It also has a method `weekday()` that returns the day of the week for the date. Python counts beginning at `0`, as we'll see with lists below. So starting with `Monday = 0` we see that `leap_day` falls on a Saturday.

Another powerful feature of `datetime` is its class `datetime.timedelta`. Adding `date + timedelta` will produce a new date. For instance we can shift five weeks ahead from `leap_day` by running

In [34]:
leap_day + datetime.timedelta(weeks=5)

datetime.date(2020, 4, 4)

Note you can replace the argument `weeks` with `days`, `hours`, `minutes`, etc.

## Collections: Lists, Tuples, Sets#

### Lists#

Arguably the most frequently used Python object is the list. A **list** is a collection of objects. These objects can have mixed types and can include duplicates.

The typical way to declare a list is using square brackets `[ ]` to signify the list and commas `,` to separate each object.


In [35]:
eastern_playoff_list = ['Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards']
print(eastern_playoff_list)
print(type(eastern_playoff_list))

['Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards']
<class 'list'>


To compute the length of a list we use `len()`:

In [36]:
len(eastern_playoff_list)

8

We can access elements of a list with square brackets `[ ]`, similar to calling a function. Note that, unlike R or Matlab, Python starts counting at `0`. So the eight teams above are indexed `0-7`:


In [37]:
print(eastern_playoff_list[0])
print(eastern_playoff_list[2])
print(eastern_playoff_list[7])
# print(eastern_playoff_list[8])  # try this in Jupyter Notebook or Spyder

Raptors
76ers
Wizards


Additionally, Python allows us to do reverse indexing, beginning at the end of the list. We access items in reverse by using negative numbers, starting with `-1`:

In [38]:
print(eastern_playoff_list[-1])
print(eastern_playoff_list[-2])
print(eastern_playoff_list[-8])

Wizards
Bucks
Raptors


Similar to strings, we can combine lists in order by using the `+` operator:

In [39]:
western_playoff_list = western_playoff_teams.split(', ')
all_playoff_list = eastern_playoff_list + western_playoff_list
print(all_playoff_list)

['Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards', 'Rockets', 'Warriors', 'Trailblazers', 'Jazz', 'Thunder', 'Pelicans', 'Spurs', 'Nuggets']


There are other ways to change the contents of a list, too:
* `append(object)` adds `object` to the end of the list
* `insert(position, object)` inserts `object` at index `position`
* `remove(object)` deletes the first occurrence of `object` in the list


In [40]:
all_playoff_list.append('Hawks')
all_playoff_list.insert(1, 'Hawks')
all_playoff_list

['Raptors',
 'Hawks',
 'Celtics',
 '76ers',
 'Cavaliers',
 'Pacers',
 'Heat',
 'Bucks',
 'Wizards',
 'Rockets',
 'Warriors',
 'Trailblazers',
 'Jazz',
 'Thunder',
 'Pelicans',
 'Spurs',
 'Nuggets',
 'Hawks']

In [41]:
# Try running this cell multiple times
all_playoff_list.remove('Hawks')
all_playoff_list

['Raptors',
 'Celtics',
 '76ers',
 'Cavaliers',
 'Pacers',
 'Heat',
 'Bucks',
 'Wizards',
 'Rockets',
 'Warriors',
 'Trailblazers',
 'Jazz',
 'Thunder',
 'Pelicans',
 'Spurs',
 'Nuggets',
 'Hawks']

### Tuples#

Tuples are similar to lists. Instead of declaring them with square brackets `[ ]`, though, we use parentheses `( )` or no boundary markings at all.


In [42]:
eastern_playoff_tuple = ('Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards')

print(type(eastern_playoff_tuple))

western_playoff_tuple = 'Rockets', 'Warriors', 'Trailblazers', 'Jazz', 'Thunder', 'Pelicans', 'Spurs', 'Timberwolves'

print(type(western_playoff_tuple))

<class 'tuple'>
<class 'tuple'>


Like lists, we can combine tuples with `+`


In [43]:
all_playoff_tuple = eastern_playoff_tuple + western_playoff_tuple
print(all_playoff_tuple)

('Raptors', 'Celtics', '76ers', 'Cavaliers', 'Pacers', 'Heat', 'Bucks', 'Wizards', 'Rockets', 'Warriors', 'Trailblazers', 'Jazz', 'Thunder', 'Pelicans', 'Spurs', 'Timberwolves')


We can even convert smoothly between tuples and lists using the class constructor functions. Notice, however, that tuples are more limited in terms of their attached methods. Their reduced functionality means they take up less memory:

In [44]:
print(all_playoff_tuple.__sizeof__())

list_from_tuple = list(all_playoff_tuple)
print(list_from_tuple.__sizeof__())

152
232


### Sets#

Like lists and tuples, sets are collections of objects. To declare a set we use braces `{ }` for the set boundaries and separate objects with commas.


However, sets have several unique properties. First of all, sets do *not* accept duplicates. Second, we don't access sets' contents directly like we do with lists or tuples.


In [45]:

upcoming_opponents = {'Warriors', 'Rockets', 'Pelicans', 'Rockets', 'Bulls', 'Warriors', 'Magic'}
print(upcoming_opponents)

print(type(upcoming_opponents))


{'Pelicans', 'Rockets', 'Magic', 'Bulls', 'Warriors'}
<class 'set'>


Because they don't accept duplicates, casting to a set is a useful way of removing duplicates. For instance, we can cast from `list -> set -> list` and get unique values then reacquiring a list's functionality:

In [46]:
upcoming_games = ['Warriors', 'Rockets', 'Pelicans', 'Rockets', 'Bulls', 'Warriors', 'Magic']

upcoming_opponents = list(set(upcoming_games))
print(upcoming_opponents)


['Pelicans', 'Rockets', 'Magic', 'Bulls', 'Warriors']


Sets come equipped with the common set operations:

* `union()` combines two sets
* `intersection()` keeps only the objects both sets have in common
* `difference()` removes the objects both sets have in common (Note order matters: $A-B \neq B-A$)
* `add()` inserts an object into the set
* `remove()` deletes an object

In [47]:
sixers_players = {'Joel Embiid', 'Ben Simmons', 'JJ Redick', 'Markelle Fultz', 'Robert Covington'}
celtics_players = {'Kyrie Irving', 'Gordon Hayward', 'Al Horford', 'Jayson Tatum', 'Jaylen Brown'}

all_players = sixers_players.union(celtics_players)
print(all_players)

all_players_no_rookies = all_players.difference({'Markelle Fultz', 'Jayson Tatum'})
print(all_players_no_rookies)

sixers_players_no_rookies = sixers_players.intersection(all_players_no_rookies)
print(sixers_players_no_rookies)

sixers_players.add('TJ McConnell')
print(sixers_players)

sixers_players.remove('TJ McConnell')
print(sixers_players)

{'Kyrie Irving', 'Jaylen Brown', 'Jayson Tatum', 'Gordon Hayward', 'Al Horford', 'Joel Embiid', 'JJ Redick', 'Ben Simmons', 'Robert Covington', 'Markelle Fultz'}
{'Jaylen Brown', 'Gordon Hayward', 'Robert Covington', 'Al Horford', 'Joel Embiid', 'JJ Redick', 'Ben Simmons', 'Kyrie Irving'}
{'Joel Embiid', 'JJ Redick', 'Robert Covington', 'Ben Simmons'}
{'TJ McConnell', 'Joel Embiid', 'JJ Redick', 'Ben Simmons', 'Robert Covington', 'Markelle Fultz'}
{'Joel Embiid', 'JJ Redick', 'Ben Simmons', 'Robert Covington', 'Markelle Fultz'}


## Collections: Dictionaries#

Another important Python collection is the dictionary. Dictinoaries look similar to sets in that they're bounded by braces `{ }` with object elements separated by commas. In a way, this notatino is appropriate since a dictionary is a type of set, as we will see soon.

Dictionaries are unique, however, because each element is actually a *pair* of objects: a **key** and a **value**. This key/value pair acts like a word/definition pair in a real dictionary. We have a word (key) in mind and then use the dictionary to lookup the definition (value) associated with that particular word (key). For this reason, a Python dictionary is often called a lookup table.

When declaring a dictionary, we separate each key/value pair with a comma, and we also separate the key from the value with a colon `:` like so:


In [48]:
player_schools = {
    'Joel Embiid': 'Kansas',
    'Ben Simmons': 'LSU',
    'JJ Redick': 'Duke',
    'Markelle Fultz': 'Washington',
    'Robert Covington': 'Tennessee State'
}

type(player_schools)

dict

To lookup the value for a particular key, we use square brackets `[ ]` and the key name

In [49]:
player_schools['Markelle Fultz']

'Washington'

We can also add key/value pairs or update values in the same way we declare variables, assuming the dictionary we're adding to has already been declared:

In [50]:
player_schools['TJ McConnell'] = 'Arizona'
player_schools['Markelle Fultz'] = 'University of Washington'


While the main purpose of dictionaries is looking up values, it's also sometimes useful to get a list of the keys or values alone:

In [51]:
print(player_schools.keys())
print(player_schools.values())

dict_keys(['Joel Embiid', 'Ben Simmons', 'JJ Redick', 'Markelle Fultz', 'Robert Covington', 'TJ McConnell'])
dict_values(['Kansas', 'LSU', 'Duke', 'University of Washington', 'Tennessee State', 'Arizona'])


And we can also get a list of all key/value pairs using `items()`

In [52]:
player_schools.items()

dict_items([('Joel Embiid', 'Kansas'), ('Ben Simmons', 'LSU'), ('JJ Redick', 'Duke'), ('Markelle Fultz', 'University of Washington'), ('Robert Covington', 'Tennessee State'), ('TJ McConnell', 'Arizona')])

## Vectors and Matrices: Numpy#

The [`numpy`](https://docs.scipy.org/doc/numpy/reference/) package implements vectors, matrices, and vectorized operations, including [several important matrix decompositions](https://docs.scipy.org/doc/numpy-1.15.0/reference/routines.linalg.html).

We'll start with making a **numpy array**, similar to a list in terms of construction, accessing elements.

In [53]:
import numpy as np

game_points = [112, 87, 90, 103, 126]
gp_array = np.array(game_points)
gp_array

array([112,  87,  90, 103, 126])

In [54]:
print(gp_array[1])

87


In [55]:
gp_array[3]

103

In [56]:
gp_array[[0, 1, 3, 4]]


array([112,  87, 103, 126])

In [57]:
gp_array[0:4]

array([112,  87,  90, 103])

Numpy has vector functions common to calculators, Excel, and other software. Some functions are common to the Numpy module `np` and the array class methods (e.g. `min`, `max`, `mean`, `std`, `round`), and other functions are specific to one or the other (e.g. `np.where()`, `np.concatenate()`).

In [58]:
gp_array.max()

126

In [59]:
gp_array.mean()

103.6

In [60]:
gp_array.std().round(2)

14.37

In [61]:
np.where(gp_array > 100)

(array([0, 3, 4]),)

In [62]:
opponent_points = [75, 90, 104, 96, 130]
points_matrix = np.matrix([game_points, opponent_points])
points_matrix

matrix([[112,  87,  90, 103, 126],
        [ 75,  90, 104,  96, 130]])

A matrix is essentially a list of lists and can be accessed with square brackets `[ ]`, including with a colon `:` or a blank character along any rows/columns where we want the whole range.

In [63]:
score_margin = points_matrix[0, :] - points_matrix[1, :]
# score_margin = points_matrix[0,] - points_matrix[1,]  # try this
points_matrix = np.matrix([game_points, opponent_points])
print(score_margin)

print('\n--\n')

print(np.where(score_margin > 0))

[[ 37  -3 -14   7  -4]]

--

(array([0, 0]), array([0, 3]))


## Data Frames: Pandas#

The [pandas](http://pandas.pydata.org/pandas-docs/stable/) package will be familiar to anyone who has programmed in R. It implements a `DataFrame` class akin to R's data frames. This `DataFrame` is a labeled matrix composed of multiple `Series` objects, where each `Series` is a labeled vector.

To see this in action, let's use a `pandas` function to scrape some online data.


In [64]:
import pandas as pd
df = pd.read_html('https://www.basketball-reference.com/leagues/NBA_2018.html')
df

[          Eastern Conference   W   L   W/L%    GB   PS/G   PA/G   SRS
 0       Toronto Raptors*¬†(1)  59  23  0.720     ‚Äî  111.7  103.9  7.29
 1        Boston Celtics*¬†(2)  55  27  0.671   4.0  104.0  100.4  3.23
 2    Philadelphia 76ers*¬†(3)  52  30  0.634   7.0  109.8  105.3  4.30
 3   Cleveland Cavaliers*¬†(4)  50  32  0.610   9.0  110.9  109.9  0.59
 4        Indiana Pacers*¬†(5)  48  34  0.585  11.0  105.6  104.2  1.18
 5            Miami Heat*¬†(6)  44  38  0.537  15.0  103.4  102.9  0.15
 6       Milwaukee Bucks*¬†(7)  44  38  0.537  15.0  106.5  106.8 -0.45
 7    Washington Wizards*¬†(8)  43  39  0.524  16.0  106.6  106.0  0.53
 8        Detroit Pistons¬†(9)  39  43  0.476  20.0  103.8  103.9 -0.26
 9     Charlotte Hornets¬†(10)  36  46  0.439  23.0  108.2  108.0  0.07
 10      New York Knicks¬†(11)  29  53  0.354  30.0  104.5  108.0 -3.53
 11        Brooklyn Nets¬†(12)  28  54  0.341  31.0  106.6  110.3 -3.67
 12        Chicago Bulls¬†(13)  27  55  0.329  32.0  102.9  110


Our "data frame" `df` actually ended up being a list with four different data frames. We'll break them up


In [65]:
east_standings = df[0]
west_standings = df[1]

east_conf_standings = df[2]
west_conf_standings = df[3]

east_standings

Unnamed: 0,Eastern Conference,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Toronto Raptors*¬†(1),59,23,0.72,‚Äî,111.7,103.9,7.29
1,Boston Celtics*¬†(2),55,27,0.671,4.0,104.0,100.4,3.23
2,Philadelphia 76ers*¬†(3),52,30,0.634,7.0,109.8,105.3,4.3
3,Cleveland Cavaliers*¬†(4),50,32,0.61,9.0,110.9,109.9,0.59
4,Indiana Pacers*¬†(5),48,34,0.585,11.0,105.6,104.2,1.18
5,Miami Heat*¬†(6),44,38,0.537,15.0,103.4,102.9,0.15
6,Milwaukee Bucks*¬†(7),44,38,0.537,15.0,106.5,106.8,-0.45
7,Washington Wizards*¬†(8),43,39,0.524,16.0,106.6,106.0,0.53
8,Detroit Pistons¬†(9),39,43,0.476,20.0,103.8,103.9,-0.26
9,Charlotte Hornets¬†(10),36,46,0.439,23.0,108.2,108.0,0.07


Now Jupyter even formats the table nicely. üòç Let's try working with this table.

First, notice the column named "Eastern Conference" seems to have picked up some descriptive text used on the website, but the column name is really "Team". Let's change the column name:


In [66]:
column_names = list(east_standings.columns)
print(column_names)
column_names.remove('Eastern Conference')
column_names.insert(0, 'Team')
east_standings.columns = column_names
east_standings

['Eastern Conference', 'W', 'L', 'W/L%', 'GB', 'PS/G', 'PA/G', 'SRS']


Unnamed: 0,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Toronto Raptors*¬†(1),59,23,0.72,‚Äî,111.7,103.9,7.29
1,Boston Celtics*¬†(2),55,27,0.671,4.0,104.0,100.4,3.23
2,Philadelphia 76ers*¬†(3),52,30,0.634,7.0,109.8,105.3,4.3
3,Cleveland Cavaliers*¬†(4),50,32,0.61,9.0,110.9,109.9,0.59
4,Indiana Pacers*¬†(5),48,34,0.585,11.0,105.6,104.2,1.18
5,Miami Heat*¬†(6),44,38,0.537,15.0,103.4,102.9,0.15
6,Milwaukee Bucks*¬†(7),44,38,0.537,15.0,106.5,106.8,-0.45
7,Washington Wizards*¬†(8),43,39,0.524,16.0,106.6,106.0,0.53
8,Detroit Pistons¬†(9),39,43,0.476,20.0,103.8,103.9,-0.26
9,Charlotte Hornets¬†(10),36,46,0.439,23.0,108.2,108.0,0.07


While thie works, there's also another way using `rename()`:

In [67]:
west_standings.rename(columns={'Western Conference': 'Team'}, inplace=True)
west_standings

Unnamed: 0,Team,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Houston Rockets*¬†(1),65,17,0.793,‚Äî,112.4,103.9,8.21
1,Golden State Warriors*¬†(2),58,24,0.707,7.0,113.5,107.5,5.79
2,Portland Trail Blazers*¬†(3),49,33,0.598,16.0,105.6,103.0,2.6
3,Oklahoma City Thunder*¬†(4),48,34,0.585,17.0,107.9,104.4,3.42
4,Utah Jazz*¬†(5),48,34,0.585,17.0,104.1,99.8,4.47
5,New Orleans Pelicans*¬†(6),48,34,0.585,17.0,111.7,110.4,1.48
6,San Antonio Spurs*¬†(7),47,35,0.573,18.0,102.7,99.8,2.89
7,Minnesota Timberwolves*¬†(8),47,35,0.573,18.0,109.5,107.3,2.35
8,Denver Nuggets¬†(9),46,36,0.561,19.0,110.0,108.5,1.57
9,Los Angeles Clippers¬†(10),42,40,0.512,23.0,109.0,109.0,0.15


Let's try accessing some of the data. We can get a column's data using its name like an attribute after a dot `.`. Or we can pass the column name as a string in square brackets `[ ]`.

In [68]:
east_standings.W

0     59
1     55
2     52
3     50
4     48
5     44
6     44
7     43
8     39
9     36
10    29
11    28
12    27
13    25
14    24
Name: W, dtype: int64

In [69]:
east_standings['W']

0     59
1     55
2     52
3     50
4     48
5     44
6     44
7     43
8     39
9     36
10    29
11    28
12    27
13    25
14    24
Name: W, dtype: int64

These columns are `pandas.Series` objects, which are essentially a cross between a Python dictionary and a `numpy.ndarray`. 


In [70]:
east_standings.W.mean()

40.2

In [71]:
east_standings.W.std().round(2)

11.55

In [72]:
east_standings.W.values # values accesses the underlying Numpy array

array([59, 55, 52, 50, 48, 44, 44, 43, 39, 36, 29, 28, 27, 25, 24])

In [73]:
east_standings.W.keys()

RangeIndex(start=0, stop=15, step=1)

We'll explore how to process data frames in Unit 6. 

In the meantime, just note that data frames can write to and read from several file types:
* `html`, as we saw above
* `csv`, useful for working in R or Excel
* `json
* `xlsx`

In general we can use functions `pd.read_****()` to read from files and data frame functions `df.to_****()` for writing to files.

In [74]:
east_standings.to_csv('east.csv')

east_standings_loaded = pd.read_csv('east.csv')
east_standings_loaded.mean()

Unnamed: 0      7.000000
W              40.200000
L              41.800000
W/L%            0.490333
PS/G          106.086667
PA/G          106.440000
SRS            -0.508667
dtype: float64

In [75]:
east_standings.to_json('east.json')

east_standings_loaded = pd.read_json('east.json')
east_standings_loaded.mean()

W        40.200000
L        41.800000
W/L%      0.490333
PS/G    106.086667
PA/G    106.440000
SRS      -0.508667
dtype: float64

In [76]:
east_standings.to_excel('east.xls')

east_standings_loaded = pd.read_excel('east.xls')
east_standings_loaded.mean()

W        40.200000
L        41.800000
W/L%      0.490333
PS/G    106.086667
PA/G    106.440000
SRS      -0.508667
dtype: float64

## Saving Data: Pickle#

We've seen how to save data frames. Often we want to save other Python objects to external files, too. This way it becomes easy to resume our progress later, particularly if our processing has involved time-consuming steps. The `pickle` module can accomodate this type of saving and loading.

The function `pickle.dump()` can be used to write a `.p` file, which contains Python data (in byte form, hence the `wb` below). And the function `pickle.load()` can then be used to load from the `.p` file.

In [77]:
import pickle

pickle.dump(player_schools, open('player_schools.p', 'wb'))

player_schools2 = pickle.load(open('player_schools.p', 'rb'))

player_schools2

{'Ben Simmons': 'LSU',
 'JJ Redick': 'Duke',
 'Joel Embiid': 'Kansas',
 'Markelle Fultz': 'University of Washington',
 'Robert Covington': 'Tennessee State',
 'TJ McConnell': 'Arizona'}