# Data Structures
[Data Structures](https://docs.python.org/3.5/tutorial/datastructures.html)

### Python Tuple

In [158]:
t = 12345, 54321, 'hello!'
# note, they may be input with or wihtout parens
t[0]

12345

In [159]:
t

(12345, 54321, 'hello!')

In [160]:
# nested tuples
u = t, (1,2,3,4,5)
u

((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))

In [161]:
# the 0th is now my original tuple
u[0]

(12345, 54321, 'hello!')

In [162]:
# 1th is the second tuple
u[1]

(1, 2, 3, 4, 5)

In [163]:
# tuples are immutable
t[0] = 88888

TypeError: 'tuple' object does not support item assignment

In [165]:
# a tuple can contain a list
# TODO: is the nested list mutable?
tup1 = 1, 2, 3
list1 = ['a', 'b', 'c']

tup2 = tup1, list1
tup2

((1, 2, 3), ['a', 'b', 'c'])

Tuples usually contain hererogeneous sequence of elements
that are accessed via unpacking or indexing, while lists are usually homogenous and accessed by iterating over the list

In [166]:
# empty tuple has empty parents
empty_tup = ()
empty_tup

()

In [167]:
# one item tuple needs a trailing comma
one_item_tup = 'hello',
#note, without that trailing comma the variable is a string
one_item_tup

('hello',)

In [168]:
# sequence unpacking - left side number of variables must
# equal number of items in the tuple. Assignment is made
# accordingly
u = 99999, 88888, 'goodbye'
a, b, c = u
b

88888

### Python Dictionary

Are indexed by 'keys' which are any immutable type. Like strings, numbers and tuples (if they contain only immutable numbers or strings). Lists cannot be used as keys since they can be modified in place using index assignments. Think of dict as unordered set of key:value pairs with requirement that keys are unique within one dict. Main operation of dict is storing a value with a key and extracting the value given that key. Storing using a key already in use will replace value.

In [169]:
# making a dict
tel = {'jack': 4098, 'sape': 4139}
tel

{'jack': 4098, 'sape': 4139}

In [170]:
# add to beginning of dict
tel['guido'] = 4127
tel

{'guido': 4127, 'jack': 4098, 'sape': 4139}

In [171]:
# show item based on key
tel['jack']

4098

In [172]:
# delete item based on key
del tel['guido']
tel

{'jack': 4098, 'sape': 4139}

In [173]:
# list keys in arbitrary order
list(tel.keys())

['sape', 'jack']

In [174]:
# list keys in sorted order
sorted(tel.keys())

['jack', 'sape']

In [175]:
# check if a key exists
'guido' in tel

False

In [176]:
'jack' in tel

True

In [177]:
# manually create a dict
dict_values = {2: 4, 4: 16, 6: 36}
dict_values

{2: 4, 4: 16, 6: 36}

In [178]:
# use dict comprehension to create a dict
dict_values_2 = {x: x**2 for x in (2, 4, 6)}
# note the loop and how x is used to creat key and compute value
dict_values_2

{2: 4, 4: 16, 6: 36}

In [179]:
# the dict constructor
dict([('sape', 4139), ('guido', 4127), ('jack', 4098)])

{'guido': 4127, 'jack': 4098, 'sape': 4139}

In [180]:
# specify pairs using keyword arguments in constructor
dict(sape=4139, guido=4127, jack=4098)

{'guido': 4127, 'jack': 4098, 'sape': 4139}

### Python lists

In [185]:
i_am_a_list = [1,2,3,4,5,6]
i_am_a_list

[1, 2, 3, 4, 5, 6]

In [186]:
type(i_am_a_list)

list

### NumPy ndarray (N-dimensional array)

In [181]:
# Is  a (usually fixed-size) multidimensional 
# container of items of the same type and size.

In [204]:
import numpy as np

# creates a 2 x 3 (2 rows, 3 columns) 2D array of int32s
# without np.int32, defaults to int64 data type
x = np.array([[1,2,3], [4,5,6]], np.int32)
"""
[1,2,3]
[4,5,6]

TODO: So it appears to be nested lists within a list?

"""

# object type is numpy.ndarray
type(x)

numpy.ndarray

In [205]:
# shape is 2 x 3 (2 rows, 3 columns)
x.shape

(2, 3)

In [206]:
# data type is int32
x.dtype

dtype('int32')

In [207]:
# access element [row, column] 
# 1th row, 2th column
x[1,2]

6

#### learn slicing first

In [208]:
# Slicing - [start index included:stop index excluded: step] where
# step is number of hops from included start index
y = np.array([0,1,2,3,4,5,6,7,8,9])
y

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [191]:
# include 0th, exclude 9th, step every two
y[0:9:2]

array([0, 2, 4, 6, 8])

In [192]:
# negative indexes from the right (but as if an imaginary one
# hanging out there)
y[-10:9:2]

array([0, 2, 4, 6, 8])

In [193]:
# if start (included) is to right of stop (excluded), step
# must be negative, moving from start
y[-3:3:-1]

array([7, 6, 5, 4])

#### back to ndarray

In [209]:
x

array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)

In [215]:
# slicing an ndarray
# in this case, [I slice rows:I AM the included column]
z = x[:,1]
z

array([2, 5], dtype=int32)

In [217]:
# slice the first 2x2
z3 = x[:2,:2]
z3

array([[1, 2],
       [4, 5]], dtype=int32)

In [220]:
# objects are passed by reference, so original is changed
z[0] = 9 # changes the 0th item in the z variable
z

array([9, 5], dtype=int32)

In [221]:
# the slice from x was changed so original is now changed
x

array([[1, 9, 3],
       [4, 5, 6]], dtype=int32)

### pandas Series

In [222]:
import numpy as np
import pandas as pd

In [None]:
# Series is a one-dimensional labeled array capable of holding
# any data type(ex. a Python dict, an ndarray, a scalar value (like 5)

In [226]:
# Series using a dict
mydict = {'one':1, 'two':2, 'three':3}
s_dict = pd.Series(mydict)
s_dict

one      1
three    3
two      2
dtype: int64

In [231]:
# Series using a tuple, note gave letters as labels
mytupl = (1,2,3)
s_tupl = pd.Series(mytupl, index=['a','b','c'])
s_tupl

a    1
b    2
c    3
dtype: int64

In [234]:
# Series using a list, also gave letters as labels
mylist = [1,2,3]
s_mylist = pd.Series(mylist, index=['a','b','c'])
s_mylist

a    1
b    2
c    3
dtype: int64

In [227]:
# Series using an ndarray
s = pd.Series(np.array([1,2,3,4,5,6]), index=['a','b','c','d','e','f'])
s

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

In [228]:
# return index
s.index

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

In [94]:
# have index automatically assigned by not specifying it
t = pd.Series(np.random.randn(5))
t

0   -0.260825
1   -0.318876
2    0.321525
3   -2.724488
4    2.456584
dtype: float64

#### More on Series using a dict

In [235]:
# using a dict, the keys become the axis labels
# NOTE, sorted according to keys
d = {'a':0., 'c':1., 'b':2.}
s_fromdict = pd.Series(d)
s_fromdict

a    0.0
b    2.0
c    1.0
dtype: float64

In [236]:
# adding an index manually can force an ordering
# note, index is a list
# note, NaN (Not a Number) added for missing value created by new key
s_fromdict_ordered = pd.Series(d, index=['b', 'c', 'd', 'a'])
s_fromdict_ordered

b    2.0
c    1.0
d    NaN
a    0.0
dtype: float64

#### Series using a scalar

In [237]:
# a scalar is just a fucking number
# scalar is just copied into the number of indexes
s_fromscalar = pd.Series(5., index=['a','b','c','d','e'])
s_fromscalar

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

#### Series is ndarray-like

In [240]:
# my original Series using an ndarray
s

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

In [241]:
# access the 0th
s[0]

1

In [108]:
# so I can slice
s[:3]

a    1
b    2
c    3
dtype: int64

In [111]:
# can perform math inside which computes to an index start
# note, s.median() is 3.5
s[s > s.median()]

d    4
e    5
f    6
dtype: int64

In [112]:
# a list denotes indexes to be retrieved
s[[4,3,1]]

e    5
d    4
b    2
dtype: int64

In [242]:
# the exponential function e^x (irrational number to the xth)
np.exp(s)

a      2.718282
b      7.389056
c     20.085537
d     54.598150
e    148.413159
f    403.428793
dtype: float64

#### series is dict-like

In [243]:
# again, from my original Series from an ndarray
s['a']

1

In [244]:
'e' in s

True

In [245]:
'z' in s

False

In [246]:
# nothing output here
s.get('x')

In [247]:
# nan output
s.get('x', np.nan)

nan

#### vectorized operations and label alignment with Series

In [248]:
# my original Series using an ndarray
s

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

In [249]:
s + s

a     2
b     4
c     6
d     8
e    10
f    12
dtype: int64

In [250]:
s * 2

a     2
b     4
c     6
d     8
e    10
f    12
dtype: int64

In [251]:
np.exp(s)

a      2.718282
b      7.389056
c     20.085537
d     54.598150
e    148.413159
f    403.428793
dtype: float64

In [252]:
# original Series with ndarray
s

a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

In [253]:
# slice and add
s[1:] + s[:-1]

a     NaN
b     4.0
c     6.0
d     8.0
e    10.0
f     NaN
dtype: float64

In [254]:
# drop unlabeled
p = s[1:]+s[:-1]
p.dropna()

b     4.0
c     6.0
d     8.0
e    10.0
dtype: float64

In [255]:
# you can add a name
q = pd.Series(np.random.randn(5), name = 'something')
q

0    2.602992
1    0.935702
2    0.640102
3    0.100129
4   -1.569747
Name: something, dtype: float64

In [256]:
q2 = q.rename('different')
q2

0    2.602992
1    0.935702
2    0.640102
3    0.100129
4   -1.569747
Name: different, dtype: float64

In [257]:
q2.name

'different'

### DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
Accepts:
* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

'index' parameter is row labels and 'column' is column labels

In [258]:
# from dict of Series
# labels are formed with the union
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [259]:
# only show certain indices and in certain order
d2 = pd.DataFrame(d, index=['d','b','a'])
d2

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [260]:
# only show certain columns and in certain order
d3 = pd.DataFrame(d, index=['d','b','a'], columns=['two'])
d3

Unnamed: 0,two
d,4.0
b,2.0
a,1.0


In [149]:
# return index 
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [150]:
# return columns
df.columns

Index(['one', 'two'], dtype='object')

In [152]:
#### DataFrame from dict of lists
d = {'one': [1.,2.,3.,4.], 'two':[5.,6.,7.,8.]}
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,5.0
1,2.0,6.0
2,3.0,7.0
3,4.0,8.0


In [154]:
# change index (row labels)
pd.DataFrame(d, index=['a','b','c','d'])

Unnamed: 0,one,two
a,1.0,5.0
b,2.0,6.0
c,3.0,7.0
d,4.0,8.0


In [261]:
np.zeros(2,)

array([ 0.,  0.])

# Stopped at 'From structured or record array'...