# Python for Data Analysis (Notes)

## Overview
- Chpt. 1: Preliminaries
- Chpt. 2-3: Python Language Basics, Built-in Data Structures, Functions, Files
- Chpt. 4: NumPy Basics (Vectorized Computation)
- Chpt. 5: pandas
- Chpt. 6: Data Loading, Storage, File Formats
- Chpt. 7: Data Cleaning, Preparation
- Chpt. 8: Data Wrangling
- Chpt. 9: Data Visualization (Matplotlib, seaborn)
- Chpt. 10: Data Aggregation, Group Operations
- Chpt. 11: Time Series
- Chpt. 12: Advanced pandas
- Chpt. 13: Modeling Libraries (Patsy, statsmodels, scikit-learn)
- Chpt. 14: Case Study
- Appendix A.: Advanced NumPy

## Chpt. 2: Python Language Basics
- (p2) Python: **scripting language** -> can be used to quickly write small programs or scripts to automate other tasks
- (p3) Python: challenging language for building highly concurrent, multithreaded applications (esp. apps with CPU-bound threads) 
    - reason: **global interpreter lock (GIL)**, a mechanism prevents the interpreter from executing more than one Python instruction at a time
- (p5) pandas = _panel data_, python data analysis
- (p6) Ipython & Jupyter: encourages an _execute-explore_ workflow
- (p10) NOT update **conda** packages with **pip** -> environment problems
- (p14) **Syntactic sugar**
- (p23) object introspection with question mark ?
- (p28) **magic commands**: command-line programs to be run within IPython
- (p31) Everything is a _**Python object**_: each object has an associated _type_ and internal data 
- (p32,33) Assignment of a variable (bound variables) -> creating a **reference** -> _binding_: binding a name to an object
- (p33) object _references_ have no type associated
- (p34) **variables** are names for objects within a particular **namespace**, the type information is stored in the object itself.
- (p34) check if an object is an instance of a particular type using `isinstance` function
- (p35) Objects have **attributes** and **methods** -> accessed via `obj.attribute_name` or `getattr()`
- (p35) **duck typing**: check if an object has certain methods or behavior
- (p37) check if two references refer to the same object, use the `is` keyword or `is not`, `is` is not the same as `==` operator
- (p37) **Binary operators**
- (p39) **scalar types**: None, str(Unicode strings), bytes(raw ASCII bytes), float(double-precision), bool, int
- (p39) floor division operator: //
- (p38,40) Python strings and tuples are immutable
- (p41) backslash character `\` is an _escape character_, or preface the leading quote of the string with `r` (stands for _raw_)
- (p41) string objects -> `format` method:  substitute formatted arguments into the string, thereby producing a new string, {:.2f}, {:d}, {:s}
- (p42) Bytes and Unicode
- (p43) **Boolean** values are combined with `and` and `or` keywords
- (p44) **None**: null value type, common default value for function arguments, reserved keyword, a unique instance of `NoneType`
- (p44,45) `datetime`, `date`, `time` type
- (p49) ternary expression: `value = true-expr if condition else false-expr`



In [None]:
# verify if an object is iterable if it implemented the iterator protocol
# which means it has a __iter__ 'magic method', alternative using `iter` function
def isiterable(obj):
    try:
        iter(obj) 
        return True
    except TypeError: # not iterable
        return False

# enable a function to accept any kind of sequence (list, tuple, ndarray)
# or even an iterator 
if not isinstance(x, list) and isiterable(x):
    x = list(x)

In [2]:
from datetime import datetime, date, time

dt = datetime(2011, 10, 29, 20, 30, 21)
print(dt.day)
print(dt.minute)
# extract date and time objects by calling methods on a datetime instance
print(dt.date())
print(dt.time())
# format a `datetime` as a string
print(dt.strftime('%m/%d/%Y %H:%M'))
# parse a string into `datetime` objects
print(datetime.strptime('20091031', '%Y%m%d'))
# before aggregating or grouping time series data
print(dt.replace(minute=0, second=0))


29
30
2011-10-29
20:30:21
10/29/2011 20:30
2009-10-31 00:00:00
2011-10-29 20:00:00


## Chpt. 3: Built-in Data Structures
- **Tuple** (p52-54)
    - convert any sequence or iterator to a tuple by invoking `tuple`: `tup = tuple('string')`
    - once tuple is created, it's not possible to modify which object is stored in each slot
    - if the object inside a tuple is mutable, the object can be modified in-place
    - concatenate tuples using `+` to produce a longer tuple
    - `(a, b)*n` create a longer tuple but with objects refered to the original one, not copies
    - syntax `*rest`: used in function signatures to capture an arbitrarily long list of positional arguments
    - use underscore `_` for unwanted variables: `a, b, *_ = vals`
    - useful method: `count`
    
- **List** (p54-61)
    - function `list()`: materialize an iterator or generator expression
    - method: `append(obj)`, `insert(idx, obj)` (computationally expensive, check `collections.deque`), `pop(idx)`, `remove(obj)`
    - check if a list contains a value using `in` keyword -> linear scan across list (slower than with _dict_, based on hash tables in constant time)
    - concatenating & combining: `+`(expensive due to copies) or `extend([some_list])`
    - sort: 
        - in-place with `sort(key=lambda...)`
        - return a sorted copy with `sorted()`
    - Binary search with built-in `bisect` module: `bisect.bisect` finds and returns the location, `bisect.insort` insert the element into the location
    - `bisect` module functions should work on but do not check if the list is sorted, used with unsorted list may succeed without error but incorrect
    - slicing: indexing operator `[]`, takes in `start:stop:step`, `seq[::-1]` works as reverse the sequence
    - built-in sequence functions: `enumerate()`, `sorted`, `zip`, `reversed`
    
- **dict** (p61-65)
    - hash map, or associative array
    - same way to access, insert, set elements as list and tuple
    - delete values using `del` keyword or `pop` method
    - methods: 
        - `keys()`, `values()` return iterators of the dict's keys and values
        - merge one dict into another using `update()` -> in-place changes
    - create dict from sequences using dict comprehensions: `mapping = dict(zip(keys, values))`
    - default values:
        - `get`: `value = some_dict.get(key, default_value)`
        - `set_default`
        - `collections.default_dict`
    - _values_ can be any Python object
    - **hashability**
        - _keys_ have to be immutable objects like scalar types (int, float, string) or tuples
        - check if an object is hashable using `hash(obj)` function
        - convert a list to tuple in order to use it as a key: `d[tuple([1, 2, 3])]`
        
- **Set** (p65-67)
    - create a set: 1. `set` function, 2. _set literal_ with curly braces `{}`
    - set operations: union, intersection, difference, symmetric difference (may use binary operator `|` for union, `&` for intersection)
    - set elements should be immutable, list-like elements must be first converted into tuples
    - check if a set is a subset or superset of another set using `set_a.issubset(set_b)`, `set_a.issuperset(set_b)`
    
- List, Set, and Dict Comprehensions
    - `[expr for val in collection if condition]`
    - `dict_comp = {key-expr: value-expr for value in collection if condition}`
    - `set_comp = {expr for value in collection if condition}`

In [4]:
vals = 1, 2, 2, 4, 5, 3, 2 
a, b, *rest = vals
print(a, b)
print(rest)
print(vals)
print(vals.count(2))

1 2
[2, 4, 5, 3, 2]
(1, 2, 2, 4, 5, 3, 2)
3


In [5]:
gen = range(10)
print(gen)
print(list(gen))

range(0, 10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [8]:
import bisect
c = [1, 2, 2, 2, 3, 4, 7]
print(bisect.bisect(c, 2), c)
print(bisect.insort(c, 6), c)

4 [1, 2, 2, 2, 3, 4, 7]
None [1, 2, 2, 2, 3, 4, 6, 7]


In [9]:
# use enumerate to compute a dict mapping the values (unique) of a sequence to their location
some_list = ['PEK', 'SH', 'GZ', 'SZ']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
mapping

{'PEK': 0, 'SH': 1, 'GZ': 2, 'SZ': 3}

In [11]:
# use zip to unzip a sequence, similar as converting a list of rows into  a list of columns
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'), ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
print(first_names, last_names)

('Nolan', 'Roger', 'Schilling') ('Ryan', 'Clemens', 'Curt')


In [13]:
# dict: default values
# categorize a list of words by their first letters
words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
print(by_letter)

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}


In [15]:
# use collections.defaultdict
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)
print(by_letter)

defaultdict(<class 'list'>, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})


## Chpt. 3: Functions and Files
- (p70) **None** is returned automatically when there is no return statement 
- (p70) **keyword** arguments _must_ follow the **positional arguments** (if any)
- (p70) **namespace**: variable scope
- (p71) use too many `global` keyword -> maybe better to consider OOP
- (p72) **functions are objects**, can be used as arguments to other functions -> example below for cleaning strings
- (p73) anonymous/**lambda function** (never given an explicit \_\_name__ attribute) -> useful in cases where data transformation functions take functions as arguments
- (p74) Currying: derive new functions from existing ones by _partial argument application_
- (p75) iterator
- (p75) generator: to construct a new iterable object, pause after each returned value until the next one is requested
    - use `yield` keyword instead of `return`
    - generator expression: list comprehension within parentheses instead of brackets
        - can be used as function arguments, while list comprehension cannot
- (p76) **itertools** module: `itertools.groupby(iterabel[], keyfunc)`
- (p78) `try-except-else-finally` block:
    - catch multiple exception types by writing a tuple of exception types: `except (TypeError, ValueError):`
    - `finally`: always executes regardless of whether `try` block succeeds or not
    - `else`: only executes when `try` block succeeds
- (p79) use `%run script` in IPython by default print a full call stack trace (traceback)
    - `%mode Plain` or `%mode Verbose`: control the amount of contex
    - `%debug` or `%pdb%`: step into the stack after an error
- (p80) EOL(end-of-line) markers
- (p81,82) file method: `read([size])`, `readlines([size])`, `tell`, `seek(pos)`

- compare two approaches of string cleaning

In [16]:
# use regular expression to clean data
import re
def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip() # stripping whitespaces
        value = re.sub('[!#?]', '', value) # remove punctuation symbols
        value = value.title() # standardizing on proper capitalization
    return result

In [17]:
def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

> The second approach is a more functional pattern, which enables to easily modify how the strings are transformed at a very high level. The `clean_strings` function is therefore more reusable and generic.

In [22]:
# generator
def squares(n=10):
    for i in range(1, n+1):
        yield i**2

gen = squares()
for x in gen:
    print(x, end=' ')

1 4 9 16 25 36 49 64 81 100 

In [23]:
# generator expression
gen = (x**2 for x in range(100))
sum(gen)

328350

In [None]:
# file > get an EOL-free list of lines
with open(path) as f:
    lines = [x.rstrip() for x in f]

## Chpt. 4, Appendix A: NumPy

### Skill Checklist

#### Fundamental
- array object: ndarray
- arithmetic with NumPy arrays
- indexing, slicing, boolean indexing, fancy indexing
- transpose arrays, swap axes
- universal functions (fast element-wise array function)
- array-oriented programming
- linear algebra
- pseudorandom number generation

#### Advanced
- array manipulation
- broadcasting
- advanced ufunc usage
- structured arrays
- indirect sorts: argsort, lexsort


## Chpt. 5-8, 10-12: pandas 

### Skill Checklist

#### Fundamental
- DataFrame
    - Create a pandas DataFrame
    - Dealing with Rows and Columns
    - Indexing, Selection, Filtreing
    - Boolean Indexing
    - Conversion Functions: map, apply, applymap
    - Iterating over Rows and Columns
    - Missing Data
    - String Manipulation
    - Working with Text Data
    - Working with Dates and Times
    - Merging, Joining and Concatenating
    - Delete rows/columns
    
- Series
    - Create a pandas Series
    - Accessing elements


#### Intermediate
- JSON
- Web Scraping (HTML, XML)
- melt
- pivot
- Time Series

#### Advanced
- Categorical Data
- Advanced GroupBy Use
- Method Chaining