CSV Data Source: http://introcs.cs.princeton.edu/java/data/
- surnames.csv
- 151,671 surnames by race/ethnicity
    - modified file to only include top 1000 to reduce file size
- data taken from 2000 US Census

# TOC
[DataCamp](#DataCamp)<br>
[Chapter 3: Data Manipulation with Pandas](#Chapter-3:-Data-Manipulation-with-Pandas)
- [Installing and Using Pandas](#Installing-and-Using-Pandas)
- [Introducing Pandas Objects](#Introducing-Pandas-Objects)
  - [The Pandas Series Object](#The-Pandas-Series-Object)
    - [Series as a Generalized NumPy Array](#Series-as-a-Generalized-NumPy-Array)
    - [Series as a Specialized Dictionary](#Series-as-a-Specialized-Dictionary)
    - [Constructing Series Objects](#Constructing-Series-Objects)
  - [The Pandas DataFrame Object](#The-Pandas-DataFrame-Object)
    - [DataFrame as a Generalized NumPy Array](#DataFrame-as-a-Generalized-NumPy-Array)
    - [DataFrame as a Specialized Dictionary](#DataFrame-as-a-Specialized-Dictionary)
    - [Constructing DataFrame Objects](#Constructing-DataFrame-Objects)
      - [From a Single Series Object](#From-a-Single-Series-Object)
      - [From a List of Dicts](#From-a-List-of-Dicts)
      - [From a Dictionary of Series Objects](#From-a-Dictionary-of-Series-Objects)
      - [From a Two-Dimensional NumPy Array](#From-a-Two-Dimensional-NumPyArray)
      - [From a NumPy Structured Array](#From-a-NumPy-Structured-Array)

---
# DataCamp
- slides from tutorial downloaded & notes taken on them

# Chapter 3: Data Manipulation with Pandas
- Pandas is built on top of NumPy w/efficient implementation of `DataFrame`
- `DataFrame`s are essentially multidimensional arrays w/attached row & column types
    - often w/mixed data types and/or missing data

## Installing and Using Pandas
- [Pandas Documentation](http://pandas.pydata.org)

In [1]:
import pandas as pd     # import Pandas
import numpy as np     # import NumPy
pd.__version__     # check version of Pandas
# pd?     # built-in Pandas documentation

'0.20.1'

## Introducing Pandas Objects
- 3 fundamental Pandas data structures: `Series`, `DataFrame`, `Index`

### The Pandas Series Object
- a `Series` is a 1D array of indexed data
- can be created from list or array w/`pd.Series()`
    - values can be accessed with `.values`
        - familiar NumPy array
    - indices can be access with `.index`
        - array-like object of type `pd.Index`
- data can be accessed by associated index via square brackets

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])     # Series from list
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
data.values     # access values

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [4]:
data.index     # access index

RangeIndex(start=0, stop=4, step=1)

In [5]:
data[0]     # access specific data element

0.25

In [6]:
data[1:4]     # access specific data slice

1    0.50
2    0.75
3    1.00
dtype: float64

#### Series as a Generalized NumPy Array
- NumPy array has implicitly defined index to access values, but Pandas Series has explicitly defined index associated w/values
- in Series, index doesn't have to be continuous int
  - can be any desired type
  - can be nonsequential

In [7]:
pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
# non-int index

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
# nonsequential index

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

#### Series as a Specialized Dictionary
- Series maps typed keys to set of typed values (being type-specific makes it more efficient)
- by default Series will be created where index is drawn from sorted keys
- typical dictionary-style access can be performed
- Series supports array-style slicing

In [9]:
population_dict = {'California': 38332521,
                           'Texas': 26448193,
                           'New York': 19651127,
                           'Florida': 19552860,
                           'Illinois': 12882135}
population = pd.Series(population_dict)     # Series from dict
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [10]:
population['California']     # dictionary-style access

38332521

In [11]:
population['California':'Illinois']     # array-style slicing

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

#### Constructing Series Objects
- when starting from scratch, constructing a Series is always some form of `pd.Series(data, index=index)`
  - index is an optional argument
  - data can be one of many entities
- when data is a list or NumPy array, index defaults to int sequence
- when data is a scalar, its repeated to fill specified index
- when data is a dictionary, index defaults to sorted dictionary keys
- in every case, index can be explicitly set if different result is preferred
  - if data is dictionary, Series will only be populated w/explicitly identified keys

In [12]:
# for list & default int index, see first example of series object above
pd.Series(5, index=[100,200,300])     # scalar data, specified index

100    5
200    5
300    5
dtype: int64

In [13]:
pd.Series({2:'a', 1:'b', 3:'c'})     # dictionary data/index

1    b
2    a
3    c
dtype: object

In [14]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])     # dictionary data/index w/specified index

3    c
2    a
dtype: object

### The Pandas DataFrame Object
- `DataFrame` can also be thought of as a generalized NumPy array or specialized Python dictionary

#### DataFrame as a Generalized NumPy Array
- DataFrame is analog of 2D array w/both flexible row indices & flexible column names
- `.index` attribute that gives access to index labels
- `.columns` attribute returns an Index object holding column labels

In [15]:
# construct DataFrame using population from above & area from next line
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995})
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [16]:
states.index     # access index labels

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [17]:
states.columns     # access column labels

Index(['area', 'population'], dtype='object')

In [18]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

#### DataFrame as a Specialized Dictionary
- where dictionary maps key to value, DataFrame maps column name to a Series of column data
- `data[0]` will return first row of data in DataFrame
- `data[col0]` will return the first column of data in DataFrame

#### Constructing DataFrame Objects
##### From a Single Series Object
- DataFrame is collection of Series object & a single-column DataFrame can be constructed from single Series

In [19]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


##### From a List of Dicts
- any list of dictionaries can be made into DataFrame
- even if some keys in dict are missing, Pandas will fill them in w/`NaN` values
  - `NaN` = not a number

In [20]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]     # list of dictionaries made from list comprehension
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [21]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
# DataFrame from dictionary where there's missing keys

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


##### From a Dictionary of Series Objects
- DataFrame can be made from dictionary of Series objects too

In [22]:
pd.DataFrame({'population':population, 'area':area})

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


##### From a Two-Dimensional NumPy Array
- given 2D array of data, we can create DataFrame w/any specified column & index names
- if omitted, int index will be used for each

In [23]:
pd.DataFrame(np.random.rand(3,2), columns=['foo','bar'], index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.375149,0.400204
b,0.693348,0.400132
c,0.41284,0.407879


##### From a NumPy Structured Array
- DataFrame operates like a structure array & can be created from one

In [24]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0,  0.), (0,  0.), (0,  0.)], 
      dtype=[('A', '<i8'), ('B', '<f8')])

In [25]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


### The Pandas Index Object
- both Series & DataFrame have an explicit index that lets you reference & modify data
- `Index` object can be thought of either as an immutable array or as an ordered set
    - technically a multiset since they may contain repeated values
- construct an Index object from a list of ints using `pd.Index()`

#### Index as Immutable Array
- can use standara Python indexing notation to retrieve values/slices
- Index object have many attributes familiar of NumPy arrays
- indices are immutable & can't be modified by normal means
    - ex: if you try to do `ind[1] = 0` you'll get an error
    - makes it safer to share indices between multiple DataFrames & arrays w/o potential for side effects from accidental index modification

In [26]:
ind = pd.Index([2,3,5,7,11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [27]:
ind[1]

3

In [28]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [29]:
print(ind.size,
     ind.shape,
     ind.ndim,
     ind.dtype)

5 (5,) 1 int64


#### Index as Ordered Set
- Pandas objects are designed to facilitate operations across datasets
    - a lot of this depends on math
- Index object follows a lot of conventions used by Python's `set` structure so unions, intersections, differences & other combos can be computed easily
    - all examples below can also be accessed via object methods

In [30]:
ind2 = pd.Index([1,3,5,7,9])

ind & ind2     # intersection
ind.intersection(ind2)     # intersection using object method

Int64Index([3, 5, 7], dtype='int64')

In [31]:
ind | ind2     # union

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [32]:
ind ^ ind2     # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

## Data Indexing and Selection
### Data Selection in Series
#### Series as Dictionary
- Series provides mapping from collection of keys to a collection of values
- can use dictionary-like Python expressions & methods to examine keys/indices & values
- can be modified w/dictionary-like syntax

In [33]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [34]:
data['b']     # accessing value by key

0.5

In [35]:
'a' in data

True

In [36]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [37]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [38]:
data['e'] = 1.25    # add new key/value pair
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

#### Series as One-Dimensional Array
- provides array-style item selection via same basic mechanisms as NumPy arrays: slices, masking, fancy indexing
- when slicing w/expliciting index, final index is _included_ in the slice
- when slicing w/implicit index, final index is _excluded_ from slice

In [39]:
data['a':'c']     # slicing by explicit index

a    0.25
b    0.50
c    0.75
dtype: float64

In [40]:
data[0:2]     # slicing by implicit integer index

a    0.25
b    0.50
dtype: float64

In [41]:
data[(data>0.3) & (data<0.8)]     # masking

b    0.50
c    0.75
dtype: float64

In [42]:
data[['a','e']]     # fancy indexing

a    0.25
e    1.25
dtype: float64

#### Indexers: loc, ilox, and ix
- Panadas has special _indexer_ attributes to avoid confusion in the case of integer indexes
    - to avoid the whole explicit/implicit selection and slicing thing shown above
- `loc` allows indexing & slicing that always references the explicit index
- `iloc` allows indexing & slicing that always references the implicit (Python-style) index
- `ix` is a hybrid of `loc` & `iloc`
    - for Series objects it's equivalent to standard []-based indexing

In [45]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])     # redefine data to have int index
data

1    a
3    b
5    c
dtype: object

In [46]:
data.loc[1]     # indexing using explicit index

'a'

In [48]:
data.iloc[1]     # indexing using implicit index

'b'

In [47]:
data.loc[1:3]     # slicing using explicit index

1    a
3    b
dtype: object

In [49]:
data.iloc[1:3]     # slicing using implicit index

3    b
5    c
dtype: object

### Data Selection in DataFrame
#### DataFrame as a Dictionary
- the individual Series that make up the columns of a DataFrame can be access via dictionary-style indexing of column name
- can use attribute-style access w/column names that are strings
    - that's b/c it accesses the exact same object as the dictionary style access
- this won't work in all cases
    - won't work if column names aren't strings
    - won't work if column names conflict w/DataFrame methods
- avoid column assignment via attribute
    - use `data['pop'] = z` instead of `data.pop = z`
- dictionary-style syntax can be used to modify or add to the object

In [50]:
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [51]:
data['area']     # dictionary-style indexing

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [52]:
data.area     # attribute-style access

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [54]:
data.area is data['area']     # dictionary-style & attribute-style are accessing same thing

True

In [56]:
data.pop is data['pop']
# not the same b/c pop is a built-in DataFrame method

False

In [57]:
data['density'] = data['pop'] / data['area']     # new column via dictionary-style syntax
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


#### DataFrame as Two-Dimensional Array
- can examine raw data array using `.values` attribute
- can transpose the full DataFrame to swap rows & columns using `.T`
- passing single index to an array accesses a row
- passing single "index" to DataFrame accesses a column
- using `iloc`, can index underlying array as if it's a NumPy array (using implicit Python-style index) but DataFrame index & column labels are kept in result
    - same w/`loc` except you have to provide the index & column names
- `ix` is hybrid of `iloc` and `loc`
    - remember that for int indices that it's subject to same explicit/implicit confusing as int-indexed Series objects
- any NumPy-style data access-patterns can be used w/in these indexers
    - ex: w/in loc you can combine masking & fancy indexing
- any of the above indexing conventions can be used to set or modify values

In [68]:
# use same data DataFrame from above
data.values     # examine raw data

array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])

In [59]:
data.T     # transpose the DataFrame

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


In [67]:
data.values[0]     # passing single index to array accesses a row
# have to use .values or it gives an error (since trying to index array)

array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])

In [66]:
data['area']     # passing single index to DataFrame accesses column

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

In [74]:
data.iloc[:2,:3]     # use iloc indexer

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121


In [70]:
data.loc[:'Illinois', :'pop']     # use loc indexer

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


In [73]:
data.ix[:2, :'pop']     # use ix indexer

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860


In [75]:
data.loc[data.density > 100, ['pop', 'density']]     # combine masking & fancy indexing w/in loc

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


In [76]:
data.iloc[0,2] = 90     # modify California's population density via iloc
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


#### Additional Indexing Conventions
- indexing refers to columns
- slicing refers to rows
- slices can also refer to rows by number rather than by index
- direct masking operations are interpreted row-wise & not column-wise

## Operating on Data in Pandas