CSV Data Source: http://introcs.cs.princeton.edu/java/data/
- surnames.csv
- 151,671 surnames by race/ethnicity
    - modified file to only include top 1000 to reduce file size
- data taken from 2000 US Census

# TOC
[DataCamp](#DataCamp)<br>
[Chapter 3: Data Manipulation with Pandas](#Chapter-3:-Data-Manipulation-with-Pandas)
- [Installing and Using Pandas](#Installing-and-Using-Pandas)
- [Introducing Pandas Objects](#Introducing-Pandas-Objects)
  - [The Pandas Series Object](#The-Pandas-Series-Object)
    - [Series as a Generalized NumPy Array](#Series-as-a-Generalized-NumPy-Array)
    - [Series as a Specialized Dictionary](#Series-as-a-Specialized-Dictionary)
    - [Constructing Series Objects](#Constructing-Series-Objects)
  - [The Pandas DataFrame Object](#The-Pandas-DataFrame-Object)
    - [DataFrame as a Generalized NumPy Array](#DataFrame-as-a-Generalized-NumPy-Array)
    - [DataFrame as a Specialized Dictionary](#DataFrame-as-a-Specialized-Dictionary)
    - [Constructing DataFrame Objects](#Constructing-DataFrame-Objects)
      - [From a Single Series Object](#From-a-Single-Series-Object)
      - [From a List of Dicts](#From-a-List-of-Dicts)
      - [From a Dictionary of Series Objects](#From-a-Dictionary-of-Series-Objects)
      - [From a Two-Dimensional NumPy Array](#From-a-Two-Dimensional-NumPyArray)
      - [From a NumPy Structured Array](#From-a-NumPy-Structured-Array)

---
# DataCamp
- slides from tutorial downloaded & notes taken on them

# Chapter 3: Data Manipulation with Pandas
- Pandas is built on top of NumPy w/efficient implementation of `DataFrame`
- `DataFrame`s are essentially multidimensional arrays w/attached row & column types
    - often w/mixed data types and/or missing data

## Installing and Using Pandas
- [Pandas Documentation](http://pandas.pydata.org)

In [1]:
import pandas as pd     # import Pandas
import numpy as np     # import NumPy
pd.__version__     # check version of Pandas
# pd?     # built-in Pandas documentation

'0.20.1'

## Introducing Pandas Objects
- 3 fundamental Pandas data structures: `Series`, `DataFrame`, `Index`

### The Pandas Series Object
- a `Series` is a 1D array of indexed data
- can be created from list or array w/`pd.Series()`
    - values can be accessed with `.values`
        - familiar NumPy array
    - indices can be access with `.index`
        - array-like object of type `pd.Index`
- data can be accessed by associated index via square brackets

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])     # Series from list
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
data.values     # access values

array([ 0.25,  0.5 ,  0.75,  1.  ])

In [4]:
data.index     # access index

RangeIndex(start=0, stop=4, step=1)

In [5]:
data[0]     # access specific data element

0.25

In [6]:
data[1:4]     # access specific data slice

1    0.50
2    0.75
3    1.00
dtype: float64

#### Series as a Generalized NumPy Array
- NumPy array has implicitly defined index to access values, but Pandas Series has explicitly defined index associated w/values
- in Series, index doesn't have to be continuous int
  - can be any desired type
  - can be nonsequential

In [7]:
pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
# non-int index

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
# nonsequential index

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

#### Series as a Specialized Dictionary
- Series maps typed keys to set of typed values (being type-specific makes it more efficient)
- by default Series will be created where index is drawn from sorted keys
- typical dictionary-style access can be performed
- Series supports array-style slicing

In [9]:
population_dict = {'California': 38332521,
                           'Texas': 26448193,
                           'New York': 19651127,
                           'Florida': 19552860,
                           'Illinois': 12882135}
population = pd.Series(population_dict)     # Series from dict
population

California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64

In [10]:
population['California']     # dictionary-style access

38332521

In [11]:
population['California':'Illinois']     # array-style slicing

California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

#### Constructing Series Objects
- when starting from scratch, constructing a Series is always some form of `pd.Series(data, index=index)`
  - index is an optional argument
  - data can be one of many entities
- when data is a list or NumPy array, index defaults to int sequence
- when data is a scalar, its repeated to fill specified index
- when data is a dictionary, index defaults to sorted dictionary keys
- in every case, index can be explicitly set if different result is preferred
  - if data is dictionary, Series will only be populated w/explicitly identified keys

In [12]:
# for list & default int index, see first example of series object above
pd.Series(5, index=[100,200,300])     # scalar data, specified index

100    5
200    5
300    5
dtype: int64

In [13]:
pd.Series({2:'a', 1:'b', 3:'c'})     # dictionary data/index

1    b
2    a
3    c
dtype: object

In [14]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])     # dictionary data/index w/specified index

3    c
2    a
dtype: object

### The Pandas DataFrame Object
- `DataFrame` can also be thought of as a generalized NumPy array or specialized Python dictionary

#### DataFrame as a Generalized NumPy Array
- DataFrame is analog of 2D array w/both flexible row indices & flexible column names
- `.index` attribute that gives access to index labels
- `.columns` attribute returns an Index object holding column labels

In [15]:
# construct DataFrame using population from above & area from next line
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995})
states = pd.DataFrame({'population': population, 'area': area})
states

Unnamed: 0,area,population
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


In [16]:
states.index     # access index labels

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')

In [17]:
states.columns     # access column labels

Index(['area', 'population'], dtype='object')

In [18]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

#### DataFrame as a Specialized Dictionary
- where dictionary maps key to value, DataFrame maps column name to a Series of column data
- `data[0]` will return first row of data in DataFrame
- `data[col0]` will return the first column of data in DataFrame

#### Constructing DataFrame Objects
##### From a Single Series Object
- DataFrame is collection of Series object & a single-column DataFrame can be constructed from single Series

In [24]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Florida,19552860
Illinois,12882135
New York,19651127
Texas,26448193


##### From a List of Dicts
- any list of dictionaries can be made into DataFrame
- even if some keys in dict are missing, Pandas will fill them in w/`NaN` values
  - `NaN` = not a number

In [25]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]     # list of dictionaries made from list comprehension
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [27]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
# DataFrame from dictionary where there's missing keys

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


##### From a Dictionary of Series Objects
##### From a Two-Dimensional NumPy Array
##### From a NumPy Structured Array