In [16]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame



# 5.1 Intro to Pandas Structures
## Series
1D array like object containing sequence of values and associated array of data labels (called *index*)

In [4]:
obj = pd.Series([4, 7, -5, 3])
obj

print(obj.values)
print(obj.index)


[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


Shows index on the left and values on right - default index is integers 0 through N - 1

Get Values: `obj.values` => `[ 4  7 -5  3] `

Get index: `obj.index` => `RangeIndex(start=0, stop=4, step=1)`

Create Series with index identifying each data point with label:

In [5]:
obj2 = pd.Series([4, 8, 2, 0], index=['a', 'b', 'c', 'd'])
obj2

a    4
b    8
c    2
d    0
dtype: int64

Compared with NumPy arrays - can use labels in index when selecting single values or set of vals
`obj2['a']` => `4`

Think about series as a fixed-length, ordered dict - it maps index values to data values

### Create series from python dict:

Index is dict's keys in sorted order -> to overide - pass dict keys in order want them to appear
Detect missing data `isnull` and `notnull` functions

In [11]:
sdata = {'Ohio': 35, 'Texas': 12, 'Alberta': 80}
obj3 = pd.Series(sdata)
obj3

states = ['Alberta', 'Ohio', 'Texas', 'Cali']
obj4 = pd.Series(sdata, index=states)
obj4

pd.isnull(obj4)

# Series auto aligns by index lable in arithmetic ops

obj3 + obj4

Alberta    160.0
Cali         NaN
Ohio        70.0
Texas       24.0
dtype: float64

### Alter Series index in place by assignment

In [12]:
obj.index =['Doug', 'Steve', 'Chris', 'Laine']
obj

Doug     4
Steve    7
Chris   -5
Laine    3
dtype: int64

## Data Frames

Represents a rectangular table of data and contains an ordered collection of columns - each of which can be different value types (numeric, string, bool etc)

Has both row and column index

Thought of as a dict of Series all sharing same index

### Constructing a DF

Common is a dict of equal length lists or NumPy arrays

The DF will have its index assigned automatically as with Series, columns placed in sorted order

For large DF's `df.head()` will select first 5 rows only

Specify columns - `pd.DataFrame(data, columns=['year', 'state', 'pop'])`


In [13]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


### Retrieving Data
**Column:** Retrived as a Series by either dict-like notation or by attribute:

`df['state']` OR `df.state`

**Row:** Retrieved by position or name with special `loc` attribute:

`df.loc['three']`

### Modifying Columns
when assigning lists or arrays to col - length must match length of DF

**Delete Col:** `del df['year']`

In [19]:
df['debt'] = 16.5
# OR
df['happiness'] = np.arange(6.)
df

Unnamed: 0,state,year,pop,debt,happiness
0,Ohio,2000,1.5,16.5,0.0
1,Ohio,2001,1.7,16.5,1.0
2,Ohio,2002,3.6,16.5,2.0
3,Nevada,2001,2.4,16.5,3.0
4,Nevada,2002,2.9,16.5,4.0
5,Nevada,2003,3.2,16.5,5.0


### Nested Dict of Dicts:

**interpret outer dict keys as columns and inner keys as row indices**

### Transpose (swap rows and cols) = `df.T`

In [20]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

df3 = pd.DataFrame(pop)

df3.T

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


### Row and Col Name attributes

In [23]:
df3.index.name='year'
df3.columns.name='state'
df3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [24]:
#.VALUES - returns 2D n-dimensional array

df3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

# Index Objects
Index objects hold axis labels and other metadata (like axis name or names)

Any array or other seq. of labels you use when constructing a Series or DF internally converted to an Index

Index objects immutable

Unlike python sets - pandas index can contain duplicate labels

`dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [30]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index

index[1:]

# immutability makes it safer to share Index objects among data structures

labels = pd.Index(np.arange(3))
labels

obj2 = pd.Series([1.5, -2, 0], index = labels)
obj2

obj2.index is labels

True

# 5.2 - Essential Functionality
## Reindexing
`reindex` import method - create new object with the data conformed to a new index

`method=ffill` interoplate and fill in values when reindexing



In [34]:
obj = pd.Series([4.5, 7.2, -3, 4.8], index = ['d', 'b', 'c', 'a'])

# reindex rearranges the data acceording to the new index
# adds missing values if not there

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

obj3 = pd.Series(['blue', 'yellow', 'red'], index = [0, 2, 4])
print(obj3)
obj3.reindex(range(6), method='ffill')


0      blue
2    yellow
4       red
dtype: object


0      blue
1      blue
2    yellow
3    yellow
4       red
5       red
dtype: object

In [39]:
# can alter either the (row) index, columns, or both
# when passed only a sequence, reindexs the rows 

frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index =['a', 'b', 'd'], columns=['ohio', 'texas', 'cali'])
print(frame)
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

# columns reindexed with `columns` keyword
frame.reindex(columns=['texas', 'cali', 'quebec'])

#

   ohio  texas  cali
a     0      1     2
b     3      4     5
d     6      7     8


Unnamed: 0,texas,cali,quebec
a,1,2,
b,4,5,
d,7,8,


## Dropping entries from an Axis

Dropping one or more entries from an axis is easy if already have an index array or list without those entries

`drop` method will return new object with indicated values deleted from an axis

`obj.drop('c')` OR `obj.drop(['c', 'd'])`

index vals can be deleted from either axis 

can drop values from columns by passing `axis=1` or `axis='columns'`

`data.prop('two', axis=1)`

can modify object inplace without returning a new object - DESTROYS ANY DATA DROPPED

`obj.drop('c', inplace=True)` 

## Indexing, Selection, and Filtering
### Indexing
Works like NumPy arrays but can use Index vals as well instead of only ints

`obj['b']` || `obj[1]` || `obj[2:4]` || `obj['a', 'b', 'd']` || `obj[obj<2]`

Slicing different than normal python in that endpoint in inclusive

Setting with these methods modifies selections

`obj['b':'c'] = 5`

Indexing into a DF is for retrieving one or more columns either with a single val or sequence:


In [44]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data
data['two']
# data[['three', 'one']]



Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

Slicing and selecting data with a boolean array

`data[:2]`

`data[data['three'] > 5`

### Selection with loc and iloc

enable you to select a subset of rows and cols from a DF useing either axis labels (`loc`) or integers (`iloc`)

Both work with slices in addition to single lables or lists of labels

`data.loc[:'Utah', 'two']` OR `data.iloc[:, :3][data.three > 5]`

Select a single row and multiple columns by label:


In [47]:
# Using labels
data.loc['Colorado', ['two', 'three']]

# Using integers
data.iloc[2, [3, 0, 1]]


four    11
one      8
two      9
Name: Utah, dtype: int32

### Arithmetic and Data Alignment

when adding together objects - if any index pairs not the same, index  in result will be union of the index pairs

basically if they aren't in both objects will return missing values where they don't overlap

can fill missing vals with a special value (ex. 0)

`df1.add(df2, fill_value=0)`



In [55]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1 + s2
    
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1 + df2)

df2.add(df1, fill_value=0)


            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN


Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


## Function Application and Mapping

NumPy ufuncs (element-wise array methods) work with pandas objects



In [57]:
frame = pd.DataFrame(np.random.randn(4,3), columns=list('efg'), index=['ab', 'ont', 'bc', 'que'])
print(frame)
np.abs(frame)

            e         f         g
ab  -2.039418 -1.177167 -1.192470
ont -0.374649  0.575577  0.787218
bc  -0.319430  1.773635  1.547850
que  0.166655 -0.562240 -1.335780


Unnamed: 0,e,f,g
ab,2.039418,1.177167,1.19247
ont,0.374649,0.575577,0.787218
bc,0.31943,1.773635,1.54785
que,0.166655,0.56224,1.33578


apply a func on 1d arrays to each column or row - `apply` method

here f computes dif bw max and min of a Series invoked on each column in frame

result is series having the col of `frame` as index

if pass `axis='columns` to `apply` function invoked once per row instead

`frame.apply(f, axis='columns')`

In [59]:
f = lambda x: x.max() - x.min()
frame.apply(f)

e    2.206074
f    2.950803
g    2.883630
dtype: float64

function passed to apply need not return a scalar value - can return a series with multiple values

In [63]:
def f(x): 
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)


Unnamed: 0,e,f,g
min,-0.73447,-0.638758,-0.137836
max,1.25527,1.099627,1.329661


## Sorting and Ranking

#### `sort_index` Sort by row or col index - sorted in ascending order by default


`obj.sort_index()` OR `obj.sort_index(axis=1)` OR `obj.sort_index(ascending=False)`

#### `sort_values` sort by values - sorted in ascending order by default
missing vals sarted to the end

`obj.sort_values()` 

Use data in a row or col as a sort key

`obj.sort_values(by='b')`

`rank` method assigns ranks from noe through the number of valid data points in an arary

## 5.3 Summarizing and Computing Descriptive Stats

handles missing values on own - skips those vals unless entire row or col is NaN - (use `skipna = False`) to disable

`df.sum()` OR `df.sum(axis=1)`

`df.describe()` returns multiple sum stats

### Correlation and Covariance



