### <font color="brown">Pandas</font>

https://pandas.pydata.org/docs/user_guide/index.html<br>
(You can also get at this from Jupiter notebook through Help -> pandas Reference -> User Guide)

#### Pandas has two key data strucutures: Series and DataFrame

In [1]:
from pandas import Series

---

#### <font color="brown">Series is a 1D array-like object containing an array of data (of any NumPy datatype),<br> and an associated array of data labels called *index*</font>

In [2]:
aser = Series([1, 5, -2, 16])
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [3]:
aser.values, aser.index

(array([ 1,  5, -2, 16], dtype=int64), RangeIndex(start=0, stop=4, step=1))

**Both values and index have data types**

In [4]:
aser.values.dtype, aser.index.dtype

(dtype('int64'), dtype('int64'))

**Can explicity specify index**

In [5]:
aser = Series([1, 5, -2, 16], index=range(0,4))
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [7]:
aser.values, aser.index

(array([ 1,  5, -2, 16], dtype=int64), RangeIndex(start=0, stop=4, step=1))

**Index can be string labels**

In [8]:
ser = Series([1, 5, -2, 16], index=['a','b','x','d'])
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [9]:
print(ser.index.dtype) 

object


---

#### <font color="brown">Acsessing Series values</font>

**Can use index label subscripts to access and assign values, like a dictionary**

In [10]:
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [11]:
aser[2]

-2

In [12]:
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [13]:
ser['x']

-2

In [18]:
ser['a'] = 10
ser

a    10
b     5
x    -2
d    16
dtype: int64

**Can access a list of values using NumPy-like row list**

In [21]:
ser[['x','a','b']]  

x    -2
a    10
b     5
dtype: int64

---

#### <font color="brown">NumPy like array operations work as before, index tags along</font>

In [22]:
import numpy as np

print(ser, '\n')

res = ser[ser > 0]
print(res, '\n')

res = ser * 2
print(res, '\n')

res = np.power(ser,2)
print(res, '\n')

ser = ser ** 2
print(ser, '\n')

a    10
b     5
x    -2
d    16
dtype: int64 

a    10
b     5
d    16
dtype: int64 

a    20
b    10
x    -4
d    32
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 



---

#### <font color="brown">Series is like an ordered dictionary</font>

**Can do membership on index (like key membership in dictionary)**

In [23]:
ser

a    100
b     25
x      4
d    256
dtype: int64

In [24]:
'x' in ser

True

In [25]:
'y' in ser

False

**Can create a Series out of a Python dictionary**

In [26]:
udict = {'Rutgers': 55000, 'Princeton': 15000, 'MIT': 20000, 'USC': 40000}
useries = Series(udict)
print(useries)
print(useries.index)

Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
dtype: int64
Index(['Rutgers', 'Princeton', 'MIT', 'USC'], dtype='object')


**Make a new series out of a subset of useries, with an explicit index that replaces 'Princeton' with 'Purdue'**

In [29]:
univs = ['Purdue','Rutgers','MIT','USC']
useries2 = Series(udict, index=univs)
useries2

# int change to float because of NaN

Purdue         NaN
Rutgers    55000.0
MIT        20000.0
USC        40000.0
dtype: float64

In [30]:
# What if dictionary has list values
adict = {"one": [1,2,3,4], "two": [4,5,6]}
aser = Series(adict)
aser

one    [1, 2, 3, 4]
two       [4, 5, 6]
dtype: object

In [31]:
np.power(aser,2)

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

---

##### <font color="brown">Checking for null/not null values</font>
**NaN is equivalent to null**

In [32]:
useries2

Purdue         NaN
Rutgers    55000.0
MIT        20000.0
USC        40000.0
dtype: float64

In [33]:
useries2.isnull()  

Purdue      True
Rutgers    False
MIT        False
USC        False
dtype: bool

In [34]:
useries2.notnull()

Purdue     False
Rutgers     True
MIT         True
USC         True
dtype: bool

---

##### <font color="brown">Naming the Series, and the index</font>

In [35]:
useries

Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
dtype: int64

In [36]:
useries.name = "student population"
useries.index.name = "university"
useries

university
Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
Name: student population, dtype: int64

---

##### <font color="brown">Adding two Series, Auto alignment of differently indexed datax</font>

**If an index appears in one and not the other, result is NaN**

In [37]:
#index only in the one side result is NaN
useries + useries2

MIT           40000.0
Princeton         NaN
Purdue            NaN
Rutgers      110000.0
USC           80000.0
dtype: float64

---

##### <font color="brown">Changing the index</font>

In [45]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

In [46]:
print('Original index: ',useries.index)
useries.index = ['RU','Princeton U','MIT','USC']
print('\nUpdated index: ',useries.index)

Original index:  Index(['RU', 'Princeton U', 'MIT', 'USC'], dtype='object')

Updated index:  Index(['RU', 'Princeton U', 'MIT', 'USC'], dtype='object')


In [40]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

In [41]:
useries.name

'student population'

---

##### <font color="brown">Dropping NaNs</font>

In [49]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**Alternatively, you can use an alias for np.nan (popular), as follows**

In [53]:
from numpy import nan as NA

dat = Series([1, NA, 2.6, NA, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**dropna**

In [54]:
dat.dropna()

0    1.0
2    2.6
4    6.0
dtype: float64

In [55]:
dat  # is the original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

---

##### <font color="brown">Filling NaNs</font>

In [56]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**1. Filling (replacing) NaNs with a specific value**

In [57]:
dat.fillna(dat.mean())   # replace each NaN with mean

0    1.0
1    3.2
2    2.6
3    3.2
4    6.0
dtype: float64

In [58]:
dat   # ? is original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**<font color="red">fillna returns a new Series, the original is unchanged</font>**

**2. Filling (replacing) NaNs with existing value using 'forward fill'**

In [59]:
dat.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    6.0
dtype: float64

In [60]:
dat2 = Series([1, np.nan, 2.6, np.nan, np.nan, 6])
print(dat2)

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64


In [61]:
dat2.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    2.6
5    6.0
dtype: float64

**3. Filling (replacing) NaNs with existing value using 'back fill'**

In [62]:
dat

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [63]:
dat.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
dtype: float64

In [64]:
dat2

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64

In [65]:
dat2.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
5    6.0
dtype: float64

---

##### <font color="brown">Filtering with notnull()</font>

In [66]:
dat[dat.notnull()]

0    1.0
2    2.6
4    6.0
dtype: float64

In [67]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**Updating in place (modifying original) with inplace parameter**

In [68]:
dat1.dropna(inplace=True)  
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

**Or modify original by reassigning to it**

In [69]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [70]:
dat1 = dat1[dat1.notnull()]
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

---

##### <font color="brown">Counting occurrences of values</font>

In [71]:
valser = Series(np.random.randint(1,10,20))
valser

0     6
1     2
2     4
3     2
4     2
5     3
6     1
7     3
8     6
9     3
10    3
11    8
12    7
13    5
14    7
15    3
16    7
17    9
18    1
19    2
dtype: int32

In [72]:
valser.value_counts()

3    5
2    4
7    3
6    2
1    2
4    1
8    1
5    1
9    1
dtype: int64

In [73]:
valser.value_counts().index

Int64Index([3, 2, 7, 6, 1, 4, 8, 5, 9], dtype='int64')

In [74]:
valser.value_counts()[7]

3

---

##### <font color="brown">Mapping values with map function</font>

In [75]:
def mapper(val):
    return val**2 + 5

In [76]:
aser = Series(np.arange(1,5))
aser

0    1
1    2
2    3
3    4
dtype: int32

In [77]:
aser.map(mapper)   # each value is transformed via the mapper function

0     6
1     9
2    14
3    21
dtype: int64

In [78]:
aser

0    1
1    2
2    3
3    4
dtype: int32

In [80]:
aser.map(mapper, inplace = True)   # each value is transformed via the mapper function

TypeError: map() got an unexpected keyword argument 'inplace'

In [79]:
aser.map(lambda v: v**2 + 5)   # can do the same with a lambda function

0     6
1     9
2    14
3    21
dtype: int64

---

##### <font color="brown">Resetting the index</font>

In [81]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

**Reset index to numbers**

In [82]:
useries = useries.reset_index() 
useries

Unnamed: 0,index,student population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


In [85]:
type(useries)   

pandas.core.frame.DataFrame

**Can change column names**

In [84]:
useries.columns = ['Univ','Student Population']
useries

Unnamed: 0,Univ,Student Population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


---

---

### <font color="brown">Pandas - DataFrame</font>

#### DataFrame is a tabular spreadsheet-like data structure consisting of an ordered collection of columns, each of which can be a different value type

In [3]:
import numpy as np
from pandas import DataFrame

#### <font color="brown">DataFrame Creation</font>

**1. Creating a DataFrame from a dictionary**

In [4]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [5]:
popdf.shape

(5, 3)

**Can sequence columns of a DataFrame as needed during creation with columns parameter**

In [89]:
popdf = DataFrame(popdat, columns=['year','state', 'pop'])
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


**If you give a column name that's not a key in the dictionary, you get NaN's (like Series index)**

In [90]:
popdf1 = DataFrame(popdat, columns=['year','state', 'population'])
popdf1

Unnamed: 0,year,state,population
0,2005,Arizona,
1,2010,Arizona,
2,2015,Arizona,
3,2010,Virginia,
4,2015,Virginia,


**Index and columns names**

In [91]:
print('Index:',popdf.index)
print('Columns:',popdf.columns)

Index: RangeIndex(start=0, stop=5, step=1)
Columns: Index(['year', 'state', 'pop'], dtype='object')


**values property gives an ndarray**

In [92]:
popdf.values

array([[2005, 'Arizona', 5.9],
       [2010, 'Arizona', 6.6],
       [2015, 'Arizona', 6.8],
       [2010, 'Virginia', 7.9],
       [2015, 'Virginia', 8.3]], dtype=object)

In [93]:
type(popdf.values)

numpy.ndarray

In [94]:
popdf.name  

AttributeError: 'DataFrame' object has no attribute 'name'

---

**2. Creating a DataFrame from a nested dictionary**

In [95]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [96]:
popdf2.T

Unnamed: 0,2005,2010,2015
Arizona,5.9,6.6,6.8
Virginia,,7.9,8.3


In [97]:
popdf2    # is the original dataframe modified?

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [98]:
#change the columns without the change method it will be NaN
popdf3 = DataFrame(popdat2,columns=['AZ','VA'])
popdf3

Unnamed: 0,AZ,VA


In [99]:
popdf3 = DataFrame(popdat2,columns=['VA','Arizona'])
popdf3

Unnamed: 0,VA,Arizona
2005,,5.9
2010,,6.6
2015,,6.8


---

**3. Creating a DataFrame from a 2D NumPy array**

In [None]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

**Change index and column names**

In [None]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

**Or set them up at creation time**

In [None]:
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

---

#### <font color="brown">Columns</font>

**Membership**

In [None]:
popdf

In [None]:
'debt' in popdf.columns  

**Each column is a Series**

**Column can be referenced by using column name as index into dataframe**

In [None]:
print(popdf['state'])
print(popdf['state'].name)
print(popdf['state'].values)
print(popdf['state'].index)

**Alternatively, a column can be referenced as an attribute of the dataframe**

In [None]:
popdf.state

**Can get at a subset of columns with list, similar to rows of ndarray or index of Series**

In [None]:
popdf[['state','pop']]

**Changing column names**

In [None]:
popdf

In [None]:
popdf.columns = ['year','state','pop']
popdf

In [None]:
# restore to original
popdf.columns = ['state','year','pop']
popdf