### <font color="brown">Pandas</font>

https://pandas.pydata.org/docs/user_guide/index.html<br>
(You can also get at this from Jupiter notebook through Help -> pandas Reference -> User Guide)

#### Pandas has two key data strucutures: Series and DataFrame

In [None]:
from pandas import Series

---

#### <font color="brown">Series is a 1D array-like object containing an array of data (of any NumPy datatype),<br> and an associated array of data labels called *index*</font>

In [None]:
aser = Series([1, 5, -2, 16])
aser

In [None]:
aser.values, aser.index

**Both values and index have data types**

In [None]:
aser.values.dtype, aser.index.dtype

**Can explicity specify index**

In [None]:
aser = Series([1, 5, -2, 16], index=range(0,4))
aser

In [None]:
aser.values, ser.index

**Index can be string labels**

In [None]:
ser = Series([1, 5, -2, 16], index=['a','b','x','d'])
ser

In [None]:
print(ser.index.dtype) 

---

#### <font color="brown">Acsessing Series values</font>

**Can use index label subscripts to access and assign values, like a dictionary**

In [None]:
aser

In [None]:
aser[2]

In [None]:
ser

In [None]:
ser['x']

In [None]:
ser['a'] = 10
ser

**Can access a list of values using NumPy-like row list**

In [None]:
ser[['x','a','b']]  

---

#### <font color="brown">NumPy like array operations work as before, index tags along</font>

In [None]:
import numpy as np

print(ser, '\n')

res = ser[ser > 0]
print(res, '\n')

res = ser * 2
print(res, '\n')

res = np.power(ser,2)
print(res, '\n')

ser = ser ** 2
print(ser, '\n')

---

#### <font color="brown">Series is like an ordered dictionary</font>

**Can do membership on index (like key membership in dictionary)**

In [None]:
ser

In [None]:
'x' in ser

**Can create a Series out of a Python dictionary**

In [None]:
udict = {'Rutgers': 55000, 'Princeton': 15000, 'MIT': 20000, 'USC': 40000}
useries = Series(udict)
print(useries)
print(useries.index)

**Make a new series out of a subset of useries, with an explicit index that replaces 'Princeton' with 'Purdue'**

In [None]:
univs = ['Purdue','Rutgers','MIT','USC']
useries2 = Series(udict, index=univs)
useries2

In [None]:
# What if dictionary has list values
adict = {"one": [1,2,3,4], "two": [4,5,6]}
aser = Series(adict)
aser

In [None]:
np.power(aser,2)

---

##### <font color="brown">Checking for null/not null values</font>
**NaN is equivalent to null**

In [None]:
useries2

In [None]:
useries2.isnull()  

In [None]:
useries2.notnull()

---

##### <font color="brown">Naming the Series, and the index</font>

In [None]:
useries

In [None]:
useries.name = "student population"
useries.index.name = "university"
useries

---

##### <font color="brown">Adding two Series, Auto alignment of differently indexed datax</font>

**If an index appears in one and not the other, result is NaN**

In [None]:
useries + useries2

---

##### <font color="brown">Changing the index</font>

In [None]:
useries

In [None]:
print('Original index: ',useries.index)
useries.index = ['RU','Princeton U','MIT','USC']
print('\nUpdated index: ',useries.index)

In [None]:
useries

---

##### <font color="brown">Dropping NaNs</font>

In [None]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

**Alternatively, you can use an alias for np.nan (popular), as follows**

In [None]:
from numpy import nan as NA

dat = Series([1, NA, 2.6, NA, 6])
print(dat)

**dropna**

In [None]:
dat.dropna()

In [None]:
dat  # is the original modified?

---

##### <font color="brown">Filling NaNs</font>

In [None]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

**1. Filling (replacing) NaNs with a specific value**

In [None]:
dat.fillna(dat.mean())   # replace each NaN with mean

In [None]:
dat   # ? is original modified?

**<font color="red">fillna returns a new Series, the original is unchanged</font>**

**2. Filling (replacing) NaNs with existing value using 'forward fill'**

In [None]:
dat.fillna(method='ffill')   # forward fill value into following NaN sequence

In [None]:
dat2 = Series([1, np.nan, 2.6, np.nan, np.nan, 6])
print(dat2)

In [None]:
dat2.fillna(method='ffill')   # forward fill value into following NaN sequence

**3. Filling (replacing) NaNs with existing value using 'back fill'**

In [None]:
dat

In [None]:
dat.fillna(method='bfill')   # back fill value into following NaN sequence

In [None]:
dat2

In [None]:
dat2.fillna(method='bfill')   # back fill value into following NaN sequence

---

##### <font color="brown">Filtering with notnull()</font>

In [None]:
dat[dat.notnull()]

In [None]:
dat1 = dat.copy()
dat1

**Updating in place (modifying original) with inplace parameter**

In [None]:
dat1.dropna(inplace=True)  
dat1

**Or modify original by reassigning to it**

In [None]:
dat1 = dat.copy()
dat1

In [None]:
dat1 = dat1[dat1.notnull()]
dat1

---

##### <font color="brown">Counting occurrences of values</font>

In [None]:
valser = Series(np.random.randint(1,10,20))
valser

In [None]:
valser.value_counts()

In [None]:
valser.value_counts().index

In [None]:
valser.value_counts()[7]

---

##### <font color="brown">Mapping values with map function</font>

In [None]:
def mapper(val):
    return val**2 + 5

In [None]:
aser = Series(np.arange(1,5))
aser

In [None]:
aser.map(mapper)   # each value is transformed via the mapper function

In [None]:
aser

In [None]:
aser.map(lambda v: v**2 + 5)   # can do the same with a lambda function

---

##### <font color="brown">Resetting the index</font>

In [None]:
useries

**Reset index to numbers**

In [None]:
useries = useries.reset_index() 
useries

In [None]:
type(useries)   

**Can change column names**

In [None]:
useries.columns = ['Univ','Student Population']
useries

---

---

### <font color="brown">Pandas - DataFrame</font>

#### DataFrame is a tabular spreadsheet-like data structure consisting of an ordered collection of columns, each of which can be a different value type

In [None]:
import numpy as np
from pandas import DataFrame

#### <font color="brown">DataFrame Creation</font>

**1. Creating a DataFrame from a dictionary**

In [None]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

In [None]:
popdf.shape

**Can sequence columns of a DataFrame as needed during creation with columns parameter**

In [None]:
popdf = DataFrame(popdat, columns=['year','state', 'pop'])
popdf

**If you give a column name that's not a key in the dictionary, you get NaN's (like Series index)**

In [None]:
popdf1 = DataFrame(popdat, columns=['year','state', 'population'])
popdf1

**Index and columns names**

In [None]:
print('Index:',popdf.index)
print('Columns:',popdf.columns)

**values property gives an ndarray**

In [None]:
popdf.values

In [77]:
type(popdf.values)

numpy.ndarray

In [None]:
popdf.name  

---

**2. Creating a DataFrame from a nested dictionary**

In [None]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

In [None]:
popdf2.T

In [None]:
popdf2    # is the original dataframe modified?

In [None]:
popdf3 = DataFrame(popdat2,columns=['AZ','VA'])
popdf3

In [None]:
popdf3 = DataFrame(popdat2,columns=['VA','Arizona'])
popdf3

---

**3. Creating a DataFrame from a 2D NumPy array**

In [None]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

**Change index and column names**

In [None]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

**Or set them up at creation time**

In [None]:
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

---

#### <font color="brown">Columns</font>

**Membership**

In [None]:
popdf

In [None]:
'debt' in popdf.columns  

**Each column is a Series**

**Column can be referenced by using column name as index into dataframe**

In [None]:
print(popdf['state'])
print(popdf['state'].name)
print(popdf['state'].values)
print(popdf['state'].index)

**Alternatively, a column can be referenced as an attribute of the dataframe**

In [None]:
popdf.state

**Can get at a subset of columns with list, similar to rows of ndarray or index of Series**

In [None]:
popdf[['state','pop']]

**Changing column names**

In [None]:
popdf

In [None]:
popdf.columns = ['year','state','pop']
popdf

In [None]:
# restore to original
popdf.columns = ['state','year','pop']
popdf