## CS 210 Spring 2024 - Mar 18
### NumPy and Pandas

---

### <font color="brown">NumPy</font>

In [11]:
import numpy as np

#### <font color="brown">Boolean arrays</font>
**Boolean values are coerced to 1 (True) and 0 (False)**

##### Exercise: Find number of positive values in an array

In [12]:
arr = np.array([1,-5,2,3,-4,6])
arr > 0

array([ True, False,  True,  True, False,  True])

In [13]:
(arr > 0).sum()    # number of True values 

4

In [14]:
print((arr > 0).any())
print((arr > 0).all())

True
False


---

### <font color="brown">Pandas</font>

https://pandas.pydata.org/docs/user_guide/index.html<br>

#### Pandas has two key data strucutures: Series and DataFrame

In [15]:
import pandas as pd
from pandas import Series

---

#### <font color="brown">Series is a 1D array-like object containing an array of data (of any NumPy datatype),<br> and an associated array of data labels called *index*</font>

In [16]:
aser = Series([1, 5, -2, 16])
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [17]:
aser.values, aser.index

(array([ 1,  5, -2, 16]), RangeIndex(start=0, stop=4, step=1))

**Both values and index have data types**

In [18]:
aser.values.dtype, aser.index.dtype

(dtype('int64'), dtype('int64'))

**Can explicity specify index**

In [19]:
aser = Series([1, 5, -2, 16], index=range(0,4))
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [20]:
aser.values, aser.index

(array([ 1,  5, -2, 16]), RangeIndex(start=0, stop=4, step=1))

**Index can be string labels**

In [21]:
ser = Series([1, 5, -2, 16], index=['a','b','x','d'])
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [22]:
print(ser.index.dtype) # Note the index type, it's not string

object


**Note: If you use strings for labels, the index type is an object, not String**

---

#### <font color="brown">Acsessing Series values</font>

**Can use index label subscripts to access and assign values, like a dictionary**

In [23]:
aser

0     1
1     5
2    -2
3    16
dtype: int64

In [24]:
aser[2]

-2

In [25]:
ser

a     1
b     5
x    -2
d    16
dtype: int64

In [26]:
ser['x']

-2

In [27]:
ser['a'] = 10
ser

a    10
b     5
x    -2
d    16
dtype: int64

**Can access a list of values using NumPy-like row list**

In [28]:
ser[['x','a','b']]  

x    -2
a    10
b     5
dtype: int64

---

#### <font color="brown">NumPy like array operations work as before, index tags along</font>

In [29]:
import numpy as np

print(ser, '\n')

res = ser[ser > 0]
print(res, '\n')

res = ser * 2
print(res, '\n')

res = np.power(ser,2)
print(res, '\n')

ser = ser ** 2
print(ser, '\n')

a    10
b     5
x    -2
d    16
dtype: int64 

a    10
b     5
d    16
dtype: int64 

a    20
b    10
x    -4
d    32
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 

a    100
b     25
x      4
d    256
dtype: int64 



---

#### <font color="brown">Series is like an ordered dictionary</font>

**Can do membership on index (like key membership in dictionary)**

In [30]:
'x' in ser

True

**Can create a Series out of a Python dictionary**

In [31]:
udict = {'Rutgers': 55000, 'Princeton': 15000, 'MIT': 20000, 'USC': 40000}
useries = Series(udict)
print(useries)
print(useries.index)

Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
dtype: int64
Index(['Rutgers', 'Princeton', 'MIT', 'USC'], dtype='object')


**Make a new series out of a subset of useries, with an explicit index that replaces 'Princeton' with 'Purdue'**

In [32]:
univs = ['Purdue','Rutgers','MIT','USC']
useries2 = Series(udict, index=univs)
useries2

Purdue         NaN
Rutgers    55000.0
MIT        20000.0
USC        40000.0
dtype: float64

##### In the above:

- Explicit index overrides the of udict
- Indexes common to the argument index (univs) and the udict index are kept with their udict values
- But for any index in univs that is not in udict, the value is NaN.

##### Also note that that dtype has changed from int to float because of the NaN

In [33]:
# What if dictionary has list values
adict = {"one": [1,2,3,4], "two": [4,5,6]}
aser = Series(adict)
aser

one    [1, 2, 3, 4]
two       [4, 5, 6]
dtype: object

**You still get a Series, but the values are now objects**
**So you can't do common NumPy ops..**

In [34]:
np.power(aser,2)

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

---

##### <font color="brown">Checking for null/not null values</font>
**NaN is equivalent to null**

In [None]:
useries2

Purdue         NaN
Rutgers    55000.0
MIT        20000.0
USC        40000.0
dtype: float64

In [None]:
useries2.isnull()  

Purdue      True
Rutgers    False
MIT        False
USC        False
dtype: bool

In [None]:
useries2.notnull()

Purdue     False
Rutgers     True
MIT         True
USC         True
dtype: bool

---

##### <font color="brown">Naming the Series, and the index</font>

In [None]:
useries

Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
dtype: int64

In [None]:
useries.name = "student population"
useries.index.name = "university"
useries

university
Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
Name: student population, dtype: int64

---

##### <font color="brown">Adding two Series, Auto alignment of differently indexed datax</font>

**If an index appears in one and not the other, result is NaN**

In [None]:
useries + useries2

MIT           40000.0
Princeton         NaN
Purdue            NaN
Rutgers      110000.0
USC           80000.0
dtype: float64

---

##### <font color="brown">Changing the index</font>

In [None]:
useries

university
Rutgers      55000
Princeton    15000
MIT          20000
USC          40000
Name: student population, dtype: int64

In [None]:
print('Original index: ',useries.index)
useries.index = ['RU','Princeton U','MIT','USC']
print('\nUpdated index: ',useries.index)

Original index:  Index(['Rutgers', 'Princeton', 'MIT', 'USC'], dtype='object', name='university')

Updated index:  Index(['RU', 'Princeton U', 'MIT', 'USC'], dtype='object')


In [None]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

**Note that index no longer has a name since we didn't specify one when we modified the index**

---

##### <font color="brown">Dropping NaNs</font>

In [None]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**Alternatively, you can use an alias for np.nan (popular), as follows**

In [None]:
from numpy import nan as NA

dat = Series([1, NA, 2.6, NA, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**dropna**

In [None]:
dat.dropna()

0    1.0
2    2.6
4    6.0
dtype: float64

In [None]:
dat  # is the original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**<font color="red">dropna returns a new Series, the original is unchanged</font>**

---

##### <font color="brown">Filling NaNs</font>

In [None]:
dat = Series([1, np.nan, 2.6, np.nan, 6])
print(dat)

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64


**1. Filling (replacing) NaNs with a specific value**

In [None]:
dat.fillna(dat.mean())   # replace each NaN with mean

0    1.0
1    3.2
2    2.6
3    3.2
4    6.0
dtype: float64

In [None]:
dat   # ? is original modified?

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**<font color="red">fillna returns a new Series, the original is unchanged</font>**

**2. Filling (replacing) NaNs with existing value using 'forward fill'**

In [None]:
dat.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    6.0
dtype: float64

In [None]:
dat2 = Series([1, np.nan, 2.6, np.nan, np.nan, 6])
print(dat2)

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64


In [None]:
dat2.fillna(method='ffill')   # forward fill value into following NaN sequence

0    1.0
1    1.0
2    2.6
3    2.6
4    2.6
5    6.0
dtype: float64

**3. Filling (replacing) NaNs with existing value using 'back fill'**

In [None]:
dat

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [None]:
dat.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
dtype: float64

In [None]:
dat2

0    1.0
1    NaN
2    2.6
3    NaN
4    NaN
5    6.0
dtype: float64

In [None]:
dat2.fillna(method='bfill')   # back fill value into following NaN sequence

0    1.0
1    2.6
2    2.6
3    6.0
4    6.0
5    6.0
dtype: float64

---

##### <font color="brown">Filtering with notnull()</font>

In [None]:
dat[dat.notnull()]

0    1.0
2    2.6
4    6.0
dtype: float64

In [None]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

**Updating in place (modifying original) with inplace parameter**

In [None]:
dat1.dropna(inplace=True)  
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

**Or modify original by reassigning to it**

In [None]:
dat1 = dat.copy()
dat1

0    1.0
1    NaN
2    2.6
3    NaN
4    6.0
dtype: float64

In [None]:
dat1 = dat1[dat1.notnull()]
dat1

0    1.0
2    2.6
4    6.0
dtype: float64

---

##### <font color="brown">Counting occurrences of values</font>

In [None]:
valser = Series(np.random.randint(1,10,20))
valser

0     5
1     9
2     9
3     6
4     6
5     8
6     9
7     7
8     6
9     8
10    7
11    5
12    3
13    4
14    7
15    8
16    4
17    6
18    5
19    3
dtype: int64

In [None]:
valser.value_counts()

6    4
9    3
8    3
7    3
5    3
4    2
3    2
dtype: int64

**Note that the value counts are stored in sorted decreasing order of counts**

In [None]:
valser.value_counts().index

Int64Index([6, 9, 8, 7, 5, 4, 3], dtype='int64')

In [None]:
valser.value_counts()[7]

3

---

##### <font color="brown">Mapping values with map function</font>

In [None]:
def mapper(val):
    return val**2 + 5

In [None]:
aser = Series(np.arange(1,5))
aser

0    1
1    2
2    3
3    4
dtype: int64

In [None]:
aser.map(mapper)   # each value is transformed via the mapper function

0     6
1     9
2    14
3    21
dtype: int64

In [None]:
aser

0    1
1    2
2    3
3    4
dtype: int64

**<font color="red">map returns a new Series, the original is unchanged</font>**

In [None]:
aser.map(lambda v: v**2 + 5)   # can do the same with a lambda function

0     6
1     9
2    14
3    21
dtype: int64

---

##### <font color="brown">Resetting the index</font>

In [None]:
useries

RU             55000
Princeton U    15000
MIT            20000
USC            40000
Name: student population, dtype: int64

**Reset index to numbers**

In [None]:
useries = useries.reset_index() 
useries

Unnamed: 0,index,student population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


**<font color="red">When you reset the index, the original index becomes the 1st column, and the values become the 2nd column</font>**

In [None]:
type(useries)   

pandas.core.frame.DataFrame

**<font color="red">The Series turns into a DataFrame</font>**

**Can change column names**

In [None]:
useries.columns = ['Univ','Student Population']
useries

Unnamed: 0,Univ,Student Population
0,RU,55000
1,Princeton U,15000
2,MIT,20000
3,USC,40000


---

### <font color="brown">Pandas - DataFrame</font>

#### DataFrame is a tabular spreadsheet-like data structure consisting of an ordered collection of columns, each of which can be a different value type

In [None]:
import numpy as np
from pandas import DataFrame

#### <font color="brown">DataFrame Creation</font>

**1. Creating a DataFrame from a dictionary**

In [None]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


**In the above, all lists must of the same length, otherwise it will be an error**

**Like an ndarray, a dataframe has a shape property**

In [None]:
popdf.shape

(5, 3)

**Can sequence columns of a DataFrame as needed during creation with columns parameter**

In [None]:
popdf = DataFrame(popdat, columns=['year','state', 'pop'])
popdf

Unnamed: 0,year,state,pop
0,2005,Arizona,5.9
1,2010,Arizona,6.6
2,2015,Arizona,6.8
3,2010,Virginia,7.9
4,2015,Virginia,8.3


**If you give a column name that's not a key in the dictionary, you get NaN's (like Series index)**

In [None]:
popdf1 = DataFrame(popdat, columns=['year','state', 'population'])
popdf1

Unnamed: 0,year,state,population
0,2005,Arizona,
1,2010,Arizona,
2,2015,Arizona,
3,2010,Virginia,
4,2015,Virginia,


**Index and columns names**

In [None]:
print('Index:',popdf.index)
print('Columns:',popdf.columns)

Index: RangeIndex(start=0, stop=5, step=1)
Columns: Index(['year', 'state', 'pop'], dtype='object')


**values property gives an ndarray**

In [None]:
popdf.values

array([[2005, 'Arizona', 5.9],
       [2010, 'Arizona', 6.6],
       [2015, 'Arizona', 6.8],
       [2010, 'Virginia', 7.9],
       [2015, 'Virginia', 8.3]], dtype=object)

In [None]:
type(popdf.values)

numpy.ndarray

In [None]:
popdf.name  

AttributeError: 'DataFrame' object has no attribute 'name'

**<font color="red">Unlike Series, DataFrame does not have a name property**

---

**2. Creating a DataFrame from a nested dictionary**

In [None]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


**The outer keys will be column names, and the inner will be indexes<br>
Indexes will be lined up, and NaN's will be used to fill missing values**

**You can flip the columns and index by transposing**

In [None]:
popdf2.T

Unnamed: 0,2005,2010,2015
Arizona,5.9,6.6,6.8
Virginia,,7.9,8.3


In [None]:
popdf2   

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


**Transpose does not change the original dataframe**

In [None]:
popdf3 = DataFrame(popdat2,columns=['AZ','VA'])
popdf3

Unnamed: 0,AZ,VA


**<font color="red">In the above, the column names don't match any of the keys of popdat2, so the resulting dataframe has columns named AZ and VA, but no values</font>**

In [None]:
popdf3 = DataFrame(popdat2,columns=['VA','Arizona'])
popdf3

Unnamed: 0,VA,Arizona
2005,,5.9
2010,,6.6
2015,,6.8


**In the above, Arizona is retained from popdat2, but column 'VA' is not a key in popdat2, so that column is filled with NaNs**