In [2]:
import pandas as pd
import numpy as np

# ICS 434: DATA SCIENCE FUNDAMENTALS

## Arithmetic Operations and Data Alignment

---

### Vectorization

A vectorized operation applies an operation to a list of elements at once
  * Implemented on `numpy` library
  * Easily extends to Pandas, which uses NumPy to store data
* Taking advantage of the parallel capabilities of modern CPUs, it is much more efficient than Python *for loops*
  * Automatically handles collection of data in parallel


### Vectorization -- Cont'd

* A `Series` contains only homogeneous data types
* `Series` operations are delegated to optimized and compiled C code. 
  * This process is known as vectorization. 
  * The result is a tremendous speedup compared to an analogous computation in Python, which:
    * Iterates through the arrays
    * Each item is painstakingly checked for its data type 


In [3]:
x = pd.Series(range(1_000_000))
type(x)

pandas.core.series.Series

In [4]:
x.head()

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [7]:
%%time
a = range(1_000_000)
b = []
for elem in a:
    b.append(elem**2)

CPU times: user 275 ms, sys: 11.4 ms, total: 287 ms
Wall time: 291 ms


In [11]:
%%time
y = x**2

CPU times: user 5 ms, sys: 4.89 ms, total: 9.89 ms
Wall time: 6.07 ms


### The Numerical Python (NumPy) Library

* `numpy` is a Python library for array-based computing implemented in the `C` language
  * The *de facto* standard for scientific computing
    * Foundation for the Python scientific stack
  
* Pandas uses `numpy` to store the data behind the scenes    

```python
>>> x = pd.Series([1, 2, 3])
>>> print(type(x.values))

<class 'numpy.ndarray'>
```

In [12]:
x = pd.Series([1, 2, 3])
print(type(x.values))

<class 'numpy.ndarray'>


### The Numerical Python (NumPy) Library -- Cont'd

* Python lists and `numpy` arrays are similar in structure. Numpy arrays differ in that they are densely packed with uniform values, as opposed to references to objects. This is especially useful:
  * When working with large amounts of data
  * Relatively small data fits in cache
  * Data can be traversed relatively quickly
* `numpy` objects have various `methods` to work with the data they contain

### Broadcasting

* Broadcasting is simply a set of rules for applying binary functions (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes
* Broadcasting follows a strict set of rules to determine the interaction between the two arrays
  * When working with arrays in one dimension, the rules are fairly simple:
    1. If the shape of the two arrays does not match, the array with a shape equal to 1 is stretched to match the second array
    2. If sizes disagree and neither is equal to 1, an error is raised

In [13]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
x+y

array([4, 4, 4])

In [14]:
x = np.array([1, 2, 3])
y = np.array([1])
x+y

array([2, 3, 4])

In [15]:
# Second condition throws and error
x = np.array([1, 2, 3])
y = np.array([1, 2])
x+y

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

### Broadcasting

<img src="https://www.dropbox.com/scl/fi/8fig4p3t4oz0el5vdbqkp/broadcasting.png?rlkey=qr4v5205nyuas81ke7rscy1ao&dl=1" alt="drawing" style="width:500px">

### Vectorization: Arithmetic Operations

* Arithmetic operations in `pandas`  are vectorized
  * You can sum or multiply two columns and the operation would be vectorized, i.e., parallelized
  
* We explore using a trivial dataset

```python
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':12, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})
df_1["AA"] + df_2["BB"]                  
```


In [16]:
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':13, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})

df_1

Unnamed: 0,AA,BB
A,79,11
C,2,2
T,13,2
X,21,9


In [17]:
df_1["AA"] + df_1["BB"]

A    90
C     4
T    15
X    30
dtype: int64

### Arithmetic Operations -- Cont'd

* Are not limited to arithmetic operations on the data with the same index 
  * i.e., `Series` of the same size

```python
df_2 = pd.DataFrame({'AA':{'A':21, 'D':14, 'T':5},
                     'CC':{'A':12, 'D':28, 'T':121}})
df_2
```

In [18]:
df_2 = pd.DataFrame({'AA':{'A':21, 'D':14, 'T':5},
                     'CC':{'A':12, 'D':28, 'T':121}})
df_2

Unnamed: 0,AA,CC
A,21,12
D,14,28
T,5,121


In [12]:
df_1["AA"]  + df_2["AA"] 

A    100.0
C      NaN
D      NaN
T     18.0
X      NaN
Name: AA, dtype: float64

### Vectorized Arithmetic Between `Series` (`DataFrame` Column)

* Column-wise vectorized operations require aligning the data on the index
  * A new index is created from the union of the indices of both `Series`
  * Missing values in either one of the Series are filled with missing values (`NaN`)

<img src="https://www.dropbox.com/scl/fi/r0aeoaqfwblq3in3ymlkw/alignment_arithmetic_col.png?rlkey=iosbgr4rknf7mj8vsndwgcypt&dl=1" alt="drawing" style="width:500px">


### Vectorized Arithmetic Between `Series` (`DataFrame` Row)

* Operations on row `Series` work the same way as columns 

```python
df_1.loc["A"] + df_2.loc["D"]
```
<img src="https://www.dropbox.com/scl/fi/bdmi1636ihr5t54kffj9i/E5_5_alignment_arithmetic_row.png?rlkey=w1xfklk9jjag0fo4uf1ba3lqm&dl=1" alt="drawing" style="width:700px">


In [13]:
df_1.loc["A"] + df_2.loc["D"]

AA    93.0
BB     NaN
CC     NaN
dtype: float64

### Vectorized Arithmetic on DataFrames

* We can also perform arithmetic operations on the whole `DataFrame`
  * This is simply an extension of the alignments on columns or rows
 

```python
df_1 + df_2
```

<img src="https://www.dropbox.com/scl/fi/d5nq66w06u4c7v1y20plt/alignment_arithmetic_df.png?rlkey=avbdrlqpq77iof3cqtpoowogc&dl=1" alt="drawing" style="width:750px">


In [19]:
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':12, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})
df_2 = pd.DataFrame({'AA':{'A':21, 'D':14, 'T':5},
                     'BB':{'A':12, 'D':28, 'T':121}})
df_1 + df_2

Unnamed: 0,AA,BB
A,100.0,23.0
C,,
D,,
T,17.0,123.0
X,,


### Question? 

* What would the outcome be if a column is present in one `DataFrame` and not in the other?

* Ex.:  df_1 + df_2?
<img src="https://www.dropbox.com/scl/fi/ppquie519614d9n6aiwnq/question.png?rlkey=3ox0vktrl3clfxji5k2h8v9h3&dl=1" alt="drawing" style="width:500px">


In [20]:
df_2 = pd.DataFrame({'AA':{'A':21, 'D':14, 'T':5},
                     'CC':{'A':12, 'D':28, 'T':121}})

In [21]:
df_1 + df_2

Unnamed: 0,AA,BB,CC
A,100.0,,
C,,,
D,,,
T,17.0,,
X,,,


### Vectorization Example with the Medical Spending Data Set

* An arithmetic operation between two `Series` yields a `Series`
  * We can use the resulting `Series` in new computations
  

* Example: we can compute the average spending per beneficiary of the `spending_df` `DataFrame` using the following step:
  1. For each `unique_id` (row), we can divide `spending` by  `nb_beneficiaries`
    * This gives us the spending for each beneficiary for each `unique_id`
  2. Compute the mean of the resulting values


In [22]:
spending_df = pd.read_csv('data/spending_10k.csv', index_col='unique_id')
spending_df.head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
NX531425,1255626040,FAMILY PRACTICE,METFORMIN HCL,30,135.24
QG879256,1699761833,FAMILY PRACTICE,ALLOPURINOL,30,715.76
FW363228,1538148804,INTERNAL MEDICINE,LOSARTAN POTASSIUM,146,1056.47
WD733417,1730200619,PSYCHIATRY,OLANZAPINE,13,28226.97
XW149832,1023116894,FAMILY PRACTICE,PRAVASTATIN SODIUM,348,8199.48


In [23]:
spending_benef =  spending_df['spending'] / spending_df['nb_beneficiaries']

type(spending_benef)

pandas.core.series.Series

In [24]:
spending_benef

unique_id
NX531425       4.508000
QG879256      23.858667
FW363228       7.236096
WD733417    2171.305385
XW149832      23.561724
               ...     
ZK471712       7.081061
QM412803       7.266515
NN242894       3.525000
ER452896      51.684865
WH309636       2.113333
Length: 10000, dtype: float64

In [19]:
spending_benef.mean()

131.92616419345254

In [20]:
# operation chaining
(spending_df['spending'] / spending_df['nb_beneficiaries']).mean()

131.92616419345254

### Broadcasting

* What happens with an arithmetic operation that involves a `Series` and a single value (`scalar`)?
  * The logic is the same for `DataFrame` and a single value

* The `scalar` is expanded (broadcasted) to fit the dimension of the `Series` (or `DataFrame`)
  
```python
df_1['AA'] + 1.2
```

### Broadcasting -- Cont'd

<img src="https://www.dropbox.com/scl/fi/g7k95046sa33mbiqsao7y/alignment.png?rlkey=qwzuulgxwnwknx4u1yr1nrivf&dl=1" alt="drawing" style="width:600px">


### Subsetting


* We often need to subset a dataset and we know how to subset a dataset to extract a single row

```python 
spending_df.iloc[1]
```
* We know how to subset a dataset to extract a range of rows

```python 
spending_df.iloc[1:32]
```

* But it's also very useful to subset the data based on some condition:
  * Extract all the entries in cardiology
  * How many flights out of HNL were going to Nebraska?
  * How many Biki bicycles were checked out in a specific geographical region (e.g., Manoa)?

### Comparison Operations

* Comparison operators ("`<`" , "`>`" , "`==`" , "`>=`" , "`<=`" , "`!=`") are applied in the same fashion as arithmetic operations 

  * Result in Booleans (True or False)

```python
>>> df_3 = pd.DataFrame({'AA':{'A':64, 'C':2, 'T':6, 'X':22}, 
                         'BB':{'A':3,  'C':2, 'T':7, 'X':12}})
```

```python
>>> df_1.loc[:, "AA"] > df_3.loc[:, "AA"]
A     True
C    False
T     True
X    False
Name: AA, dtype: bool
```



### Comparison Result

* Result of the comparison is a `Series` with the same indices in `df_1` and `df_3` 
  * IMPORTANT: comparisons between two `Series` or `DataFrames` can only be carried out if both objects are identically labeled
  * Have the same row **and** column names

* All entries are of type `bool`
  * `True` is the position passed the comparison test and `False` otherwise

```python
>>> df_1.loc[:, "AA"] > df_2.loc[:, "AA"]
...
ValueError: Can only compare identically-labeled Series objects
```

In [21]:
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':13, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})
df_3 = pd.DataFrame({'AA':{'A':64, 'C':2, 'T':6, 'X':22},
                     'BB':{'A':3,  'C':2, 'T':7, 'X':12}})

df_1["AA"] > df_3["AA"]

A     True
C    False
T     True
X    False
Name: AA, dtype: bool

In [22]:
## Not the same number of rows now the same index

df_4 = pd.DataFrame({'AA':{'I':64, 'J':2, 'K':6}})

df_1["AA"] > df_4["AA"]

ValueError: Can only compare identically-labeled Series objects

### Comparison Operations  and Indexing

* Comparison operators are ideal for querying and subsetting

  * SQL-like functionality although at much lower efficiency (both RAM and CPU intensive compared to dedicated DBMS)
  

* We can subset a `Series` using a *list* (`[]`) of `Boolean`s that has the same size



In [26]:
x = pd.Series([95, 23, 67])

my_filter = pd.Series([True, False, True])

x[my_filter]

0    95
2    67
dtype: int64

In [27]:
boolean_list = [True, True, False, False]
df_1.loc[:, 'AA'][boolean_list]

A    79
C     2
Name: AA, dtype: int64

In [28]:
df_1['AA'] > 3

A     True
C    False
T     True
X     True
Name: AA, dtype: bool

### Comparison Operations  and Indexing -- Cont'd

* Subsetting can also leverage broadcasting
<img src="https://www.dropbox.com/scl/fi/ly9rt948jrdznobeycatb/E4_2_filter_dataframe.png?rlkey=rbx13q2m3rwi8nr5ddzrhyhx3&dl=1" alt="drawing" style="width:600px">


### Boolean Operations

* Conditional expressions allow us to create more sophisticated queries for our dataset
  * Queries commonly rely on boolean logic (boolean operators)

* Pandas' `Boolean` operations are slightly different from Python's

 * `and` is replaced with `&` 
 * `or` is replaced with `|`
 * `not` is replaced with `~`
 
* The operators above return a boolean object of the same shape as the input

### Boolean Operations Example

* Find instances in df_1 where values in columns "AA" and "BB" are greater than 3.
```python
>>> true_false_series = (df_1['AA'] > 3) & (df_1['BB'] > 3) 
>>> true_false_series
A     True
C    False
T    False
X     True
dtype: bool
```

In [26]:
true_false_series = (df_1['AA'] > 3) & (df_1['BB'] > 3) 
true_false_series

A     True
C    False
T    False
X     True
dtype: bool

In [27]:
df_1[true_false_series]

Unnamed: 0,AA,BB
A,79,11
X,21,9


In [28]:
df_1[(df_1['AA'] > 3) & (df_1['BB'] > 3)]

Unnamed: 0,AA,BB
A,79,11
X,21,9


### Boolean Operations Examples with the Medical Spending Data Set


* Let's subset the `spending_df` to retain rows that:

  * Have 'DENTIST' as `specialty` 
  * Have spending below \$100 
  
* Save results to a new `DataFrame`

### Examples - Cont'd

* Have 'DENTIST' as `specialty`

```python
spending_df['specialty'] == 'DENTIST'
``` 

In [29]:
(spending_df['specialty'] == 'DENTIST').head()

unique_id
NX531425    False
QG879256    False
FW363228    False
WD733417    False
XW149832    False
Name: specialty, dtype: bool

### Examples -- Cont'd

* Have spending below \$100 
    
```python 
spending_df['spending'] < 100
``` 

In [30]:
(spending_df['spending'] < 100).head()

unique_id
NX531425    False
QG879256    False
FW363228    False
WD733417    False
XW149832    False
Name: spending, dtype: bool

### Examples -- Cont'd


* Combine both conditions

```python
(spending_df.loc[:, 'specialty'] == 'DENTIST') & (spending_df.loc[:, 'spending'] < 100)
```

### Examples -- Cont'd


* Combine both conditions

```python
(spending_df['specialty'] == 'DENTIST') & (spending_df['spending'] < 100)
```

In [29]:
(spending_df['specialty'] == 'DENTIST') & (spending_df['spending'] < 100)

unique_id
NX531425    False
QG879256    False
FW363228    False
WD733417    False
XW149832    False
            ...  
ZK471712    False
QM412803    False
NN242894    False
ER452896    False
WH309636    False
Length: 10000, dtype: bool

### Examples -- Cont'd

* Subset the original dataset

```python
dentist_and_small_spending = (spending_df.loc[:, 'specialty'] == 'DENTIST') & (spending_df.loc[:, 'spending'] < 100)
spending_df.loc[dentist_and_small_spending]
```

### Examples -- Cont'd

* Subset the original dataset

```python
dentist_and_small_spending = (spending_df['specialty'] == 'DENTIST') & (spending_df['spending'] < 100)
spending_df.loc[dentist_and_small_spending]
```

In [32]:
dentist_and_small_spending = (spending_df['specialty'] == 'DENTIST') & (spending_df['spending'] < 100)
spending_df.loc[dentist_and_small_spending].head()

Unnamed: 0_level_0,doctor_id,specialty,medication,nb_beneficiaries,spending
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
XY759578,1114930567,DENTIST,IBUPROFEN,23,67.19
NR408938,1407936230,DENTIST,IBUPROFEN,20,64.2
KK703203,1669464095,DENTIST,CLINDAMYCIN HCL,14,94.95
RI403710,1063709178,DENTIST,HYDROCODONE/ACETAMINOPHEN,14,74.48
YM646458,1528120383,DENTIST,AMOXICILLIN,23,63.84
