<div align="center">
    <h1>DS-210: Programming for Data Science</h1>
    <h1>Lecture 10</h1>
</div>

1. Complexity Analysis (e.g. memory management in vectors)
2. Hash maps [(§8.3)](https://doc.rust-lang.org/book/ch08-03-hash-maps.html)
3. Hash maps with custom types

# 1. Complexity Analysis (e.g. memory management in vectors)

Let's dive deeper into algorithmic complexity analysis by considering how memory is manged in Rust Vecs.


## Last time: vectors `Vec<T>`

* Dynamic-length array/list 
* Allowed operations:
  * access item at specific location
  * `push`: add something to the end
  * `pop`: remove an element from the end



* Python: list
* C++: `vector<T>`
* Java: `ArrayList<T>` / `Vector<T>`


<div align="center">
    <h3>How to implement this efficiently?</h3>
</div>

## Select implementation details

### Challenges

* Size changes: allocate on the heap?
* What to do if a new element added?
  * Allocate a larger array and copy everything? 
  * Linked list?

### Solution

* Allocate more space than needed!
* When out of space:
  * Increase storage size by, say, 100%
  * Copy everything

### Under the hood
Variable of type `Vec<T>` contains:
* pointer to allocated memory
* size: the current number of items
* capacity: how many items could currently fit

**Important:** size $\le$ capacity

## Example

Method `capacity()` reports the current storage size

In [2]:
// print out the current size and capacity

// define a generic function `info` that takes one argument, `vector`,
// of generic `Vec` type and prints it's length and capacity
fn info<T>(vector:&Vec<T>) {  
    println!("length = {}, capacity = {}",vector.len(),vector.capacity());
}

In [3]:
fn test () {
    let boop = "helo";
    match boop {
        "heelo" => {println!("whee")},
        _ => {println!("oops")},
    }
}
test()

oops


()

In [4]:
// Let's keep adding elements to Vec and see what happens to capacity

let mut v = Vec::with_capacity(7); // instantiate empty Vec with capacity 7
let mut capacity = v.capacity();
info(&v);

for i in 1..=1000 {
    v.push(i);  // push the index onto the Vec

    // if capacity changed, print the length and new capacity
    if v.capacity() != capacity {
        capacity = v.capacity();
        info(&v);
    }
};
info(&v);

length = 0, capacity = 7
length = 8, capacity = 14
length = 15, capacity = 28
length = 29, capacity = 56
length = 57, capacity = 112
length = 113, capacity = 224
length = 225, capacity = 448
length = 449, capacity = 896
length = 897, capacity = 1792
length = 1000, capacity = 1792


In [5]:
// what happens when we decrease the Vec by popping off values?

info(&v);

// `while let` is a control flow construct that will continue
// as long as pattern `Some(_) = v.pop()` matches.
// If there is a value to pop, v.pop() returns Option enum, which
//    is either Some(Vec<T>)
//    otherwise it will return None and the loop will end.
while let Some(_) = v.pop() {}

info(&v);

length = 1000, capacity = 1792
length = 0, capacity = 1792


<br>

**Questions:** 

- What is happening as we push elements?
- When does it happen?
- How much is it changing by?
- What happens when we pop? Is capacity changing?

<br>


## Example (continued)

In [6]:
// shrinking the size manually
info(&v);

for i in 1..=13 {
    v.push(i);
}

info(&v);

v.shrink_to_fit();

info(&v);
// note: size and capacity not guaranteed
//       to be the same

length = 0, capacity = 1792
length = 13, capacity = 1792
length = 13, capacity = 13


In [7]:
// creating vector with specific capacity
let mut v2 : Vec<i32> = Vec::with_capacity(1234);
info(&v2);

// avoids reallocation if you know how many items
// to expect

length = 0, capacity = 1234


<br>

`.get()` versus `.pop()`

<br>

In [8]:
// Does not remove from the vector
println!("{:?} {:?}", v.get(1), v);
// But this one does
println!("{:?} {:?}", v.pop(), v);

Some(2) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Some(13) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]


### Some other useful functions
* `append` Add vector at the end of another `vec.append(&mut vec2)
* `clear` Remove all elements from the vector `vec.clear()`
* `dedup` Remove consecutive identical elements `vec.dedup()`, most useful when combined with `sort`
* `drain` Remove a slice from the vector `vec.drain(2..4)`  -- removes and shifts -- **expensive**
* `remove` Remove an element from the vector `vec.remove(2)` -- removes and shifts -- **expensive**
* `sort` Sort the elements of a mutable vector `vec.sort()`
* Complete list at https://doc.rust-lang.org/std/vec/struct.Vec.html

## Sketch of analysis: Amortization

* Inserting an element not constant time (i.e. $O(1)$) under all conditions

### However

* **Assumption:** allocating memory size $t$ takes either $O(t)$ or $O(1)$ time


* **Slow operations:** $O($ current_size $)$ time
* **Fast operations:** $O(1)$ time


### What is the _average_ time?

- Consider an initial 100-capacity Vec.
- Continually add element
- First 100 added elements: $O(?)$
- For 101st element: $O(?)$

So on average for the first 101 elements: $ (??) / 101 \approx ?? $

* **On average:** $O(1)$ time
* Fast operations pay for slow operations


* **Terminology:** $O(1)$ *amortized* time

### Dominant terms and constants in $O()$ notation

We ignore constants and all but dominant terms as $n \rightarrow \infty$ :

$$ O(n/2) \rightarrow O(n) $$

$$ O(n^2 + 100n + 50) \rightarrow O(n^2) $$

$$ O(2^n + n^{100}) \rightarrow ?? $$


<div align="center">
    <img src="order_plot.png" alt="Order of n plots" style="width: 80%; height: auto;">
</div>

### Shrinking?

* Can be implemented this way too
* Example: shrink by 50% if less than 25% used
* Most implementations don't shrink automatically

### Notations

$O(n)$ -> Algorithm takes no more than n time (worst case scenario)

$\Omega(n)$ -> Algorithm takes at least n time (best case scenario)  

$\Theta(n)$ -> Average/Typical running time for the algorithm (average case scenario)  


## Digression (Sorting Vectors in Rust)

Sorting on on integer vectors works fine.

In [9]:
// This works great
let mut a = vec![1, 4, 3, 6, 8, 12, 5];
a.sort();
println!("{:?}", a);

[1, 3, 4, 5, 6, 8, 12]


But sorting on floating point vectors does not work directly.

In [10]:
// But the compiler does not like this one, since sort depends on total order
let mut a = vec![1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0];

In [11]:
a.sort();
println!("{:?}", a);

Error: the trait bound `f64: Ord` is not satisfied

Why?

Because floats in Rust support special values like `NaN` and `inf` which don't obey normal sorting rules.

More technically, floats in Rust don't implement the [`Ord` trait](https://doc.rust-lang.org/std/cmp/trait.Ord.html).

In [12]:
let mut x: f64 = 6.8;
x/0.0

inf

We can push `inf` onto a `Vec`.

In [13]:
a.push(x/0.0);
println!("{:?}", a);

[1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0, inf]


In [14]:
let mut x: f64 = -1.0;
x.sqrt()

NaN

<br>

Similarly, we can push `NaN` onto a `Vec`.

<br>

In [15]:
a.push(x.sqrt());
println!("{:?}", a);

[1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0, inf, NaN]


<br>

We can work around this by not relying on the Rust implementation of `sort()`, but rather defining our own comparison and using the [`.sort_by()`](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_by) function.

<br>

In [16]:
// This is ok since we don't use sort, sort_by depends on the function you pass in to compute order
let mut a: Vec<f32> = vec![1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0];
// a.sort();
a.sort_by(|x, y| x.partial_cmp(y).unwrap());
println!("{:?}", a);

[1.0, 3.0, 4.0, 5.0, 6.0, 8.0, 12.0]


<br>

Just be careful! It will panic if you try to unwrap a special value.

<br>

In [17]:
// When partial order is not well defined in the inputs you get a panic
let mut a = vec![1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0];
let mut x: f32 = -1.0;
x = x.sqrt();
a.push(x);
println!("{:?}", a);
a.sort_by(|x, y| x.partial_cmp(y).unwrap());
println!("{:?}", a);

[1.0, 4.0, 3.0, 6.0, 8.0, 12.0, 5.0, NaN]


thread '<unnamed>' panicked at src/lib.rs:120:35:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::panic
   3: core::option::unwrap_failed
   4: core::slice::sort::shared::smallsort::insert_tail
   5: std::panic::catch_unwind
   6: _run_user_code_15
   7: evcxr::runtime::Runtime::run_loop
   8: evcxr::runtime::runtime_hook
   9: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


# 2. Hash maps

## Collection `HashMap<K,V>`

**Goal:** a mapping from elements of `K` to elements of `V`

* elements of `K` called *keys* -- **must be unique**
* elements of `V` called *values*  -- **need not be unique**

Similar structure in other languages:

* Python: dictionaries
* C++: `unordered_map<K,V>`
* Java: `Hashtable<K,T>`

In [18]:
// creating a hash map and inserting pair

use std::collections::HashMap;

// number of wins in a local Counterstrike league
let mut wins = HashMap::<String,u16>::new();

// Insert creates a new key/value if exists and overwrites old value if key exists
wins.insert(String::from("Boston University"),24);
wins.insert(String::from("Harvard"),22);
wins.insert(String::from("Boston College"),20);
wins.insert(String::from("Northeastern"),32);

Extracting a reference: returns `Option<&V>`

In [19]:
wins.get("Boston University")

Some(24)

In [20]:
wins.get("MIT")

None

<br>

Insert if not present, you can use `.entry().or_insert()`.

<br>

In [21]:
wins.entry(String::from("MIT")).or_insert(10);
wins.get("MIT")

Some(10)

<br>

Updating a value based on the old value:

<br>

In [22]:
{ // code block to limit how long the reference lasts
    let entry = wins.entry(String::from("Boston University")).or_insert(10);
    *entry += 50;
}
//wins.insert(String::from("Boston University"),24);
wins.get("Boston University")

Some(74)

## Iterating

You can iterate over each key-value pair with a `for` loop similar to vectors.

<br>

In [23]:
for (k,v) in &wins {
    println!("{}: {}",k,v);
};

for (k,v) in wins.iter() {
    println!("Iter {}: {}",k,v);
};


Northeastern: 32
Harvard: 22
MIT: 10
Boston College: 20
Boston University: 74
Iter Northeastern: 32
Iter Harvard: 22
Iter MIT: 10
Iter Boston College: 20
Iter Boston University: 74


<br>

To modify values, you have to use mutable versions:

<br>

In [24]:
for (k,v) in &mut wins {
    *v += 1;
};

for (k,v) in &wins {
    println!("{}: {}",k,v);
};

for (k,v) in wins.iter_mut() {
    *v += 1;
};

for (k,v) in wins.iter() {
    println!("Mut iter {}: {}",k,v);
};



Northeastern: 33
Harvard: 23
MIT: 11
Boston College: 21
Boston University: 75
Mut iter Northeastern: 34
Mut iter Harvard: 24
Mut iter MIT: 12
Mut iter Boston College: 22
Mut iter Boston University: 76


### Using HashMaps with Match statements

In [25]:
use std::collections::HashMap;

let mut crispy_crêpes_café = HashMap::new();
crispy_crêpes_café.insert(String::from("Nutella Crêpe"),5.85);
crispy_crêpes_café.insert(String::from("Strawberries and Nutella Crêpe"),8.75);
crispy_crêpes_café.insert(String::from("Roma Tomato, Pesto and Spinach Crêpe"),8.90);
crispy_crêpes_café.insert(String::from("Three Mushroom Crêpe"),8.90);

fn on_the_menu(cafe: &HashMap<String,f64>, s:String) {
    print!("{}: ",s);
    match cafe.get(&s) {  // .get() returns an Option enum
        None => println!("not on the menu"),
        Some(price) => println!("${:.2}",price),
    }
}
on_the_menu(&crispy_crêpes_café, String::from("Four Mushroom Crêpe"));
on_the_menu(&crispy_crêpes_café, String::from("Three Mushroom Crêpe"));



Four Mushroom Crêpe: not on the menu
Three Mushroom Crêpe: $8.90


## How Hash Tables Work

### Storage

* Array (e.g. `Vec<Option<T>>`) representing $B$ (e.g. 8) buckets (capacity),
  * each holding something like an `Option<T>` enum with `Some((key, value, hash))`
* **Hash function:**
  * Like a pseudorandom number generator with key as the seed, e.g. $\textrm{hash}: \textrm{``apple''} \rightarrow 2678277905398556038$
  * Then take modulo of capacity $B=8$, e.g. `index = hash % 8 = 6`
  * So ultimately maps keys into one of the buckets
    * $h: Key \rightarrow \text\{0,1,\ldots,B-1\}$

### General ideas
  * Store keys (and associated values and hashes) in buckets
  * Indexing: Use hash function to find bucket holding key and value.

### Collision: two keys mapped to the same bucket  
  * Very unlikely given the pseudorandom nature of the hash function
  * What to do if two keys in the same bucket



## Handling collisions

### Probing

* Each bucket entry: (key, value, hash)
* Use a deterministic algorithm to find an open bucket

**Inserting:**
  * entry $h(k)$ busy: try $h(k) + 1$, $h(k) + 2$, etc. 
  * insert into first empty

**Searching:**
  * try $h(k)$, $h(k) + 1$, $h(k)+2$, etc.
  * stop when found or empty entry

## Growing the collection: amortization

Keep track of the number of filled entries.

When the number of keys $\ge 0.75 B$

* Double $B$
* Pick new hash function
* Move the information

## Adversarial data

* Could create lots of collisions

* Potential basis for *denial of service attacks*

### What makes a good hash function?

* Uniform distribution of inputs to the buckets available!!!
* Consistent hashing adds the property that not too many things move around when the number of buckets changes

http://www.partow.net/programming/hashfunctions/index.html  
https://en.wikipedia.org/wiki/Consistent_hashing  
https://doc.rust-lang.org/std/collections/struct.HashMap.html  

### To Dig Deeper (Optional)

Inspect and debug/single-step through a [simple implementation](./hashmap_impl) that supports creation, insert, get and remove.

See how index is found from hashing the key.

See how collision is handled.

## Hashing with custom types in Rust

How do we use custom datatypes as keys?

Required for hashing:
  1. check if $k_1,k_2 \in {}$`K` equal
  1. compute a hash function for elements of `K`

In [26]:
use std::collections::HashMap;

struct Point {
    x:i64,
    y:i64,
}

let point = Point{x:2,y:-1};

let mut elevation = HashMap::new();

elevation.insert(point,2.3);


Error: the trait bound `Point: Eq` is not satisfied

Error: the trait bound `Point: Hash` is not satisfied

In order for a data structure to work as a key for hashmap, they need three traits:
  * `PartialEq`
    * ✅ Symmetry: If a == b, then b == a.
    * ✅ Transitivity: If a == b and b == c, then a == c.
    * ❌ Reflexivity is NOT guaranteed (because e.g. NaN != NaN in floats).
  * `Eq`
    * ✅ Reflexivity: a == a is always true.
    * ✅ Symmetry
    * ✅ Transitivity
  * `Hash`
    * Supports deterministic output of a hash function
    * Consistency with Equality -- if two values are equal $a == b$, then their hashes are equal
    * Non-Invertibility -- One way. You cannot reconstruct the original value from the hash
    * etc...

Default implementation:

In [27]:
use std::collections::HashMap;

#[derive(Debug,Hash,Eq,PartialEq)]
struct DistanceKM(u64);

let mut tired = HashMap::new();

tired.insert(DistanceKM(30),true);
tired

{DistanceKM(30): true}

### All the traits that you can automatically derive from

* Clone: Allow user to make an explicit copy
* Copy: Allow user to make an implicit copy
* Debug: Allow user to print contents
* Default: Allow user to initialize with default values (Default::default())
* Hash: Allow user to use it as a key to a hash map or set.
* Eq: Allow user to test for equality
* Ord: Allow user to sort and fully order types
* PartialEq: Obeys most rules for equality but not all
* PartialOrd: Obeys most rules for ordering but not all

### Using Floats as Keys

Not all basic types support the Eq and Hash traits (f32 and f64 do not).  The reasons have to do with the NaN and Infinity problems we discussed last time.  

* If you find yourself needing floats as keys consider converting the float to a collection of integers
* Floating point representation consists of Sign, Exponent and Mantissa, each integer

<div align="center">
    <img src="Single-Precision-IEEE-754-Floating-Point-Standard.jpg" alt="Float number" style="width: 50%; height: auto;">
</div>
<div style="text-align:center">
    <em>From https://www.geeksforgeeks.org/ieee-standard-754-floating-point-numbers/</em>
</div>

`float_num = (-1)^sign * mantissa * 2^exponent` where  
- `sign` is -1 or 1  
- `mantissa` is `u23` between 0 and 2^23  
- `exponent` is `i8` between -127 and 128  

In [28]:
// Built-in Rust library for traits on numbers
:dep num-traits="0.2"

In [29]:
use num_traits::Float;

let num:f64 = 3.14159;  // Some float
println!("num: {:32.21}", num);

The type of the variable tired was redefined, so was lost.


num:          3.141589999999999882618


<br>

**Question:** Why is the number printed different than the number assigned?

<br>

Let's decompose the floating point number into its components:

In [30]:
let base:f64 = 2.0;

// Deconstruct the floating point
let (mantissa, exponent, sign) = Float::integer_decode(num);
println!("mantissa: {} exponent: {} sign: {}", mantissa, exponent, sign);

// Conver to f64
let sign_f:f64 = sign as f64;
let mantissa_f:f64 = mantissa as f64;
let exponent_f:f64 = base.powf(exponent as f64);

// Recalculate the floating point value
let new_num:f64 = sign_f * mantissa_f * exponent_f;

println!("{:32.31} {:32.31}", num, new_num);

mantissa: 7074231776675438 exponent: -51 sign: 1
3.1415899999999998826183400524314 3.1415899999999998826183400524314


## `HashSet<K>` 

"A HashMap without values"

* No value associated with keys
* Just a set of items
* Same implementation
* Fastest way to do membership tests


In [31]:
use std::collections::HashSet;

// create
let mut covid = HashSet::new();

// insert values
for i in 2019..=2022 {
    covid.insert(i);
};

In [32]:
// iterate over values in the set
for year in &covid {
    print!("{} ",year);
}
println!();

2022 2021 2019 2020 


<br>

**Question:** Why aren't the years in the order we inserted them?

<br>

We can use `.get()` and `.insert()`, similarly to how we used them in HashMaps.

In [33]:
// Returns `None` if not in the HashSet
covid.get(&2015)

None

In [34]:
covid.get(&2021)

Some(2021)

In [35]:
covid.insert(2020);
for year in &covid {
    print!("{} ",year);
}
println!();

2022 2021 2019 2020 


## In-Class Piazza Poll





https://piazza.com/class/m5qyw6267j12cj/post/206