<div align="center">
    <h1>DS-210: Programming for Data Science</h1>
    <h1>Lecture 16</h1>
</div>

1. Strings: `String` and `&str`  ( §8.2)
2. Lifetimes  (§10.3)
3. Closures   (§13.1-2)

Read section 8.2 for strings, section 10.3 for lifetimes and section 13.1 and 13.2 for closures and iterators

# 1. Strings: `String` and `&str`


## Unicode Standard and UTF-8

* **Unicode** – variable length character encoding standard.
    * currently defines 149,813 characters and 161 scripts, including emoji, symbols, etc.
* **Unicode Codepoint** -- can represent up to $17×2^{16}=1,114,112$ entries.
    * e.g. U+0000 – U+10FFFF in hexadecimal
* Unicode Transformation Standard (e.g. UTF-8) – is a variable length encoding using one to four bytes
    * First 128 chars same as ASCII

<div align="center">
    <img src="UTF8-codepoints.png" width="70%">
</div>

1. 1st row (1 byte ) covers ASCII
2. 2nd row (2 bytes) covers remainder of almost all Latin-script alphabets
3. 3rd row (3 bytes) Basic Multilingual Plane including Chinese, Japanese and Korean characters
4. 4th row (4 bytes) Emoji, historic scripts, math symbols

https://en.wikipedia.org/wiki/Unicode<br>
https://en.wikipedia.org/wiki/UTF-8 

## Rust and strings

* We have avoided this topic so far


* It's complicated


* Unicode is complicated


* Advantages: internationalization and emojis out of the box

* **Rust:** Unicode strings are a first–class citizen


* **Classical programming languages:**
  * ASCII strings are the default
  * Easier to manage
  * Additional libraries needed to deal with Unicode


### Helper Function to Print Variable Type

In [2]:
// helper function to type of variable
use std::any::type_name;

fn type_of<T>(_: T) -> &'static str {
    type_name::<T>()
}

// Usage example:
let x = 42;
println!("{}", type_of(x)); // prints "i32"

i32


## Reminder: Single characters (Unicode scalar values)

* Type: `char`
* Size: 4 bytes
* Note the single quotes!

In [3]:
let a : char = 'a';
let b = '🦕';

Dinosaurs:<br>
&nbsp;&nbsp;&nbsp;🦕 (U+1F995)<br>
&nbsp;&nbsp;&nbsp;🦖 (U+1F996)

In [4]:
// Mayan numeral (not all unicode characters are supported everywhere)
let c = '𝋥';

In [5]:
std::mem::size_of_val(&a)

4

In [6]:
std::mem::size_of_val(&b)

4

In [7]:
let c = char::from_u32(0x1F995);
println!("UTF32 character {:?}", c);
let c = char::from_u32(0x1F996);
println!("UTF32 character {:?}", c);


UTF32 character Some('🦕')
UTF32 character Some('🦖')


## String literals

* String literal${}={}$when you create a string `"like this"`
* Note the double quotes
* What type are they?

In [8]:
let sample = "Hello, DS210!";
println!("{}", type_of(sample));

&str


In [9]:
let sample: String = "Hello, DS210!".to_string();
println!("{}", type_of(sample));

alloc::string::String


In [10]:
let sample: &str = "Hello, DS210!";
println!("{}", type_of(sample));

&str


`&str` is a **string slice**, internally behaves like `&[u8]`

In [11]:
let byte_escape = "I'm writing \x52\x75\x73\x74!";
println!("What are you doing\x3F (\\x3F means ?) {}", byte_escape);

What are you doing? (\x3F means ?) I'm writing Rust!


In [12]:
// This will not work as hex can only be 2 bytes long (two characters)
// let unicode_codepoint = "\x211D";

// ...or Unicode code points.
let unicode_codepoint = "\u{211D}";
let character_name = "\"DOUBLE-STRUCK CAPITAL R\"";
println!("Unicode character {} (U+211D) is called {}",
                unicode_codepoint, character_name );

Unicode character ℝ (U+211D) is called "DOUBLE-STRUCK CAPITAL R"


## Encoding of characters

`a` and `🦕` were both 4 bytes

In [13]:
std::mem::size_of_val("a")

1

In [14]:
std::mem::size_of_val("🦕")

4

Characters need 1–4 bytes to be encoded.

In [15]:
let dinos = "🦕🦖";
std::mem::size_of_val(dinos)

8

In [16]:
let mixed = "a🦖b🦕";
std::mem::size_of_val(mixed)

10

In [None]:
// Iterating through characters
for (i, c) in mixed.chars().enumerate() {
    println!("{} {} {}", i, c, std::mem::size_of_val(&c));
}
println!("{:?} {:?}", mixed.chars().nth(1), mixed.chars().nth_back(1));

0 a 4
1 🦖 4
2 b 4


**Question:** why are all the sizes 4?

Can select substrings, but they must be aligned with actual characters (or runtime error)

In [18]:
// error
dinos[0..1]

3 🦕 4
Some('🦖') Some('b')



thread '<unnamed>' panicked at src/lib.rs:137:6:
byte index 1 is not a char boundary; it is inside '🦕' (bytes 0..4) of `🦕🦖`
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::str::slice_error_fail_rt
   3: core::str::slice_error_fail
   4: std::panic::catch_unwind
   5: _run_user_code_17
   6: evcxr::runtime::Runtime::run_loop
   7: evcxr::runtime::runtime_hook
   8: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


In [19]:
// proper way to get substrings
let first_two: String = dinos.chars().take(2).collect();
first_two

"🦕🦖"

In [20]:
let dinos = "🦕🦖";
dinos[4..8]

"🦖"

In [21]:
let sample = "Hello, world!";
sample[7..]

"world!"

## Strings

* String type is dynamic: `Vec<u8>` internally
* Can add characters and strings to the end

In [22]:
let mut sample = String::new();

//append string
sample.push_str("abc");
sample

"abc"

In [23]:
// append character
sample.push('d');
sample

"abcd"

## Converting literals to type `String`

Use `.to_string()` or `String::from(...)`

In [24]:
let string_1 = "This is a test".to_string();     // specific to &str
let string_2 = String::from("This is a test");   // specific to &str
let string_3 = "Hello".to_owned();               // applies to any slice
let string_4: String = "Hello".into();           // applies to any slice
println!("{} {} ", string_1 == string_2, string_3 == string_4);  // check if content is the same

true true 


Can also use macro `format!(...)`:
  * same syntax as `println!(...)`
  * produces an object of type `String`

In [25]:
let sample: String = format!("{} == {}",string_1,string_2);
sample

"This is a test == This is a test"

## String concatenation via `+`

* Takes ownership of the first parameter
* Second parameter: `&str`

In [26]:
let string_1 = "abc".to_string();
let string_2 = "def".to_string();

In [27]:
println!("{}", string_1 + &string_2);

//println!("{}", string_1); // error
println!("{}", string_2);

abcdef
def


### Experiments to try

1. Run the 2nd cell above a 2nd time.  What happened?
2. Uncomment line 3 of 2nd cell and run both cells. What happened?

Why `+` takes ownership of `string_1`:
 * reason: efficiency
 * no need to copy the content of the first string (unless the container size has to be increased)

## Writing generic code
* Use string slices &str if possible (e.g. you won't manipulate the string)
* This will work with `String` and `&str`

In [28]:
fn show(message: &str) {
    println!("{}",message);
}

In [29]:
// automatic conversion to &str from &String
let mut my_string = String::from("ds210");
show(&my_string);
show("ds210");

ds210
ds210


In [30]:
// &str has the same limitation as any slice (not extensible)
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{} {} {} {}", s1, s2, s3, s4);

String1 String2 str1 str2


## Which of the following will work?

In [None]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", s1+s2);

Error: mismatched types

In [32]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", s1+s3);








String1str1


In [33]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", s1+&s2);

String1String2


In [34]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", s3.to_string()+&s1);

str1String1


In [None]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", *s3+s4);

Error: cannot add `&str` to `str`

In [None]:
let s1 = "String1".to_string();
let s2 = "String2".to_string();
let s3 = "str1";
let s4 = "str2";
println!("{}", s3+s4);

Error: cannot add `&str` to `&str`

# 2. Lifetimes  ([Rust Programming Language §10.3](https://doc.rust-lang.org/book/ch10-03-lifetime-syntax.html))

* Ensures references are valid as long as we need them to be


In [37]:
// References are a kind of type
{
let a = 32;
println!("a is of type: {}", type_of(a));
let b = &a;
println!("b is of type: {}", type_of(b));
}

a is of type: i32
b is of type: &i32


()

* The goal is to enable Rust compiler to prevent __dangling references.__

In [None]:
{
let r;

{
    let x = 5;
    r = &x;
}

println!("r: {r}");
}

Error: `x` does not live long enough

### The Rust Compiler _Borrow Checker_

* Annotate the lifetimes of `r` and `x`.

* Rust uses a special pattern `'a` (single quote followed by identifier)

```rust
fn main() {
    let r;                // ---------+-- 'a
                          //          |
    {                     //          |
        let x = 5;        // -+-- 'b  |
        r = &x;           //  |       |
    }                     // -+       |
                          //          |
    println!("r: {r}");   //          |
}                         // ---------+
```

* We can see that `x` goes out of scope before we use a reference, `r`, to `x`.

* We can can fix the scope so lifetimes overlap

In [40]:
{
    let x = 5;            // ----------+-- 'b
                          //           |
    let r = &x;           // --+-- 'a  |
                          //   |       |
    println!("r: {r}");   //   |       |
                          // --+       |
}                         // ----------+

r: 5


()

## Generic Lifetimes in Functions

* Let's see an example of why we need to be able to specify lifetimes.

* Say we want to compare to strings and pick the longest one

```rust
fn main() {
    let string1 = String::from("abcd");
    let string2 = "xyz";

    let result = longest(string1.as_str(), string2);
    println!("The longest string is {result}");
}
```

In [None]:
// compare two string slices and return reference to the longest
fn longest(x: &str, y: &str) -> &str {
    if x.len() > y.len() {
        x
    } else {
        y
    }
}

Error: missing lifetime specifier

* **Question:** Why is this a problem?

<br><br><br>

* **Answer:** We don't know which reference will be returned and so we can't know the lifetime of the return reference.

## The Solution: Lifetime Annotation Syntax

* names of lifetime parameters must start with an apostrophe (') and are usually all lowercase and very short, like generic types

```rust
&i32        // a reference with inferred lifetime
&'a i32     // a reference with an explicit lifetime
&'a mut i32 // a mutable reference with an explicit lifetime
```

* now we can annotate our function with lifetime

In [42]:
fn longest<'a>(x: &'a str, y: &'a str) -> &'a str {
    if x.len() > y.len() {
        x
    } else {
        y
    }
}

### How to interpret the above code

* we use the same syntax like we used for _generic types_, `fn longest<'a>(...`

* For some lifetime `'a`, the two paramaters have lifetimes at least as long as `'a`, e.g. `(x: &'a str, y: &'a str)`

* The returned string slice will have lifetime at least as long as `'a`, e.g. `-> &'a str`

In [43]:
{
    let string1 = String::from("abcd");
    let string2 = "xyz";

    let result = longest(string1.as_str(), string2);
    println!("The longest string is {result}");
}

The longest string is abcd


()

* Above is not an issue, because all lifetimes are the same.

In [45]:
// this code is still fine
{
    let string1 = String::from("long string is long");

    {
        let string2 = String::from("xyz");
        let result = longest(string1.as_str(), string2.as_str());
        println!("The longest string is {result}");
    }
}

The longest string is long string is long


()

* Above is not an issue, because the returned reference is no longer than the shorter of the two args

* But what about below?

In [None]:
{
    let string1 = String::from("long string is long");
    let result;
    {
        let string2 = String::from("xyz");
        result = longest(string1.as_str(), string2.as_str());
    }
    println!("The longest string is {result}");
}

Error: `string2` does not live long enough

* We're trying to use `result` after the shortest arg lifetime ended

### Lifetime of return type must match lifetime of at least one parameter

* This won't work

In [None]:
fn first_str<'a>(_x: &str, _y: &str) -> &'a str {
    let result = String::from("really long string");
    result.as_str()
}

Error: cannot return value referencing local variable `result`

* Problem is the return reference is to `result` which gets dropped at end of function

## Lifetime Annotations in Struct Definitions

* So far, we've only used structs that fully owned member types.

* We can define structs to hold references, but then we need lifetime annotations

In [2]:
#[derive(Debug)]
struct ImportantExcerpt<'a> {
    part: &'a str,
}

{
    let novel = String::from("Call me Ishmael. Some years ago...");
    let first_sentence = novel.split('.').next().unwrap();
    let i = ImportantExcerpt {
        part: first_sentence,
    };
    println!("{:?}", i);
}

ImportantExcerpt { part: "Call me Ishmael" }


()

* An instance of `ImportantExcerpt` can't outlive the reference it holds in the `part` field.

## Lifetime Elision

```
e·li·sion
/əˈliZH(ə)n/
noun

the omission of a sound or syllable when speaking (as in I'm, let's, e ' en ).

* an omission of a passage in a book, speech, or film.
  "the movie's elisions and distortions have been carefully thought out"

* the process of joining together or merging things, especially abstract ideas.
  "unease at the elision of so many vital questions"
```

But why does this function compile without errors?

In [8]:
fn first_word(s: &str) -> &str {
    let bytes = s.as_bytes();

    for (i, &item) in bytes.iter().enumerate() {
        if item == b' ' {
            return &s[0..i];
        }
    }

    &s[..]
}


Shouldn't we have to write?

```rust
fn first_word<'a>(s: &'a str) -> &'a str {
```

The compiler developers decided that some patterns were so common and simple to infer that the
compiler could just infer and automatically generate the lifetime specifications.

* **input lifetimes:** lifetimes on function or method parameters

* **output lifetimes:** lifetimes on return values

### Three Rules for Compiler Lifetime Inference



#### First Rule

Assign a lifetime parameter to each parameter that is a reference.

```rust
// function with one parameter
fn foo<'a>(x: &'a i32);

//a function with two parameters gets two separate lifetime parameters: 
fn foo<'a, 'b>(x: &'a i32, y: &'b i32);

// and so on.
```


#### Second Rule

If there is exactly one input lifetime parameter, that lifetime is assigned to all output lifetime parameters

```rust
fn foo<'a>(x: &'a i32) -> &'a i32
```


#### Third Rule -- Methods

If there are multiple input lifetime parameters, but one of them is `&self` or `&mut self` because this is a **method**, the lifetime of self is assigned to all output lifetime parameters. 


#### Let's Test Our Understanding

You're the compiler and you see this function.

```rust
fn first_word(s: &str) -> &str {
```

Do any rules apply? which one would you apply first?
<br><br><br>

First rule. Apply input lifetime annotations.

```rust
fn first_word<'a>(s: &'a str) -> &str {
```
<br><br><br>

Second rule. Apply output lifetime annotation.

```rust
fn first_word<'a>(s: &'a str) -> &'a str {
```
<br><br>

#### Test Our Understanding Again

What about if you see this function signature?

```rust
fn longest(x: &str, y: &str) -> &str {
```
<br><br><br>

We can apply first rule again. Each parameter gets it's own lifetime.

```rust
fn longest<'a, 'b>(x: &'a str, y: &'b str) -> &str {
```
Can we apply anymore rules?
<br><br><br>
No. Produce a compiler error asking for annotations.

### Lifetime Annotations in Method Definitions

Let's take a look at the third rule again:

> If there are multiple input lifetime parameters, but one of them is `&self` or `&mut self` because this is a **method**, the lifetime of self is assigned to all output lifetime parameters.

Previously, we defined a struct with a field that takes a string slice reference.

In [None]:
#[derive(Debug)]
struct ImportantExcerpt<'a> {
    part: &'a str,
}

For implementation, `impl` of methods, we use the generics style annotation, which is required.

But we don't have to annotate the following method. The **First Rule** applies.

In [None]:
impl<'a> ImportantExcerpt<'a> {
    fn level(&self) -> i32 {
        3
    }
}


For the following method...

In [None]:
impl<'a> ImportantExcerpt<'a> {
    fn announce_and_return_part(&self, announcement: &str) -> &str {
        println!("Attention please: {announcement}");
        self.part
    }
}


There are two input lifetimes so:

* Rust applies the first lifetime elision rule and gives both `&self` and announcement their own lifetimes.

* Then, because one of the parameters is `&self`, the return type gets the lifetime of `&self`, and all lifetimes have been accounted for.

## The Static Lifetime

* a special lifetime designation

* lives for the entire duration of the program

In [17]:
let s: &'static str = "I have a static lifetime.";

* use only if necessary

* manage lifetimes more fine grained if at all possible

For more, see for example:
- https://doc.rust-lang.org/rust-by-example/scope/lifetime/static_lifetime.html

## Combining Lifetimes with Generics and Trait Bounds

Let's look at an example that combines:
- lifetimes
- generics with trait bounds

In [18]:
use std::fmt::Display;

fn longest_with_an_announcement<'a, T>(
    x: &'a str,
    y: &'a str,
    ann: T,
) -> &'a str
where
    T: Display,  // T must implement the Display trait
{
    println!("Announcement! {ann}");
    if x.len() > y.len() {
        x
    } else {
        y
    }
}

* And let's execute the code:

In [23]:
{
    let string1 = String::from("short");
    let string2 = "longer";

    let result = longest_with_an_announcement(string1.as_str(), string2, "Hear ye! Hear ye!");
    println!("The longest string is {result}");
}

Announcement! Hear ye! Hear ye!
The longest string is longer


()

Let's break down the function declaration:

```rust
fn longest_with_an_announcement<'a, T>(
    x: &'a str,
    y: &'a str,
    ann: T,
) -> &'a str
where
    T: Display,   // T must implement the Display trait
```

- It has two generic parameters:
    - `'a`: A lifetime parameter
    - `T`: A type parameter
- It takes three arguments:
    - `x`: A string slice with lifetime `'a`
    - `y`: A string slice with lifetime `'a`
    - `ann`: A value of generic type `T`
- Returns a string slice with lifetime `'a`
- The `where` clause specifies that type `T` must implement the `Display` trait


---


# 3. Closures (anonymous functions)

* Closures are anonymous functions you can:
    * save in a variable, or
    * pass as arguments to other functions

We have seen them before in Python (as lambda functions):

```Python
>>> x = lambda a, b: a * b
>>> print(x(5,6))
30
```


In Rust (with implicit or explicit type specification):
```
|a, b| a * b
|a: i32, b: i32| -> i32 {a * b}
```

## Basic Closure Syntax

* types are inferred

In [6]:
{
    // Example 1: Basic closure syntax
    let add = |x, y| x + y;
    println!("Basic closure: 5 + 3 = {}", add(5, 3));
}

Basic closure: 5 + 3 = 8


()

* But once inferred, the type cannot change.

In [None]:
{
    let example_closure = |x| x;

    let s = example_closure(String::from("hello"));
    let n = example_closure(5);
}

Error: mismatched types

## Basic Closure Syntax with Explicit Types

* Type annotations in closures are _optional_ unlike in functions.

* Required in functions because those are interfaces exposed to users.

For comparison:

```rust
fn  add_one_v1   (x: u32) -> u32 { x + 1 }  // function
let add_one_v2 = |x: u32| -> u32 { x + 1 }; // closures...
let add_one_v3 = |x|             { x + 1 }; // ... remove types
let add_one_v4 = |x|               x + 1  ; // ... remove brackets
```

In [7]:
{
    // Example 2: Basic closure syntax with explicit types
    let add = |x: i32, y: i32| -> i32 {x + y};
    println!("Basic closure: 5 + 3 = {}", add(5, 3));
}

Basic closure: 5 + 3 = 8


()

## Closure Capturing a Variable from the Environment

Note how `multiplier` is used from the environment.

In [9]:
{
    // Example 3: Closure capturing a variable from environment
    let multiplier = 2;
    let multiply = |x| x * multiplier;
    println!("Closure with captured variable: 4 * {} = {}", multiplier, multiply(4));
}

Closure with captured variable: 4 * 2 = 8


()

## Closure with Multiple Statements


In [21]:
{
    // Example 4: Closure with multiple statements
    let process = |x: i32| {
        let doubled = x * 2;
        doubled + 1
    };
    println!("Multi-statement closure: process(3) = {}", process(3));
}

Multi-statement closure: process(3) = 7


()

### Digression -- You can assign regular functions to variable as well

In [11]:
{
    fn median2(arr: &mut [i32]) -> i32 {
        arr.sort();
        println!("{}", arr[2]);
        arr[2]
    }

    let f = median2;
    f(&mut [1,4,5,6,4]);
}

4


()

## Sample application: lazy evaluation of a value
Compute a value only if needed

In [12]:
// What does it compute?
fn expensive_function(i:u32) -> u128 {
    if i <= 1 {
        i as u128
    } else {
        expensive_function(i-1) + expensive_function(i-2)
    }
}

In [13]:
expensive_function(44)

701408733

In [14]:
// This function always computes expensive_function(44), even if not needed.
// Method unwrap_or takes a default value as a parameter.
fn value_or_fib44(input:Option<u128>) -> u128 {
    input.unwrap_or(expensive_function(44))
}

In [15]:
use std::time::SystemTime;
let d = SystemTime::now();
// slow
let ret = value_or_fib44(None);
let elapsed = d.elapsed().unwrap().as_millis();
println!("{} {}",ret, elapsed);

701408733 1708


In [16]:
let d = SystemTime::now();
// slow
let ret = value_or_fib44(Some(123));
let elapsed = d.elapsed().unwrap().as_millis();
println!("{} {}", ret, elapsed);

123 1699


## Sample application: lazy evaluation of a value
Compute a value only if needed

In [17]:
// This function computes expensive_function(44) only if needed.
// Method unwrap_or_else's parameter is a function that computes
// the default value, not the default value itself. 
fn value_or_fib44_version_2(input:Option<u128>) -> u128 {
    input.unwrap_or_else(|| expensive_function(44))
}

In [18]:
// slow
let d = SystemTime::now();
value_or_fib44_version_2(None);
let elapsed = d.elapsed().unwrap().as_millis();
println!("{}", elapsed);

1751


In [19]:
// fast
let d = SystemTime::now();
value_or_fib44_version_2(Some(1));
let elapsed = d.elapsed().unwrap().as_millis();
println!("{}", elapsed);

0


* This programing pattern appears in many places.
* Another example: default value for an entry in HashMap

In [20]:
let mut map = std::collections::HashMap::<i32,i32>::new();
map.insert(1, 1);
*map.entry(1).or_insert_with(|| expensive_function(44) as i32) *= -1;
*map.entry(2).or_insert_with(|| expensive_function(44) as i32) *= -1;
println!("{}:{:?}    {}:{:?}",1,map.get(&1),2,map.get(&2));

1:Some(-1)    2:Some(-701408733)


**Read section 8.2 for strings, section 10.3 for lifetimes and section 13.1 and 13.2 for closures and iterators**

## In-Class Poll

https://piazza.com/class/m5qyw6267j12cj/post/368

