<div align="center">
    <h1>DS-210: Programming for Data Science</h1>
    <h1>Lecture 21</h1>
</div>

* Plotting support in Rust with Plotters
*  Data Engineering in Rust
    1. Reading CSV Files
    2. Deserializing CSV Files
    3. Cleaning CSV Files
    4. Converting CSV Data to NDArray representation

# Plotters -- Rust Drawing Library

From the [website](https://docs.rs/plotters/latest/plotters/):

> Plotters is a drawing library designed for rendering figures, plots, and charts, in pure Rust.

Full documentation is at https://docs.rs/plotters/latest/plotters/.



## Installation and Configuration

### Rust Project

In Cargo.toml add the following dependency

```sh
[dependencies]  
plotters="0.3.6"
```

### Or inside Jupyter notebook

For 2021 edition:

```
:dep plotters = { version = "^0.3.6", default_features = false, features = ["evcxr", "all_series"] }
```

For 2024 edition, `default_features` became `default-features` (dash instead of underscore):

```
:dep plotters = { version = "^0.3.6", default-features = false, features = ["evcxr", "all_series"] }
```


## Plotters Tutorial

We'll go through the [interactive tutorial](https://docs.rs/plotters/latest/plotters/#interactive-tutorial-with-jupyter-notebook),
reproduced here.

Import everything defined in `prelude` which includes `evcxr_figure()`.

In [2]:
// Rust 2021 edition syntax
//:dep plotters = { version = "^0.3.6", default_features = false, features = ["evcxr", "all_series"] }

// Rust 2024 edition syntax: changed the syntax to 'default-feature'
:dep plotters = { version = "^0.3.6", default-features = false, features = ["evcxr", "all_series"] }

extern crate plotters;

// Import all the plotters prelude functions
use plotters::prelude::*;

// To create a figure that can be displayed in Jupyter notebook, use evcxr_figure function.
// The first param is the resolution of the figure.
// The second param is the closure that performes the drawing.
evcxr_figure((300, 100), |name| {
    // Do the drawings
    name.fill(&BLUE)?;
    // Tell plotters that everything is ok
    Ok(())
})


* `evcxr_figure((xsize, ysize), |name| { your code })` is how you make a figure of a certain size and put things in it.  
* name is the handle for accessing the figure

In [3]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }

extern crate plotters;
use plotters::prelude::*;

evcxr_figure((320,50), |root| {
    root.fill(&GREEN)?;
    root.draw(&Text::new("Hello World from Plotters!", (15, 15), ("Arial", 20).into_font()))?;
    Ok(())
})

## Sub-Drawing Areas

The object created by evcxr_figure is a DrawingArea

DrawingArea is a key concept and represents the handle into which things will be actually drawn.
Plotters supports different types of drawing areas depending on context.  

* Inside jupyter notebook the type of drawing area is an SVG (Scalable Vector Graphics) area
* When used from the termina the most common type is a BitMapBackend

For full documentation on what you can do with DrawingArea see https://docs.rs/plotters/latest/plotters/drawing/struct.DrawingArea.html

Key capabilities:

* fill: Fill it with a background color
* draw_mesh: Draw a mesh on it
* draw_text: Add some text in graphic form
* present: make it visible (may not be neeed in all backend types)
* titled: Add a title and return the remaining area
* split_*: Split it into subareas in a variety of ways

### Split Drawing Areas Example

We can make a [Sierpiński carpet](https://en.wikipedia.org/wiki/Sierpi%C5%84ski_carpet)
by splitting the drawing areas and recursion function.

The Sierpiński carpet is a plane fractal first described by Wacław Sierpiński in 1916. 
The carpet is a generalization of the Cantor set to two dimensions...


In [5]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }
extern crate plotters;
use plotters::prelude::*;
use plotters::coord::Shift;

pub fn sierpinski_carpet(
    depth: u32, 
    drawing_area: &DrawingArea<SVGBackend, Shift>) -> Result<(), Box<dyn std::error::Error>> {
    if depth > 0 {
        // Split the drawing area into 9 equal parts
        let sub_areas = drawing_area.split_evenly((3,3));

        // Iterate over the sub-areas
        for (idx, sub_area) in (0..).zip(sub_areas.iter()) {
            if idx == 4 { // idx == 4 is the center sub-area
                // If the sub-area is the center one, fill it with white
                sub_area.fill(&WHITE)?;
            } else {
                sierpinski_carpet(depth - 1, sub_area)?;
            }
        }
    }
    Ok(())
}
evcxr_figure((480,480), |root| {
    root.fill(&BLACK)?;
    sierpinski_carpet(5, &root)
}).style("width: 200px")  /* You can add CSS style to the result */


## Charts

Drawing areas are too basic for scientific drawings so the next important concept is a chart

Charts can be used to plot functions, datasets, bargraphs, scatterplots, 3D Objects and other stuff.

Full documentation at https://docs.rs/plotters/latest/plotters/chart/struct.ChartBuilder.html and
https://docs.rs/plotters/latest/plotters/chart/struct.ChartContext.html 

In [6]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }
extern crate plotters;
use plotters::prelude::*;

evcxr_figure((640, 240), |root| {
    // The following code will create a chart context
    let mut chart = ChartBuilder::on(&root)
    // the caption for the chart
        .caption("Hello Plotters Chart Context!", ("Arial", 20).into_font())
   // the X and Y coordinates spaces for the chart
        .build_cartesian_2d(0f32..1f32, 0f32..1f32)?;
    // Then we can draw a series on it!
    chart.draw_series((1..10).map(|x|{
        let x = x as f32/10.0;
        Circle::new((x,x), 5, &RED)
    }))?;
    Ok(())
}).style("width:60%")


## Common chart components

Adding a mesh, and X and Y labels

In [7]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }
extern crate plotters;
use plotters::prelude::*;

evcxr_figure((640, 480), |root| {
    // The following code will create a chart context
    let mut chart = ChartBuilder::on(&root)
        .caption("Chart with Axis Label", ("Arial", 20).into_font())
        .x_label_area_size(80)
        .y_label_area_size(80)
        .build_cartesian_2d(0f32..1f32, 0f32..1f32)?;
    
    chart.configure_mesh()
        .x_desc("Here's the label for X")
        .y_desc("Here's the label for Y")
        .draw()?;

    // Then we can draw a series on it!
    chart.draw_series((1..10).map(|x|{
        let x = x as f32/10.0;
        Circle::new((x,x), 5, &RED)
    }))?;
    
    Ok(())
}).style("width: 60%")


Then let's disable mesh lines for the X axis

In [8]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }
extern crate plotters;
use plotters::prelude::*;

evcxr_figure((640, 480), |root| {
    // The following code will create a chart context
    let mut chart = ChartBuilder::on(&root)
        .caption("Chart Context with Mesh and Axis", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
        .build_cartesian_2d(0f32..1f32, 0f32..1f32)?;
    
    chart.configure_mesh()
        .y_labels(10)
        .light_line_style(&TRANSPARENT)
        .disable_x_mesh()
        .draw()?;
    
    // Then we can draw a series on it!
    chart.draw_series((1..10).map(|x|{
        let x = x as f32/10.0;
        Circle::new((x,x), 5, &RED)
    }))?;

    Ok(())
}).style("width: 60%")

## Adding subcharts

Simple.  Split your drawing area and then add a chart in each of the split portions

In [9]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }

extern crate plotters;
use plotters::prelude::*;
evcxr_figure((640, 480), |root| {
    let sub_areas = root.split_evenly((2,2));
    
    for (idx, area) in (1..).zip(sub_areas.iter()) {
        // The following code will create a chart context
        let mut chart = ChartBuilder::on(&area)
            .caption(format!("Subchart #{}", idx), ("Arial", 15).into_font())
            .x_label_area_size(40)
            .y_label_area_size(40)
            .build_cartesian_2d(0f32..1f32, 0f32..1f32)?;

        chart.configure_mesh()
            .y_labels(10)
            .light_line_style(&TRANSPARENT)
            .disable_x_mesh()
            .draw()?;

        // Then we can draw a series on it!
        chart.draw_series((1..10).map(|x|{
            let x = x as f32/10.0;
            Circle::new((x,x), 5, &RED)
        }))?;
    }

    Ok(())
}).style("width: 60%")


## Drawing on Charts with the Series Abstraction

* Unlike most of the plotting libraries, `Plotters` doesn't actually define any types of chart. 

* All the charts are abstracted to a concept of series. 
    * By doing so, you can put a histgoram series and a line plot series into the same chart context.

* The series is actually defined as an iterator of elements.

This gives `Plotters` a huge flexibility on drawing charts. You can implement your own types of series and uses the coordinate translation and chart elements. 

There are few types of predefined series, just for convenience:

- Line Series
- Histogram
- Point Series


## Scatter Plot

First, generate random numbers

In [10]:
:dep rand = { version = "0.6.5" }
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series"] }

extern crate rand;

use rand::distributions::Normal;
use rand::distributions::Distribution;
use rand::thread_rng;
let sd = 0.13;
let random_points:Vec<(f64,f64)> = {
    let mut norm_dist = Normal::new(0.5, sd);
    let (mut x_rand, mut y_rand) = (thread_rng(), thread_rng());
    let x_iter = norm_dist.sample_iter(&mut x_rand);
    let y_iter = norm_dist.sample_iter(&mut y_rand);
    x_iter.zip(y_iter).take(1000).collect()
};
println!("{}", random_points.len());


1000


To draw the series, we provide an iterator on the elements and then map a closure.

In [11]:

extern crate plotters;
use plotters::prelude::*;

evcxr_figure((480, 480), |root| {
    // The following code will create a chart context
    let mut chart = ChartBuilder::on(&root)
        .caption("Normal Distribution w/ 2 sigma", ("Arial", 20).into_font())
        .x_label_area_size(40)
        .y_label_area_size(40)
        .build_ranged(0f64..1f64, 0f64..1f64)?;
    
    chart.configure_mesh()
        .disable_x_mesh()
        .disable_y_mesh()
        .draw()?;
    
    // Draw little green circles. Remember that closures can capture variables from the enclosing scope
    chart.draw_series(random_points.iter().map(|(x,y)| Circle::new((*x,*y), 3, GREEN.filled())));
    
    // You can always freely draw on the drawing backend.  So we can add background after the fact
    let area = chart.plotting_area();
    let two_sigma = sd * 2.0;
    let chart_width = 480;
    let radius = two_sigma * chart_width as f64;
    area.draw(&Circle::new((0.5, 0.5), radius, RED.mix(0.3).filled()))?;
    area.draw(&Cross::new((0.5, 0.5), 5, &RED))?;
    
    Ok(())
}).style("width:60%")


## Histograms

We can also have histograms. For histograms, we can use the predefined histogram series struct to build the histogram easily. The following code demonstrate how to create both histogram for X and Y value of `random_points`.

In [12]:
// Rust 2021
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series", "all_elements"] }

// Rust 2024
:dep plotters = { version = "^0.3.0", default-features = false, features = ["evcxr", "all_series", "all_elements"] }


extern crate plotters;
use plotters::prelude::*;

evcxr_figure((640, 480), |root| {
    let areas = root.split_evenly((2,1));
    let mut charts = vec![];
    
    // The following code will create a chart context
   for (area, name) in areas.iter().zip(["X", "Y"].into_iter()) {
        let mut chart = ChartBuilder::on(&area)
            .caption(format!("Histogram for {}", name), ("Arial", 20).into_font())
            .x_label_area_size(40)
            .y_label_area_size(40)
            .build_cartesian_2d(0u32..100u32, 0f64..0.5f64)?;
        chart.configure_mesh()
            .disable_x_mesh()
            .disable_y_mesh()
            .y_labels(5)
            .x_label_formatter(&|x| format!("{:.1}", *x as f64 / 100.0))
            .y_label_formatter(&|y| format!("{}%", (*y * 100.0) as u32))
            .draw()?;
        charts.push(chart);
    }
    // Histogram is just another series but a nicely encapsulated one
    let hist_x = Histogram::vertical(&charts[0])
        .style(RED.filled())
        .margin(0)
        .data(random_points.iter().map(|(x,_)| ((x*100.0) as u32, 0.01)));
    
    let hist_y = Histogram::vertical(&charts[0])
        .style(GREEN.filled())
        .margin(0)
        .data(random_points.iter().map(|(_,y)| ((y*100.0) as u32, 0.01)));
    
    charts[0].draw_series(hist_x);
    charts[1].draw_series(hist_y);
    
    Ok(())
}).style("width:60%")


## Fancy combination of histogram and scatter

Split the drawing area in 3 parts and draw two histograms and a scatter plot

In [13]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series", "all_elements"] }

use plotters::prelude::*;

evcxr_figure((640, 480), |root| {
    let root = root.titled("Scatter with Histogram Example", ("Arial", 20).into_font())?;
    
    // Split the drawing area into a grid with specified X and Y breakpoints
    let areas = root.split_by_breakpoints([560], [80]);

    let mut x_hist_ctx = ChartBuilder::on(&areas[0])
        .y_label_area_size(40)
        .build_cartesian_2d(0u32..100u32, 0f64..0.5f64)?;
    let mut y_hist_ctx = ChartBuilder::on(&areas[3])
        .x_label_area_size(40)
        .build_cartesian_2d(0f64..0.5f64, 0..100u32)?;
    let mut scatter_ctx = ChartBuilder::on(&areas[2])
        .x_label_area_size(40)
        .y_label_area_size(40)
        .build_cartesian_2d(0f64..1f64, 0f64..1f64)?;
    scatter_ctx.configure_mesh()
        .disable_x_mesh()
        .disable_y_mesh()
        .draw()?;
    scatter_ctx.draw_series(random_points.iter().map(|(x,y)| Circle::new((*x,*y), 3, GREEN.filled())))?;
    let x_hist = Histogram::vertical(&x_hist_ctx)
        .style(RED.filled())
        .margin(0)
        .data(random_points.iter().map(|(x,_)| ((x*100.0) as u32, 0.01)));
    let y_hist = Histogram::horizontal(&y_hist_ctx)
        .style(GREEN.filled())
        .margin(0)
        .data(random_points.iter().map(|(_,y)| ((y*100.0) as u32, 0.01)));
    x_hist_ctx.draw_series(x_hist)?;
    y_hist_ctx.draw_series(y_hist)?;
    
    Ok(())
}).style("width:60%")


## Drawing Lines

It's stil using the `draw_series` call with the convenient wrapper of `LineSeries`.

In [14]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series", "all_elements"] }

use plotters::prelude::*;

evcxr_figure((640, 480), |root_area| {
        root_area.fill(&WHITE)?;

    let root_area = root_area.titled("Line Graph", ("sans-serif", 60))?;

    let x_axis = (-3.4f32..3.4).step(0.1);

    let mut cc = ChartBuilder::on(&root_area)
        .margin(5)
        .set_all_label_area_size(50)
        .caption("Sine and Cosine", ("sans-serif", 40))
        .build_cartesian_2d(-3.4f32..3.4, -1.2f32..1.2f32)?;

    cc.configure_mesh()
        .x_labels(20)
        .y_labels(10)
        .disable_mesh()
        .x_label_formatter(&|v| format!("{:.1}", v))
        .y_label_formatter(&|v| format!("{:.1}", v))
        .draw()?;

    cc.draw_series(LineSeries::new(x_axis.values().map(|x| (x, x.sin())), &RED))?
        .label("Sine")
        .legend(|(x, y)| PathElement::new(vec![(x, y), (x + 20, y)], RED));

    cc.draw_series(LineSeries::new(x_axis.values().map(|x| (x, x.cos())), &BLUE,))?
    .label("Cosine")
    .legend(|(x, y)| PathElement::new(vec![(x, y), (x + 20, y)], BLUE));

    cc.configure_series_labels().border_style(BLACK).draw()?;

    Ok(())
}).style("width:60%")

## 3D Plotting

Big difference is in the `ChartBuilder` call.  Instead of `build_cartesian_2d` we use `build_cartesian_3d`.
  
Unlike the 2D plots, 3D plots use the function configure_axes to configure the chart components.


In [15]:
//:dep plotters = { version = "^0.3.0", default_features = false, features = ["evcxr", "all_series", "all_elements"] }

use plotters::prelude::*;

evcxr_figure((640, 480), |root| {
    let root = root.titled("3D Plotting", ("Arial", 20).into_font())?;
    
    let mut chart = ChartBuilder::on(&root)
        .build_cartesian_3d(-10.0..10.0, -10.0..10.0, -10.0..10.0)?;
    
    chart.configure_axes().draw()?;
    
    // Draw a red circle parallel to XOZ panel
    chart.draw_series(LineSeries::new(
        (-314..314).map(|a| a as f64 / 100.0).map(|a| (8.0 * a.cos(), 0.0, 8.0 *a.sin())),
        &RED,
    ))?;
    // Draw a green circle parallel to YOZ panel
    chart.draw_series(LineSeries::new(
        (-314..314).map(|a| a as f64 / 100.0).map(|a| (0.0, 8.0 * a.cos(), 8.0 *a.sin())),
        &GREEN,
    ))?;
    
    Ok(())
})


## For more examples check
https://plotters-rs.github.io/plotters-doc-data/evcxr-jupyter-integration.html

## What about using it from the terminal?

The key difference is in how you define your drawing area.

* Inside Jupyter notebook we create a drawing area using evcxr_figure  

* In the terminal context we create a drawing area using

```rust
let root = BitMapBackend::new("0.png", (640, 480)).into_drawing_area();

// or

let root = SVGBackend::new("0.svg", (1024, 768)).into_drawing_area();

// or

let root = BitMapBackend::gif("0.gif", (600, 400), 100)?.into_drawing_area();
```

Let's take a look on the terminal example ([demo](./demo/)).

## What if you don't want output to a file or a browser but standalone application?

Things get very messy and machine specific there. You need to integrate with the underlying OS graphics terminal libraries.  For MacOS and Linux this is the the CairoBackend library but I don't know what it is
for Windows

Here's an example from the terminal using GTK.

On MacOS, install these dependencies first:

```sh
brew install gtk4
brew install pgk-config
```

Then `cargo run` in ([plotters-gtk-demo](./plotters-gtk-demo/)).

# CSV Files and Basic Data Engineering


1. Reading CSV Files
2. Deserializing CSV Files
3. Cleaning CSV Files
4. Converting CSV Data to NDArray representation

* By default CSV will generate StringRecords which are structs containing an array of strings

* Missing fields will be represented as empty strings

In [16]:
:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}


StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
StringRecord(["Oakman", "AL", "", "33.7133333", "-87.3886111"])
StringRecord(["Richards Crossroads", "AL", "", "31.7369444", "-85.2644444"])
StringRecord(["Sandfort", "AL", "", "32.3380556", "-85.2233333"])


()

### What if there malformed records with mismatched fields?

In [17]:
:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}


StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])



thread '<unnamed>' panicked at src/lib.rs:164:25:
a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 125, line: 4, record: 3 }), expected_len: 5, len: 8 })
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: std::panic::catch_unwind
   4: _run_user_code_16
   5: evcxr::runtime::Runtime::run_loop
   6: evcxr::runtime::runtime_hook
   7: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


### Let's make this safe for malformed records.  Match statements to the rescue

In [18]:
:dep csv = { version = "^1.3" }

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => { 
          if count < 5 {
              println!("{:?}", record);
          }
          count += 1; 
        },
        Err(err) => {
            println!("error reading CSV record {}", err);
        }  
    }
}

StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
error reading CSV record CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])


()

### If your csv file has headers and you want to access them then you can use the headers function

By default, the first row is treated as a special header row.

In [2]:
:dep csv = { version = "^1.3" }
{
let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
let headers = rdr.headers()?;
println!("Headers: {:?}", headers);

for result in rdr.records() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => { 
          if count < 5 {
              println!("{:?}", record);
          }
          count += 1; 
        },
        Err(err) => {
            println!("error reading CSV record {}", err);
        }  
    }
}
}

Headers: StringRecord(["City", "State", "Population", "Latitude", "Longitude"])
StringRecord(["Davidsons Landing", "AK", "", "65.2419444", "-165.2716667"])
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])
error reading CSV record CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
StringRecord(["Kenai", "AK", "7610", "60.5544444", "-151.2583333"])


()

### You can customize your reader in many ways:
```rust
let mut rdr = csv::ReaderBuilder::new()
        .has_headers(false)
        .delimiter(b';')
        .double_quote(false)
        .escape(Some(b'\\'))
        .flexible(true)
        .comment(Some(b'#'))
        .from_path("Some path");
```

What is the difference between a ReaderBuilder and a Reader?  One is customizable and one is not.

## 2. Deserializing CSV Files

StringRecords are not particularly useful in computation.  They typically have to be converted to floats or integers before we can work with them.


You can deserialize your CSV data either into a:

- Record with types you define, or

- a hashmap of key value pairs

### Custom Record

In [None]:
:dep csv = { version = "^1.3" }
use std::collections::HashMap;

type StrRecord = (String, String, Option<u64>, f64, f64);

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:StrRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}


("Davidsons Landing", "AK", None, 65.2419444, -165.2716667)
("Kenai", "AK", Some(7610), 60.5544444, -151.2583333)
("Oakman", "AL", None, 33.7133333, -87.3886111)
("Richards Crossroads", "AL", None, 31.7369444, -85.2644444)
("Sandfort", "AL", None, 32.3380556, -85.2233333)


()

### HashMap

In [5]:
:dep csv = { version = "^1.3" }
use std::collections::HashMap;

type Record = HashMap<String, String>;

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:Record = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}

{"State": "AK", "City": "Davidsons Landing", "Latitude": "65.2419444", "Population": "", "Longitude": "-165.2716667"}
{"City": "Kenai", "Latitude": "60.5544444", "State": "AK", "Longitude": "-151.2583333", "Population": "7610"}
{"State": "AL", "Latitude": "33.7133333", "Longitude": "-87.3886111", "Population": "", "City": "Oakman"}
{"Population": "", "City": "Richards Crossroads", "State": "AL", "Latitude": "31.7369444", "Longitude": "-85.2644444"}
{"Latitude": "32.3380556", "State": "AL", "Longitude": "-85.2233333", "City": "Sandfort", "Population": ""}


()

### This will work well but makes it hard to read and know what type is associated with which CSV field
### You can do better by using serde and structs

In [6]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]  // derive the Deserialize trait
#[serde(rename_all = "PascalCase")]
struct SerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,  // account for the fact that some records have no population
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let mut count = 0;

// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:SerRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}


The type of the variable rdr was redefined, so was lost.


SerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
SerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
SerRecord { latitude: 33.7133333, longitude: -87.3886111, population: None, city: "Oakman", state: "AL" }
SerRecord { latitude: 31.7369444, longitude: -85.2644444, population: None, city: "Richards Crossroads", state: "AL" }
SerRecord { latitude: 32.3380556, longitude: -85.2233333, population: None, city: "Sandfort", state: "AL" }


()

### What about deserializing with invalid data?

In [7]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct FSerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    let record:FSerRecord = result.expect("a CSV record");
    // Print a debug version of the record.
    if count < 5 {
        println!("{:?}", record);
    }
    count += 1;
}


FSerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
FSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }



thread '<unnamed>' panicked at src/lib.rs:168:36:
a CSV record: Error(UnequalLengths { pos: Some(Position { byte: 125, line: 4, record: 3 }), expected_len: 5, len: 8 })
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: std::panic::catch_unwind
   4: _run_user_code_7
   5: evcxr::runtime::Runtime::run_loop
   6: evcxr::runtime::runtime_hook
   7: evcxr_jupyter::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


### Deserialization failed so we need to deal with bad records just like before.  Match statement to the rescue

In [3]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct GSerRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("usbad.csv").unwrap();
let mut count = 0;

// Loop over each record.
// We need to specify the type we are deserializing to because compiler
// cannot infer the type from the match statement
for result in rdr.deserialize::<GSerRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // Print a debug version of the record.
            if count < 5 {
                println!("{:?}", record);
            }
            count += 1;
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}


GSerRecord { latitude: 65.2419444, longitude: -165.2716667, population: None, city: "Davidsons Landing", state: "AK" }
GSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }
CSV error: record 3 (line: 4, byte: 125): found record with 8 fields, but the previous record has 5 fields
GSerRecord { latitude: 60.5544444, longitude: -151.2583333, population: Some(7610), city: "Kenai", state: "AK" }


()

### Some more complex work.  Let's filter cities over a population threshold

In [4]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct FilterRecord {
    latitude: f64,
    longitude: f64,
    population: Option<u64>,
    city: String,
    state: String,
}

let mut rdr = csv::Reader::from_path("uspop.csv").unwrap();
let minimum_pop: u64 = 50_000;
// Loop over each record.
for result in rdr.deserialize::<FilterRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // `map_or` is a combinator on `Option`. It take two parameters:
            // a value to use when the `Option` is `None` (i.e., the record has
            // no population count) and a closure that returns another value of
            // the same type when the `Option` is `Some`. In this case, we test it
            // against our minimum population count that we got from the command
            // line.
            if record.population.map_or(false, |pop| pop >= minimum_pop) {
                println!("{:?}", record);
            }
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}


FilterRecord { latitude: 34.0738889, longitude: -117.3127778, population: Some(52335), city: "Colton", state: "CA" }
FilterRecord { latitude: 34.0922222, longitude: -117.4341667, population: Some(169160), city: "Fontana", state: "CA" }
FilterRecord { latitude: 33.7091667, longitude: -117.9527778, population: Some(56133), city: "Fountain Valley", state: "CA" }
FilterRecord { latitude: 37.4283333, longitude: -121.9055556, population: Some(62636), city: "Milpitas", state: "CA" }
FilterRecord { latitude: 33.4269444, longitude: -117.6111111, population: Some(62272), city: "San Clemente", state: "CA" }
FilterRecord { latitude: 41.1669444, longitude: -73.2052778, population: Some(139090), city: "Bridgeport", state: "CT" }
FilterRecord { latitude: 34.0230556, longitude: -84.3616667, population: Some(77218), city: "Roswell", state: "GA" }
FilterRecord { latitude: 39.7683333, longitude: -86.1580556, population: Some(773283), city: "Indianapolis", state: "IN" }
FilterRecord { latitude: 45.12, lon

()

## Cleaning CSV Files

Once you have a Record you can push it to a vector and then iterate over the vector to fix it.
Deserialization doesn't quite work all that well when the fields themselves are malformed

In [4]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct DirtyRecord {
    CustomerNumber: Option<u32>,
    CustomerName: String,
    S2016: Option<f64>,
    S2017: Option<f64>,
    PercentGrowth: Option<f64>,
    JanUnits:Option<u64>,
    Month: Option<u8>,
    Day: Option<u8>,
    Year: Option<u16>,
    Active: String,
}

let mut rdr = csv::Reader::from_path("sales_data_types.csv").unwrap();
let mut count = 0;
// Loop over each record.
for result in rdr.deserialize::<DirtyRecord>() {
    // An error may occur, so abort the program in an unfriendly way.
    // We will make this more friendly later!
    match result {
        Ok(record) => {
            // Print a debug version of the record.
            if count < 5 {
                println!("{:?}", record);
            }
            count += 1;
        },
        Err(err) => {
            println!("{}", err);
        }
    }
}


CSV deserialize error: record 1 (line: 2, byte: 85): field 0: invalid digit found in string
CSV deserialize error: record 2 (line: 3, byte: 161): field 2: invalid float literal
CSV deserialize error: record 3 (line: 4, byte: 236): field 2: invalid float literal
CSV deserialize error: record 4 (line: 5, byte: 305): field 2: invalid float literal
CSV deserialize error: record 5 (line: 6, byte: 370): field 2: invalid float literal


()

An alternative is to read everything as Strings and clean them up using String methods.

In [6]:
:dep csv = { version = "^1.3" }
:dep serde = { version = "^1", features = ["derive"] }

// This lets us write `#[derive(Deserialize)]`.
use serde::Deserialize;

// We don't need to derive `Debug` (which doesn't require Serde), but it's a
// good habit to do it for all your types.
//
// Notice that the field names in this struct are NOT in the same order as
// the fields in the CSV data!
#[derive(Debug, Deserialize)]
#[serde(rename_all = "PascalCase")]
struct DirtyRecord {
    CustomerNumber: String,
    CustomerName: String,
    S2016: String,
    S2017: String,
    PercentGrowth: String,
    JanUnits:String,
    Month: String,
    Day: String,
    Year: String,
    Active: String,
}

#[derive(Debug, Default)]
struct CleanRecord {
    CustomerNumber: u64,
    CustomerName: String,
    S2016: f64,
    S2017: f64,
    PercentGrowth: f32,
    JanUnits:u64,
    Month: u8,
    Day: u8,
    Year: u16,
    Active: bool,

}

fn cleanRecord(r: DirtyRecord) -> CleanRecord {
    let mut c = CleanRecord::default();
    c.CustomerNumber = r.CustomerNumber.trim_matches('"').parse::<f64>().unwrap() as u64;
    c.CustomerName = r.CustomerName.clone();
    c.S2016 = r.S2016.replace('$',"").replace(',',"").parse::<f64>().unwrap();
    c.S2017 = r.S2017.replace('$',"").replace(',',"").parse::<f64>().unwrap();
    c.PercentGrowth = r.PercentGrowth.replace('%',"").parse::<f32>().unwrap() / 100.0;
    let JanUnits = r.JanUnits.parse::<u64>();
    if JanUnits.is_ok() {
        c.JanUnits = JanUnits.unwrap();
    } else {
        c.JanUnits = 0;
    }
    c.Month = r.Month.parse::<u8>().unwrap();
    c.Day = r.Day.parse::<u8>().unwrap();
    c.Year = r.Year.parse::<u16>().unwrap();
    c.Active = if r.Active == "Y" { true } else {false};
    return c;
}

fn process_csv_file() -> Vec<CleanRecord> {
    let mut rdr = csv::Reader::from_path("sales_data_types.csv").unwrap();
    let mut v:Vec<DirtyRecord> = Vec::new();
    // Loop over each record.
    for result in rdr.deserialize::<DirtyRecord>() {
        // An error may occur, so abort the program in an unfriendly way.
        // We will make this more friendly later!
        match result {
            Ok(record) => {
                // Print a debug version of the record.
                println!("{:?}", record);
                v.push(record);
            },
            Err(err) => {
                println!("{}", err);
            }
        }
    }

    println!("");

    let mut cleanv: Vec<CleanRecord> = Vec::new();
    for r in v {
        let cleanrec = cleanRecord(r);
        println!("{:?}", cleanrec);
        cleanv.push(cleanrec);
    }
    return cleanv;
}

process_csv_file();

DirtyRecord { CustomerNumber: "10002.0", CustomerName: "QuestIndustries", S2016: "$125,000.00", S2017: "$162500.00", PercentGrowth: "30.00%", JanUnits: "500", Month: "1", Day: "10", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "552278", CustomerName: "SmithPlumbing", S2016: "$920,000.00", S2017: "$101,2000.00", PercentGrowth: "10.00%", JanUnits: "700", Month: "6", Day: "15", Year: "2014", Active: "Y" }
DirtyRecord { CustomerNumber: "23477", CustomerName: "ACMEIndustrial", S2016: "$50,000.00", S2017: "$62500.00", PercentGrowth: "25.00%", JanUnits: "125", Month: "3", Day: "29", Year: "2016", Active: "Y" }
DirtyRecord { CustomerNumber: "24900", CustomerName: "BrekkeLTD", S2016: "$350,000.00", S2017: "$490000.00", PercentGrowth: "4.00%", JanUnits: "75", Month: "10", Day: "27", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "651029", CustomerName: "HarborCo", S2016: "$15,000.00", S2017: "$12750.00", PercentGrowth: "-15.00%", JanUnits: "Closed", Month: "2", Day: "2", 

## 4.  Let's convert the Vector of structs to an ndarray that can be fed into other libraries

Remember that ndarrays have to contain uniform data, so make sure the "columns" you pick are of the same type or you convert them appropriately.

In [7]:
:dep ndarray = { version = "^0.15.6" }
use ndarray::Array2;

let mut cleanv = process_csv_file();
let mut flat_values: Vec<f64> = Vec::new();
for s in &cleanv {
    flat_values.push(s.S2016);
    flat_values.push(s.S2017);
    flat_values.push(s.PercentGrowth as f64);
}
let array = Array2::from_shape_vec((cleanv.len(), 3), flat_values).expect("Error creating ndarray");
println!("{:?}", array);


DirtyRecord { CustomerNumber: "10002.0", CustomerName: "QuestIndustries", S2016: "$125,000.00", S2017: "$162500.00", PercentGrowth: "30.00%", JanUnits: "500", Month: "1", Day: "10", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "552278", CustomerName: "SmithPlumbing", S2016: "$920,000.00", S2017: "$101,2000.00", PercentGrowth: "10.00%", JanUnits: "700", Month: "6", Day: "15", Year: "2014", Active: "Y" }
DirtyRecord { CustomerNumber: "23477", CustomerName: "ACMEIndustrial", S2016: "$50,000.00", S2017: "$62500.00", PercentGrowth: "25.00%", JanUnits: "125", Month: "3", Day: "29", Year: "2016", Active: "Y" }
DirtyRecord { CustomerNumber: "24900", CustomerName: "BrekkeLTD", S2016: "$350,000.00", S2017: "$490000.00", PercentGrowth: "4.00%", JanUnits: "75", Month: "10", Day: "27", Year: "2015", Active: "Y" }
DirtyRecord { CustomerNumber: "651029", CustomerName: "HarborCo", S2016: "$15,000.00", S2017: "$12750.00", PercentGrowth: "-15.00%", JanUnits: "Closed", Month: "2", Day: "2", 

## If your data does not need cleaning

This is not likely, but sometimes data preprocessing happens in other environments and you are given a clean file to work with.  Or you clean the data once and use it to train many different models.  There is a crate that lets you go directly from csv to ndarray!

https://docs.rs/ndarray-csv/latest/ndarray_csv/

In [3]:
:dep csv = { version = "^1.3.1" }
:dep ndarray = { version = "^0.15.6" }
:dep ndarray-csv = { version = "^0.5.3" }

extern crate ndarray;
extern crate ndarray_csv;

use csv::{ReaderBuilder, WriterBuilder};
use ndarray::{array, Array2};
use ndarray_csv::{Array2Reader, Array2Writer};
use std::error::Error;
use std::fs::File;

fn main() -> Result<(), Box<dyn Error>> {
    // Our 2x3 test array
    let array: Array2<u64> = array![[1, 2, 3], [4, 5, 6]];

    // Write the array into the file.
    {
        let file = File::create("test.csv")?;
        let mut writer = WriterBuilder::new().has_headers(false).from_writer(file);
        writer.serialize_array2(&array)?;
    }

    // Read an array back from the file
    let file = File::open("test2.csv")?;
    let mut reader = ReaderBuilder::new().has_headers(true).from_reader(file);
    let array_read: Array2<u64> = reader.deserialize_array2((2, 3))?;

    // Ensure that we got the original array back
    assert_eq!(array_read, array);
    println!("{:?}", array_read);
    Ok(())
}

main();

[[1, 2, 3],
 [4, 5, 6]], shape=[2, 3], strides=[3, 1], layout=Cc (0x5), const ndim=2


# In-Class Poll

https://piazza.com/class/m5qyw6267j12cj/post/466
