NB: It would be better to develop this pipeline in an IDE (such as Visual Studio Code), which provides signature annotations, linting, auto-imports and formatting, and testing/benchmarking. This notebook just showcases the iterative development style.  

You are studying student performance on different tests for a school board. The data is in a csv that looks like so:

In [1]:
rawData := `student, date, grade
John Doe, 1/1/20, 70
John Doe, 1/5/20, 60
John Doe, 1/10/20, 65
John Doe, 1/15/20, 50
John Doe, 1/30/20, 80
John Doe, 2/1/20, 80
John Doe, 2/15/20, 85
Jane Doe, 1/1/20, 75
Jane Doe, 1/5/20, 60
Jane Doe, 1/10/20, 70
Jane Doe, 1/15/20, 60
Jane Doe, 1/30/20, 95
Jane Doe, 2/1/20, 90
Jane Doe, 2/15/20, 85
Jane Doe, 2/28/20, 95
`

The board first asks to see the average grade by student across all tests. 

You want to build a reusable data pipeline and validate that it works for any similar data. The first step is to convert the data into a smaller, more generalized form so that tests are easier to reason about. 

For this, run tada.WriteMockCSV(), copy and paste the string output, and throw some extra null data into the output for good measure.

In [2]:
import (
    "bytes"
    "encoding/csv"
    "fmt"
    "log"
    "testing"
    "time"
    "strconv"
    "strings"
    
    "github.com/ptiger10/tada"
)

In [3]:
w := new(bytes.Buffer)
tada.WriteMockCSV(strings.NewReader(rawData), w, 5)
w.String()

student,date,grade
foo,2020-01-01,1
baz,,3
bar,2020-02-02,1
bar,2020-01-01,1
,2020-02-02,3


Now we write a test to make sure that the data pipeline is outputting the right values. 

For this, we use df.EqualsCSVFromString()

In [14]:
// Normally, you would supply a *testing.T to this function, 
// include the expected input and output in the test itself,
// and report errors with t.Errorf.
// However, the normal Go test workflow is not well supported in a notebook.

mockInput := `student,date,grade
,2020-01-02,2
baz,2020-01-02,3
baz,2020-02-02,4
bar,2020-01-01,4
bar,2019-12-31,5`
    
want := `student, mean_grade
bar, 4.5
baz, 3`

func TestTransform()  {
    df, err := tada.ReadCSV(strings.NewReader(mockInput))
    if err != nil {
        fmt.Println("Error:", err)
    }
    ret := transform(df)
    ok, diffs, err := ret.EqualsCSV(strings.NewReader(want), true)
    if err != nil {
        fmt.Println("Error:", err)
    }
    if !ok {
        fmt.Println("transform() has diffs:")
        fmt.Println("--text view--")
        fmt.Println(diffs)
        
        fmt.Println("--table view--")
        fmt.Println(diffs.AsTable())
        
        fmt.Println("--df that was returned--")
        fmt.Println(ret)
    } else {
        fmt.Println("PASS")
    }
}

And we write a function that will satisfy the test

In [15]:
func transform(df *tada.DataFrame) *tada.DataFrame {
    df.InPlace().DropNull()
    df.InPlace().Sort(tada.Sorter{Name: "student", DType: tada.String})
    return df.GroupBy("student").Mean("grade")
}

In [16]:
TestTransform()

transform() has diffs:
--text view--
modified: [2][1] = 3.5 -> 3

--table view--
+--+--------+
|  |        |
|  |        |
|  | 3.5->3 |
+--+--------+

--df that was returned--
+---------++------------+
| student || mean_grade |
|---------||------------|
|     bar ||        4.5 |
|     baz ||        3.5 |
+---------++------------+
name: mean



Uh oh. The test failed in position [2][1] (i.e., third row, second column). transform() returned 3.5 as the mean grade for baz, and we were expecting 3. Double checking the raw data, we see that we should have been expecting 3.5. We update the test.

In [17]:
mockInput := `student,date,grade
,2020-01-02,2
baz,2020-01-02,3
baz,2020-02-02,4
bar,2020-01-01,4
bar,2019-12-31,5`

want := `student, mean_grade
bar, 4.5
baz, 3.5`

In [18]:
TestTransform()

PASS


Once we are comfortable with our test coverage, we run the real data through the function.

In [20]:
df, err := tada.ReadCSV(strings.NewReader(rawData))
if err != nil {
    log.Fatal(err)
}
transform(df)

+----------++------------+
| student  || mean_grade |
|----------||------------|
| Jane Doe ||      78.75 |
| John Doe ||         70 |
+----------++------------+
name: mean


## Revision 1

The board changed its mind. Now it only wants to see scores grouped by student by month. So you update the mock input to have at least a couple of observations in different months, write a new expected output, and write a new transform function.

In [21]:
mockInput := `student,date,grade
,2020-01-02,2
baz,2020-02-01,3
baz,2020-02-02,4
bar,2020-01-01,4
bar,2019-12-01,5
bar,2019-12-02,7`

want := `student, date, mean_grade
bar, 2019-12, 6.0
bar, 2020-01, 4.0
baz, 2020-02, 3.5`

In [22]:
func transform(df *tada.DataFrame) *tada.DataFrame {
    df.InPlace().DropNull()
    df.InPlace().Resample(map[string]tada.Resampler{"date": {ByMonth: true}})
    df.InPlace().Sort([]tada.Sorter{
        {Name: "student", DType: tada.String}, 
        {Name: "date", DType: tada.DateTime}}...)
    df = df.GroupBy("student", "date").Mean("grade")
    
    monthFormat := tada.ApplyFormatFn{DateTime: func(v time.Time) string {return v.Format("2006-01")}}
    decimalFormat := tada.ApplyFormatFn{Float64: func(v float64) string {return strconv.FormatFloat(v, 'f', 1, 64)}}
    df.InPlace().ApplyFormat(map[string]tada.ApplyFormatFn{
        "date": monthFormat,
        "mean_grade": decimalFormat})
    return df
}

In [23]:
TestTransform()

PASS


In [24]:
df, err := tada.ReadCSV(strings.NewReader(rawData))
if err != nil {
    log.Fatal(err)
}
transform(df)

+----------+---------++------------+
| student  |  date   || mean_grade |
|----------|---------||------------|
| Jane Doe | 2020-01 ||       72.0 |
|          | 2020-02 ||       90.0 |
| John Doe | 2020-01 ||       65.0 |
|          | 2020-02 ||       82.5 |
+----------+---------++------------+
name: mean


## Revision 2
Looking at your most recent report, the school board notices that students seem to have unusually low scores in January. Someone discovers that the school district changed test scanning software on January 16, 2020. Now they want a report on average scores across all students for all tests before and after that key date.

You want your function to be able to perform this analysis given any date. This requires a new test as well.

In [25]:
mockInput := `student,date,grade
,2020-01-02,2
baz,2020-02-01,5
baz,2020-02-02,4
bar,2020-01-01,1
bar,2019-12-01,2
bar,2019-12-02,3`


// if date is 2020-01-02
want := `period, mean_grade
d < 2020-01-02, 2.0
d >= 2020-01-02, 4.5`

In [29]:
func TestTransformWithDate(testDate time.Time)  {
    df, err := tada.ReadCSV(strings.NewReader(mockInput))
    if err != nil {
        fmt.Println("Error:", err)
    }
    ret := transformWithDate(df, testDate)
    ok, diffs, err := ret.EqualsCSV(strings.NewReader(want), true)
    if err != nil {
        fmt.Println("Error:", err)
    }
    if !ok {
        fmt.Println("transform() has diffs:")
        fmt.Println("--text view--")
        fmt.Println(diffs)
        
        fmt.Println("--table view--")
        fmt.Println(diffs.AsTable())
        
        fmt.Println("--df that was returned--")
        fmt.Println(ret)
    } else {
        fmt.Println("PASS")
    }
}

In [34]:
func transformWithDate(df *tada.DataFrame, date time.Time) *tada.DataFrame {
    df.InPlace().DropNull()
    beforeDate := tada.FilterFn{DateTime: func(v time.Time) bool{ return v.Before(date)}}
    period, err := df.Where(
        map[string]tada.FilterFn{"date": beforeDate}, 
        fmt.Sprintf("d < %v", date.Format("2006-01-02")),
        fmt.Sprintf("d >= %v", date.Format("2006-01-02")),
    )
    if err != nil {
        log.Fatal(err)
    }
    df.InPlace().WithCol("period", period)
    df.InPlace().Sort(tada.Sorter{Name: "date", DType: tada.DateTime})
    ret := df.GroupBy("period").Mean("grade")
    
    decimalFormat := tada.ApplyFormatFn{Float64: func(v float64) string {return strconv.FormatFloat(v, 'f', 1, 64)}}
    ret.InPlace().ApplyFormat(map[string]tada.ApplyFormatFn{
        "mean_grade": decimalFormat})
    return ret
} 

In [35]:
testDate := time.Date(2020,01,02,0,0,0,0,time.UTC)
TestTransformWithDate(testDate)

PASS


In [36]:
df, err := tada.ReadCSV(strings.NewReader(rawData))
if err != nil {
    log.Fatal(err)
}
d := time.Date(2020,1,16,0,0,0,0,time.UTC)
transformWithDate(df, d)

+-----------------++------------+
|     period      || mean_grade |
|-----------------||------------|
|  d < 2020-01-16 ||       63.8 |
| d >= 2020-01-16 ||       87.1 |
+-----------------++------------+
name: mean
