I want to make sure the framework produced the same result as the original notebook, so here is a comparison of the enriched dataset (where the city column has been calculated) - comparing what the original notebook produced vs the framework.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

Read the 2 datasets - since they should contain the same data and are sorted the same way the panda indexes should be the same which will allow a direct comparison

In [2]:
customers1 = pd.read_csv('city-values-ref.csv', quotechar='"', escapechar='\\').sort_values(by='address', ascending=False) 
customers2 = pd.read_csv('city-values-res.csv', quotechar='"', escapechar='\\').sort_values(by='address', ascending=False) 

Reindex to be sure - I'm not really sure how significant this is yet

In [3]:
customers2 = customers2.reindex(customers1.index, fill_value=0)

Compare the two datasets

In [4]:
comparison = customers1.compare(customers2)
print(comparison)

          city            
          self       other
25053  PRESTON  CANTERBURY


The comparison shows one difference so lets look at it

In [5]:
customers1.iloc[25053]

address        FORSTAL HOUSE THE FORSTAL,\nPRESTON,\nCANTERBURY,\nENGLAND,\nCT3 1DT
total_spend                                                                    4100
city                                                                        PRESTON
Name: 25053, dtype: object

In [6]:
customers2.iloc[25053]

address        FORSTAL HOUSE THE FORSTAL,\nPRESTON,\nCANTERBURY,\nENGLAND,\nCT3 1DT
total_spend                                                                    4100
city                                                                     CANTERBURY
Name: 25053, dtype: object

I can see whats happening here - it matches both criteria for Preston and Canterbury so what matters here is how the logic is implemented. When finding a match, does it stop there or continue and use a later match? The order of the data will affect this as well. 

In this scenario, first match wins:
```
var calculateValue: UDF1<*, *> = UDF1<String, String> { value ->
    for (name in names) {
        if (value.contains("\n$name,")) {
            return@UDF1 name
        }
    }
    return@UDF1 other
}
```

And here, last match wins:
```
var calculateValue: UDF1<*, *> = UDF1<String, String> { value ->
    var result: String = other
    for (name in names) {
        if (value.contains("\n$name,")) {
            result = name
        }
    }
    return@UDF1 result
}
```

Deterministic results are always good, so it would seem reasonable to make sure the data is sorted - this will avoid seemingly random differences as the data updates over time, and the "first wins" or "last wins" characteristics of the code should be documented for those that want to know.

I'll leave this as a future exercise for now.
