# Assignment 4

In this assignment, you will be loading and analyzing data using the Clojure language.

## The dataset

Data is obtained from: https://simplemaps.com/data/canada-cities.  The raw data is available
in "/home/jovyan/work/shared/datasets/canadacities.csv".  It contains information about over 1700 cities in Canada.

## Analysis

In this assignment, we will guide you to build functions of increasing complexity.  You will be using
functional constructs to progressively more sophisticated data processing pipelines in Clojure.

In [217]:
; 🔒🔒🔒
;
; Import core clojure functions from namespaces:
; - `clojure.string` https://clojuredocs.org/clojure.string
; - `clojure.pprint` https://clojuredocs.org/clojure.pprint
;
(require '[clojure.string :as s])
(require '[clojure.pprint :refer [pprint]])

nil

# Q1. `strip-quotes`

`(strip-quotes s)`

- `s` is a string that may be enclosed by double quotes.
- returns a string with the double quotes removed.  If `s` does not have quotes, then
  just return the element.

In [218]:
; ****Your Work****
;;; @solution

(defn strip-quotes [s]
    (s/replace s #"\"" "")
)

#'user/strip-quotes

In [219]:
; 🔒🔒🔒
;;; @check

(strip-quotes "hello")

"hello"

In [220]:
; 🔒🔒🔒
;;; @check
(strip-quotes "\"world\"")

"world"

# Q2. `parse-element`

`(parse-element s)`

- `s` is the string value to be parsed.
  - if `s` looks like an integer, it should be converted to an integer.
  - if `s` looks like a decimal, it should be converted to a double.
  - otherwise, return `s` unmodified as a string.
---
- returns either a string, double, or an integer.

In [221]:
; ****Your Work****
;;; @solution

(defn parse-element [s]
    (try (Double/parseDouble s)
        (if (s/includes? s ".")
            (Double/parseDouble s)
            (Integer/parseInt s)
        )
       (catch Exception e (str s)) 
    )
)

#'user/parse-element

In [222]:
; 🔒🔒🔒
;;; @check
(let [x (parse-element "3.1415")]
    [(type x) x])

[java.lang.Double 3.1415]

In [223]:
; 🔒🔒🔒
;;; @check
(let [x (parse-element "3.")]
    [(type x) x])

[java.lang.Double 3.0]

In [224]:
; 🔒🔒🔒
;;; @check
(let [x (parse-element "31415")]
    [(type x) x])

[java.lang.Integer 31415]

In [225]:
; 🔒🔒🔒
;;; @check
(let [x (parse-element "hello world")]
    [(type x) x])

[java.lang.String "hello world"]

# Q3. `parse-line`

`(parse-line line)`

- `line` is a string containing common-separated values.  Each value is maybe enclosed by double quotes.
---
- returns a sequence of parsed values (either string, double or integer).  You must also remote any excess whitespace outside the quotes.

In [226]:
; ****Your Work****
;;; @solution

(defn parse-line [line]
    (for [x (clojure.string/split line #",")
        :let [y (parse-element (strip-quotes (s/trim  x)))]]
    y)
)

#'user/parse-line

In [227]:
; 🔒🔒🔒
;;; @check

(parse-line "\"hello\",\"world\"")

("hello" "world")

In [228]:
; 🔒🔒🔒
;;; @check

(parse-line "hello,123,3.1415")

("hello" 123 3.1415)

In [229]:
; 🔒🔒🔒
;;; @check

(parse-line "hello,   123   ,   3.1415   ")

("hello" 123 3.1415)

In [230]:
; 🔒🔒🔒
;;; @check

(parse-line "hello,  \" 123 \"  ,   \"3.1415\"   ")

("hello" " 123 " 3.1415)

# Q4. Using these functions

We have provided code (see `city.clj` file) that will make use of your functions to perform
parsing of a CSV file containing information of 1738 Canadian cities.

The Clojure file can be imported using `(require <name>)` where `<name>` is a quoted
symbol that corresponds to the Clojure file, in this case `city`.

- Complete the require form so you can access the functions provided in `cities.clj`.
- Study the `cities.clj` file and make sure you understand how the functions work.

In [231]:
; ****Your Work****
;;; @solution

;
; Complete the following require form.
;

(require 'city)


nil

In [20]:
; 🔒🔒🔒
;;; @check

(load-header)

("city" "city_ascii" "province_id" "province_name" "lat" "lng" "population" "density" "timezone" "ranking" "postal" "id")

In [232]:
; 🔒🔒🔒
;;; @check

(-> (load-rows) (count))

1738

In [233]:
; 🔒🔒🔒
;;; @check

(->> (load-rows)
     (take 2)
     (map println))

((Toronto Toronto ON Ontario 43.7417 -79.3733 5429524 4334.4 America/Toronto 1 M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5N M5H M5J M4X M4Y M4R M4S M4P M4V M4W M4T M4J M4K M4H M4N M4L M4M M4B M4C M4A M4G M4E M3N M3M M3L M3K M3J M3H M3C M3B M3A M2P M2R M2L M2M M2N M2H M2J M2K M1C M1B M1E M1G M1H M1K M1J M1M M1L M1N M1P M1S M1R M1T M1W M1V M1X M9P M9R M9W M9V M9M M9L M9N M9A M9C M9B M6P M6R M6S M6A M6B M6C M6E M6G M6H M6J M6K M6L M6M M6N M8Z M8X M8Y M8V M8W 1124279679)
(Montréal Montreal QC Quebec 45.5089 -73.5617 3519595 3889.0 America/Montreal 1 H1X H1Y H1Z H1P H1R H1S H1T H1V H1W H1H H1J H1K H1L H1M H1N H1A H1B H1C H1E H1G H2Y H2X H2Z H2T H2W H2V H2P H2S H2R H2M H2L H2N H2H H2K H2J H2E H2G H2A H2C H2B H3B H3C H3A H3G H3E H3J H3K H3H H3N H3L H3M H3R H3S H3V H3W H3T H3X H4G H4E H4C H4B H4A H4N H4M H4L H4K H4J H4H H4V H4S H4R H4P H8N H8S H8R H8P H8T H8Z H8Y H9A H9C H9E H9H H9J H9K 1124586170)
nil nil)

# Q5. `build-hash-map`

`(build-hash-map header row)`

- `header` is a sequence of strings that are the column headers of the CSV file.
- `row` is a sequence of values from a row in the CSV file.
---
- returns a hash-map representation of the row.  You are required to
  convert strings in the header to keywords, and then use them as the keys of the hash-map.

In [234]:
; ****Your Work****
;;; @solution

(defn build-hash-map [header row]
    (zipmap header row)
)

#'user/build-hash-map

In [235]:
; 🔒🔒🔒
;;; @check

(->> (load-rows)
    (first)
    (build-hash-map (load-header))
    (pprint))

{"city" "Toronto",
 "timezone" "America/Toronto",
 "lng" -79.3733,
 "id" 1124279679,
 "province_name" "Ontario",
 "postal"
 "M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5N M5H M5J M4X M4Y M4R M4S M4P M4V M4W M4T M4J M4K M4H M4N M4L M4M M4B M4C M4A M4G M4E M3N M3M M3L M3K M3J M3H M3C M3B M3A M2P M2R M2L M2M M2N M2H M2J M2K M1C M1B M1E M1G M1H M1K M1J M1M M1L M1N M1P M1S M1R M1T M1W M1V M1X M9P M9R M9W M9V M9M M9L M9N M9A M9C M9B M6P M6R M6S M6A M6B M6C M6E M6G M6H M6J M6K M6L M6M M6N M8Z M8X M8Y M8V M8W",
 "population" 5429524,
 "province_id" "ON",
 "city_ascii" "Toronto",
 "lat" 43.7417,
 "ranking" 1,
 "density" 4334.4}


nil

# Q6. `abbreviate`

`(abbreviate city)`

- `city` is the hash-map representation of a city.
---
- returns the abbreviated hash-map with only keys `:city` and `:province_id`.

In [236]:
; ****Your Work****
;;; @solution

(defn abbreviate [city] 
    (hash-map :city (get city "city") :province_id (get city "province_id"))
)

#'user/abbreviate

In [237]:
; 🔒🔒🔒
;;; @check

(->> (load-rows)
    (first)
    (build-hash-map (load-header))
    (abbreviate))

{:city "Toronto", :province_id "ON"}

# Q7. `load-cities`

`(load-cities)`

- returns a sequence of cities represented as hash-maps.

In [238]:
; ****Your Work****
;;; @solution

(defn load-cities []
    (for [x (load-rows) :let [y (build-hash-map (load-header) x)]] y)
)

#'user/load-cities

In [239]:
; 🔒🔒🔒
;;; @check

(first (load-cities))

{"city" "Toronto", "timezone" "America/Toronto", "lng" -79.3733, "id" 1124279679, "province_name" "Ontario", "postal" "M5T M5V M5P M5S M5R M5E M5G M5A M5C M5B M5M M5N M5H M5J M4X M4Y M4R M4S M4P M4V M4W M4T M4J M4K M4H M4N M4L M4M M4B M4C M4A M4G M4E M3N M3M M3L M3K M3J M3H M3C M3B M3A M2P M2R M2L M2M M2N M2H M2J M2K M1C M1B M1E M1G M1H M1K M1J M1M M1L M1N M1P M1S M1R M1T M1W M1V M1X M9P M9R M9W M9V M9M M9L M9N M9A M9C M9B M6P M6R M6S M6A M6B M6C M6E M6G M6H M6J M6K M6L M6M M6N M8Z M8X M8Y M8V M8W", "population" 5429524, "province_id" "ON", "city_ascii" "Toronto", "lat" 43.7417, "ranking" 1, "density" 4334.4}

In [240]:
; 🔒🔒🔒
;;; @check
(last (load-cities))

{"city" "Oyen", "timezone" "America/Edmonton", "lng" -110.4739, "id" 1124000494, "province_name" "Alberta", "postal" "T0J", "population" 1001, "province_id" "AB", "city_ascii" "Oyen", "lat" 51.3522, "ranking" 3, "density" 189.6}

# Q8. `load-abbreviated-cities`

`(load-abbreviated-cities)`

- returns cities as abbreviated hash-maps.

In [241]:
; ****Your Work****
;;; @solution

(defn load-abbreviated-cities [] 
    (for [x (load-cities) :let [y (abbreviate x)]] y)
)

#'user/load-abbreviated-cities

In [242]:
; 🔒🔒🔒
;;; @check
(pprint (take 10 (load-abbreviated-cities)))

({:city "Toronto", :province_id "ON"}
 {:city "Montréal", :province_id "QC"}
 {:city "Vancouver", :province_id "BC"}
 {:city "Calgary", :province_id "AB"}
 {:city "Edmonton", :province_id "AB"}
 {:city "Ottawa", :province_id "ON"}
 {:city "Mississauga", :province_id "ON"}
 {:city "Winnipeg", :province_id "MB"}
 {:city "Quebec City", :province_id "QC"}
 {:city "Hamilton", :province_id "ON"})


nil

In [243]:
; 🔒🔒🔒
;;; @check
(pprint (take-last 10 (load-abbreviated-cities)))

({:city "Assiginack", :province_id "ON"}
 {:city "Brébeuf", :province_id "QC"}
 {:city "Hudson Hope", :province_id "BC"}
 {:city "Prince", :province_id "ON"}
 {:city "Baie-du-Febvre", :province_id "QC"}
 {:city "Durham-Sud", :province_id "QC"}
 {:city "Melbourne", :province_id "QC"}
 {:city "Nipawin No. 487", :province_id "SK"}
 {:city "Duck Lake No. 463", :province_id "SK"}
 {:city "Oyen", :province_id "AB"})


nil

# Q9. `match?`

`(match? [regexp city])`

- `regexp` is a regular expression.
- `city` is a hash-map representation of the city.
---
- returns a true if and only if some key/value in `city` (as a string) matches the `pattern`.

The string representation of a key `k`, value `v` pair is given by `(str k " " v)`

In [244]:
; ****Your Work****
;;; @solution

(defn match? [pattern city]
    (let [split (clojure.string/split (str pattern) #" ")]
        (if (= (count split) 1)
            (clojure.string/includes? (str city) (first split))
            (if (clojure.string/includes? (first split) ":")
                (clojure.string/includes? (str (get city (subs (first split) 1))) (last split))
                (clojure.string/includes? (str (get city (first split))) (last split))
            )
        )
    )
)

#'user/match?

In [245]:
; 🔒🔒🔒
;;; @check

(let [toronto (build-hash-map (load-header) (first (load-rows)))]
    (println "match? \"toronto\":" (match? #"Toronto" toronto))
    (println "match? population starting with 5:" (match? #"population 5" toronto))
    (println "match? population starting with 6:" (match? #"population 6" toronto))
    (println "match? M5T:" (match? #"M5T" toronto))
    (println "match? M5K:"(match? #"M5K" toronto)))

match? "toronto": true
match? population starting with 5: true
match? population starting with 6: false
match? M5T: true
match? M5K: false


nil

In [246]:
; 🔒🔒🔒
;;; @check

(let [toronto (abbreviate (build-hash-map (load-header) (first (load-rows))))]
    (println "match? \"toronto\":" (match? #"Toronto" toronto))
    (println "match? population starting with 5:" (match? #"population 5" toronto))
    (println "match? population starting with 6:" (match? #"population 6" toronto))
    (println "match? M5T:" (match? #"M5T" toronto))
    (println "match? M5K:"(match? #"M5K" toronto)))

match? "toronto": true
match? population starting with 5: false
match? population starting with 6: false
match? M5T: false
match? M5K: false


nil

# Q10. `query`

`(query pattern cities)`

- `pattern` is a regular expression.
- `cities` is a sequence of cities represented as hash-maps.
---
- returns all cities `x` (as hash-maps) in `cities` that satisfy the `(match? pattern x)` predicate.

In [253]:
; ****Your Work****
;;; @solution
(defn query [pattern cities]
    (for [x cities :let [y x] :when (match? pattern x)] y)
)

#'user/query

In [254]:
; 🔒🔒🔒
;;; @check

;
; Count the number of cities with "Toronto" as part of its entry.
; Note: many cities use the "American/Toronto" timezone.
;
(count (query #"Toronto" (load-cities)))

329

In [255]:
; 🔒🔒🔒
;;; @check

;
; Count the number of cities with ":city Toronto" as part of its entry.
; This limits the query to just city name.
;
(->> (load-cities)
    (query #":city Toronto")
    (count))

1

In [256]:
; 🔒🔒🔒
;;; @check

;
; Count the number of cities with "Ontario"
;
(->> (load-cities)
    (query #"Ontario")
    (count))

345

In [257]:
; 🔒🔒🔒
;;; @check

;
; Query the cities satisfying both query
; conditions:
; - "Ontario"
; - ":city Lav", i.e. city name starting with Lav.
;
(->> (load-cities)
    (query #"Ontario")
    (query #":city Lav")
    (count))

0

In [258]:
; 🔒🔒🔒
;;; @check
;
; Search for the city Ontario Tech is part of by
; querying for its postal code. L1G
;
(->> (load-cities)
     (query #"L1G")
     (first)
     (pprint))

{"city" "Oshawa",
 "timezone" "America/Toronto",
 "lng" -78.85,
 "id" 1124541904,
 "province_name" "Ontario",
 "postal" "L1L L1H L1J L1K L1G",
 "population" 166000,
 "province_id" "ON",
 "city_ascii" "Oshawa",
 "lat" 43.9,
 "ranking" 2,
 "density" 1027.0}


nil