Redundant code removal, CSV ID column name dependency removal, stress test support, and alternative color palette by SaifKhan21 · Pull Request #244 · mgastner/cartogram-cpp

SaifKhan21 · 2025-04-02T15:02:53Z

@adisidev
A lot of new changes have been pushed so I decided to make a new pull request for them. The changes that have been made are as follows:

Removal of redundant code, mainly in read_geojson.cpp
Support for program to check for matching ID columns in CSV and GeoJSON, removing the need for order and name of ID column in the CSV
(minor) Formatting of code to Clang format file
Support for stress testing for new command line arguments (--output_equal_area_map, --plot_polygons, --export_preprocessed, and --export_time_report), note that --add-grid command testing has not been added as its related issue Add Grid to SVG Outputs #239 has not been addressed yet
Support for stress testing for missing Color and Inset columns
Added alternative color palette if some but not all colors have been provided when plotting polygons

…and/or first column

…/inset columns

…or plotting polygons

SaifKhan21 · 2025-04-02T15:06:13Z

Removal of redundant code, mainly in read_geojson.cpp

Renaming and repurposing of extract_unique_properties(), from where it initially just returned a vector of the unique keys in the GEOJSON file, it is now extract_unique_properties_map() which returns a map of the unique keys and item values of the GEOJSON file
Moving the error check for whether there are no unique properties inside the GEOJSON file from read_geojson() to extract_unique_properties()
Removing the redundant code in generate_csv_template() and have it use the new extract_unique_properties_map()
Passing the vector of unique keys as a parameter to extract_properties_map() instead of having it call the extract_unique_properties_map() again
Removing set_id_header() as it's only ever used in that one instance in read_geojson()

In the following commit, I restructured and renamed the class variable properties_map_ which was a std::map<std::string, std::map<std::string, std::string>> to unique_properties_map_ which is now a std::map<std::string, std::vector<std::string>>. To that end, I have deleted the now redundant extract_properties_map(). Instead, the new unique_properties_map_ class variable will be assigned the return value from extract_unique_properties_map().

Some variables that were initially declared as auto have also been assigned its proper variable type to make the code more readable as requested.

In the read_csv.cpp file, modifications have been made in update_header_info() to now utilize the new unique_properties_map_ class variable instead. Redundant code that assigned new values to the class variable has been removed as well since the variable is never used again once the function ends.

SaifKhan21 · 2025-04-02T15:19:52Z

Support for program to check for matching ID columns in CSV and GeoJSON, removing the need for order and name of ID column in the CSV

Initially, the ID column in the CSV file needed to have the same name as the ID column in the GeoJSON file, even if the user provided the name of the CSV ID column in the arguments. In addition, if the user didn't provide the name of the ID column in the arguments, the first column in the CSV needed to be the ID column. Even if the ID column in the CSV file had the same name as the ID column in the GeoJSON file, if the column name was not provided and it isn't the first column, the program would not be able to match the columns.

Now, the program is able to go through the columns in both the CSV and GeoJSON files to find the matching ID columns, removing the need to have the ID columns to have the same name as each other. This also removes the need for the ID column to be the first column in the CSV file if the user doesn't include a column name in the arguments. If the user does provide the name of the CSV ID column in the arguments, the program will check that column first and try to match it with the GeoJSON columns.

SaifKhan21 · 2025-04-02T15:21:17Z

(minor) Formatting of code to Clang format file

A quick minor change to ensure all files in the project adhered to the standardized code structure provided in the Clang format file.

SaifKhan21 · 2025-04-02T15:45:12Z

Support for stress testing for new command line arguments and missing Color and Inset columns

Included more tests for the stress_test.sh shell script that includes testing for program command line argument features, such as --output_equal_area_map, --plot_polygons, --export_preprocessed, and --export_time_report. The program checks if the CSV file that's being tested has a flag mentioned in its filename and uses the appropriate flag if it does (if the CSV file has export_preprocessed in its filename, it will use the --export_preprocessed flag for it). This also means that copies of existing CSV files have been created with these added flags in their filenames.

Additional tests have been added as well, such as for CSV files that have no Color or Inset column. An additional test where the Color column has values for some but not all of its rows has been added as well, in order to test out the alternative color palette when auto coloring which is also included in this pull request.

Note that --add-grid command testing has not been added as its related issue #239 has not been addressed yet. Once the issue has been addressed and resolved, testing for the feature will be included as well.

SaifKhan21 · 2025-04-02T15:57:13Z

Added alternative color palette if some but not all colors have been provided when plotting polygons

Requested in #242, an alternative color palette is used if some but not all rows in a CSV file has been provided a color value when plotting polygons (--plot_polygons). Instead of the default Pastel color palette (https://colorbrewer2.org/#type=qualitative&scheme=Pastel1&n=8) that is used if no colors are provided, the Dark color palette (https://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=8) is used in its place.

Based on my interpretation of the request, in the case where some but not all CSV rows have color values, the program will not change the color values of rows that already have them and will only apply the Dark color palette for the ones that don't. If this interpretation is incorrect and the program should replace all color values with the palette in the CSV, please let me know.

Copilot

Pull Request Overview

This pull request refactors various parts of the code by removing redundant code and reformatting functions to eliminate dependencies on CSV ID column names while introducing support for new stress test command line arguments and an alternative color palette. Key changes include:

Simplifying function declarations in header files with formatting updates.
Removing the redundant set_id_header function and updating update_id_header_info to accept an optional parameter and return an int.
Reordering and cleaning up include directives in the inset_state header.

Reviewed Changes

Copilot reviewed 24 out of 30 changed files in this pull request and generated 1 comment.

File	Description
include/write_image.hpp	Reformatted function declaration for clarity.
include/parse_arguments.hpp	Simplified function declaration formatting.
include/inset_state.hpp	Adjusted include order and reformatted transform_polygons declaration.
include/cartogram_info.hpp	Revised property type and updated ID header-related functions.

Files not reviewed (6)

sample_data/algeria_by_wilaya_since_2019/algeria_population_by_wilaya_2022_no_colors.csv: Language not supported
sample_data/austria_by_state_since_1945/austria_population_by_state_2020_output_equal_area_map.csv: Language not supported
sample_data/belgium_by_region_since_1995/belgium_population_by_region_2022_plot_polygons.csv: Language not supported
sample_data/brazil_by_state_since_1988/brazil_population_by_state_2021_no_inset.csv: Language not supported
sample_data/croatia_by_county_since_1992/croatia_population_by_county_2021_export_preprocessed.csv: Language not supported
sample_data/germany_by_state_since_1990/germany_population_by_state_2011_export_time_report.csv: Language not supported

Copilot · 2025-04-02T17:42:12Z

  void project_to_equal_area();

-  void update_id_header_info(const std::string &);
+  int update_id_header_info(const std::optional<std::string> &);


The update_id_header_info function now returns an int instead of void. Ensure that all callers are updated to handle the new integer return type or document the expected behavior clearly.

… class variable

…specific data

SaifKhan21 · 2025-04-14T08:18:50Z

@nihalzp
There have been multiple new changes to the dev branch since the last review. I will be going through each major change in their own separate comments.

SaifKhan21 · 2025-04-14T08:30:10Z

Split up function into two according to purpose and removed redundant class variable
As discussed in our meeting on 4th April, the function update_id_header_info was seen as having too many different uses and purposes, to the point where its name did not accurately reflect what the function was doing. To solve that, the function was split into two functions, one called match_id_columns and another that is still called update_id_header_info.

The function match_id_columns has the purpose of finding the matching ID columns between the GeoJSON and CSV file. It returns a string which is the name of the ID column in the GeoJSON file, which is used later on. The function update_id_header_info is now a void function and has the sole purpose of updating the ID header info in the stored GeoJSON maps.

Another minor change in this commit is the removal of the unique_properties_ class variable which was redundant.

SaifKhan21 · 2025-04-14T08:47:36Z

Restructured stress testing for flags for all sample data instead of specific data
Another issue discussed in the meeting on 4th April was the implementation of stress testing for all the different flags that can be enabled in the cartogram-cpp program.

Changes were made to restructure the stress testing for different flags entirely. Instead of the script testing the different flags only if the CSV file has the aforementioned flag in its filename ("belgium_population_by_region_2022_plot_polygons.csv" will trigger the test for --plot_polygons), the script now tests all the different flags for each of the GeoJSON and CSV files. With this change, the additional CSV files that were initially added to test using the CSV filename have been removed.

When stress testing the different flags, the script will first test all of the sample data files with no flags as per normal before proceeding to test all the sample data with one of the flags enabled such as --plot_polygons. After the testing for that has completed, it will test all the sample data with another flag. This continues until all the various flags have been tested. In the results folder, each flag will stores its outputs and lists of successful/failed runs in a separate folder (tests for --plot_polygons will be stored in the "plot_polygons" folder). To stress test all the different flags using the script, add the --flags flag when running the script file (bash stress_test.sh --flags).

SaifKhan21 · 2025-04-14T08:58:21Z

Implement draft legend feature for inset polygon map
Implement full legend feature functionality for polygon maps
Finally, the biggest change since the last review is the implementation of the legends feature when the user runs --plot_polygons and --add_grid. Now when those flags are used, there will be a legend at the top left of the generated SVG files indicating how much the size of one grid cell is covered, be it area or population. In addition, the total area/population in each map is included as well. For example, on the Singapore sample data, one grid cell covers roughly 20 square kilometres on the equal area map and 100K people on the insetted map and displays a total of 783 square kilometres and 3.9 million people respectively. This feature mostly follows the same functionality that is available in cartogram-web and on go-cart.io.

nihalzp

Overall, it looks good.

Additionally, add const to variables that do not change over their lifetime to improve readability. It may also help the compiler optimise in some cases.

nihalzp · 2025-04-14T15:27:07Z

+  auto first_element = unique_properties_map_.begin();
+  id_header_ = first_element->first;


Suggested change

auto first_element = unique_properties_map_.begin();

id_header_ = first_element->first;

id_header_ = unique_properties_map_.begin()->first;

nihalzp · 2025-04-14T15:34:40Z

-
-  set_id_header(unique_properties_[0]);
+  unique_properties_map_ = extract_unique_properties_map(j);



Add an assertion here to ensure that unique_properties_map is not empty. While this is currently guaranteed by extract_unique_properties_map, future changes to its implementation may break this assumption.

nihalzp · 2025-04-14T15:45:00Z

+    per_grid_cell_ = 1;

    double total_area = total_inset_area();
    // double total_area = equal_area_projection_area_to_earth_area(total_inset_area());
    size_t multiplier_idx = 0;
-    while (per_grid_cell < 0.015 * total_area) {
-      per_grid_cell *= multipliers[multiplier_idx];
+    while (per_grid_cell_ < 0.015 * total_area) {
+      per_grid_cell_ *= multipliers[multiplier_idx];
      multiplier_idx = (multiplier_idx + 1) % multipliers.size();
    }
    std::cerr << "Total area: " << total_area << std::endl;
-    std::cerr << "Per grid cell: " << per_grid_cell << std::endl;
-    // per_grid_cell = earth_area_to_equal_area_projection_area(per_grid_cell);
-    const double grid_line_spacing = sqrt(per_grid_cell);
+    std::cerr << "Per grid cell: " << per_grid_cell_ << std::endl;
+    // per_grid_cell_ = earth_area_to_equal_area_projection_area(per_grid_cell_);
+    const double grid_line_spacing = sqrt(per_grid_cell_);


I think we should not have per_grid_cell_ as a data member of InsetState. A better design could that we have a function that returns the value of per_grid_cell_, with parameters such as the InsetState object, equal_area (indicating whether the area is for cartogram or equal area map), multipliers.

nihalzp · 2025-04-14T15:49:43Z

+  // Display value per grid cell in billions/millions/thousands
+  std::string grid_cell_label = "";
+  if (grid_cell_value >= 1000000000) {
+    int billions = grid_cell_value / 1000000000;
+    grid_cell_label = std::to_string(billions) + "B";
+  } else if (grid_cell_value >= 1000000) {
+    int millions = grid_cell_value / 1000000;
+    grid_cell_label = std::to_string(millions) + "M";
+  } else if (grid_cell_value >= 1000) {
+    int thousands = grid_cell_value / 1000;
+    grid_cell_label = std::to_string(thousands) + "K";
+  } else {
+    grid_cell_label = std::to_string(grid_cell_value);
+  }
+
+  // Display total value in billions/millions/thousands with 1 decimal place
+  std::string total_label = "Total: ";
+  std::stringstream sstream;
+  if (total_value >= 1000000000) {
+    double billions = (double)total_value / 1000000000;
+    sstream << std::fixed << std::setprecision(1) << billions;
+    total_label += sstream.str() + "B";
+  } else if (total_value >= 1000000) {
+    double millions = (double)total_value / 1000000;
+    sstream << std::fixed << std::setprecision(1) << millions;
+    total_label += sstream.str() + "M";
+  } else if (total_value >= 1000) {
+    double thousands = (double)total_value / 1000;
+    sstream << std::fixed << std::setprecision(1) << thousands;
+    total_label += sstream.str() + "K";
+  } else {
+    total_label += std::to_string(total_value);
+  }


It would be better if they were part of separate functions.

nihalzp · 2025-04-14T16:00:48Z

+  printf "All *.geojson, *.csv, *.svg, *.png and *.ps files deleted.\n" | color $red
  exit ${failed}
 fi



At the end, it would be useful to display summary statistics grouped by flags. For example: 5 failures without any flag, 3 failures with the plot_polygons flag, etc. Additionally, for each failed test, we should display the exact command that can be run from the root folder to reproduce it.

SaifKhan21 · 2025-04-16T08:38:04Z

@nihalzp
I have implemented the various changes suggested in the pull request review. These changes are:

(read_geojson.cpp) Removing the use of the first_element variable and directly assigning id_header_ to the first key in unique_properties_map_
(read_geojson.cpp) Adding an assertion to ensure unique_properties_map_ is not empty after calling extract_unique_properties_map() to allow for easier debugging in the future
(write_image.cpp) Removed the per_grid_cell_ data variable from the InsetState class and and created a new function called compute_per_grid_cell() that returns the area per grid cell which is now used instead
(write_image.cpp) Removed lengthy grid cell and total label creation in write_legend_on_surface() and assigned it under a new function called get_legend_labels()
(stress_test.sh) When testing with all the various flags, the summary statistics of which tests passed and failed are now grouped by flags
(General) Added const keywords to variables which are never re-assigned new values
(General) Applied Clang formatting for files that were modified

SaifKhan21 and others added 8 commits March 25, 2025 20:32

Repurposed unique properties function and removed redundant code

4b34910

Removed extract properties map function and repurposed class variable

a0f352d

CSV ID column name no longer needs to match GeoJSON header name

f8acec9

Merge branch 'mgastner:dev' into dev

bd46ba6

Formatted code to Clang format file

674a182

Every CSV column is checked for ID header instead of just given name …

f6f5127

…and/or first column

Added stress testing support for new command line arguments and color…

be2b2a2

…/inset columns

Added alternative color palette if some but not all colors provided f…

27d4dc9

…or plotting polygons

nihalzp requested a review from Copilot April 2, 2025 17:41

Copilot AI reviewed Apr 2, 2025

View reviewed changes

nihalzp requested review from adisidev and nihalzp April 2, 2025 18:38

SaifKhan21 added 4 commits April 5, 2025 02:32

Split up function into two according to purpose and removed redundant…

1bdc917

… class variable

Restructured stress testing for flags for all sample data instead of …

a3ecaf9

…specific data

Implement draft legend feature for inset polygon map

51fe119

Implement full legend feature functionality for polygon maps

05b5fa4

nihalzp reviewed Apr 14, 2025

View reviewed changes

SaifKhan21 added 2 commits April 16, 2025 16:05

Implement changes from pull request review mgastner#244

34e3db1

Unassign legend labels function from InsetState class

886ac9d

nihalzp approved these changes Apr 21, 2025

View reviewed changes

nihalzp merged commit b8fbea3 into mgastner:dev Apr 21, 2025

		auto first_element = unique_properties_map_.begin();
		id_header_ = first_element->first;

	auto first_element = unique_properties_map_.begin();
	id_header_ = first_element->first;
	id_header_ = unique_properties_map_.begin()->first;


		set_id_header(unique_properties_[0]);
		unique_properties_map_ = extract_unique_properties_map(j);

Conversation

SaifKhan21 commented Apr 2, 2025

Uh oh!

SaifKhan21 commented Apr 2, 2025

Uh oh!

SaifKhan21 commented Apr 2, 2025

Uh oh!

SaifKhan21 commented Apr 2, 2025

Uh oh!

SaifKhan21 commented Apr 2, 2025

Uh oh!

SaifKhan21 commented Apr 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

SaifKhan21 commented Apr 14, 2025

Uh oh!

SaifKhan21 commented Apr 14, 2025

Uh oh!

SaifKhan21 commented Apr 14, 2025

Uh oh!

SaifKhan21 commented Apr 14, 2025

Uh oh!

nihalzp left a comment

Choose a reason for hiding this comment

Uh oh!

nihalzp Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

nihalzp Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

nihalzp Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

nihalzp Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

nihalzp Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

SaifKhan21 commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants