Skip to content

Redundant code removal, CSV ID column name dependency removal, stress test support, and alternative color palette#244

Merged
nihalzp merged 14 commits intomgastner:devfrom
SaifKhan21:dev
Apr 21, 2025
Merged

Redundant code removal, CSV ID column name dependency removal, stress test support, and alternative color palette#244
nihalzp merged 14 commits intomgastner:devfrom
SaifKhan21:dev

Conversation

@SaifKhan21
Copy link
Copy Markdown
Contributor

@adisidev
A lot of new changes have been pushed so I decided to make a new pull request for them. The changes that have been made are as follows:

  • Removal of redundant code, mainly in read_geojson.cpp
  • Support for program to check for matching ID columns in CSV and GeoJSON, removing the need for order and name of ID column in the CSV
  • (minor) Formatting of code to Clang format file
  • Support for stress testing for new command line arguments (--output_equal_area_map, --plot_polygons, --export_preprocessed, and --export_time_report), note that --add-grid command testing has not been added as its related issue Add Grid to SVG Outputs #239 has not been addressed yet
  • Support for stress testing for missing Color and Inset columns
  • Added alternative color palette if some but not all colors have been provided when plotting polygons

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Removal of redundant code, mainly in read_geojson.cpp

  • Renaming and repurposing of extract_unique_properties(), from where it initially just returned a vector of the unique keys in the GEOJSON file, it is now extract_unique_properties_map() which returns a map of the unique keys and item values of the GEOJSON file
  • Moving the error check for whether there are no unique properties inside the GEOJSON file from read_geojson() to extract_unique_properties()
  • Removing the redundant code in generate_csv_template() and have it use the new extract_unique_properties_map()
  • Passing the vector of unique keys as a parameter to extract_properties_map() instead of having it call the extract_unique_properties_map() again
  • Removing set_id_header() as it's only ever used in that one instance in read_geojson()

In the following commit, I restructured and renamed the class variable properties_map_ which was a std::map<std::string, std::map<std::string, std::string>> to unique_properties_map_ which is now a std::map<std::string, std::vector<std::string>>. To that end, I have deleted the now redundant extract_properties_map(). Instead, the new unique_properties_map_ class variable will be assigned the return value from extract_unique_properties_map().

Some variables that were initially declared as auto have also been assigned its proper variable type to make the code more readable as requested.

In the read_csv.cpp file, modifications have been made in update_header_info() to now utilize the new unique_properties_map_ class variable instead. Redundant code that assigned new values to the class variable has been removed as well since the variable is never used again once the function ends.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Support for program to check for matching ID columns in CSV and GeoJSON, removing the need for order and name of ID column in the CSV

Initially, the ID column in the CSV file needed to have the same name as the ID column in the GeoJSON file, even if the user provided the name of the CSV ID column in the arguments. In addition, if the user didn't provide the name of the ID column in the arguments, the first column in the CSV needed to be the ID column. Even if the ID column in the CSV file had the same name as the ID column in the GeoJSON file, if the column name was not provided and it isn't the first column, the program would not be able to match the columns.

Now, the program is able to go through the columns in both the CSV and GeoJSON files to find the matching ID columns, removing the need to have the ID columns to have the same name as each other. This also removes the need for the ID column to be the first column in the CSV file if the user doesn't include a column name in the arguments. If the user does provide the name of the CSV ID column in the arguments, the program will check that column first and try to match it with the GeoJSON columns.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

(minor) Formatting of code to Clang format file

A quick minor change to ensure all files in the project adhered to the standardized code structure provided in the Clang format file.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Support for stress testing for new command line arguments and missing Color and Inset columns

Included more tests for the stress_test.sh shell script that includes testing for program command line argument features, such as --output_equal_area_map, --plot_polygons, --export_preprocessed, and --export_time_report. The program checks if the CSV file that's being tested has a flag mentioned in its filename and uses the appropriate flag if it does (if the CSV file has export_preprocessed in its filename, it will use the --export_preprocessed flag for it). This also means that copies of existing CSV files have been created with these added flags in their filenames.

Additional tests have been added as well, such as for CSV files that have no Color or Inset column. An additional test where the Color column has values for some but not all of its rows has been added as well, in order to test out the alternative color palette when auto coloring which is also included in this pull request.

Note that --add-grid command testing has not been added as its related issue #239 has not been addressed yet. Once the issue has been addressed and resolved, testing for the feature will be included as well.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Added alternative color palette if some but not all colors have been provided when plotting polygons

Requested in #242, an alternative color palette is used if some but not all rows in a CSV file has been provided a color value when plotting polygons (--plot_polygons). Instead of the default Pastel color palette (https://colorbrewer2.org/#type=qualitative&scheme=Pastel1&n=8) that is used if no colors are provided, the Dark color palette (https://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=8) is used in its place.

Based on my interpretation of the request, in the case where some but not all CSV rows have color values, the program will not change the color values of rows that already have them and will only apply the Dark color palette for the ones that don't. If this interpretation is incorrect and the program should replace all color values with the palette in the CSV, please let me know.

@nihalzp nihalzp requested a review from Copilot April 2, 2025 17:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request refactors various parts of the code by removing redundant code and reformatting functions to eliminate dependencies on CSV ID column names while introducing support for new stress test command line arguments and an alternative color palette. Key changes include:

  • Simplifying function declarations in header files with formatting updates.
  • Removing the redundant set_id_header function and updating update_id_header_info to accept an optional parameter and return an int.
  • Reordering and cleaning up include directives in the inset_state header.

Reviewed Changes

Copilot reviewed 24 out of 30 changed files in this pull request and generated 1 comment.

File Description
include/write_image.hpp Reformatted function declaration for clarity.
include/parse_arguments.hpp Simplified function declaration formatting.
include/inset_state.hpp Adjusted include order and reformatted transform_polygons declaration.
include/cartogram_info.hpp Revised property type and updated ID header-related functions.
Files not reviewed (6)
  • sample_data/algeria_by_wilaya_since_2019/algeria_population_by_wilaya_2022_no_colors.csv: Language not supported
  • sample_data/austria_by_state_since_1945/austria_population_by_state_2020_output_equal_area_map.csv: Language not supported
  • sample_data/belgium_by_region_since_1995/belgium_population_by_region_2022_plot_polygons.csv: Language not supported
  • sample_data/brazil_by_state_since_1988/brazil_population_by_state_2021_no_inset.csv: Language not supported
  • sample_data/croatia_by_county_since_1992/croatia_population_by_county_2021_export_preprocessed.csv: Language not supported
  • sample_data/germany_by_state_since_1990/germany_population_by_state_2011_export_time_report.csv: Language not supported

Comment thread include/cartogram_info.hpp Outdated
void project_to_equal_area();

void update_id_header_info(const std::string &);
int update_id_header_info(const std::optional<std::string> &);
Copy link

Copilot AI Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_id_header_info function now returns an int instead of void. Ensure that all callers are updated to handle the new integer return type or document the expected behavior clearly.

Copilot uses AI. Check for mistakes.
@nihalzp nihalzp requested review from adisidev and nihalzp April 2, 2025 18:38
@SaifKhan21
Copy link
Copy Markdown
Contributor Author

@nihalzp
There have been multiple new changes to the dev branch since the last review. I will be going through each major change in their own separate comments.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Split up function into two according to purpose and removed redundant class variable
As discussed in our meeting on 4th April, the function update_id_header_info was seen as having too many different uses and purposes, to the point where its name did not accurately reflect what the function was doing. To solve that, the function was split into two functions, one called match_id_columns and another that is still called update_id_header_info.

The function match_id_columns has the purpose of finding the matching ID columns between the GeoJSON and CSV file. It returns a string which is the name of the ID column in the GeoJSON file, which is used later on. The function update_id_header_info is now a void function and has the sole purpose of updating the ID header info in the stored GeoJSON maps.

Another minor change in this commit is the removal of the unique_properties_ class variable which was redundant.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Restructured stress testing for flags for all sample data instead of specific data
Another issue discussed in the meeting on 4th April was the implementation of stress testing for all the different flags that can be enabled in the cartogram-cpp program.

Changes were made to restructure the stress testing for different flags entirely. Instead of the script testing the different flags only if the CSV file has the aforementioned flag in its filename ("belgium_population_by_region_2022_plot_polygons.csv" will trigger the test for --plot_polygons), the script now tests all the different flags for each of the GeoJSON and CSV files. With this change, the additional CSV files that were initially added to test using the CSV filename have been removed.

When stress testing the different flags, the script will first test all of the sample data files with no flags as per normal before proceeding to test all the sample data with one of the flags enabled such as --plot_polygons. After the testing for that has completed, it will test all the sample data with another flag. This continues until all the various flags have been tested. In the results folder, each flag will stores its outputs and lists of successful/failed runs in a separate folder (tests for --plot_polygons will be stored in the "plot_polygons" folder). To stress test all the different flags using the script, add the --flags flag when running the script file (bash stress_test.sh --flags).

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

Implement draft legend feature for inset polygon map
Implement full legend feature functionality for polygon maps
Finally, the biggest change since the last review is the implementation of the legends feature when the user runs --plot_polygons and --add_grid. Now when those flags are used, there will be a legend at the top left of the generated SVG files indicating how much the size of one grid cell is covered, be it area or population. In addition, the total area/population in each map is included as well. For example, on the Singapore sample data, one grid cell covers roughly 20 square kilometres on the equal area map and 100K people on the insetted map and displays a total of 783 square kilometres and 3.9 million people respectively. This feature mostly follows the same functionality that is available in cartogram-web and on go-cart.io.

Copy link
Copy Markdown
Collaborator

@nihalzp nihalzp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good.

Additionally, add const to variables that do not change over their lifetime to improve readability. It may also help the compiler optimise in some cases.

Comment thread src/cartogram_info/read_geojson.cpp Outdated
Comment on lines +410 to +411
auto first_element = unique_properties_map_.begin();
id_header_ = first_element->first;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto first_element = unique_properties_map_.begin();
id_header_ = first_element->first;
id_header_ = unique_properties_map_.begin()->first;


set_id_header(unique_properties_[0]);
unique_properties_map_ = extract_unique_properties_map(j);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an assertion here to ensure that unique_properties_map is not empty. While this is currently guaranteed by extract_unique_properties_map, future changes to its implementation may break this assumption.

Comment thread src/inset_state/write_image.cpp Outdated
Comment on lines +406 to +418
per_grid_cell_ = 1;

double total_area = total_inset_area();
// double total_area = equal_area_projection_area_to_earth_area(total_inset_area());
size_t multiplier_idx = 0;
while (per_grid_cell < 0.015 * total_area) {
per_grid_cell *= multipliers[multiplier_idx];
while (per_grid_cell_ < 0.015 * total_area) {
per_grid_cell_ *= multipliers[multiplier_idx];
multiplier_idx = (multiplier_idx + 1) % multipliers.size();
}
std::cerr << "Total area: " << total_area << std::endl;
std::cerr << "Per grid cell: " << per_grid_cell << std::endl;
// per_grid_cell = earth_area_to_equal_area_projection_area(per_grid_cell);
const double grid_line_spacing = sqrt(per_grid_cell);
std::cerr << "Per grid cell: " << per_grid_cell_ << std::endl;
// per_grid_cell_ = earth_area_to_equal_area_projection_area(per_grid_cell_);
const double grid_line_spacing = sqrt(per_grid_cell_);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have per_grid_cell_ as a data member of InsetState. A better design could that we have a function that returns the value of per_grid_cell_, with parameters such as the InsetState object, equal_area (indicating whether the area is for cartogram or equal area map), multipliers.

Comment thread src/inset_state/write_image.cpp Outdated
Comment on lines +941 to +973
// Display value per grid cell in billions/millions/thousands
std::string grid_cell_label = "";
if (grid_cell_value >= 1000000000) {
int billions = grid_cell_value / 1000000000;
grid_cell_label = std::to_string(billions) + "B";
} else if (grid_cell_value >= 1000000) {
int millions = grid_cell_value / 1000000;
grid_cell_label = std::to_string(millions) + "M";
} else if (grid_cell_value >= 1000) {
int thousands = grid_cell_value / 1000;
grid_cell_label = std::to_string(thousands) + "K";
} else {
grid_cell_label = std::to_string(grid_cell_value);
}

// Display total value in billions/millions/thousands with 1 decimal place
std::string total_label = "Total: ";
std::stringstream sstream;
if (total_value >= 1000000000) {
double billions = (double)total_value / 1000000000;
sstream << std::fixed << std::setprecision(1) << billions;
total_label += sstream.str() + "B";
} else if (total_value >= 1000000) {
double millions = (double)total_value / 1000000;
sstream << std::fixed << std::setprecision(1) << millions;
total_label += sstream.str() + "M";
} else if (total_value >= 1000) {
double thousands = (double)total_value / 1000;
sstream << std::fixed << std::setprecision(1) << thousands;
total_label += sstream.str() + "K";
} else {
total_label += std::to_string(total_value);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if they were part of separate functions.

Comment thread tests/stress_test.sh
printf "All *.geojson, *.csv, *.svg, *.png and *.ps files deleted.\n" | color $red
exit ${failed}
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end, it would be useful to display summary statistics grouped by flags. For example: 5 failures without any flag, 3 failures with the plot_polygons flag, etc. Additionally, for each failed test, we should display the exact command that can be run from the root folder to reproduce it.

@SaifKhan21
Copy link
Copy Markdown
Contributor Author

@nihalzp
I have implemented the various changes suggested in the pull request review. These changes are:

  • (read_geojson.cpp) Removing the use of the first_element variable and directly assigning id_header_ to the first key in unique_properties_map_
  • (read_geojson.cpp) Adding an assertion to ensure unique_properties_map_ is not empty after calling extract_unique_properties_map() to allow for easier debugging in the future
  • (write_image.cpp) Removed the per_grid_cell_ data variable from the InsetState class and and created a new function called compute_per_grid_cell() that returns the area per grid cell which is now used instead
  • (write_image.cpp) Removed lengthy grid cell and total label creation in write_legend_on_surface() and assigned it under a new function called get_legend_labels()
  • (stress_test.sh) When testing with all the various flags, the summary statistics of which tests passed and failed are now grouped by flags
  • (General) Added const keywords to variables which are never re-assigned new values
  • (General) Applied Clang formatting for files that were modified

@nihalzp nihalzp merged commit b8fbea3 into mgastner:dev Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants