# An Exploration of Misleading Data Visualization Using Toronto Immunization Coverage Data

In this report, my aim is to show some infovisualizations derived from immunization data from Toronto's Open Data Portal, in order to demonstrate how manipulations with leading language, visual color cues, and misrepresentation of data can mislead viewers of such graphics.

## Cherrypicking data

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import folium
from shapely.geometry import Point, Polygon

In [2]:
df_2019 = pd.read_csv("immunization-coverage-2018-2019.csv")

In [3]:
# extract data points where either the DTP or MMR religious exemption rates from immunization are greater than 10%
high_rer_rate_schools = df_2019[df_2019["DTP Religious exemption rate (%)" or "MMR Religious exemption rate (%)"] >= 10]

In [4]:
map1_points = high_rer_rate_schools[['Lat','Lng']]
map1_list = high_rer_rate_schools[['Lat','Lng']].values.tolist()
map1 = folium.Map(location=[43.6532,-79.3832], tiles='Stamen Toner', zoom_start=12)
for point in range(0, len(map1_points)):
    folium.CircleMarker(map1_list[point], radius=15, color='red', fill=True).add_to(map1)
map1

<h3><center>More than 10% of children in these Toronto schools are unimmunized</center></h3>

The combination of language used in the graphic's title, to the red color of its markers, and the underlying manipulations of what datapoints are represented, altogether contribute to an alarmist message. Out of context, this graphic could be used to distress parents considering different schools for their children.
With a group of these points concentrated downtown, these markers may send messages to some parents that there is a greater prevalence of immunization exemptions in these areas than they may have previously thought. 

However, **the map does not even show all schools listed in the original data**, so viewers have no way to explore the proportion of schools represented here among all schools in Toronto. Furthermore, **the enrolled population at these schools is not considered along with their immunization rates**. Lastly, **the data is not specified - both MMR and DTP immunization exemption rates have been considered**, and if one of them is over 10% it is included with no indication as to whether it is the MMR religious exemption rate or DTP religious exemption rate that is being used for the school's inclusion within the graphic. DTP coverage refers to the rate of immunization for diphteria, tetanus, and polio. MMR coverage refers to the rate of immunization for measles, mumps, and rubella.

To provide a more contextualized look at the data, the following revision shows what this visualization would look like with all schools from the original dataset included (but maintaining both the lack of representation of enrolled population, and the lack of specificity with regards to which immunization is being considered).

In [5]:
other_schools = df_2019[df_2019["DTP Religious exemption rate (%)" or "MMR Religious exemption rate (%)"] < 10]

In [6]:
map2_points = other_schools[['Lat','Lng']]
map2_list = other_schools[['Lat','Lng']].values.tolist()
for point in range(0, len(map2_points)):
    folium.CircleMarker(map2_list[point], radius=15, color='blue', fill=True).add_to(map1)
map1

<h3><center>Red circles represent Toronto schools with more than a 10% religious exemption rate for either DTP or MMR immunization</center></h3>

With more accurate titling, and inclusion of all datapoints from the original dataset, much of the alarmist impact of the first graphic is reduced.

## Misrepresentation of data 

To explore an even more dubious approach to this dataset, in the following steps I have created a new "measure", the "Artificial DTP Exemption Rate %". This measure is taken by subtracting the original dataset's measure of "DTP coverage rate %" from 100. 

In [7]:
# Manufacturing a measure of exemption from DTP immunization, based on coverage rate
df_2019["Artificial DTP Exemption %"] = 100 - df_2019["DTP coverage rate (%)"]

# Selecting schools with a manufactured DTP exemption rate over 10%
high_rer_from_artificial = df_2019[df_2019["Artificial DTP Exemption %"] > 10]
len(high_rer_from_artificial)

377

In [8]:
map3_points = high_rer_from_artificial[['Lat','Lng']]
map3_list = high_rer_from_artificial[['Lat','Lng']].values.tolist()
map3 = folium.Map(location=[43.6932,-79.3872], tiles='Stamen Toner', zoom_start=12)
for point in range(0, len(map3_points)):
    folium.CircleMarker(map3_list[point], radius=15, color='red', fill=True).add_to(map3)
map3

<h6><div style="text-align: right">*calculated from <i>(100% - 2019 DTP coverage %)</i> </div></h6>
<h3><center>Red circles represent Toronto schools with more than 10% children unimmunized for DTP</center></h3>

An *even more* questionable approach has been taken in creating the above graphic.

The original data from which the graphic was created has been manipulated, but its manipulation has been reduced to an asterisk, which indicates that the unimmunized rate has been calculated from the immunization coverage rate. While this might seem "reasonable" at first, **this rate is completely artificial, as this column does not exist in the original data**. The original data provides an immunization coverage rate, and a religious exemption rate. These proportions do not cover the entire student population, with some remaining percentage who may be exempt for medical or administrative reasons, or who are not complete for age for immunization. This information is found in the immunization data's README file.

**Creating an artifical rate from existing data, and combining different kinds of exemptions in the process, creates a misleading picture of the original data.** This would exemplify unethical data practice.

These map visualizations, if interrogated with some of Darrell Huff's famous questions - specifically, *what's missing? did somebody change the subject?* - would be uncovered to be deceptive, as both use biased selection processes, as well as language and visual cues reminiscent of "hotspots" to attempt to manipulate the viewer's emotional response. Open data processes, including the sharing of data processing methods for visualization, are integral to combat such manipulations in information visualization.

## References

Huff, D. (1954). How to talk back to a statistic. In <i>How to lie with statistics</i>. 

Open Toronto. (2019). <i>Immunization coverage for students 2018-2019</i> [Data set]. https://open.toronto.ca/dataset/immunization-coverage-for-students/