# 1. Abstract

[Gichoya et. al](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext) created a model that can predict a patient's race with near absolute accuracy just by looking at their chest radiograph. For this project, we would like to recreate their results. We used pretrained ResNet18 and EfficientNetB0 models, trained them on a subset of the [ChexPert](https://stanfordmlgroup.github.io/competitions/chexpert/) dataset that includes equal proportions of Black, White and Asian patients, and tested on another subset of the ChexPert dataset. Some models were trained only on frontal chest X-rays, and some were trained on both lateral and frontal chest X-rays, but all models achieved around 75% accuracy and displayed no gender bias. Compared to Gichoya et al.'s results, our models still need optimization, and more bias tests to confirm that they are functioning properly. However, given the constraints on time and computing power, we believe that this project has been relatively successful.

The GitHub repository for the project code can be found [here](https://github.com/kennyerss/csci451-project).

# 2. Introduction

Race-based medicine, according to [Cerdena et al.](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(2032076-6/fulltext), is characterized by medical research that treats race as an essential and biological factor. When translated into clinical practice, race-based medicine relies on racial stereotypes and leads to faulty, inequitable care. For example, because Asian patients are presumed to have more visceral body fat than other races, they are considered to be at higher risk for diabetes. Because race-based medicine serves as a shortcut so that doctors can work around a patient's personal pathological history, it can either exaggerate or underestimate a patient's risk for certain illnesses. Not only can race-based medicine reinforce racist stereotypes, it can also limit access to treatment for patients wrongly considered as lower-risk for a disease.

It is understandable that pathological information about a patient can be extracted from their chest X-ray. However, [Adleberg et. al](https://pubmed.ncbi.nlm.nih.gov/35964688/) created a deep learning model that can extract self-reported information such as age, gender, race, ethnicity and insurance status with almost 100% certainty. Similarly, [Gichoya et. al](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext) also successfully created a model that can predict race from chest X-rays. They also confirmed that disease distribution and body habitus among patients did not strongly affect the prediction of their model. This means that their model – and in extension, other deep learning models used in medicine – can pick up a patient's race based on their medical images. This could lead to race-specific errors that clinical radiologists without access to demographic information would not be able to tell, and thus resulting in faulty medical decision-making.

Our goal when undertaking this project is to find out whether it is possible for deep learning models to detect race from racially ambiguous data such as chest X-rays. We would also like to investigate the potential ethical risks that the success of this project poses. Racial inequalities may be justified by the identifiable physiological differences between the races. Also, this project may introduce new possibility for surveillance methods: medical surveillance.

# 3. Values Statement

Racial classification from chest X-rays is not a popular topic, so we do not believe that there are any users for this model. The only instance in which we think this model could be useful is when we need to expose a celebrity for cultural appropriation by showing that their chest X-ray does not belong in that cultural group per se; but of course, this is a stretch. Despite the lack of people who might actually use the model, the results of this project may be used to justify racial inequalities because apparently according to this model there are identifiable differences between the races — not just superficial differences in skin tone or hair texture, but bones, subcutaneous differences that we cannot see. If this model fell into the hands of a eugenicist, the repercussions would be dire. In that case, this model would work towards the direction that Dr. Timnit Gebru had warned us against: eugenics comes back to us again and again in increasingly progressive and scientific forms. 

In short, potential users for our projects are propagandists working to discriminate against minorities. Even though this is not our goal, it will undoubtedly marginalize a lot of people if eugenicists happen upon our results. However, since we are not the first to undertake this project, and there are more comprehensive studies of the same subject done by Adleberg et al. and Gichoya et al, we believe that we are not worsening this risk.

As for our goal, we aim to check if it is indeed possible to classify race this way. We also want to know what could lead to this possibility: whether the model is picking up on something that is not indicative of race and using that to make its decision, or there is indeed a racial difference. In the past, we learned about an image classification model that was trained to detect criminals but turned out to detect if someone is smiling or not. This could be the case for our model, but until there is an actual test, we cannot conclude anything on how exactly our model is making its decisions. 

Gichoya et al., however, have done several tests to confirm that their model was indeed using physiological differences to make its classification. If we were allowed more time and computing power, we might arrive at the same conclusion, which would work in favor of our professed fears. Therefore, our second deliverable, to provide a rebuttal against those who may extrapolate our results to their ends, is a paper demonstrating our findings on the ambiguity of race in medicine.

The potential justification of racial inequalities is, of course, still potential. We have yet to receive a newsletter extrapolating the results of Gichoya et al. to call for the reintroduction of racial segregation. However, the success of this project still shows us a very real, ongoing injustice: that deep learning models currently used in medicine are also capable of identifying a patient's race based on their racially ambiguous medical images, and thus turning the patient's race into a vector in the decision-making matrix of the model. Again, this works in the direction that Dr. Gebru had warned us against: inequitable medical care is administered by supposedly fair machines. Working on this project, we do not aim to solve a problem, but to see what the problem is.

# 4. Materials and Methods

#  Our Data
We used the [ChexPert dataset](https://stanfordmlgroup.github.io/competitions/chexpert/) collected by [Irvin & Rajpurkar et al.](https://arxiv.org/abs/1901.07031). This dataset contains 224,316 frontal and lateral chest radiographs of 65,240 patients. Each radiograph is labeled with information such as age, gender, race, ethnicity and medical conditions, but we are primarily concerned with race and gender. 

White patients occupy the vast majority of this dataset, as shown by the following figure, and we are concerned that this may lead to a racial bias in the model's classification algorithm.

![imbalance](imbalance.png)

To account for this imbalance, we trained our model on a racially balanced subset of the ChexPert dataset. Even though there are more male than female patients in this training set, we would later learn that the model does not exhibit gender bias.

![balance1](balance.png) ![balance2](gender.png)

# Our Method

We trained our model using 10,000 frontal chest X-rays, such as the one in the following figure, and the feature used as target is race. We only used 10,000 images due to the lack of computing power. This subset is equally divided among Asian, Black and White patients, and excludes other races to keep the algorithm simple.

![chest X-ray](view1_frontal.jpg)

As for our models, we used ResNet and EfficientNet because they are popular deep learning architectures for image classification. Specifically, we used pretrained EfficientNetB0 and ResNet18 models. We also implemented some ResNet18 models on our own but achieved a lower accuracy.

For the training data, we accessed the .csv file containing demographic details about the patients, extracted the path to each radiograph, and label each image with its owner's race. The images are turned into tensors, and then loaded to be trained using the Adam optimizer. Because of the large size of the data, we did all training and testing on Google Colab.

To optimize a model, we would train it on 10,000 images in a loop using different learning rates for the Adam optimizer and $\gamma$ values for the exponential scheduler. In the same loop, we would then test the model on 2,500 images to find the optimal parameters. Cross entropy loss was used for all models.

As mentioned before, there may be a gender bias in our model because there are more male than female patients in our training dataset. We inspected this by splitting our test set into male and female counterparts and testing the model on each subset. Gender bias is then examined by looking at the score and confusion matrix for each gendered subset.

# 5. Results

We achieved a fairly efficient model that can predict with up to 80% accuracy whether a person is Asian, Black or White. Pretrained ResNet and EfficientNet models obtained similar accuracies and losses, so we will display only EfficientNet results. As we can see in the following figures, the training score and loss gradually improved while the validation score and loss plateau after a few epochs. This means that there was some overfitting. We tried to optimize the model by altering the scheduler type, varying the Adam learning rate from 0.001 to 0.01, but the overfitting did not go away. The score, however, remained consistently at around 75% throughout these changes.

![EfficientNetB0 loss](loss0735.png) ![EfficientNetB0 score](score0735.png)

It is important to note that we did our training and testing on racially balanced datasets. Before optimization, we could only achieve good accuracy if we tested the model on a racially balanced test dataset. If we tested the model on a predominantly white dataset, the model tended to guess everyone to be white. After optimizing our model with the Adam optimizer learning rate at 0.001 and an exponential scheduler with $\gamma = 0.735$, we achieved good accuracy on the imbalanced set without the model guessing everybody to be white.

We also did our own implementation of ResNet18, and obtained comparable results. The issue of overfitting remained, and our model achieved a score of up to 68% when tested on unseen data. Because the pretrained EfficientNet model returned better results, we would explore gender bias on this model.

![ResNet18 Self-Implementation Loss](self-loss.png) ![ResNet18 Self-Implementation Score](self-score.png)

We tested our model on male and female subsets of unseen data. The model obtained around 75% accuracy and similar confusion matrices for both subsets. The rate at which a patient, of either gender, is misclassified is almost the same among the races. Given similar results between the two genders, we can conclude that there is not a gender bias.

![Male Confusion Matrix](male.png) ![Female Confusion Matrix](female.png)

# 6. Conclusion

The models that our project produced can classify race up to 80% accuracy based on chest X-rays. Given the confines on our time and computing power, this is comparable to models attained by [Adleberg et. al](https://pubmed.ncbi.nlm.nih.gov/35964688/) and [Gichoya et. al](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext). Our initial goal for the project was to see if the models are actually feasible, and we indeed achieved this goal. 

We also investigated the ethical issues that these models could pose. We speculate that if the results of this project were used in bad faith, existing racial inequalities would be reinforced and worsened. Physiological differences between the races include not only visible features such as skin tone, but also unobservable, subcutaneous features such as bones. This has also been observed by [Maglo et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4756148/). They learned that this phenomenon results from racism itself, because prolonged exposure to effects of racism such as toxic stress can lead to observable physiological changes.

Another ethical issue that our models can pose is improved surveillance tactics. We learned that during World War II, people of certain ethnicities were incarcerated in various countries, and there were protocols that officials followed to check if an incarceree actually belonged in the condemned ethnic group. We also learned that there were those who actually evaded incarceration by assuming a different identity. With the introduction of a absolutely accurate racial classification model, nobody is safe, because there is no way to lie your way out of persecuation.

But of course, this is a stretch. The fact that our deep learning models can learn race is deeply concerning, because it means that race-based medicine can be administered by medical machines that are supposedly unbiased and reliable. The only way to overcome this is for the medical industry itself to stop using race as a vector in decision making.

Right now, our model is still in its infancy, and we do not know for sure what the model is looking at to make its decision. If we had more time and Google Colab premium, the first thing we would do is loop over all parameters of the Adam optimizer and the exponential scheduler to optimize our model and overcome overfitting. Once we achieved almost 100% accuracy, we would like to know whether the model was looking at physiological differences between patients to make its decision, or it was looking at something else. To achieve this, we would need more data from different sources and more Google Colab premiums.

# 7. Group Contributions Statement

Trong: I downloaded the ChexPert dataset to a shared drive to make the pathways in our Google Colab consistent; visualized racial and gender imbalances in the training data; trained ResNet18 models with different schedulers; visualized training and validation losses for each model; investigated gender bias for the pretrained EfficientNetB0 model. I tried to implement the code from [here](https://github.com/Emory-HITI/AI-Vengers/tree/main#readme) but it didn't work. I also wrote the project presentation script, the blogpost, and finalized the ethics research.

Jay-U:

Kent: