# 1. Abstract

For this project, we aim to make a model that can guess with high accuracy someone's race just by looking at their chest X-ray. By using pretrained EfficientNet and ResNet models, we made a machine that can predict whether a patient is Asian, Black or White with over 75% certainty and no gender bias.

The GitHub repository for the project code can be found [here](https://github.com/kennyerss/csci451-project).

# 2. Introduction

It is understandable that medical information about a patient can be extracted from their chest X-ray. However, [Adleberg et. al](https://pubmed.ncbi.nlm.nih.gov/35964688/) created a model that can extract non-medical information such as age, gender, race, ethnicity and insurance status with almost 100% certainty. Similarly, [Gichoya et. al](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext) also successfully created a model that can predict race from chest X-rays. These studies trained their models on large chest X-ray datasets such as [ChexPert](https://stanfordmlgroup.github.io/competitions/chexpert/), which is the same dataset that we trained our model on.

Our primary concern with this project is whether it is possible to detect race from such racially ambiguous data as chest X-rays. However, that racial classification can be extracted from chest X-rays will reinforce the idea that people of different races are inherently different and perpetuate existing inequalities. Another concern ethical concern is deanonymization. If we can extract somebody's age, gender, race, and even insurance status just from a chest X-ray, it will pave the way for stronger and more accurate surveillance models that can extract civilians' personal information from ambiguous data, such as a chest X-ray.

# 3. Values Statement

Racial classification from chest X-rays is not a popular topic, so we do not believe that there are any users for this model. The only instance in which we think this model could be useful is when we need to expose a celebrity for cultural appropriation by showing that their chest X-ray does not belong in that cultural group per se; but of course, this is a stretch. Despite the lack of people who might actually use the model, the results of this project may be used to justify the current racial hierarchy because apparently according to this model there are inherent differences between the races — not just superficial differences in skin tone or hair texture, but bones, internal things that we cannot see. If this model fell into the hands of a eugenicist, the repercussions would be dire. In that case, this model would work towards the direction that Dr. Timnit Gebru had warned us against: eugenics comes back to us again and again in increasingly progressive and scientific forms. 

In short, potential users for our projects are propagandists working to discriminate against minorities. Even though this is not our goal, it will undoubtedly marginalize a lot of people if eugenicists happen upon our model.

As for our goal, we aim to check if it is indeed possible to classify race this way. We also want to know what could lead to this possibility: whether the model is picking up on something that is not indicative of race and using that to make its decision, or there is indeed a racial difference. In the past, we learned about an image classification model that was trained to detect criminals but turned out to detect if someone is smiling or not. This could be the case for our model, but until there is an actual test, we cannot conclude anything on how exactly our model is making its decisions.

In other words, our project is not to solve a problem, but to see what the problem is. We cannot conclude whether race determines underlying physiological differences just by looking at the efficacy of our model, because that would be unscientific and destructive. Therefore, our second deliverable, to provide a rebuttal against those who may extrapolate our results to their ends, is a paper demonstrating our findings on the ambiguity of race in medicine.

# 4. Materials and Methods

# Our Data
We use the [ChexPert dataset](https://stanfordmlgroup.github.io/competitions/chexpert/) collected by [Irvin & Rajpurkar et al.](https://arxiv.org/abs/1901.07031). This dataset contains 224,316 frontal and lateral chest radiographs of 65,240 patients. Each radiograph is labeled with information such as age, gender, race, ethnicity and medical conditions, but we are primarily concerned with race and gender. 

White patients occupy the vast majority of this dataset, as shown by the following figure, and we are concerned that this may lead to a racial bias in the model's classification algorithm.
![imbalance](imbalance.png)

To account for this imbalance, we trained our model on a racially balanced subset of the ChexPert dataset. Even though there are more male than female patients in this training set, we would later learn that the model does not exhibit gender bias.

![balance1](balance.png) ![balance2](gender.png)

# Our Method

We trained our model using 10,000 frontal chest X-rays, such as the one in the following figure, and the feature used as target is race. We only used 10,000 images due to the lack of computing power. This subset is equally divided among Asian, Black and White patients. We excluded other races to keep the algorithm simple.

![chest X-ray](view1_frontal.jpg)

As for our models, we used ResNet and EfficientNet because they are popular deep learning architectures for image classification. Specifically, we used pretrained EfficientNetB0 and ResNet18 models. We also implemented some ResNet18 models on our own but achieved a lower accuracy.

For the training data, we accessed the .csv file containing demographic details about the patients, extracted the path to each radiograph, and label each image with its owner's race. The images are turned into tensors, and then loaded to be trained using the Adam optimizer. Because of the large size of the data, we did all training and testing on Google Colab.

To optimize a model, we would train it on 10,000 images in a loop using different learning rates for the Adam optimizer and $\gamma$ values for the exponential scheduler. In the same loop, we would then test the model on 2,500 images to find the optimal parameters. Cross entropy loss was used for all models.

As mentioned before, there may be a gender bias in our model because there are more male than female patients in our training dataset. We inspected this by splitting our test set into male and female counterparts and testing the model on each subset. Gender bias is then examined by looking at the score and confusion matrix for each gendered subset.

# 5. Results

We achieved a fairly efficient model that can predict with up to 80% accuracy whether a person is Asian, Black or White. Pretrained ResNet and EfficientNet models obtained similar accuracies and losses, so we will display only EfficientNet results. As we can see in the following figures, the training score and loss gradually improved while the validation score and loss plateau after a few epochs. This means that there was some overfitting. We tried to optimize the model by altering the scheduler type, varying the Adam learning rate from 0.001 to 0.01, but the overfitting did not go away. The score, however, remained consistently at around 75% throughout these changes.

![EfficientNetB0 loss](loss0735.png) ![EfficientNetB0 score](score0735.png)

It is important to note that we did our training and testing on racially balanced datasets. Before optimization, we could only achieve good accuracy if we tested the model on a racially balanced test dataset. If we tested the model on a predominantly white dataset, the model tended to guess everyone to be white. After optimizing our model with the Adam optimizer learning rate at 0.001 and an exponential scheduler with $\gamma = 0.735$, we achieved good accuracy on the imbalanced set without the model guessing everybody to be white.

We also did our own implementation of ResNet18, and obtained comparable results. The issue of overfitting remained, and our model achieved a score of up to 68% when tested on unseen data. Because the pretrained EfficientNet model returned better results, we would explore gender bias on this model.

![ResNet18 Self-Implementation Loss](self-loss.png) ![ResNet18 Self-Implementation Score](self-score.png)

We tested our model on male and female subsets of unseen data. The model obtained around 75% accuracy and similar confusion matrices for both subsets. The rate at which a patient, of either gender, is misclassified is almost the same among the races. Given similar results between the two genders, we can conclude that there is not a gender bias.

![Male Confusion Matrix](male.png) ![Female Confusion Matrix](female.png)

# 6. Conclusion

The model that our project produced can classify race up to 80% accuracy based on chest X-rays. Given the confines on our time and computing power, this is comparable to models attained by [Adleberg et. al](https://pubmed.ncbi.nlm.nih.gov/35964688/) and [Gichoya et. al](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext). Our initial goal for the project was to see if the model is actually feasible, and we indeed achieved this goal. 

We also investigated the ethical issues that this model could pose. We found that if the results of this project were used in bad faith, existing racial inequalities would be reinforced and worsened. Physiological differences between the races would not only include visible features such as skin tone, but also intrinsic, unobservable features such as bones. This has been refuted by [Maglo et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4756148/). They learned that this phenomenon results from racism itself, because prolonged exposure to effects of racism such as toxic stress can lead to observable physiological changes.

Another ethical issue that our model can pose is improved surveillance tactics. We learned that during World War II, people of certain ethnicities were incarcerated in various countries, and there were protocols that officials followed to check if an incarceree was actually in the condemned ethnic group. We also learned that there were survivors, and there were those who actually evaded incarceration by assuming a different identity. With the introduction of this racial classification model, nobody is safe, because there is nowhere to hide.

But of course, this is a stretch. Right now, the model is still in its infancy, and we do not know for sure what the model is looking at to make its decision. If we had more time and Google Colab premium, the first thing we would do is loop over all parameters of the Adam optimizer and the exponential scheduler to optimize our model and overcome overfitting. Once we achieved almost 100% accuracy, we would like to know whether the model was looking at physiological differences between patients to make its decision, or it was looking at something else. To achieve this, we would need more data from different sources and more Google Colab premiums.

# 7. Group Contributions Statement

Trong: I downloaded the ChexPert dataset to a shared drive to make the pathways in our Google Colab consistent; trained ResNet18 models with different schedulers; visualized racial and gender imbalances in the training data; investigated gender bias for the pretrained EfficientNetB0 model. I also wrote the project presentation script, the blogpost, and finalized the ethics research.

Jay-U:

Kent: