ML Course, Bogotá, Colombia  (&copy; Josh Bloom; June 2019)

In [1]:
%run ../talktools.py

# Bias, Reproducibility, GDPR, and Ethics in ML

> *With great power comes great responsibility.*

ML models and the systems that they are deployed inside of are built by people and are used by people. There is
no way a model is devoid of bias. We just hope that we understand how biased we can be
and build protections into the ML systems to minimize the effects of bias. 

Bias isn't just in the ways we might process the data but in how we collect the data (and the results) themselves. Data are also collected by people which have their own (sometimes unspoken) reasons for collecting the data. What we decide to optimize on might make sense in some context but could lead to unintended consequences if we are not thoughtful about the broader societal impacts.

Unfortunately, people blindly use ML systems thinking that they are perfect mathematical entities that
cannot possibly be wrong. This is dangerous!

In [5]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Rep. Gomez asks FBI what it&#39;s doing to thwart risks of bias in <a href="https://twitter.com/hashtag/facialrecognition?src=hash&amp;ref_src=twsrc%5Etfw">#facialrecognition</a> algorithms, especially misidentifying women of color, cites <a href="https://twitter.com/ACLU?ref_src=twsrc%5Etfw">@ACLU</a> study on high rates of misID. <a href="https://twitter.com/FBI?ref_src=twsrc%5Etfw">@FBI</a>: Bias for the algorithm? No we don&#39;t train on that. It&#39;s a mathematical equation that comes back.</p>&mdash; Angelique Carson (@privacypen) <a href="https://twitter.com/privacypen/status/1135947573183819776?ref_src=twsrc%5Etfw">June 4, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

In fact models are subject to attack. A potentially fatal flaw with ML models is that 
if you know the weights of the system you can devise inputs that
give vastly incorrect answers. This is called an **Adversarial attack**.

In [6]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I like this simple adversarial attack. Left image is well predicted by machine with confidence 99%. Right image is wrong predicted (it thinks it&#39;s a 7) with confidence 99%. Both images differ in a few points. Source: <a href="https://t.co/79C0KxXd2T">https://t.co/79C0KxXd2T</a> <a href="https://twitter.com/hashtag/datascience?src=hash&amp;ref_src=twsrc%5Etfw">#datascience</a> <a href="https://t.co/E1dObv7DQb">pic.twitter.com/E1dObv7DQb</a></p>&mdash; Javier Nogales (@fjnogales) <a href="https://twitter.com/fjnogales/status/1133287485650493440?ref_src=twsrc%5Etfw">May 28, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<img src="imgs/flaws.png">

Remember...you've never tested your model on data that hasn't been created yet. You might think you know you're accuracy but you dont.

In [7]:
from IPython.display import YouTubeVideo
YouTubeVideo("i1sp4X57TL4")

> One project, conducted in collaboration with Google, involved probing machine-learning algorithms trained to generate automatic responses from e-mail messages (in this case the Enron e-mail data set). The effort showed that by creating the right messages, it is possible to have the machine model spit out sensitive data such as credit card numbers. The findings were used by Google to prevent Smart Compose, the tool that auto-generates text in Gmail, from being exploited.

https://www.technologyreview.com/s/613170/emtech-digital-dawn-song-adversarial-machine-learning/

<img src="imgs/medical.png">

<img src="imgs/tesla.png">

https://www.technologyreview.com/f/613254/hackers-trick-teslas-autopilot-into-veering-towards-oncoming-traffic/

Fighting against attacks is an active area of research (https://deepdrive.berkeley.edu/project/adversarial-deep-learning-autonomous-driving)
<img src="imgs/ucb_drive.png">

## Reproducibility

While you might not be able to protect against all forms of bias and attacks, one of the things you can do right now is insure that your results are reproducible. If there's a problem you should be able to understand where the issue came in after the fact. We talked about setting the random seed early in the learning workflow. You should also be thinking about curating your data in such a way that you can go back to a version that was used to train models.   Using git (to capture the code that generates workflows) is a good idea. For big files you could consider checking them in to [git LFS](https://git-lfs.github.com/) (or [zenodo](https://zenodo.org/) if public). 

Reproducibility is also a gateway to transparency. Understanding why you got an answer and being about to communicate that is a good thing.

In [2]:
%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Fascinating new law from France banning publication of statistical analyses of judges&#39; decisions. Seems like an attempt to maintain mystique/legitimacy of legal system as above the flaws of particular humans. <a href="https://t.co/RGIfjgn9de">https://t.co/RGIfjgn9de</a></p>&mdash; Angela Walch (@angela_walch) <a href="https://twitter.com/angela_walch/status/1135935554053451776?ref_src=twsrc%5Etfw">June 4, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


In fact, the "General Data Protection Regulation" law passed in the EU holds:

> When data is collected, data subjects must be clearly informed about the extent of data collection, the legal basis for processing of personal data, how long data is retained, if data is being transferred to a third-party and/or outside the EU, and any automated decision-making that is made on a solely algorithmic basis. Data subjects must be informed of their privacy rights under the GDPR, including their right to revoke consent to data processing at any time, their right to view their personal data and access an overview of how it is being processed, their right to obtain a portable copy of the stored data, the right to erasure of data under certain circumstances, the right to contest any automated decision-making that was made on a solely algorithmic basis, and the right to file complaints with a Data Protection Authority.  -- [Wikipedia](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation)

So reproducibility, transparency, and *explainability* are now a matter of legal requirements in a part of the world.

# Ethics

Behaving ethically is something, in the end, which is subjective.  That said there are some principles that people have developed (e.g.,  from ["The Institute for Ethical AI & Machine Learning"](https://ethical.institute)). 

**Responsible Machine Learning Principles**

<img src="imgs/ethical.png">

Consider this a draft of the Hippocratic Oath for data scientists. You might want to try to internalize it.