Data Privacy and Data Ethics

Introduction

Data ethics and data privacy are integral to any data project. There are obvious cases such as protecting the privacy of individuals health records under HIPAA. There are also many gray areas surrounding what constitutes personally identifiable information (PII) which occur throughout many industries including advertising, finance, and consumer goods. You may have noticed that starting around the summer of 2018, you started receiving privacy policy notices on many websites asking you to accept the use of cookies. This was a result of Europe's GDPR legislation. You are also probably aware of the Cambridge Analytica debacle in the 2016 United States presidential election. As a data practitioner, it is your responsibility to uphold data ethics in a fast-changing environment.

Objectives

You will be able to:

Determine whether or not a data science procedure meets an ethics standard

Examples

Data Breaches

If the data you are handling is valuable, then security should be a primary concern. Data breaches are all too common and often, such leaks of sensitive information could have been avoided if businesses and organizations followed standard security protocols. While there are thousands of said cases, two of the biggest breaches which have caught the public's attention include Cambridge Analytica's misuse of Facebook data to influence political elections, and Equifax's leaking of roughly 100 million individuals' social security numbers and credit scores.

Identifying PII

PII stands for personally identifiable information. While some cases such as one's social security number and medical records are clear examples of PII, other pieces of data may or may not qualify as PII depending on the jurisdiction. In the United States for example, there are two federal regulations: the Health Insurance Portability and Accountability Act (HIPAA), and the Privacy Act of 1974. While in theory these acts aim to protect the use, collection, and maintenance of personal data, the scope of what constitutes PII and the subsequent regulations surrounding handling and using said data is generally antiquated. For example, a user's IP address has been categorized as non-PII by several U.S. courts despite it being a unique identifier to most individual's home internet connection. This was further eroded by the rollback of net neutrality laws by the FCC Chairman Ajit Pai in mid-2018. Aside from federal jurisdiction, several states, most notably California have their own data protection laws to the benefit and protection of users and consumers.

GDPR

GDPR stands for the general data protection regulation. It was passed on April 14th 2016 by the European Union and went into effect on May 25th 2018. GDPR protects the data rights of all European citizens and is an example of how legislation will have to change and adapt to the online digital era of the 21st century. GDPR has implemented more widespread regulations surrounding what constitutes PII and has set fine structures for up to 4% of a company's revenue.

Data Best Practices

There are two primary practices that you should follow when dealing with PII and other sensitive data. The first is to encrypt sensitive data. When in doubt, encrypt. Secondly, ask yourself what level of information you really need. Large organizations will always include data cleaning teams which will first scrub sensitive data such as names and addresses before passing said data off to analysts and others to mine. Ultimately, any well-thought strategy will include multiple layers, safeguards, and other measures to ensure data is safe and secure.

Data Collection Processes

When collecting data, it is important to ensure you are not gathering it in a manner that will generate bias. For example, if Data Scientists are not careful in the way they phrase questions in surveys, they can generate misleading results. If a poll contained the question "How poorly has Politician X performed when it comes to the economy" it adds a negative connotation the question. That phrasing might make people say Politician X performed worse than if they had merely been asked "How has Politician X performed when it comes to the economy?"

In some cases, choosing which variables to collect and how to define them can also contain bias. You’ll notice that in some of the datasets we use, gender is represented as a binary value and race is referenced in an insensitive manner. This is an artifact of the societal conditions at the time the data was collected. As soon-to-be Data Scientists, it will be your responsibility to ensure that data collection is done in an inclusive manner.

Algorithm Bias

People often trust algorithms and their output based on measurements such as "this algorithm has 99.9% accuracy". However, it should also be noted that while algorithms such as linear regression are mathematically sound and powerful tools, the models are simply reflections of the data that is fed in. For example, logistic regression and other algorithms are used to inform a wide range of decisions including whether to provide someone with a loan, the degree of criminal sentencing, or whether to hire an individual for a job. (Do a quick search online for algorithm bias, or check out some of the articles below.) In all of these scenarios, it is again important to remember that the algorithm is simply reflective of the underlying data itself. If an algorithm is trained on a dataset where African Americans have had disproportionate criminal prosecution, the algorithm will continue to perpetuate these racial injustices. Similarly, algorithms trained on data reflecting a gender pay-gap will also continue to promote this bias. With this, substantial thought and analysis regarding problem set up and the resulting model is incredibly important.

Below is a handful of resources providing further information regarding some of the topics discussed here.

Gray Areas and Forward Thinking

Aside from overtly illegal practices according to current legislation, data privacy and ethics calls into question a myriad of various thought experiments. For example, should IP addresses or cookies be considered PII? How should security camera footage be handled? What about vehicles such as Google street view cars which are capturing video and pictures of public places? Some companies are now even taking pictures of license plates to track car movements. Should they be allowed to maintain massive databases of said information? What regulations should be put on these and other potentially sensitive datasets?

All of these examples question where and when limits should be put on data. Science fiction stories such as 1984, are much more accurate then one might expect. Moreover, injustices and questionable practices still abound. For example, despite public outcry at debacles like Cambridge Analytica, many companies still exist with nearly identical practices such as Applecart in New York City, which collects and sells user data to the Republican party, amongst others.

Fix it Already!

In staying current, you should also identify some news sources to stay up to date on tech trends.
One great resource is the Electronic Frontier Foundation (EFF).

EFF recently put together an article called Fix it Already, outlining fixable mishaps by technology companies that continue to be ignored. Take a look at the article here and get involved to put pressure on these organizations and your representatives to shape up. Here's a quick preview of their list:

Android should let users deny and revoke apps' Internet permissions.

Apple should let users encrypt their iCloud backups.

Facebook should leave your phone number where you put it.

Slack should give free workspace administrators control over data retention.

Twitter should end-to-end encrypt direct messages.

Venmo should let users hide their friends lists.

Verizon should stop pre-installing spyware on its users’ phones.

WhatsApp should get your consent before you’re added to a group.

Windows 10 should let users keep their disk encryption keys to themselves.

Disclaimer

As a final note, it should also be noted that the nature of online data can also include offensive or inappropriate data at times. For example, if acquiring data from an API such as Twitter, there is potential to encounter lewd or offensive material. While many of these services will eventually screen out and remove particularly egregious cases, plenty of trolls still exist.

Additional Resources

There's a multitude of resources to get involved with data privacy and ethics, but here's a few to get you started.

Summary

In this lesson, you got a preview of some of the many issues regarding data privacy and ethics. From GDPR to being aware of your own data aura, there's plenty to keep you busy and on your toes regarding this fascinating perspective on the data industry.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
.canvas		.canvas
.gitignore		.gitignore
.learn		.learn
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
index.ipynb		index.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

.canvas

.canvas

.gitignore

.gitignore

.learn

.learn

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE.md

LICENSE.md

README.md

README.md

index.ipynb

index.ipynb

Repository files navigation

Data Privacy and Data Ethics

Introduction

Objectives

Examples

Data Breaches

Identifying PII

GDPR

Data Best Practices

Data Collection Processes

Algorithm Bias

Gray Areas and Forward Thinking

Fix it Already!

Disclaimer

Additional Resources

Summary

About

Releases

Packages

Contributors 5

Languages

learn-co-students/dsc-data-ethics-houston-ds-021720

Folders and files

Latest commit

History

Repository files navigation

Data Privacy and Data Ethics

Introduction

Objectives

Examples

Data Breaches

Identifying PII

GDPR

Data Best Practices

Data Collection Processes

Algorithm Bias

Gray Areas and Forward Thinking

Disclaimer

Additional Resources

Summary

About

Resources

Stars

Watchers

Forks

Languages