# Feature Engineering + Project 4 Review

## Feature Engineering: Concepts In Need of Review

- __Create "Interaction" Features__: Interaction features are new metrics we __derive__ from our existing data (NOTE: we've already touched on this concept this semester).Examples include creating new ratios / proportions from our existing data as well as binning / bucketing of numeric data and the de-structuring of date / time values.


- __Combining Sparse Classifications__: With categorical data, "sparse classes" are those that have very few observations within a particular categorical variable. Such "sparse data" can cause problems with a wide variety of predictive and machine learning models (e.g., causing the models to "overfit" the data use for model training purposes).  These challenges can oftentimes be overcome by combining sparse classes into new classifications that aggregate the sparse items based on some shared characteristic(s). How you identify such "shared characteristics" is highly subjective, i.e., you need to apply your __domain knowledge__ to identify them. 


- __Adding "Dummy" Variables__: As we've previously discussed, we cannot use raw categorical data as input to a model that requires numerical data. Instead, we create a new "0/1" binary indicator variable for each categorical data value.

### Confusing these concepts will derail your analytical work, so make sure you are clear on each of them!!


## Project 4 Comments

- Converting categorical values to digits __DOES NOT__ change the fact that the contents of the variable are CATEGORICAL. Converting categorical values to digits is simply a "renaming" of the categorical values and in the resulting digits have no more meaning for mathematical calculation purposes than do the original character-based categorical values.


- Correlation metrics __CANNOT__ be derived from raw categorical data (whether in string or digit format). When performing EDA work with categorical data it is best to make use of bar plots for purposes of understanding the relationship between categorical explanatory variables and the given response variable.


- Dummy variables are derived from explanatory variables for purposes of model building. Failure to make use of dummy variables will result in your models being invalid.

# Module 14: "Data Ethics" and "Ethics in Statistics"

As data analytics practitioners, we can often find ourselves working with data pertaining to specific individuals, and that data can often contain very personal details that many individuals would, if they could, prevent others from using for any purpose. Furthermore, the ways in which we analyze and report on data we are working with can (whether intentionally or not) easily lead both ourselves and others to misleading or inaccurate analytical and/or interpretational conclusions. 

This raises the question:

__*As data analytics practitioners, How can we ensure that we are maintaining an ethical approach to both data usage and statistical analysis?*__

## Data Ethics

### What is "Data Ethics"

- According to Wikipedia __data ethics__ "..refers to systemising, defending, and recommending concepts of right and wrong conduct in relation to data, in particular personal data."


- Specifically, __individual rights__ (e.g., right to privacy, right to consent, etc.) should __always__ take precedence over institutional or commericial interests.


- Ideally, the practice of "__Data Ethics__" should apply to all collectors and disseminators of structured or unstructured data such as data analytics professionals, data brokers, governments, and large corporations.


### So what are the guiding principles of data ethics?

According to __dataethics.eu__ (https://dataethics.eu/data-ethics-principles/):

- __INDIVIDUAL DATA CONTROL__: Humans should be in control of and empowered by their data. The individual has the primary control over the usage of their data, the context in which his/her data is processed and how it is activated.


- __TRANSPARENCY__: The purpose of and ways in which data is being processed must be fully transparent and explainable to the individual, including any potential risks to the individual such as economic, privacy, social, ethical, or societal consequences.


- __ACCOUNTABILITY__: All collectors and disseminators of structured and unstructured data should have guidelines in place to ensure that any use of personal data adheres to ethical data usage practices, and should ensure that such guidelines are honored by subcontractors. Furthermore, all collectors and disseminators of structured and unstructured data should ensure they have appropriate controls and systems in place to prevent hacking of the data they have amassed.


- __EQUALITY__: All personal data should be treated in a consistent manner without adversely affecting any individual's self-determination and control of their personal information, and should not expose any individual to discrimination or stigmatisation, for example due to their financial, social or health related conditions.


__dataethics.eu__ provides a useful questionnaire that can help practitioners address each of these principles: https://dataethics.eu/data-ethics-principles/

### Some Examples of  "Data Ethics" Practices Not Being Followed:

- __OfficeMax__ Sends Mail to Father Mentioning That His Daughter Died in a Crash:
https://www.latimes.com/nation/la-na-officemax-mess-20140121-story.html


- __Target__ Realizes a Teen is Pregnant Before She's Told Anyone: https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#347420aa6668


- __Facebook__ fails to protect user's data and privacy: https://www.cnbc.com/2019/03/06/some-advertisers-are-quitting-facebook-after-privacy-scandals.html


- https://usa.inquirer.net/17723/facebook-data-privacy-issues-facebook-shared-your-private-messages-with-its-partners?utm_expid=.XqNwTug2W6nwDVUSgFJXed.1


- __Google__ fails to protect user's data and privacy: https://thevpn.guru/8-google-data-privacy-concerns/


These are just a few examples: there are literally hundreds more we can find via a simple web search. Here are some of the worst from 2018:

- https://www.businessinsider.com/data-hacks-breaches-biggest-of-2018-2018-12#20-orbitz-880000-2


- https://www.fastcompany.com/90272858/how-our-data-got-hacked-scandalized-and-abused-in-2018


..and here is a summary of some of the higher-profile data breaches of 2022:

- https://www.identityforce.com/blog/2022-data-breaches

__*In summary, a lack of appropriate data ethics practices is evident throughout the business world AND throughout governmental agencies worldwide.*__


### Generally speaking, at present the EU appears to have the strongest data ethics laws and regulations

- The General Data Protection Regulation (GDPR) imposes strict data privacy rules on all entities that target the EU market for goods and services:
https://gdpr.eu/


- In the USA, there is no single nationwide set of data privacy rules and regulations: instead there is a patchwork of national and state/local level rules and regulations.


- Many countries around the world are currently in the process of developing data privacy guidelines that either adhere to or are substantially similar to the EU's GDPR. A useful summary of data privacy laws around the world can be found here: https://i-sight.com/resources/a-practical-guide-to-data-privacy-laws-by-country/



### Our goal as data analytics practitioners: Don't either ignore or contribute to the problem !! 

- Instead, __*always*__ follow the data ethics principles discussed here in your daily work.

## Ethics in Statistics

We are bombarded with misleading (and sometimes deliberately false) statistical data and statistics-based arguments on a daily basis via journalists, politicians, bureaucrats, and others who are attempting to impose their own points of view on us. Here are some great examples:

- https://www.statisticshowto.com/probability-and-statistics/descriptive-statistics/misleading-graphs/


Unfortunately, statistical analysis, while a powerful analytical tool, is also an easily abused form of analysis due to the ease with which data collection, data analysis, and reporting of analytical results can be manipulated as a result of either overt or unconscious biases and/or agendas.


As a result, a statistician's "Code of Ethics" has evolved to help address this problem.


### What is an "Ethical Statistician"?

- According to the American Statistical Association (ASA) (https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx) an ethical statistician "..uses methodology and data that are relevant and appropriate; without favoritism or prejudice; and in a manner intended to produce valid, interpretable, and __reproducible results__. The ethical statistician does not knowingly accept work for which he/she is not sufficiently qualified, is honest with the client about any limitation of expertise, and consults other statisticians when necessary or in doubt."


- A detailed set of guidelines from the ASA is provided here: https://www.amstat.org/asa/files/pdfs/EthicalGuidelines.pdf


- A key aspect of ethical statistical practice is __making your original data available for others to work with__. 


- Another key aspect of ethical statistics is __accepting when your peers have determined that your analysis is invalid due to a flaw in the approach you chose to use to analyze your data__.


- Another key aspect of ethical statistics is __ensuring that you've considered whether any confounding variables could have influenced your results, and be willing to accept that your results can be invalidated if a future researcher provides evidence that one or more confounding variables have either tangibly influenced your reported results or have rendered your analysis invalid__.


Unfortunately, it is all to easy to find examples of either flawed or deliberately misleading statistical analysis, or instances in which researchers overtly refuse to share their original research data,  all of which could have been mitigated if the ASA's guidelines had been adhered to.


### Example 1. Social Science Reproducibility Crisis

Social science research is heavily dependent on statistical methods. Unfortunately, many social science research efforts appear to be deficient in their adherence to ethical statistics practices. The lack of reproducibility of significant amounts of social science research has been a huge issue for more than a decade now.

Some examples:

- Only 13 of 21 Major Social Science Experiments Published in Top Journals Could be Replicated: https://www.washingtonpost.com/news/speaking-of-science/wp/2018/08/27/researchers-replicate-just-13-of-21-social-science-experiments-published-in-top-journals/?noredirect=on&utm_term=.b5fc779f2625


- Psychology researchers ".. are too willing to run small and statistically weak studies that throw up misleading fluke results, to futz around with the data until they get something interesting, or to only publish positive results while hiding negative ones in their file drawers..": https://www.theatlantic.com/science/archive/2018/11/psychologys-replication-crisis-real/576223/


__What steps should people engaging with social science data take to ensure they are adhering to ethical statistics practices?__


### Example 2. Climate Data

Widespread firsthand human observations of climate and weather are primarily limited to the past 150 years or so. Prior to that, weather and climate data was generally not widely collected in a systematic manner. So given how little firsthand data we have on the Earth's weather and climate, how do we reconcile concerns about climate change relative to how long the Earth has existed?

- The Earth is approximately 4.5 Billion years old


- Climate on Earth has always varied/changed and has never remained constant


- Only 52 million years ago (52 million out of 4.5 Billion years), crocodiles swam in the arctic ocean and palm trees populated what was then the very warm region that is now the Arctic/sub-Arctic region of Alaska: https://news.nationalgeographic.com/2016/05/160523-climate-change-study-eight-degrees/.


- The last major ice age ended only approx. 11,700 years ago and sea levels rose more than 400 feet between 20,000 - 6,000 years ago, and since that time have remained largely constant. https://wattsupwiththat.com/2010/12/01/sea-level-rise-jumpy-after-last-ice-age/


- Measures of climate and weather prior to recorded human history rely on secondhand observations (e.g., rock/ice/soil samples, etc.) and extrapolations.


- Climate is influenced by many variables that cannot be easily represented within climate models, including things like natural variations in the earth's orbit around the sun, changes in the orientation of the Earth’s axis of rotation (aka, the "wobble" of the earth's axis), plate tectonics, unexpected variations in the amount of energy being produced by the Sun (including the Solar Wind), radiation from cosmic rays that continually bombard the atmosphere, and more:
https://www.bgs.ac.uk/discoveringGeology/climateChange/general/causes.html
https://www.sciencealert.com/cosmic-rays-could-influence-cloud-cover-on-earth


Many examples of unethical statistical practices being used to manipulate climate data have been found:


- The infamous "Hockey Stick" graph was proven to have relied on flawed data normalization methods and the scientist behind it has refused to share his data with other researchers: 

- http://scienceandpublicpolicy.org/wp-content/uploads/2010/07/ad_hoc_report.pdf

- https://www.powerlineblog.com/archives/2019/08/michael-mann-refuses-to-produce-data-loses-case.php

- https://wattsupwiththat.com/2018/04/30/20-years-later-the-hockey-stick-graph-behind-waves-of-climate-alarmism-is-still-in-dispute/


- British scientists were accused of scientific fraud for attempting to suppress data that could cast doubt on a key 1990 study on the effect of cities on warming: https://www.theguardian.com/environment/2010/feb/01/leaked-emails-climate-jones-chinese


- NOAA temperature data shown has been shown to be severely flawed: https://wattsupwiththat.com/2017/07/06/bombshell-study-temperature-adjustments-account-for-nearly-all-of-the-warming-in-government-climate-data/


__So how should an ethical statistician approach/engage with climate data?__


### Example 3. "Disparate Impact"

What is "disparate impact"?

-  Disparate Impact is a form of indirect and often unintentional discrimination whereby certain criteria disproportionately favor certain groups over other groups. Disparate impact occurs when what appears to be a neutral policy, rule, or practice has a disproportionate negative impact on members of a protected class (e.g., in the USA that includes minorities, the disabled, etc.)

For example:

- A strength requirement might screen out disproportionate numbers of female applicants for a job

- Requiring all applicants for promotion to receive a certain score on a standardized test could adversely affect candidates of color.


However, almost every selection methodology used by employers produces a degree of disparate impact because each methodology disproportionately excludes members of a protected group. Basic selection criteria such as background checks, credit checks, work experience, pre-employment tests, and minimum educational requirements can all lead to disparate impact if you subscribe to that theory.


So how is disparate impact proven? __Using statistical analysis__

- The statistical analyses used are often simplistic and often purposely refuse to consider the effects of possible confounding variables.


- Recent examples include NYC Schools' decision to severely limit the discipline of unruly students by arguing that school suspensions had a disparate impact on minority students: https://cei.org/blog/manhattan-institute-study-nyc-schools-shows-harm-flawed-legal-standards


- Another example: Does banning criminals from apartment rentals mean that minorities are suffering from the effects of disparate impact?: https://www.povertylaw.org/article/housing-the-key-to-successful-reentry-for-people-with-criminal-records/


What do these examples have in common? Should the statistical analyses have been required to consider confounding variables?


- For example, in the case of the students, are minorities actually being unfairly singled out for suspension or is there some other confounding variable that contributes to their being suspended at substantially higher rates than non-minorities?


- What about the criminals? Are minorities being unfairly singled out for exclusion from apartment rentals? Is it legitimate for landlords and residents to want to prohibit people convicted of violent crimes from living in the same building or development? What other confounding variables should be considered when determining whether minorities are being unfairly singled out in this situation?


- Are the statistical methods being used to prove disparate impact in these examples adhering to the principles of ethics in statistics?

### What can we do as data analytics practitioners to ensure we are adhering to ethical statistics principles?

- Always approach any data you are working with with an open mind: __DO NOT__ engage with any data if you intend to apply preconceived ideas as to what the data should tell you. __Let the data speak for itself__


- Always double check your assumptions to try to identify any unintentional biases you may have introduced to your analysis.


- Always report any findings, whether positive or negative, as even-handedly as possible.


- Never manufacture or covertly manipulate data to achieve a certain result


- Never produce misleading graphics as part of your analysis
