# Learning Unit 4 - Bias

## Reading Part - Prepare for exercise

**✏️ Task 4.1**

1. *Download "A Survey on Bias and Fairness in Machine Learning" https://dl.acm.org/doi/10.1145/3457607. (You can download it per Eduroam or VPN if your are working from home.)*
2. *You do NOT have to read it, but you need to have the paper for the exercise.*

---

## Programming Part - Get to know datasets

In this learning unit, we will explore the diabetes dataset you already worked with in previous learning units.

The dataset is called *“CDC Diabetes Health Indicators”*. You can find more information about it in the UCI ML Library [here](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators). In the Resources section, we have also provided a file called `diabetes_012_health_indicators_BRFSS2015.csv`, which is one of three csv files in the dataset.

In this learning unit, we will continue our analysis of the *CDC Diabetes Health Indicators* dataset. We will investigate the possible biases of this dataset, and we will train a diabetes prediction model.

First, let's take loot another at the attributes of our dataset. In the last learning unit, we examined the distributions of the attributes and their meanings. This time, we want to take a closer look at the potential problems that can arise from including certain features.

**✏️ Task 4.2** 
*Perform an exploratory data analysis on the dataset. Think about all the things that you would need to know about this data in order to make confident statements about it and to be able to continue to work with it. Make sure to at least:*
- *look at the features and try to understand them,*
- *look at the distributions of the following features: sex, age education, income, diabetes_012, HighBP, HighChol*

*Write down your code in the Python cell, and your key findings about the sample in the Markdown cell below.*

If you are feeling lost about the exploratory data analysis, take a look at the linked article in the resources section below.

In [25]:
import pandas as pd

df = pd.read_csv("diabetes_012_health_indicators_BRFSS2015.csv")
print(df.shape)
# df.head()
df.columns
# df.nunique(axis=0)
# df.describe()

# df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
# df.Diabetes_012.unique()
for col in df.columns:
    unique = df[col].unique()
    print(f"{col} ({len(unique)}): {unique}")

(253680, 22)
Diabetes_012 (3): [0. 2. 1.]
HighBP (2): [1. 0.]
HighChol (2): [1. 0.]
CholCheck (2): [1. 0.]
BMI (84): [40. 25. 28. 27. 24. 30. 34. 26. 33. 21. 23. 22. 38. 32. 37. 31. 29. 20.
 35. 45. 39. 19. 47. 18. 36. 43. 55. 49. 42. 17. 16. 41. 44. 50. 59. 48.
 52. 46. 54. 57. 53. 14. 15. 51. 58. 63. 61. 56. 74. 62. 64. 66. 73. 85.
 60. 67. 65. 70. 82. 79. 92. 68. 72. 88. 96. 13. 81. 71. 75. 12. 77. 69.
 76. 87. 89. 84. 95. 98. 91. 86. 83. 80. 90. 78.]
Smoker (2): [1. 0.]
Stroke (2): [0. 1.]
HeartDiseaseorAttack (2): [0. 1.]
PhysActivity (2): [0. 1.]
Fruits (2): [0. 1.]
Veggies (2): [1. 0.]
HvyAlcoholConsump (2): [0. 1.]
AnyHealthcare (2): [1. 0.]
NoDocbcCost (2): [0. 1.]
GenHlth (5): [5. 3. 2. 4. 1.]
MentHlth (31): [18.  0. 30.  3.  5. 15. 10.  6. 20.  2. 25.  1.  4.  7.  8. 21. 14. 26.
 29. 16. 28. 11. 12. 24. 17. 13. 27. 19. 22.  9. 23.]
PhysHlth (31): [15.  0. 30.  2. 14. 28.  7. 20.  3. 10.  1.  5. 17.  4. 19.  6. 12. 25.
 27. 21. 22.  8. 29. 24.  9. 16. 18. 23. 13. 26. 11.]
Dif

---

In the lecture, you learned about *"protected attributes"*. Protected attributes refer to sensitive or personal characteristics that are legally or ethically protected from being used as a basis for discrimination in data science applications. The *CDC Diabetes Health Indicators* dataset contains some of these attributes. It is important to understand and be able to identify the difference between protected and unprotected attributes.

**✏️ Task 4.3**
1. *What are the protected attributes of the dataset?*
2. *Explain why it makes sense to protect these attributes. Identify any problems that may arise from including them.*
3. *Is using the unprotected attributes unproblematic?*

*Please write your answers in the markdown cell below.*

---

We continue to train a first version of a diabetes type prediction system. You will compare this system to the one you create in the next assignment to evaluate the impact of bias mitigation and fairness techniques. 

This task is intended to be open-ended. You are expected to use your prior data science knowledge. Think about the type of problem you are trying to solve and what model might be a good fit. If you are not sure what to do, take a look at Assignment 1 for a model training example and consider looking at other resources.

**✏️ Task 4.4** 
1. *Train a model that predicts the type of diabetes (the “diabetes_012” feature).*
2. *Try at least one other type of model and compare the results with the model from step 1 (e.g., RandomForestClassifier, KNeighborsClassifier, LogisticRegression).*
3. *Choose the best performing model and further optimize it. Try to avoid [common pitfalls](https://www.gyata.ai/machine-learning/machine-learning-optimization/) of machine learning optimization.* <br>
Hint: Save the model and your predictions, you will need them in the next learning unit again!

*Please write your answers in the markdown cell below.*

---

## 📝 Feedback
We are interested in your feedback in order to improve this course. We will read all of your feedback and evaluate it. What you share may have a direct impact on the rest of the course or future iterations of it.

Write down your feedback on the lecture, the exercises, or the assignments in the Markdown cell below. Furthermore, please note the approximate time it took you to complete the assignment. You may also write about your insights, what you found interesting, or questions that you have.