# Module 4

# Section 1 - Model building

Machine learning starts with data. 

**TASK:** Read in the following dataset by simply running the following cell:

In [1]:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)

          id  user-age    account-class  account-age active-session  \
0          0        39        State-gov           13             No   
1          1        50         Personal           13            Yes   
2          2        38  Private-company            9            Yes   
3          3        53  Private-company            7            Yes   
4          4        28  Private-company           13            Yes   
...      ...       ...              ...          ...            ...   
32556  32556        27  Private-company           12            Yes   
32557  32557        40  Private-company            9            Yes   
32558  32558        58  Private-company            9             No   
32559  32559        22  Private-company            9             No   
32560  32560        52         Personal            9            Yes   

      backup-email-provider    race      sex  minutes-spent-on-service  \
0                     other   White     Male                       NaN   

## Data overview

The data you just read in relates to **detecting fraudulent login attempts**. It has the following columns:

* **id**: sequential ID numbers for the data rows

* **user-age**: the age (in years) of the user

* **account-class**: categories specifying whether the account is a *personal* account or related to various organizations that assign accounts with your provider

* **account-age**: how long (in months) the account has been active

* **active-session**: whether the user is signing in from a device that has previously had an active login session, as identified by their cookies

* **backup-email-provider**: the service (e.g., gmail) through which their backup email is registered

* **race**: the race of the user

* **sex**: the sex of the user

* **minutes-spent-on-service**: the total number of minutes the user has spent logged into the service

* **time-of-day**: the hour of the day (0 - 23) during which the login attempt is being made

* **login-country**: the country from which the login attempt is being made

* **label**: a binary label indicating the ground truth of whether the login attempt was deemed **Benign** or **Fraudulent**


### Data preparation

First, you'll need to prepare your data. You'll first want to explore your data and decide which variables you do or do not want to include in your classifier, also verifying that the data seems to be in a reasonable form.

In addition to making sure that the numeric data is properly encoded, there are several categorical variables in this dataset. For example, ``account-class`` contains information about the type of account. If we encode this variable naively, (e.g., gmail=0, yahoo=1, etc.) certain kinds of models (e.g., logistic regressions) will treat this coding as a statement that there is an *ordering* of these variables. The way to avoid this is by through **dummy-coding**, or creating columns with binary values to represent the different categories. Luckily, there is already a pandas function that does this for you: [get_dummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html).

**TASK:** Use the following cell to prepare your data:

In [2]:
# modify this code

X = pd.get_dummies(pd.DataFrame(data, columns=['account-age', 'active-session']))
print(X)

       account-age  active-session_No  active-session_Yes
0               13                  1                   0
1               13                  0                   1
2                9                  0                   1
3                7                  0                   1
4               13                  0                   1
...            ...                ...                 ...
32556           12                  0                   1
32557            9                  0                   1
32558            9                  1                   0
32559            9                  1                   0
32560            9                  0                   1

[32561 rows x 3 columns]


This is technically already enough processing to build a model.

**TASK:** Ignore the things we've said about being careful about data and model building. Run the code below to create sample models that try to predict whether or not a login attempt is fraudulent. Then, go back and edit this code and your pre-processing code above to remove features (variables) that are potentially problematic to use in prediction and fix any errors in how data is represented. Iterate on your model.

**WEIJIA, INSERT YOUR MODEL-BUILDING CODE HERE**

**TASK:** Be prepared to explain to the group what modeling approach you used, variables you included in your model, as well as how you pre-processed them. Be prepared to report some notion of your model's accuracy.

# Section 2 - Fairness

Now that you've gotten a sense of how models are built, let's look at fairness. 

**TASK:** Take the model(s) you've built and try computing different notions of their performance (accuracy, false positive rate, false negative rate, outcomes) for different sub-populations, referring to the demographic variables. What do you find?

In [None]:
**WEIJIA, INSERT YOUR MODEL-BUILDING CODE HERE**

# Section 3 - Explainability

Finally, think about how you will explain your model. What do you need to communicate? Google has a page about the [model cards approach](https://modelcards.withgoogle.com/about), including some [examples](https://modelcards.withgoogle.com/model-reports). You won't have nearly enough time to create a full version of a model card or anything similar, but be prepared for the following tasks:

**TASK:** Prepare a short explanation *for your boss* about what the model does and why you are confident in deploying it. Think about the many aspects about which you might want to be confident about deploying this model. Describe for your boss how the model works, what the model does, and why they should trust you.

**TASK:** Prepare a short explaination *for your customers* who are going to be trying to log into your service and whose login attempt will be classified by the model.