# 02: Using logistic regression model for classification on census data 

This project used a binary logistic regression model in BigQuery ML to predict the income range of respondents in the US Census Dataset. The dataset [`census_adult_income`](https://pantheon.corp.google.com/bigquery?p=bigquery-public-data&d=census_bureau_usa&page=dataset) has about 32561 rows of data. 

## Objective 
1. Create a linear regression model 
1. Evaluate the linear regression model 
1. Make the income group predictions using the logistic regression model. 

## Key Concepts
1. Logistic regression 
1. Explainable AI
1. ML.EVALUATE
1. ML PREDICT

## steps
1. Create the dataset 
1. Use the SELECT statement to examine the data 
1. Use the CREATE VIEW statement to compile your training data
1. Use the CREATE MODEL statement to create your logistic regression model. 
1. Use the ML.EVALUATE function to evaluate the model data
1. Use the ML.PREDICT function to predict the income bracket for a given set of census participants.
1. Use the ML.EXPLAIN_PREDICT function to explain prediction results with explainable AI Methods. 
1. Use the ML.GLOBAL_EXPLAIN function to know which features are the most important to determine the income bracket. 


## Step 1: Create the Dataset and Dataset Table




Dataset was retrieved from the BQ Marketplace of datasets [US census data](https://cloud.google.com/bigquery/public-data) . 


## Step 2: Use the SELECT statement to examine the data
Next the dataset was examined and identified which columns to use as training data for your logistic regression model.

```sql
SELECT
 age,
 workclass,
 marital_status,
 education_num,
 occupation,
 hours_per_week,
 income_bracket
FROM
 `bigquery-public-data.ml_datasets.census_adult_income`
LIMIT
 100;

```
The data view results show that the income_bracket column in the census_adult_income table has only one of two values: <=50K or >50K. 

![01](assets/01.png "01")


It also shows that the columns education and `education_num` in the `census_adult_income` table express the same data in different formats. 

## Step 3: Use the CREATE VIEW statement to compile the training data

The next step was to create a view that compiles the training data. This was done by selecting the data used to train your logistic regression model.The census respondent income prediction is done based on the following attributes:

```sql
-- This query extracts data on census respondents from the attributes listed, which represent the type of work the respondent performs.
CREATE OR REPLACE VIEW
 `02_bqml_logistic_reg_census_income_prediction.input_view` AS
SELECT
 age,
 workclass,
 marital_status,
 education_num,
 occupation,
 hours_per_week,
 income_bracket,
 CASE
   WHEN MOD(functional_weight, 10) < 8 THEN 'training'
   WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
   WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
 END AS dataframe
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
```

The query creates a view containing these columns, so that you can use them to perform training and prediction later. The dataframe column used the excluded functional_weight column to label 80% of the data source for training, and reserved the remaining data for evaluation and prediction.


## Step 4: Use the CREATE MODEL Statement to create your logistic regression model

Next, we used the CREATE MODEL statement to train the new logistic regression model with the option 'LOGISTIC_REG' on the view from the previous query.

```sql
CREATE OR REPLACE MODEL
 `02_bqml_logistic_reg_census_income_prediction.census_model`
OPTIONS
 ( model_type='LOGISTIC_REG',
   auto_class_weights=TRUE, -- Ensures calculations are fair. Balances the class labels in the training data
   input_label_cols=['income_bracket'] --the column prediction we’re modeling for
 ) AS
SELECT --The SELECT statement queries the view from Step 2.
 * EXCEPT(dataframe)
FROM
 `02_bqml_logistic_reg_census_income_prediction.input_view`
WHERE
 dataframe = 'training'

```

## Step 5: Use the ML.EVALUATE function to evaluate the model data

Then, use the ML.EVALUATE function to provide statistics about model performance.

```sql
SELECT
 *
FROM
 ML.EVALUATE (MODEL `02_bqml_logistic_reg_census_income_prediction.census_model`,
   (
   SELECT
     *
   FROM
     `02_bqml_logistic_reg_census_income_prediction.input_view`
   WHERE
     dataframe = 'evaluation'
   )
 )
```

## Step 6: Use the ML.PREDICT function to predict the income bracket for a given set of census participants.

Finally, you use the ML.PREDICT function to predict the income bracket for a given set of census participants.

```sql
SELECT
 *
FROM
 ML.PREDICT (MODEL `02_bqml_logistic_reg_census_income_prediction.census_model`,
   (
   SELECT
     *
   FROM
     `02_bqml_logistic_reg_census_income_prediction.input_view`
   WHERE
     dataframe = 'prediction'
    )
 )
```


## Step 7: Use the ML.EXPLAIN_PREDICT function to know which features are the most important to determine the weight. 

To understand why the model is generating these prediction results, you can use the ML.EXPLAIN_PREDICT function.

```sql
-- Use the ML.EXPLAIN_PREDICT function to explain prediction results with explainable AI Methods.

SELECT
*
FROM
ML.EXPLAIN_PREDICT(MODEL `02_bqml_logistic_reg_census_income_prediction.census_model`,
 (
 SELECT
   *
 FROM
   `02_bqml_logistic_reg_census_income_prediction.input_view`
 WHERE
   dataframe = 'evaluation'),
 STRUCT(3 as top_k_features))
```


## Step 8: Globally Explain the Model 

```sql
/** 8. Use the ML.GLOBAL_EXPLAIN function to know which features are the most important to determine the income bracket. **/

SELECT
*
FROM
 ML.GLOBAL_EXPLAIN(MODEL `02_bqml_logistic_reg_census_income_prediction.census_model`);

```