## Connect Big Query to VScode

Before writing any queries, I need to connect Big Query to VScode. 

I want to do this because VScode is a great all in one IDE and by consolidating the tools I use to VScode, 
I can have all of my project work in one place.

This takes many steps including downloading G Cloud CLI, 
creating a service account and assigning correct permissions to the account. 

Now that my Big Query account is connected to VSCode, I can get started with my queries.

### Install packages
Installing packages and dependents to query from bigquery and loading bigquery package

In [None]:
##bigquery installation
!pip install google-cloud-bigquery-storage --quiet 
##bigquery uses pandas internally, pyarrow is required to enable this
!pip install pyarrow --quiet 
##for progress bar
!pip install tqdm --quiet
## jupyter notebook ui
!pip install ipywidgets --quiet 


In [None]:
## Load magic string extension from google.cloud.bigquery to be enable magic cell for bigquery in my notebook.

%load_ext google.cloud.bigquery

# Data Cleaning and Validation
## Loading data and reviewing attributes and record types

- Create a new dataset in Biq Query and upload the table for query.
- Load the whole dataset and the data details for review and for a better understanding of the data as a whole.
- Use `SELECT *` , `COUNT(DISTINCT unique_id)`, and  `INFORMATION_SCHEMA.COLUMNS` statements to get a count of the total number of customers

In [None]:
%%bigquery
SELECT
    *
FROM `nomadic-ocean-395807.churn_rate.customer_data`
LIMIT 20;



In [None]:
%%bigquery
SELECT
    COUNT(DISTINCT unique_id) AS sample_size
FROM `nomadic-ocean-395807.churn_rate.customer_data`;


In [None]:
%%bigquery
SELECT
    *
FROM nomadic-ocean-395807.churn_rate.INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'customer_data';

### Data Type and column names
The data type is correct for each attribute and the column names are logical 

## Duplicates 
Because there is no UUID provided, checking across multiple attributes can identify duplicates. It would be highly unlikely that two accounts have exactly the same area code and usage

In [None]:
%%bigquery
SELECT 
    COUNT(*) AS duplicates
FROM `nomadic-ocean-395807.churn_rate.customer_data`
GROUP BY 
    area_code,
    total_day_minutes,
    total_eve_minutes,
    total_night_minutes
HAVING COUNT(*) > 1;


No rows have the same value for area_code, total_day_minutes, total_eve_minutes, total_night_minutes. It is highly unlikely there are duplicate accounts

## Attribute length

### 'state' attribute length = 2
The state column should be only 2 digits, this next query checks that.

In [None]:
%%bigquery
SELECT 
    LENGTH(state) AS state_value_length
FROM `nomadic-ocean-395807.churn_rate.customer_data`
GROUP BY LENGTH(state);


### 'area_code' attribute length = 3
The area_code column should be 3 digits, this next query checks that.

In [None]:
%%bigquery
SELECT
    LENGTH(area_code) AS area_code_value_length
FROM `nomadic-ocean-395807.churn_rate.customer_data`
GROUP BY LENGTH(area_code);

## NULL values

### Select NULL values

In [None]:
%%bigquery
SELECT 
    *
FROM `nomadic-ocean-395807.churn_rate.customer_data`
WHERE 
    state IS NULL OR
    account_length IS NULL OR
    area_code IS NULL OR
    international_plan IS NULL OR
    voice_mail_plan IS NULL OR
    number_vmail_messages IS NULL OR
    total_day_minutes IS NULL OR
    total_day_calls IS NULL OR
    total_day_charge IS NULL OR
    total_eve_minutes IS NULL OR
    total_eve_calls IS NULL OR
    total_eve_charge IS NULL OR
    total_night_minutes IS NULL OR
    total_night_calls IS NULL OR
    total_night_charge IS NULL OR
    total_intl_minutes IS NULL OR
    total_intl_calls IS NULL OR
    total_intl_charge IS NULL OR
    number_customer_service_calls IS NULL OR
    churn IS NULL;


### Delete IS NULL
1 row returned NULL for all values, safe to say this is an error and should be deleted

In [None]:
%%bigquery
DELETE FROM `nomadic-ocean-395807.churn_rate.customer_data`
WHERE 
    state IS NULL OR
    account_length IS NULL OR
    area_code IS NULL OR
    international_plan IS NULL OR
    voice_mail_plan IS NULL OR
    number_vmail_messages IS NULL OR
    total_day_minutes IS NULL OR
    total_day_calls IS NULL OR
    total_day_charge IS NULL OR
    total_eve_minutes IS NULL OR
    total_eve_calls IS NULL OR
    total_eve_charge IS NULL OR
    total_night_minutes IS NULL OR
    total_night_calls IS NULL OR
    total_night_charge IS NULL OR
    total_intl_minutes IS NULL OR
    total_intl_calls IS NULL OR
    total_intl_charge IS NULL OR
    number_customer_service_calls IS NULL OR
    churn IS NULL;


## Outliers

### Max Outliers

Check for maximum values of columns to ensure no outliers.
Start by checking the MAX account_length.

In [None]:
%%bigquery
SELECT 
    MAX(account_length) AS max_account_length 
FROM `nomadic-ocean-395807.churn_rate.customer_data`;


#### Very high MAX account length
The value is 243 (months) which is about 20 years.

Add the `GROUP BY` churn to see if the data is counting months after churn, it doesn't appear to be as the account with 243 months is not churned

Check the top 10 values to compare and see if it is an outlier or part of the norm

In [None]:
%%bigquery
SELECT 
    account_length,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY account_length DESC
LIMIT 10;

#### High Max account_length not an outlier
The high account_length doesn't appear to be an outlier, this company has some very longstanding customers.

In a real world scenario, I would double check this with the customer support manager

#### Comparing MAX values
Check the MAX values for all of the calls variables

In [None]:
%%bigquery
SELECT
    number_customer_service_calls,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY number_customer_service_calls DESC
LIMIT 10;

In [None]:
%%bigquery
SELECT
    total_intl_charge,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_charge DESC
LIMIT 10;

In [None]:
%%bigquery
SELECT
    total_intl_calls,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_calls DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_intl_minutes,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_minutes DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_night_charge,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_charge DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_night_calls,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_calls DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_night_minutes,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_minutes DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_eve_charge,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_charge DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_eve_calls,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_calls DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_eve_minutes,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_minutes DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_day_charge,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_charge DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_day_calls,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_calls DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    total_day_minutes,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_minutes DESC
LIMIT 10;


In [None]:
%%bigquery
SELECT
    number_vmail_messages,
    churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY number_vmail_messages DESC
LIMIT 10;


### Minimum values
Checking that there are no negative values in any of the INT64 or FLOAT columns

In [None]:
%%bigquery
SELECT 
    churn,
    SUM(IF (account_length < 0, 1, 0)) AS account_length,
    SUM(IF (number_vmail_messages < 0, 1, 0)) AS number_vmail_messages,
    SUM(IF (total_day_minutes < 0, 1, 0)) AS total_day_minutes,
    SUM(IF (total_day_calls < 0, 1, 0)) AS total_day_calls,
    SUM(IF (total_day_charge < 0, 1, 0)) AS total_day_charge,
    SUM(IF (total_eve_minutes < 0, 1, 0)) AS total_eve_minutes,
    SUM(IF (total_eve_calls < 0, 1, 0)) AS total_eve_calls,
    SUM(IF (total_eve_charge < 0, 1, 0)) AS total_eve_charge,
    SUM(IF (total_night_minutes < 0, 1, 0)) AS total_night_minutes,
    SUM(IF (total_night_calls < 0, 1, 0)) AS total_night_calls,
    SUM(IF (total_night_charge < 0, 1, 0)) AS total_night_charge,
    SUM(IF (total_intl_minutes < 0, 1, 0)) AS total_intl_minutes,
    SUM(IF (total_intl_calls < 0, 1, 0)) AS total_intl_calls,
    SUM(IF (total_intl_charge < 0, 1, 0)) AS total_intl_charge,
    SUM(IF (number_customer_service_calls < 0, 1, 0)) AS number_customer_service_calls
FROM `nomadic-ocean-395807.churn_rate.customer_data`
GROUP BY churn;


## Clean area_code
The field area_code has a prefix area_code_ before the 3 digit area code, for cleaner analysis remove the prefix and leave the 3 digit area code

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
SET 
  area_code = RIGHT(area_code, LENGTH(area_code)-10)
WHERE area_code LIKE 'area_code_%';



## Consistency check Voicemail plan
Checking that if they do not have the voicemail plan, the voicemails is 0

In [None]:
%%bigquery
SELECT 
    churn,
    SUM(IF(number_vmail_messages > 0, 1, 0)) voicemail_count,
FROM `nomadic-ocean-395807.churn_rate.customer_data`
GROUP BY 
    churn,
    voice_mail_plan
HAVING voice_mail_plan = FALSE;


## Adding a column 'churn_binary'
Create a churn column with binary values for ease of calculation in the analysis

In [None]:
%%bigquery
ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
ADD COLUMN churn_binary INT;


### Update 'churn_binary'
Update column churn_binary based on the churn column.

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
SET churn_binary =
          CASE 
               WHEN churn IS false THEN 0
               ELSE 1
          END
WHERE churn_binary IS NULL;


### Select test 'churn_binary'
Run a select to check the churn_binary is entered correctly

In [None]:
%%bigquery
SELECT 
    churn_binary
FROM `nomadic-ocean-395807.churn_rate.customer_data`


## Adding a column 'unique_id'
Each row corresponds to one account, in most account data there would be an account id that is unique to each account, something that identifies the account, this dataset is missing so it needs to be created.

In [None]:
%%bigquery
ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
ADD COLUMN unique_id STRING;


### Update 'unique_id'
Update column unique_id with a uuid for identification of the account

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
SET unique_id = generate_uuid()
WHERE unique_id IS NULL;


### Select test 'unique_id'
Run a select to check the unique_id is entered correctly

In [None]:
%%bigquery
SELECT
    unique_id
FROM `nomadic-ocean-395807.churn_rate.customer_data`;


## Adding a column 'total_charges'
This will help streamline high level analysis

In [None]:
%%bigquery
ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
ADD COLUMN total_charges FLOAT64;


### Update 'total_charges'
Add the values as the SUM of the charges on the account

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
    SET total_charges = total_day_charge + total_eve_charge + total_night_charge + total_intl_Charge
WHERE total_charges IS NULL;


### Select test 'total_charges'
Run a select to check the charges are entered correctly

In [None]:
%%bigquery
SELECT 
    total_charges
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_charges;


## Adding a column 'account_length_group_years'
This is to segment the accounts based on account length. This will help to identify which accounts are at risk of churn.

In [None]:
%%bigquery
ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
ADD COLUMN account_length_group_years STRING;


### Update 'account_length_group_years'
Update column account_length_group_years based on the segmenting of the account length. 5 equal segments were chosen to ensure simplicity and accuracy of the representation of each segment.

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
SET account_length_group_years =
            CASE 
                WHEN account_length > 192 THEN '16-20'
                WHEN account_length > 144 THEN '12-16'
                WHEN account_length > 96 THEN '8-12'
                WHEN account_length > 48 THEN '4-8'
                ELSE '0-4' 
            END
WHERE account_length_group_years IS NULL;


### Select test 'account_length_group_years'
Run a select to check the account length groups are entered correctly

In [None]:
%%bigquery
SELECT 
    account_length_group_years
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY account_length_group_years;


## Adding a column 'yrr_group'
This is to segment the accounts based on the total_charges. This will help to identify which accounts are at risk of churn.

In [None]:
%%bigquery
ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
ADD COLUMN yrr_group STRING;


### Update 'yrr_group'
Update column yrr_group based on the segmenting of the total_charges. 5 equal segments were chosen to ensure simplicity and accuracy of the representation of each segment.

In [None]:
%%bigquery
UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
SET yrr_group =
            CASE 
                WHEN total_charges >= 80 THEN '$80.00 - $99.99'
                WHEN total_charges >= 60 THEN '$60.00 - $79.99'
                WHEN total_charges >= 40 THEN '$40.00 - $59.99'
                WHEN total_charges >= 20 THEN '$20.00 - $39.99'
                ELSE '$00.00 - $19.99'
            END
WHERE yrr_group IS NULL;


### Select test 'yrr_group'
Run a select to check the total_charges groups are entered correctly

In [None]:
%%bigquery
SELECT 
    yrr_group
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY yrr_group;
