## Connect Big Query to VScode

Before I writing any queries, I need to connect Big Query to VScode. I want to do this because VScode is a great all in one IDE and by consolidating the tools I use to VScode, I can have all of my projects work in one place.
This has taken many steps including downloading G Cloud CLI, creating a service account and assigning correct permissions to the account. Now that my Big Query account is connected to VSCode, I can get started with my queries.

# Data Cleaning and Validation
## Loading data and reviewing attributes and record types

- Create a new dataset in Biq query and upload the table for query.
- Load the whole dataset and the data details for review and for a better understanding of the data as a whole.
- Use `SELECT *` , `INFORMATION_SCHEMA.COLUMNS` statements and `COUNT(DISTINCT unique_id)` to get a count of the total number of customers

In [8]:


## Import Big Query to VScode

from google.cloud import bigquery
client = bigquery.Client()

## In order to use Biq Query in VScode, I import the package and assign the client.
## I use `def` to define `query_and_display` function. Each query is a multiline string variable, query_and_display( ''', which is passed as an argument to function.

def query_and_display(sql): 
    return client.query(sql).to_dataframe()

In [9]:
query_and_display("""
    SELECT
        COUNT(DISTINCT unique_id) AS sample_size
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
""")

Unnamed: 0,sample_size
0,4250


In [10]:
query_and_display("""
    SELECT
        *
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    LIMIT 20
""")


Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,...,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn,churn_binary,unique_id,total_charges
0,OK,17,408,False,False,0,180.4,121,30.67,228.5,...,99,8.98,14.8,5,4.0,0,False,0,fff4abf1-3c47-46c9-868f-a40f527113d0,63.07
1,CO,119,408,False,False,0,124.3,68,21.13,207.1,...,93,7.08,14.8,1,4.0,0,False,0,03d09aae-5ddf-4741-9b62-e764ce8292f2,49.81
2,NM,20,408,False,False,0,131.5,64,22.36,161.8,...,106,7.14,14.8,5,4.0,0,False,0,496ffed9-d5f6-4b1b-aa89-27e55d139813,47.25
3,SC,1,408,False,False,0,123.8,113,21.05,236.2,...,81,3.29,3.7,2,1.0,0,False,0,2f53a7a7-af51-47b6-b554-4f45fcefd9d0,45.42
4,WY,113,408,False,False,0,158.4,107,26.93,142.2,...,108,7.86,14.9,2,4.02,0,False,0,14ce4c10-c0e9-4abb-bf61-5d9a73a28867,50.9
5,MA,97,408,False,False,0,217.6,81,36.99,320.5,...,110,6.78,4.2,3,1.13,0,False,0,a79032ee-d712-4701-9508-6930aa67b441,72.14
6,MS,76,408,False,False,0,193.0,82,32.81,200.8,...,79,6.12,14.4,3,3.89,0,False,0,bd90f48d-ecc5-43e6-8963-73514ffa6253,59.89
7,RI,39,408,False,False,0,93.3,83,15.86,199.6,...,104,9.28,6.5,4,1.76,0,False,0,c23bee7e-e37f-4590-92bd-fdf15d821cfd,43.87
8,MN,113,408,False,False,0,158.9,137,27.01,242.8,...,97,11.15,6.5,4,1.76,0,False,0,0fe33653-543d-4cac-ab33-95680451e0c8,60.56
9,NH,123,408,False,False,0,224.0,99,38.08,210.7,...,75,10.44,2.1,5,0.57,0,False,0,5013dc26-2a40-426f-ad92-21eebd36e8ff,67.0


In [11]:
query_and_display("""
    SELECT
        *
    FROM nomadic-ocean-395807.churn_rate.INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = 'customer_data'
""")

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position,collation_name,column_default,rounding_mode
0,nomadic-ocean-395807,churn_rate,customer_data,state,1,YES,STRING,NEVER,,,NO,,NO,NO,,,,
1,nomadic-ocean-395807,churn_rate,customer_data,account_length,2,YES,INT64,NEVER,,,NO,,NO,NO,,,,
2,nomadic-ocean-395807,churn_rate,customer_data,area_code,3,YES,STRING,NEVER,,,NO,,NO,NO,,,,
3,nomadic-ocean-395807,churn_rate,customer_data,international_plan,4,YES,BOOL,NEVER,,,NO,,NO,NO,,,,
4,nomadic-ocean-395807,churn_rate,customer_data,voice_mail_plan,5,YES,BOOL,NEVER,,,NO,,NO,NO,,,,
5,nomadic-ocean-395807,churn_rate,customer_data,number_vmail_messages,6,YES,INT64,NEVER,,,NO,,NO,NO,,,,
6,nomadic-ocean-395807,churn_rate,customer_data,total_day_minutes,7,YES,FLOAT64,NEVER,,,NO,,NO,NO,,,,
7,nomadic-ocean-395807,churn_rate,customer_data,total_day_calls,8,YES,INT64,NEVER,,,NO,,NO,NO,,,,
8,nomadic-ocean-395807,churn_rate,customer_data,total_day_charge,9,YES,FLOAT64,NEVER,,,NO,,NO,NO,,,,
9,nomadic-ocean-395807,churn_rate,customer_data,total_eve_minutes,10,YES,FLOAT64,NEVER,,,NO,,NO,NO,,,,


### Data Type and column names
The data type is correct for each attribute and the column names are logical 

## Duplicates 
Because there is no UUID provided, checking across multiple attributes can identify duplicates. It would be highly unlikely that two accounts have exactly the same area code and usage

In [13]:
query_and_display("""
    SELECT 
        COUNT(*) AS duplicates
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    GROUP BY 
        area_code,
        total_day_minutes,
        total_eve_minutes,
        total_night_minutes
    HAVING COUNT(*) > 1
""")

Unnamed: 0,duplicates


No rows have the same value for area_code, total_day_minutes, total_eve_minutes, total_night_minutes. It is highly unlikely there are duplicate accounts

## Attribute length

### 'state' attribute length = 2
The state column should be only 2 digits, this next query checks that.

In [14]:
query_and_display("""
    SELECT 
        LENGTH(state) AS state_value_length
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    GROUP BY LENGTH(state)
""")

Unnamed: 0,state_value_length
0,2


## 'area_code' attribute length = 3
The area_code column should be 3 digits, this next query checks that.

In [15]:
query_and_display("""
    SELECT
        LENGTH(area_code) AS area_code_value_length
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    GROUP BY LENGTH(area_code)
""")

Unnamed: 0,area_code_value_length
0,3


## NULL values

### Select NULL values

In [17]:
query_and_display("""
    SELECT 
        *
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    WHERE 
        state IS NULL OR
        account_length IS NULL OR
        area_code IS NULL OR
        international_plan IS NULL OR
        voice_mail_plan IS NULL OR
        number_vmail_messages IS NULL OR
        total_day_minutes IS NULL OR
        total_day_calls IS NULL OR
        total_day_charge IS NULL OR
        total_eve_minutes IS NULL OR
        total_eve_calls IS NULL OR
        total_eve_charge IS NULL OR
        total_night_minutes IS NULL OR
        total_night_calls IS NULL OR
        total_night_charge IS NULL OR
        total_intl_minutes IS NULL OR
        total_intl_calls IS NULL OR
        total_intl_charge IS NULL OR
        number_customer_service_calls IS NULL OR
        churn IS NULL
""")

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,...,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn,churn_binary,unique_id,total_charges


### Delete IS NULL
1 row returned NULL for all values, safe to say this is an error and should be deleted

In [29]:
query_and_display("""
    DELETE FROM `nomadic-ocean-395807.churn_rate.customer_data`
    WHERE 
        state IS NULL OR
        account_length IS NULL OR
        area_code IS NULL OR
        international_plan IS NULL OR
        voice_mail_plan IS NULL OR
        number_vmail_messages IS NULL OR
        total_day_minutes IS NULL OR
        total_day_calls IS NULL OR
        total_day_charge IS NULL OR
        total_eve_minutes IS NULL OR
        total_eve_calls IS NULL OR
        total_eve_charge IS NULL OR
        total_night_minutes IS NULL OR
        total_night_calls IS NULL OR
        total_night_charge IS NULL OR
        total_intl_minutes IS NULL OR
        total_intl_calls IS NULL OR
        total_intl_charge IS NULL OR
        number_customer_service_calls IS NULL OR
        churn IS NULL
""")

## Outliers

### Max Outliers

Check for max values of columns to ensure nothing is extreme due to an input error. 
Start by checking the MAX account_length.

In [19]:
query_and_display("""
    SELECT 
        MAX(account_length) AS max_account_length 
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
""")

Unnamed: 0,max_account_length
0,243


#### Very high MAX account length
The value is 243 (months) which is about 20 years
Add the GROUP BY churn to see if the data is counting months after churn, it doesn't appear to be as the account with 243 months is not churned
Check the top 10 values to compare and see if it is an outlier or part of the norm

In [None]:
query_and_display("""
    SELECT 
        account_length,
        churn
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    ORDER BY account_length DESC
    LIMIT 10
""")

   account_length  churn
0             243  False
1             232  False
2             232  False
3             225   True
4             225  False
5             224  False
6             224   True
7             222   True
8             222  False
9             221  False


#### High Max account_length not an outlier
The high account_length doesn't appear to be an outlier, this company has some very longstanding customers
In a real world scenario, double check this with the customer support manager

#### Comparing MAX values
Check the MAX values for all of the calls variables

In [20]:
query_and_display("""
SELECT number_customer_service_calls, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY number_customer_service_calls DESC
LIMIT 10;
""")

Unnamed: 0,number_customer_service_calls,churn
0,9,True
1,9,True
2,8,False
3,8,True
4,7,False
5,7,False
6,7,False
7,7,True
8,7,True
9,7,False


In [28]:
query_and_display("""
SELECT total_intl_charge, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_charge DESC
LIMIT 10;
""")

Unnamed: 0,total_intl_charge,churn
0,5.4,True
1,5.32,False
2,5.32,False
3,5.21,False
4,5.18,True
5,5.1,False
6,5.0,False
7,4.97,False
8,4.94,True
9,4.91,False


In [27]:
query_and_display("""
SELECT total_intl_calls, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_calls DESC
LIMIT 10;
""")

Unnamed: 0,total_intl_calls,churn
0,20,True
1,19,False
2,18,False
3,18,False
4,18,False
5,18,True
6,17,False
7,16,False
8,16,False
9,16,False


In [26]:
query_and_display("""
SELECT total_intl_minutes, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_intl_minutes DESC
LIMIT 10;
""")

Unnamed: 0,total_intl_minutes,churn
0,20.0,True
1,19.7,False
2,19.7,False
3,19.3,False
4,19.2,True
5,18.9,False
6,18.5,False
7,18.4,False
8,18.3,True
9,18.2,False


In [None]:
query_and_display("""
SELECT total_night_charge, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_charge DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_night_calls, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_calls DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_night_minutes, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_night_minutes DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_eve_charge, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_charge DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_eve_calls, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_calls DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_eve_minutes, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_eve_minutes DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT total_day_charge, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_charge DESC
LIMIT 10;
""")

In [21]:
query_and_display("""
SELECT total_day_calls, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_calls DESC
LIMIT 10;
""")

Unnamed: 0,total_day_calls,churn
0,165,True
1,160,False
2,160,True
3,158,False
4,158,False
5,157,False
6,157,False
7,156,True
8,156,True
9,156,True


In [None]:
query_and_display("""
SELECT total_day_minutes, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY total_day_minutes DESC
LIMIT 10;
""")

In [None]:
query_and_display("""
SELECT number_vmail_messages, churn
FROM `nomadic-ocean-395807.churn_rate.customer_data`
ORDER BY number_vmail_messages DESC
LIMIT 10;
""")

   number_vmail_messages  churn
0                     52  False
1                     50  False
2                     50  False
3                     49  False
4                     49  False
5                     49  False
6                     48   True
7                     48   True
8                     48  False
9                     48  False


### Minimum values
Checking that there are no negative values in any of the INT64 or FLOAT columns

In [25]:
query_and_display("""
    SELECT 
        churn,
        SUM(IF (account_length < 0, 1, 0)) AS account_length,
        SUM(IF (number_vmail_messages < 0, 1, 0)) AS number_vmail_messages,
        SUM(IF (total_day_minutes < 0, 1, 0)) AS total_day_minutes,
        SUM(IF (total_day_calls < 0, 1, 0)) AS total_day_calls,
        SUM(IF (total_day_charge < 0, 1, 0)) AS total_day_charge,
        SUM(IF (total_eve_minutes < 0, 1, 0)) AS total_eve_minutes,
        SUM(IF (total_eve_calls < 0, 1, 0)) AS total_eve_calls,
        SUM(IF (total_eve_charge < 0, 1, 0)) AS total_eve_charge,
        SUM(IF (total_night_minutes < 0, 1, 0)) AS total_night_minutes,
        SUM(IF (total_night_calls < 0, 1, 0)) AS total_night_calls,
        SUM(IF (total_night_charge < 0, 1, 0)) AS total_night_charge,
        SUM(IF (total_intl_minutes < 0, 1, 0)) AS total_intl_minutes,
        SUM(IF (total_intl_calls < 0, 1, 0)) AS total_intl_calls,
        SUM(IF (total_intl_charge < 0, 1, 0)) AS total_intl_charge,
        SUM(IF (number_customer_service_calls < 0, 1, 0)) AS number_customer_service_calls
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    GROUP BY churn
""")

Unnamed: 0,churn,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
0,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,True,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Clean area_code
The field area_code has a prefix area_code_ before the 3 digit area code, for cleaner analysis remove the prefix and leave the 3 digit area code

In [22]:
query_and_display("""
  UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
  SET 
    area_code = RIGHT(area_code, LENGTH(area_code)-10)
  WHERE area_code LIKE 'area_code_%';
""")


## Consistency check Voicemail plan
Checking that if they do not have the voicemail plan, the voicemails is 0

In [24]:
query_and_display("""
    SELECT 
        churn,
        SUM(IF(number_vmail_messages > 0, 1, 0)) voicemail_count,
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    GROUP BY 
        churn,
        voice_mail_plan
    HAVING voice_mail_plan = FALSE
""")

Unnamed: 0,churn,voicemail_count
0,False,0
1,True,0


## ## Adding a column 'churn_binary'
Create a churn column with binary values for ease of calculation in the analysis

In [None]:
query_and_display("""
    ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
    ADD COLUMN churn_binary INT;
""")

BadRequest: 400 Column already exists: churn_binary at [3:12]

Location: US
Job ID: 3b8186be-7e48-41fe-9bd7-de71ea93133a


### Update 'churn_binary'
Update column churn_binary based on the churn column.

In [None]:
query_and_display("""
     UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
     SET churn_binary =
               CASE 
                    WHEN churn IS false THEN 0
                    ELSE 1
               END
     WHERE churn_binary IS NULL
""")

Empty DataFrame
Columns: []
Index: []


### Select test 'churn_binary'
Run a select to check the churn_binary is entered correctly

In [None]:
query_and_display("""
    SELECT 
        churn_binary
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
""")

      churn_binary
0                0
1                0
2                0
3                0
4                0
...            ...
4245             1
4246             1
4247             1
4248             1
4249             1

[4250 rows x 1 columns]


## Adding a column 'unique_id'
Each row corresponds to one account, in most account data there would be an account id that is unique to each account, something that identifies the account, this dataset is missing so it needs to be created.

In [None]:
query_and_display("""
    ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
    ADD COLUMN unique_id STRING;
""")

BadRequest: 400 Column already exists: unique_id at [3:12]

Location: US
Job ID: 033a1b05-5406-4495-aeb9-64c8f19e06ef


### Update 'unique_id'
Update column unique_id with a uuid for identification of the account

In [None]:
query_and_display("""
    UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
    SET unique_id = generate_uuid()
    WHERE unique_id IS NULL
""")

Empty DataFrame
Columns: []
Index: []


### Select test 'unique_id'
Run a select to check the unique_id is entered correctly

In [None]:
query_and_display("""
    SELECT
        unique_id
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
""")

                                 unique_id
0     531743ae-40ab-44ca-b560-962c649e0edb
1     cb236153-a41e-4901-957e-7c756a82805b
2     8b5f377b-b170-4f52-b9b8-30bde6673953
3     9573f255-bad4-4f45-8597-98165409624d
4     20098f96-a5b7-4011-9f52-34ae96c8f93c
...                                    ...
4245  26e7516f-cddb-4324-a71c-4580548001d6
4246  70ecf14b-1ff5-44b1-b8cc-85e5a8a4ae3d
4247  c0c697d1-1a0d-4314-9816-bfcc91d90503
4248  2f37029a-20aa-4cf4-a620-dc0060b0a773
4249  164c5345-4eb0-4e30-b687-184b78975ff4

[4250 rows x 1 columns]


## Adding a column 'total_charges'
This will help streamline high level analysis

In [None]:
query_and_display("""
    ALTER TABLE `nomadic-ocean-395807.churn_rate.customer_data`
    ADD COLUMN total_charges FLOAT64;
""")

BadRequest: 400 Column already exists: total_charges at [3:12]

Location: US
Job ID: 700b2651-ba30-45bd-8a2a-5af6ba0ae24e


### Update 'total_charges'
Add the values as the SUM of the charges on the account

In [None]:
query_and_display("""
    UPDATE `nomadic-ocean-395807.churn_rate.customer_data`
        SET total_charges = total_day_charge + total_eve_charge + total_night_charge + total_intl_Charge
    WHERE total_charges IS NULL
""")

Empty DataFrame
Columns: []
Index: []


### Select test 'total_charges'
Run a select to check the charges are entered correctly

In [None]:
query_and_display("""
    SELECT 
        total_charges
    FROM `nomadic-ocean-395807.churn_rate.customer_data`
    ORDER BY total_charges
""")

      total_charges
0             22.93
1             23.25
2             25.87
3             27.02
4             27.08
...             ...
4245          91.40
4246          92.29
4247          92.96
4248          93.39
4249          96.15

[4250 rows x 1 columns]
