
=================================================

Milestone 3

Nama  : Nakia Melvana

Batch : FTDS-026-RMT

Objectives: This program is created to validate the user base netflix dataset using Great Expectation.

=================================================



# Install Great Expectation Package

In [1]:
# Install the library

!pip install -q great-expectations

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# Import Library

In [2]:
from great_expectations.data_context import FileDataContext

# Instantiate Data Context

In [3]:
# Create a data context
context = FileDataContext.create(project_root_dir='./')

  and should_run_async(code)



**Insight:** It offers the settings and procedures for all supporting GX components. The parameter project_root_dir='./' indicates that the Data Context will be created in the current working directory.

# Connect to A `Datasource`

In [4]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'milestone3.csv'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'netflix_userbase'
path_to_data = 'P2M3_Nakia_Melvana_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

**Insight:**
- Datasource: gives a standard way to access and work with data from different source systems.
- Data Asset: a group of records within a Datasource, usually named after the underlying data system and sliced to match specific requirements.

# Create an Expectation Suite

In [5]:
# Creat an expectation suite
expectation_suite_name = 'expectation-netflix-'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()




Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,user_id,subscription_type,monthly_revenue,join_date,last_payment_date,country,age,gender,device,plan_duration
0,1,Basic,10,2022-01-15,2023-06-10,United States,28,Male,Smartphone,1
1,2,Premium,15,2021-09-05,2023-06-22,Canada,35,Female,Tablet,1
2,3,Standard,12,2023-02-28,2023-06-27,United Kingdom,42,Male,Smart TV,1
3,4,Standard,12,2022-07-10,2023-06-26,Australia,51,Female,Laptop,1
4,5,Basic,10,2023-05-01,2023-06-28,Germany,33,Male,Smartphone,1


**Insight:** An Expectation Suite is a set of testable statements about data which bring together various expectations to provide an overall description of the data.

## Expectations

In [6]:
# Expectation 1 : Column `user_id` must be unique

validator.expect_column_values_to_be_unique('user_id')




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "user_id",
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "user_id" column was successful. The expectation was to ensure that all values in the "user_id" column are unique, and indeed, all 2500 elements in the column meet this criterion. Therefore, the dataset complies with the expectation that each user ID is unique.

In [7]:
# Expectation 2 : Column `age` must be 18 or over

validator.expect_column_values_to_be_between(
    column='age', min_value=18, max_value=100
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "age",
      "min_value": 18,
      "max_value": 100,
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "age" column was successful. The expectation was to ensure that all values in the "age" column are more than 18 year because the minimum age of netflix user is above 18.  Therefore, the dataset complies with the expectation that the ages are above 18.

In [8]:
# Expectation 3 : Column `subscription_type` must contain one of the following 3 things :
# Basic, Premium, Standard

validator.expect_column_values_to_be_in_set('subscription_type', ['Basic', 'Premium', 'Standard'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "subscription_type",
      "value_set": [
        "Basic",
        "Premium",
        "Standard"
      ],
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "subcription_type" column was successful. The expectation was to ensure that all values in the "subcription_type" column are consist of 3 types; Basic, Premium, Standard; because based on the data, netflix only has 3 types of subscription. Therefore, the dataset complies with the expectation that the subscription types are Basic, Premium, Standard.

In [18]:
# Expectation 4 : Column `monthly_revenue` must 'int64' or 'float'
validator.expect_column_values_to_be_in_type_list('monthly_revenue', ['int64', 'float'])




Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_type_list",
    "kwargs": {
      "column": "monthly_revenue",
      "type_list": [
        "int64",
        "float"
      ],
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "monthly_revenue" column was successful. The expectation was to ensure that all values in the "monthly_revenue" column are consist 2 types of data type. Whether it is integer or float. can not be string data type. Therefore, the dataset complies with the expectation that the monthly revenue are integer data type.

In [10]:
# Expectation 5 : Column 'plan_duration' can not contain missing values

validator.expect_column_values_to_not_be_null('plan_duration')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "plan_duration",
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "plan_duration" column was successful. The expectation was to ensure "plan_duration" column has no missing value.

In [11]:
# Expectation 6 : Column `plan_duration` must be exist according to user's netflix plan

validator.expect_column_to_exist(column='plan_duration')

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "plan_duration",
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "plan_duration" column was successful. The expectation was to ensure "plan_duration" column is exist because it is important to ensure all users has correct subscription time. Therefore, the dataset complies with the expectation that the plan duration exist.

In [13]:
# Expectation 7 : The minimum value of column `monthly_revenue` must be `USD10`

validator.expect_column_min_to_be_between('monthly_revenue', 0, 10)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_min_to_be_between",
    "kwargs": {
      "column": "monthly_revenue",
      "min_value": 0,
      "max_value": 10,
      "batch_id": "milestone3.csv-netflix_userbase"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 10
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Insight:** The expectation test for the "monthly_revenue" column was successful. The expectation was to ensure "monthly_revenue" column has minimum value of USD10 because it is important as revenue for the company. Therefore, the dataset complies with the expectation that the monthly revenue has minimum value of USD10.

In [14]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

**Insight:** Make sure to set the parameter discard_failed_expectation to False in Great Expectations. This will allow both successful and failed expectations to be stored instead of discarding the failed ones.

# Checkpoint

In [15]:
# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [16]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()




Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

**Insight:** Checkpoints offer a convenient way to package the validation of a batch (or batches) of data against one or more expectation suites, along with the actions to be taken afterward.

# Data Docs

In [17]:
# Build data docs

context.build_data_docs()

{'local_site': 'file:///content/gx/uncommitted/data_docs/local_site/index.html'}

**Insight:** Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation.