# Getting Started with GX Core

We will learn how to:

- connect to data
- create expectations
- validate data
- review validation results

First of all, let's import the core library of Great Expectations.

Specifically, the `great_expectations` module is the root of the GX Core library.</br>
It contains shortcuts and convenience methods for starting a GX project in a Python session.

In [1]:
import great_expectations as gx
import pandas as pd

## Importing the data

For this lab, we will be using a sample of [NYC taxi trip record data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

> Yellow and green taxi trip records include fields capturing *pick-up and drop-off dates/times*, *pick-up and drop-off locations*, *trip distances*, *itemized fares*, *rate types*, *payment types*, and driver-reported *passenger counts*. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).

Let's download and read the sample data into a Pandas DataFrame.

In [2]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
df.head()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
0,1,2019-01-15 03:36:12,2019-01-15 03:42:19,1,1.0,1,N,230,48,1,6.5,0.5,0.5,1.95,0.0,0.3,9.75,
1,1,2019-01-25 18:20:32,2019-01-25 18:26:55,1,0.8,1,N,112,112,1,6.0,1.0,0.5,1.55,0.0,0.3,9.35,0.0
2,1,2019-01-05 06:47:31,2019-01-05 06:52:19,1,1.1,1,N,107,4,2,6.0,0.0,0.5,0.0,0.0,0.3,6.8,
3,1,2019-01-09 15:08:02,2019-01-09 15:20:17,1,2.5,1,N,143,158,1,11.0,0.0,0.5,3.0,0.0,0.3,14.8,
4,1,2019-01-25 18:49:51,2019-01-25 18:56:44,1,0.8,1,N,246,90,1,6.5,1.0,0.5,1.65,0.0,0.3,9.95,0.0


### Create a **Data Context**

A **Data Context** object serves as the entrypoint for interacting with GX components.

In [3]:
context = gx.get_context()

### Connect to data and create a **Batch**

Let's define:

- a **Data Source**
- a **Data Asset**
- a **Batch Definition**
- a **Batch**

In [4]:
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

### Create an **Expectation**

We expect that the contents of the column `passenger_count` consist of values ranging from 1 to 6.

In [5]:
expectation = gx.expectations.ExpectColumnValuesToBeBetween(column="passenger_count", min_value=1, max_value=6)

### Validate the data batch

Using the expectation you just created, validate the data batch.

In [6]:
validation_result = batch.validate(expectation)

Calculating Metrics: 100%|██████████| 10/10 [00:00<00:00, 503.32it/s]


In [7]:
print(validation_result)

{
  "success": true,
  "expectation_config": {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "batch_id": "pandas-pd dataframe asset",
      "column": "passenger_count",
      "min_value": 1.0,
      "max_value": 6.0
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_counts": [],
    "partial_unexpected_index_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


### Create an **Expectation Suite**

In [8]:
suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(
        name="Taxi trip expectations",
    )
)

In [9]:
# Expectation 1
suite.add_expectation(expectation)

# Expectation 2
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(column="fare_amount", min_value=0))

# Expectation 3
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="pickup_datetime"))

# Expectation 4
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(column="payment_type", type_="int"))

# Expectation 5
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="payment_type"))

# Expectation 6
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeInSet(column="payment_type", value_set=[1, 2, 3, 4]))

ExpectColumnValuesToBeInSet(id='378fd399-b1a1-4e0a-893b-a9b287f8e718', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=True, rendered_content=None, windows=None, batch_id=None, row_condition=None, condition_parser=None, column='payment_type', mostly=1.0, value_set=[1, 2, 3, 4])

### Create a **Validation Definition**

A **Validation Definition** explicitly ties together the Batch of data to be validated to the Expectation Suite that should be used to validate it.

In [10]:
validation_definition = context.validation_definitions.add(
    gx.core.validation_definition.ValidationDefinition(name="Validation definition", data=batch_definition, suite=suite)
)

### Create and run a **Checkpoint**

Create and run a Checkpoint to validate the data based on the supplied Validation Definition.

In [11]:
checkpoint = context.checkpoints.add(
    gx.checkpoint.checkpoint.Checkpoint(name="checkpoint", validation_definitions=[validation_definition])
)
checkpoint_result = checkpoint.run(batch_parameters={"dataframe": df})
print(checkpoint_result.describe())

Calculating Metrics: 100%|██████████| 37/37 [00:00<00:00, 654.16it/s]

{
    "success": false,
    "statistics": {
        "evaluated_validations": 1,
        "success_percent": 0.0,
        "successful_validations": 0,
        "unsuccessful_validations": 1
    },
    "validation_results": [
        {
            "success": false,
            "statistics": {
                "evaluated_expectations": 6,
                "successful_expectations": 5,
                "unsuccessful_expectations": 1,
                "success_percent": 83.33333333333334
            },
            "expectations": [
                {
                    "expectation_type": "expect_column_values_to_be_between",
                    "success": true,
                    "kwargs": {
                        "batch_id": "pandas-pd dataframe asset",
                        "column": "passenger_count",
                        "min_value": 1.0,
                        "max_value": 6.0
                    },
                    "result": {
                        "element_count": 10000,
    




---

## References

[1] https://docs.greatexpectations.io/docs/core/introduction/try_gx

[2] [Expectations Gallery](https://greatexpectations.io/expectations/)