# Tutorial 3
In tutorial 1 and 2, we have seen how to use great expectation as a project framework to validate data. If you don't want to use
all the features that it provides you. you can just use the simple validation method directly on a dataframe

In [11]:
import great_expectations as ge
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [12]:
file_path="../data/adult_with_duplicates.csv"


df = pd.read_csv(file_path)

# convert pandas dataframe to ge dataframe
df = ge.dataset.PandasDataset(df)
print(df.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')


In [13]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,139.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,-12.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,,emp-by-pengfei,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
3,39.5,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
4,39.0,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


# Apply validation method directly on the dataframe

Below method checks if the dataframe has the expected column names. It's equivalent to the yaml config

```yaml
"expectations": [
    {
      "expectation_type": "expect_table_columns_to_match_ordered_list",
      "kwargs": {
        "column_list": [
          "age",
          "workclass",
          "fnlwgt",
          "education",
          "education-num",
          "marital-status",
          "occupation",
          "relationship",
          "race",
          "sex",
          "capital-gain",
          "capital-loss",
          "hours-per-week",
          "native-country",
          "income"
        ]
      },
      "meta": {}
    }
```

In [19]:
column_list= [
          "age",
          "workclass",
          "fnlwgt",
          "education",
          "education-num",
          "marital-status",
          "occupation",
          "relationship",
          "race",
          "sex",
          "capital-gain",
          "capital-loss",
          "hours-per-week",
          "native-country",
          "income",
    # "toto"
        ]
df.expect_table_columns_to_match_ordered_list(column_list=column_list)

{
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": [
      "age",
      "workclass",
      "fnlwgt",
      "education",
      "education-num",
      "marital-status",
      "occupation",
      "relationship",
      "race",
      "sex",
      "capital-gain",
      "capital-loss",
      "hours-per-week",
      "native-country",
      "income"
    ]
  }
}

Below method checks if the age value is between 0 and 120. It's equivalent to the yaml config

```yaml
 {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "age",
        "max_value": 120.0,
        "min_value": 0.0
      },
      "meta": {}
    }
```

In [20]:
# ge dataframe provides access to all validation method

df.expect_column_values_to_be_between(column='age', min_value=0, max_value=120)

{
  "success": false,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 32607,
    "missing_count": 1,
    "missing_percent": 0.0030668261416260316,
    "unexpected_count": 4,
    "unexpected_percent": 0.012267680794945715,
    "unexpected_percent_total": 0.012267304566504126,
    "unexpected_percent_nonmissing": 0.012267680794945715,
    "partial_unexpected_list": [
      139.0,
      -12.0,
      152.0,
      154.0
    ]
  }
}

In [15]:
df.expect_column_values_to_not_be_null("age")

{
  "success": false,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 32607,
    "unexpected_count": 1,
    "unexpected_percent": 0.0030668261416260316,
    "unexpected_percent_total": 0.0030668261416260316,
    "partial_unexpected_list": []
  }
}

In [16]:
values= ("Private", "Self-emp-not-inc", "Self-emp-inc", "Federal-gov", "Local-gov", "State-gov", "Without-pay", "Never-worked")
df.expect_column_values_to_be_in_set("workclass",value_set=values)

{
  "success": false,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 32607,
    "missing_count": 1836,
    "missing_percent": 5.630692796025393,
    "unexpected_count": 1,
    "unexpected_percent": 0.003249813135744695,
    "unexpected_percent_total": 0.0030668261416260316,
    "unexpected_percent_nonmissing": 0.003249813135744695,
    "partial_unexpected_list": [
      "emp-by-pengfei"
    ]
  }
}