---

Name : Ogi Hadicahyo

Objective : This program will explore the quality and integrity of data using Great Expectations. This aims to ensure that the data held meets the expectations and requirements needed for analysis, modeling and decision making

---

In [15]:
# Import Libraries
import pandas as pd

In [16]:
# Data Loading
data = pd.read_csv('P2M3_Ogi-Hadicahyo_data_clean.csv')

In [17]:
# Added line number description
for i, row in data.iterrows():
     data.at[i, 'index'] = str(i+1)

In [19]:
# Save DataFrame to CSV file
data.to_csv('P2M3_Ogi-Hadicahyo_data_clean.csv', index=False)

In [2]:
# Install the library

!pip install -q great-expectations

In [20]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

In [21]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'P2M3_Ogi-Hadicahyo_data_raw.csv'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'Milestone 3'
path_to_data = 'P2M3_Ogi-Hadicahyo_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

In [22]:
# Creat an expectation suite
expectation_suite_name = 'Milestone 3 Test'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,name,host_id,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,...,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,index
0,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,False,strict,...,966.0,193.0,10.0,9.0,2021-10-19,0.21,4.0,6.0,286.0,1
1,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,False,moderate,...,142.0,28.0,30.0,45.0,2022-05-21,0.38,4.0,2.0,228.0,2
2,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,False,moderate,...,204.0,41.0,10.0,9.0,2018-11-19,0.1,3.0,1.0,289.0,3
3,Large Cozy 1 BR Apartment In Midtown East,45498551794,verified,Michelle,Manhattan,Murray Hill,40.74767,-73.975,True,flexible,...,577.0,115.0,3.0,74.0,2019-06-22,0.59,3.0,1.0,374.0,4
4,BlissArtsSpace!,90821839709,unconfirmed,Emma,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,False,moderate,...,1060.0,212.0,45.0,49.0,2017-10-05,0.4,5.0,1.0,219.0,5


### Expectation 1 : Column *index* must be unique

In [23]:
# Expectation 1 : Column `index` must be unique

validator.expect_column_values_to_be_unique('index')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 84313,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that the expectations of **Expectation 1 have been fulfilled well.** The index column in the dataset must have a unique value in each row. From the output results, it can be concluded that **there are no values that do not match expectations (values that are not unique) in the index column.** The total number of elements in the index column is 84,313, and **there are no missing values**. The overall percentage of values that do not match expectations regarding the total number of elements in the index column is 0.0%, while the percentage of values that do not match expectations regarding the number of elements that are not missing in the index column is also 0.0%. Thus, **it can be concluded that expectations have been met.**

### Expectation 2 : Column *review_rate_number* must be less than 5.0

In [25]:
# Expectation 2 : Column `review_rate_number` must be less than 5.0

validator.expect_column_values_to_be_between(
    column='review_rate_number', min_value=0, max_value=5
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 84313,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that **expectations have been met well.** There are no values that do not match expectations (values outside the specified range) in the review_rate_number column. Additionally, **there are no missing values in the column.** Thus, all values in the review_rate_number column are within the expected range **(0 to 5)**. No exception is generated, and no exception message is displayed. Thus, there are no problems that need to be corrected in the review_rate_number column based on the specified criteria

### Expectation 3: Column *cancellation_policy* must have values between "strict", "moderate", "flexible"

In [30]:
# Expectation 3 : Column `cancellation_policy` must have values between "strict", "moderate", "flexible"

validator.expect_column_values_to_be_in_set(
    'cancellation_policy', ["strict","moderate","flexible"])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 84313,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that validation of the cancellation_policy column **has been successfully carried out.** There are no values that do not match expectations, which in this case is the set of cancellation policies consisting of **"strict", "moderate", and "flexible".** Additionally, there are no missing values in the column. Thus, there are no issues that need to be fixed in the cancellation_policy column based on the specified criteria. This is confirmed by the true value in the success section of the output.

### Expectation 4: Column *price* must in form of integer or float

In [31]:
# Expectation 4 : Column `price` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('price', ['integer', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that **validation of the price column has been successfully carried out.** All values in the column correspond to the expected data type, **namely float64.** No technical problems were generated during validation. Thus, validation can be considered successful.

### Expectation 5 : Column *host_identity_verified* must have a number of unique values between 2 and 4

In [33]:
# Expectation 5 : Column `host_identity_verified` must have a number of unique values between 2 and 4

validator.expect_column_unique_value_count_to_be_between('host_identity_verified', min_value=2, max_value=4)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 2
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that validation of the host_identity_verified column **has been successfully carried out.** The expectation set is that the number of unique values in the column should be **between 2 and 4.** The observed value is in accordance with the expectation, **namely 2 unique values.** With these results, it can be concluded that the n**umber of unique values in the host_identity_verified column meets the specified criteria, namely between 2 and 4.**

### Expectation 6 : Column *reviews_per_month* must have a median value between 0.5 and 1

In [37]:
# Expectation 6 : Column `reviews_per_month` must have a median value between 0.5 and 1

validator.expect_column_median_to_be_between('reviews_per_month', min_value=0.5, max_value=1)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 0.74
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that the expectation expect_column_median_to_be_between **has been fulfilled properly**. The expectation requires that the median of the reviews_per_month column must be between the **minimum value of 0.5 and the maximum value of 1**. In the output, we see that **the observed_value is 0.74.** Since **this value is between the expected range (0.5 to 1), the result is true**, indicating that the expectation has been met.

### Expectation 7 : Column *price* should have an average score between 600 and 650

In [41]:
# Expectation 7 : Column `price` should have an average score between 600 and 650

validator.expect_column_mean_to_be_between('price', min_value=600, max_value=650)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 626.0427099023875
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The output results show that validation of the price column **has been successfully carried out.** The expectation set is that the average value in that column should be **between 600 and 650.** The observed value is in line with the expectation, **namely around 626.04.** This is in accordance with the previously set range. With these results, it can be concluded that **the average value in the price column meets the established criteria,** namely between 600 and 650.