### Data Cleaning in Scraping with building Data Quality Pipelines

Topics Covered
- What are Data Qiality Pipeline?
- Why Data Quality Failsin Production
- Greate Expectations (Deep Theory + Practice)
- Missing Data & Imputation Libraries
- Anomaly Detection Rules (Statistical + Business)

In [1]:
import pandas as pd
import numpy as np
import great_expectations as gx

In [2]:
df = pd.read_csv("cars24-car-price-cleaned-new.csv")
df.head()

Unnamed: 0,selling_price,km_driven,mileage,engine,max_power,age,make,model,Individual,Trustmark Dealer,Diesel,Electric,LPG,Petrol,Manual,5,>5
0,1.2,120000,19.7,796.0,46.3,11.0,MARUTI,ALTO STD,1,0,0,0,0,1,1,1,0
1,5.5,20000,18.9,1197.0,82.0,7.0,HYUNDAI,GRAND I10 ASTA,1,0,0,0,0,1,1,1,0
2,2.15,60000,17.0,1197.0,80.0,13.0,HYUNDAI,I20 ASTA,1,0,0,0,0,1,1,1,0
3,2.26,37000,20.92,998.0,67.1,11.0,MARUTI,ALTO K10 2010-2014 VXI,1,0,0,0,0,1,1,1,0
4,5.7,30000,22.77,1498.0,98.59,8.0,FORD,ECOSPORT 2015-2021 1.5 TDCI TITANIUM BSIV,0,0,1,0,0,0,1,1,0


In [3]:
context = gx.get_context()
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

In [4]:
batch_defintion = data_asset.add_batch_definition_whole_dataframe("subodh")
batch = batch_defintion.get_batch(batch_parameters={"dataframe":df})

In [5]:
exp = gx.expectations.ExpectColumnValuesToBeBetween(
    column="km_driven",min_value=50000,max_value=10000
)

In [6]:
result = batch.validate(exp)

print(result)

Calculating Metrics:  40%|████      | 4/10 [00:00<00:00, 290.71it/s]

Calculating Metrics:  50%|█████     | 5/10 [00:00<00:00, 363.38it/s]

{
  "success": false,
  "expectation_config": {
    "type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "km_driven",
      "min_value": 50000.0,
      "max_value": 10000.0,
      "batch_id": "pandas-pd dataframe asset"
    },
    "meta": {},
    "severity": "critical"
  },
  "result": {},
  "meta": {},
  "exception_info": {
    "MetricConfigurationID(metric_name='column_values.between.condition', metric_domain_kwargs_id='d27ae0ce08579ea2b38454960a72da62', metric_value_kwargs_id='b998ae24ff8b8f7d31b7ea6ed19e3db8')": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"c:\\Users\\HARSH\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\great_expectations\\execution_engine\\execution_engine.py\", line 577, in _process_direct_and_bundled_metric_computation_configurations\n    metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"c:\\Users\\HARSH\\AppData\\Local\\Programs\\Python\\Python3




In [7]:
# Missing data And Imputation Libraries
import pandas as pd
import numpy as np

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

In [9]:
df = pd. read_csv("cars24-car-price-cleaned-new.csv")
print(df.isnull().sum())

selling_price       0
km_driven           0
mileage             0
engine              0
max_power           0
age                 0
make                0
model               0
Individual          0
Trustmark Dealer    0
Diesel              0
Electric            0
LPG                 0
Petrol              0
Manual              0
5                   0
>5                  0
dtype: int64


In [10]:
y = df["selling_price"]
X = df.drop(columns=["selling_price"])

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

num_cols = X_train.select_dtypes(include=["int64", "float64"]).columns
num_imputer_simple = SimpleImputer(strategy="median")

X_train_num_simple = num_imputer_simple. fit_transform(X_train[num_cols])
X_test_num_simple = num_imputer_simple.transform(X_test[num_cols])

In [12]:
cat_cols = X_train.select_dtypes(include=["object"]).columns
cat_imputer = SimpleImputer(strategy="most_frequent")

X_train_cat = cat_imputer.fit_transform(X_train[cat_cols])
X_test_cat = cat_imputer.transform(X_test[cat_cols])

In [14]:
X_train_simple = pd.DataFrame(
    np.hstack([X_train_num_simple, X_train_cat]),
    columns=list(num_cols) + list(cat_cols)
)

X_test_num_simple = pd.DataFrame(
    np.hstack([X_test_num_simple, X_test_cat]),
    columns=list(num_cols) + list(cat_cols)
)

X_train_simple,X_test_num_simple

(      km_driven mileage  engine max_power   age Individual Trustmark Dealer  \
 0       70000.0   19.01  1461.0    108.45  10.0        1.0              0.0   
 1       35000.0    16.0  2179.0     140.0   7.0        1.0              0.0   
 2       50000.0    13.9  1598.0      92.0  18.0        0.0              0.0   
 3       17000.0    18.9  1197.0      82.0   4.0        0.0              0.0   
 4       34000.0   16.38  1999.0     177.0   7.0        0.0              0.0   
 ...         ...     ...     ...       ...   ...        ...              ...   
 13869   32000.0   22.74   796.0      47.3   8.0        0.0              0.0   
 13870    3944.0   20.51   998.0      67.0   5.0        0.0              1.0   
 13871  101931.0    19.3  1248.0      73.9   9.0        0.0              0.0   
 13872   70333.0    17.8  1497.0     117.3   7.0        0.0              0.0   
 13873   90000.0   12.05  2179.0     120.0   9.0        1.0              0.0   
 
       Diesel Electric  LPG Petrol Man