### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [1]:
# Write your code from here
import pandas as pd

# Step 1: Create sample data
company_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [9.99, 19.99, 29.99, 39.99, 49.99]
}

trusted_data = {
    'product_id': [101, 102, 103, 104, 105],
    'price': [9.99, 18.99, 29.99, 39.00, 49.99]
}

# Convert to DataFrames
company_df = pd.DataFrame(company_data)
trusted_df = pd.DataFrame(trusted_data)

# Step 2: Merge datasets on product_id
merged_df = pd.merge(company_df, trusted_df, on='product_id', suffixes=('_company', '_trusted'))

# Step 3: Check for price match
merged_df['match'] = merged_df['price_company'] == merged_df['price_trusted']

# Step 4: Calculate accuracy
total = len(merged_df)
matches = merged_df['match'].sum()
accuracy = (matches / total) * 100

print("Merged DataFrame:")
print(merged_df)
print(f"\n✅ Price Accuracy: {accuracy:.2f}%")

Merged DataFrame:
   product_id  price_company  price_trusted  match
0         101           9.99           9.99   True
1         102          19.99          18.99  False
2         103          29.99          29.99   True
3         104          39.99          39.00  False
4         105          49.99          49.99   True

✅ Price Accuracy: 60.00%


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [2]:
# Write your code from here
import pandas as pd

# Simulated company_prices data
company_data = {
    'product_id': [201, 202, 203, 204, 205],
    'price': [15.99, -20.00, 35.50, -5.25, 10.00]  # some negative prices
}

# Convert to DataFrame
company_df = pd.DataFrame(company_data)

# Detect negative prices
invalid_prices_df = company_df[company_df['price'] < 0]

# Display the result
print("🔍 Products with Incorrect (Negative) Prices:")
print(invalid_prices_df)

# Optional: Count of invalid values
print(f"\n🚫 Total Incorrect Values Detected: {len(invalid_prices_df)}")


🔍 Products with Incorrect (Negative) Prices:
   product_id  price
1         202 -20.00
3         204  -5.25

🚫 Total Incorrect Values Detected: 2


### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [3]:
# Write your code from here
import pandas as pd
import numpy as np

# Step 1: Simulated customer data with missing values
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', None, 'David', 'Eva'],
    'email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com', None],
    'age': [25, np.nan, 30, 45, np.nan]
}

customer_df = pd.DataFrame(data)

# Step 2: Calculate missing data rates
missing_percent = customer_df.isnull().mean() * 100

# Step 3: Display results
print("📊 Missing Data Rate (in %):")
print(missing_percent.round(2))

📊 Missing Data Rate (in %):
customer_id     0.0
name           20.0
email          40.0
age            40.0
dtype: float64


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [4]:
# Write your code from here
import pandas as pd
import numpy as np

# Step 1: Simulate customer_data with missing emails/phone numbers
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'email': ['alice@example.com', None, 'charlie@example.com', 'david@example.com', None],
    'phone_number': [None, '123-456-7890', '555-555-5555', None, '888-888-8888']
}

df = pd.DataFrame(data)

# Step 2: Identify records with missing email or phone number
partial_records = df[df['email'].isnull() | df['phone_number'].isnull()]

print("🔍 Partially Available Records (missing email or phone number):")
print(partial_records)

# Step 3: Decide how to handle — Option A: Drop, Option B: Fill

# ----- Option A: Drop records with missing email or phone -----
dropped_df = df.dropna(subset=['email', 'phone_number'])
print("\n🗑️ Records After Dropping Incomplete Ones:")
print(dropped_df)

# ----- Option B: Fill missing values with placeholder -----
filled_df = df.copy()
filled_df['email'].fillna('no-email@domain.com', inplace=True)
filled_df['phone_number'].fillna('000-000-0000', inplace=True)
print("\n🛠️ Records After Filling Missing Values:")
print(filled_df)

🔍 Partially Available Records (missing email or phone number):
   customer_id   name              email  phone_number
0            1  Alice  alice@example.com          None
1            2    Bob               None  123-456-7890
3            4  David  david@example.com          None
4            5    Eva               None  888-888-8888

🗑️ Records After Dropping Incomplete Ones:
   customer_id     name                email  phone_number
2            3  Charlie  charlie@example.com  555-555-5555

🛠️ Records After Filling Missing Values:
   customer_id     name                email  phone_number
0            1    Alice    alice@example.com  000-000-0000
1            2      Bob  no-email@domain.com  123-456-7890
2            3  Charlie  charlie@example.com  555-555-5555
3            4    David    david@example.com  000-000-0000
4            5      Eva  no-email@domain.com  888-888-8888
