# Day 1, Practice 1: Customers

In [2]:
import pandas as pd
df = pd.read_csv('customer_contacts.csv')

# Task 1: Inspect the data

In [8]:
# Question 1: How many rows and columns?
print(f"The DataFrame has {df.shape[0]} rows, and {df.shape[1]} columns.")
df.shape  # Returns (rows, columns) as a tuple

The DataFrame has 200 rows, and 6 columns.


(200, 6)

In [9]:
# Question 2: What are the data types?
print("The data types of each feature are as follows:")
df.dtypes

The data types of each feature are as follows:


customer_id     int64
first_name     object
last_name      object
email          object
phone          object
company        object
dtype: object

In [19]:
# Question 3: How many missing values per column?
df.info()
print('\n',"="*50,"\n")
for column in df.columns:
    print(f"{column} has {df[column].isnull().sum()} missing values.")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   customer_id  200 non-null    int64 
 1   first_name   200 non-null    object
 2   last_name    200 non-null    object
 3   email        186 non-null    object
 4   phone        164 non-null    object
 5   company      200 non-null    object
dtypes: int64(1), object(5)
memory usage: 9.5+ KB


customer_id has 0 missing values.
first_name has 0 missing values.
last_name has 0 missing values.
email has 14 missing values.
phone has 36 missing values.
company has 0 missing values.


In [20]:
# Question 4: Print first 10 rows.

print("Printing first 10 rows:\n")
df.head(10)

Printing first 10 rows:



Unnamed: 0,customer_id,first_name,last_name,email,phone,company
0,1,WILLIAM,Johnson,william.johnson@yahoo.com,(892) 958-9935,Apple Inc.
1,2,James,Brown,james.brown@business.net,484.206.3615,INTEL CORP
2,3,Linda,garcia,linda.garcia@gmail.com,,Salesforce.com
3,4,Richard,Jones,richard.jones@gmail.com,5733667065,Microsoft Corp
4,5,Michael,rodriguez,michael.rodriguez@company.com,8559044598,Intel Corp
5,6,David,Johnson,david.johnson@business.net,871.711.7482,Amazon.com Inc
6,7,ROBERT,Martinez,robert.martinez@yahoo.com,248-312-3504,Netflix Inc
7,8,James,Jones,james.jones@gmail.com,7499685371,Oracle Corporation
8,9,David,Rodriguez,david.rodriguez@gmail.com,854-719-4258,Cisco Systems
9,10,MARY,Garcia,,(571) 514-4923,Adobe Systems


# TASK 1: Inspection Summary

## Dataset Overview
- **Size:** 200 rows Ã— 6 columns
- **Data Types:** 1 int (customer_id), 5 object (text fields)

## Data Quality Issues Found

### 1. Missing Values
- Email: 14 missing (7%)
- Phone: 36 missing (18%)

### 2. Name Inconsistencies
- First names: Mix of UPPERCASE, Title Case
- Last names: Mix of lowercase, Title Case, UPPERCASE

### 3. Phone Format Chaos
- Format 1: (XXX) XXX-XXXX
- Format 2: XXX.XXX.XXXX
- Format 3: XXXXXXXXXX (no formatting)
- Plus 18% completely missing

### 4. Company Name Variations
- Same company, different cases: "INTEL CORP" vs "Intel Corp"
- Will need standardization

### 5. Email Issues (to investigate)
- Some might have extra whitespace
- Case inconsistencies (some ALL CAPS)

# Task 2: Handle missing emails

In [28]:
# Step 1 & 2: How many records have missing email?/What % is that?

missing_emails = df['email'].isnull().sum()
missing_email_pct = (missing_emails/len(df)) * 100

print(f"Missing emails: {missing_emails} ({missing_email_pct:.1f}%)")

Missing emails: 14 (7.0%)


In [31]:
# Step 3: Decide: drop rows, fill with placeholder, or keep?
# DECISION: Keep as NaN (don't drop, don't fill with fake data)
# They may have phone numbers as contact, and still valuable customer data, can work around missing value

# Step 4: Implement Decision
# Add a flag column for easy filtering later
df['has_email'] = df['email'].notna()

# Step 5: Verify
print(f"\nRecords with email: {df['has_email'].sum()}")
print(f"Records without email: {(~df['has_email']).sum()}")


Records with email: 186
Records without email: 14


# TASK 2: Handle Missing Emails - Summary

## Analysis
- **Missing emails:** 14 out of 200 (7.0%)
- **Records with email:** 186 (93.0%)

## Decision Made
**KEEP AS NaN** (Do not drop rows, do not fill with fake data)

### Reasoning:
1. **Customer records are valuable** - Can't delete customers just because one field is missing
2. **Alternative contact methods exist** - They likely have phone numbers
3. **Data integrity** - NaN clearly indicates "we don't have this" vs fake placeholder
4. **Prevents errors** - No risk of accidentally emailing fake addresses
5. **Easy to handle** - Can filter with `df[df['email'].notna()]` when needed

## Implementation
- Added `has_email` flag column for easy filtering
- Kept original email column as-is (with NaN for missing)
- Can now easily query: "Show me all customers WITH email" or "Show customers WITHOUT email"

## Verification
âœ… 186 records have email  
âœ… 14 records flagged as missing email  
âœ… No data loss - all 200 customer records retained

# Task 3: Clean phone numbers

In [35]:
# Step 1: Print unique phone number formats you see
print("Sample phone numbers (first 30 records):")
print(df[['customer_id', 'phone']].head(30))


Sample phone numbers (first 30 records):
    customer_id           phone
0             1  (892) 958-9935
1             2    484.206.3615
2             3             NaN
3             4      5733667065
4             5      8559044598
5             6    871.711.7482
6             7    248-312-3504
7             8      7499685371
8             9    854-719-4258
9            10  (571) 514-4923
10           11    740-821-7932
11           12             NaN
12           13    335-940-8744
13           14      3002627596
14           15    753-215-2528
15           16    769-877-8973
16           17             NaN
17           18      5234674346
18           19    418-718-5345
19           20    506-879-2697
20           21      2522947939
21           22             NaN
22           23             NaN
23           24      8859525065
24           25  (608) 536-5564
25           26             NaN
26           27      7519034228
27           28    767.330.4143
28           29    870.529.8618

In [36]:
# Step 2: Standardize all to format: (XXX) XXX-XXXX (doing format XXXXXXXXXX instead)
# But first, regex practice

import re  # Regular expressions library

# Test on one phone number first
test_phone = "(892) 958-9935"

# This regex means: "find all digits (0-9)"
digits_only = re.sub(r'[^0-9]', '', test_phone)
print(f"Original: {test_phone}")
print(f"Digits only: {digits_only}")

Original: (892) 958-9935
Digits only: 8929589935


In [38]:
# Starting with our digits_only from before
digits_only = "8929589935"

# Strict slicing to break into parts
area_code = digits_only[0:3]   # First three digits: "892"
prefix = digits_only[3:6]      # Next three digits: "958"
line = digits_only[6:10]       # Last 4 digits: "9935"

# Combine with formatting
formatted = f"({area_code}) {prefix}-{line}"

print(f"Digits: {digits_only}")
print(f"Formatted: {formatted}")

Digits: 8929589935
Formatted: (892) 958-9935


In [42]:
def clean_phone_number(phone):
    """
    Standardize phone number to (XXX) XXX-XXXX format.

    Args:
        phone: Phone number in any format (or NaN).

    Returns:
        Formatted phone as (XXX) XXX-XXXX, or NaN if missing
    """
    # Handle missing values
    if pd.isna(phone):
        return None    # Keep as NaN

    # Extract only digits
    digits_only = re.sub(r'[^0-9]', '', phone)

    # Check if we have exactly 10 digits
    if len(digits_only) != 10:
        return None    # Invalild phone number

    # Format as (XXX) XXX-XXXX
    area_code = digits_only[0:3]
    prefix = digits_only[3:6]
    line = digits_only[6:10]

    formatted = f"({area_code}) {prefix}-{line}"

    return formatted

# Test on a few examples
test_phones = [
    "(892) 958-9935",
    "484.206.3615",
    "5733667065",
    "248-312-3504",
    None   # Test NaN
]

print("Testing phone cleaning function:")
for phone in test_phones:
    cleaned = clean_phone_number(phone)
    print(f"{phone} â†’ {cleaned}")
        

Testing phone cleaning function:
(892) 958-9935 â†’ (892) 958-9935
484.206.3615 â†’ (484) 206-3615
5733667065 â†’ (573) 366-7065
248-312-3504 â†’ (248) 312-3504
None â†’ None


In [45]:
# Apply function to entire phone column
print("Cleaning all phone numbers...")
df['phone_clean'] = df['phone'].apply(clean_phone_number)

# Verify it worked
print(f"\nBefore and after comparison (first 20 rows):")
print(df[['customer_id', 'phone', 'phone_clean']].head(20))

Cleaning all phone numbers...

Before and after comparison (first 20 rows):
    customer_id           phone     phone_clean
0             1  (892) 958-9935  (892) 958-9935
1             2    484.206.3615  (484) 206-3615
2             3             NaN            None
3             4      5733667065  (573) 366-7065
4             5      8559044598  (855) 904-4598
5             6    871.711.7482  (871) 711-7482
6             7    248-312-3504  (248) 312-3504
7             8      7499685371  (749) 968-5371
8             9    854-719-4258  (854) 719-4258
9            10  (571) 514-4923  (571) 514-4923
10           11    740-821-7932  (740) 821-7932
11           12             NaN            None
12           13    335-940-8744  (335) 940-8744
13           14      3002627596  (300) 262-7596
14           15    753-215-2528  (753) 215-2528
15           16    769-877-8973  (769) 877-8973
16           17             NaN            None
17           18      5234674346  (523) 467-4346
18          

In [48]:
# Count how many were successfully cleaned
print(f"\nOriginal non-null phones: {df['phone'].notna().sum()}")
print(f"Cleaned non-null phones: {df['phone_clean'].notna().sum()}")
print(f"Missing phones (kept as NaN): {df['phone_clean'].isnull().sum()}")


Original non-null phones: 164
Cleaned non-null phones: 164
Missing phones (kept as NaN): 36


# TASK 3: Clean Phone Numbers - Summary

## Phone Format Analysis
Found 5 different formats in the data:
1. `(XXX) XXX-XXXX` - parentheses with dash
2. `XXX.XXX.XXXX` - dots
3. `XXXXXXXXXX` - no formatting (10 digits)
4. `XXX-XXX-XXXX` - dashes only
5. `NaN` - missing values

## Standardization Strategy
**Target format:** `(XXX) XXX-XXXX`

### Cleaning Process:
1. Extract only digits using regex: `re.sub(r'[^0-9]', '', phone)`
2. Validate exactly 10 digits
3. Split into parts using string slicing:
   - Area code: `digits[0:3]`
   - Prefix: `digits[3:6]`
   - Line: `digits[6:10]`
4. Format as: `f"({area_code}) {prefix}-{line}"`

## Results
- **Original non-null phones:** 164 (82%)
- **Successfully cleaned:** 164 (100% of non-null)
- **Missing phones:** 36 (18%) - kept as NaN

## Decision: Missing Values
**KEPT AS NaN** - Same reasoning as email field. Customer records remain valuable even without phone numbers.

## New Skills Learned
âœ… Regular expressions (regex) for pattern matching  
âœ… String slicing for extracting substrings  
âœ… `.apply()` function to transform entire column  
âœ… Creating reusable cleaning functions

# Task 4: Standardize names

In [54]:
# Step 1: Remove extra whitespace (leading, trailing, multiple spaces)

print("Cleaning whitespace from names...")

df['first_name'] = df['first_name'].str.strip()
df['last_name'] = df['last_name'].str.strip()

# Verify it worked
print("\nVerify it worked:")
print(df[['first_name', 'last_name']].head(20))

Cleaning whitespace from names...

Verify it worked:
   first_name  last_name
0     WILLIAM    Johnson
1       James      Brown
2       Linda     garcia
3     Richard      Jones
4     Michael  rodriguez
5       David    Johnson
6      ROBERT   Martinez
7       James      Jones
8       David  Rodriguez
9        MARY     Garcia
10    Richard  Rodriguez
11    WILLIAM     Wilson
12       John     garcia
13     Robert     Miller
14       Lisa      Brown
15     ROBERT      Jones
16       John   Martinez
17       Mary   Martinez
18    Richard    Johnson
19   Jennifer      Jones


In [57]:
# Step 2: Standardize case(Title case for names)

print("Standardizing capitalization for names...")
df['first_name'] = df['first_name'].str.title()
df['last_name'] = df['last_name'].str.title()

# Verify it worked
print("\nVerify it worked:")
print(df[['customer_id', 'first_name', 'last_name']].head(20))

Standardizing capitalization for names...

Verify it worked:
    customer_id first_name  last_name
0             1    William    Johnson
1             2      James      Brown
2             3      Linda     Garcia
3             4    Richard      Jones
4             5    Michael  Rodriguez
5             6      David    Johnson
6             7     Robert   Martinez
7             8      James      Jones
8             9      David  Rodriguez
9            10       Mary     Garcia
10           11    Richard  Rodriguez
11           12    William     Wilson
12           13       John     Garcia
13           14     Robert     Miller
14           15       Lisa      Brown
15           16     Robert      Jones
16           17       John   Martinez
17           18       Mary   Martinez
18           19    Richard    Johnson
19           20   Jennifer      Jones


In [61]:
# Step 3: Check for any weird characters
print("Checking for any weird characters in names...")

# Find any names with numbers or special characters
weird_first = df[df['first_name'].str.contains(r'[^a-zA-z\s]', na=False)]
weird_last = df[df['last_name'].str.contains(r'[^a-zA-Z\s]', na=False)]

print(f"\nFirst names with weird characters: {len(weird_first)}")
if len(weird_first) > 0:
    print(weird_first[['customer_id', 'first_name']])
    
print(f"Last names with weird characters: {len(weird_last)}")
if len(weird_last) > 0:
    print(weird_last[['customer_id', 'last_name']])

if len(weird_first) == 0 and len(weird_last) == 0:
    print(f"\nNo weird characters found - all names are clean!")

Checking for any weird characters in names...

First names with weird characters: 0
Last names with weird characters: 0

No weird characters found - all names are clean!


# TASK 4: Standardize Names - Summary

## Issues Found
- **Whitespace:** Extra leading/trailing spaces in names
- **Capitalization:** Mix of UPPERCASE, lowercase, and Title Case
- **Weird characters:** None found (validation passed)

## Cleaning Steps

### 1. Remove Whitespace
```python
df['first_name'] = df['first_name'].str.strip()
df['last_name'] = df['last_name'].str.strip()
```
- Removes leading and trailing spaces
- Does NOT remove spaces within names (e.g., "Mary Jane" stays intact)

### 2. Standardize Capitalization
```python
df['first_name'] = df['first_name'].str.title()
df['last_name'] = df['last_name'].str.title()
```
- Converts all names to Title Case (first letter capitalized)
- WILLIAM â†’ William, garcia â†’ Garcia

### 3. Validate - No Special Characters
- Checked for numbers, symbols, or other non-letter characters
- âœ… All names contain only letters and spaces

## Results
âœ… All 200 names now standardized  
âœ… Consistent Title Case formatting  
âœ… No extra whitespace  
âœ… No invalid characters

# Task 5: Create Summary Report

In [62]:
print("="*60)
print("DATA CLEANING SUMMARY - Customer Contacts")
print("="*60)

print("\nðŸ“Š DATASET OVERVIEW")
print(f"Total records: {len(df)}")
print(f"Total columns: {len(df.columns)}")

print("\nðŸ“§ EMAIL CLEANING")
print(f"Missing emails: {df['email'].isna().sum()} ({df['email'].isna().sum()/len(df)*100:.1f}%)")
print(f"Records with email: {df['has_email'].sum()}")
print(f"Decision: Kept as NaN (no fake data)")

print("\nðŸ“ž PHONE CLEANING")
print(f"Original formats: 5 different formats found")
print(f"Standardized to: (XXX) XXX-XXXX")
print(f"Successfully cleaned: {df['phone_clean'].notna().sum()}")
print(f"Missing phones: {df['phone_clean'].isna().sum()} ({df['phone_clean'].isna().sum()/len(df)*100:.1f}%)")

print("\nðŸ‘¤ NAME CLEANING")
print(f"Whitespace removed: Yes")
print(f"Standardized case: Title Case")
print(f"Invalid characters: None found")

print("\nâœ… FINAL DATA QUALITY")
print(f"Clean records ready for use: {len(df)}")
print(f"Data quality score: High")

print("\n" + "="*60)

DATA CLEANING SUMMARY - Customer Contacts

ðŸ“Š DATASET OVERVIEW
Total records: 200
Total columns: 8

ðŸ“§ EMAIL CLEANING
Missing emails: 14 (7.0%)
Records with email: 186
Decision: Kept as NaN (no fake data)

ðŸ“ž PHONE CLEANING
Original formats: 5 different formats found
Standardized to: (XXX) XXX-XXXX
Successfully cleaned: 164
Missing phones: 36 (18.0%)

ðŸ‘¤ NAME CLEANING
Whitespace removed: Yes
Standardized case: Title Case
Invalid characters: None found

âœ… FINAL DATA QUALITY
Clean records ready for use: 200
Data quality score: High

