# Type Conversion & Date/Time Formats

**Module 4: Data Cleaning & Transformation**

## Learning Objectives
- Master data type conversions in pandas
- Parse and manipulate date/time data effectively
- Handle common type conversion problems
- Work with different date formats and timezones

## Business Context
> "Wrong data types cause wrong analysis. A '123' string cannot be summed with a 456 integer!"

Data often comes in the wrong format. Numbers as strings, dates as text, categories as objects. Fixing these is essential for accurate analysis.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)

print("‚úì Libraries loaded successfully")

---
## 1. Understanding Data Types in Pandas

### Common Data Types

| Pandas dtype | Python type | Description | Use case |
|-------------|-------------|-------------|----------|
| `int64` | int | Integer numbers | IDs, counts, ages |
| `float64` | float | Decimal numbers | Prices, percentages, measurements |
| `object` | str | Text/strings | Names, descriptions, mixed data |
| `bool` | bool | True/False | Flags, binary categories |
| `datetime64` | datetime | Date and time | Timestamps, dates |
| `category` | - | Categorical data | Departments, status, ratings |

### üéØ Why Data Types Matter
- **Performance**: Proper types use less memory
- **Functionality**: Can't calculate mean of strings
- **Accuracy**: Prevents silent errors in analysis

In [None]:
# Create a messy dataset with type problems (common in real data)
messy_data = pd.DataFrame({
    'employee_id': ['001', '002', '003', '004', '005'],  # IDs as strings
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'salary': ['50000', '60,000', '$55,000', '70000', 'N/A'],  # Mixed formats
    'hire_date': ['2020-01-15', '01/20/2019', 'March 5, 2021', '2018-07-01', '15-06-2022'],  # Mixed date formats
    'is_manager': ['Yes', 'No', 'yes', 'NO', 'True'],  # Inconsistent booleans
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],  # Should be categorical
    'performance_score': ['4.5', '3.8', 'excellent', '4.2', '3.9'],  # Mixed
    'age': [28, '35', 42, '29', 31]  # Mixed int/string
})

print("Messy Dataset:")
print(messy_data)
print("\nData Types:")
print(messy_data.dtypes)

---
## 2. Converting Numeric Types

### 2.1 Basic Conversion with `astype()`

In [None]:
# Simple conversion when data is clean
df = pd.DataFrame({
    'numbers_as_strings': ['10', '20', '30', '40'],
    'floats_as_strings': ['1.5', '2.5', '3.5', '4.5']
})

print("Original types:")
print(df.dtypes)

# Convert using astype()
df['numbers_as_int'] = df['numbers_as_strings'].astype(int)
df['floats_as_float'] = df['floats_as_strings'].astype(float)

print("\nAfter conversion:")
print(df.dtypes)
print("\nData:")
print(df)

### 2.2 Handling Messy Numbers with `pd.to_numeric()`

When data has errors or special characters, `astype()` will fail. Use `pd.to_numeric()` instead!

In [None]:
# Messy salary data
salary_data = pd.Series(['50000', '60,000', '$55,000', '70000', 'N/A', '', '45.5k'])

print("Original salary data:")
print(salary_data)

# This would fail: salary_data.astype(float)

# Using pd.to_numeric with errors='coerce' (converts errors to NaN)
salary_clean = pd.to_numeric(salary_data, errors='coerce')
print("\nWith errors='coerce' (errors become NaN):")
print(salary_clean)

In [None]:
def clean_currency(value):
    """
    Clean currency strings and convert to float.
    Handles: $, commas, k/K suffix, empty strings
    """
    if pd.isna(value) or value == '' or value == 'N/A':
        return np.nan
    
    # Convert to string
    value = str(value)
    
    # Remove $ and commas
    value = value.replace('$', '').replace(',', '')
    
    # Handle k/K suffix (thousands)
    if value.lower().endswith('k'):
        value = float(value[:-1]) * 1000
    else:
        value = float(value)
    
    return value

# Apply the cleaning function
salary_clean = salary_data.apply(clean_currency)
print("Cleaned salary data:")
print(salary_clean)
print(f"\nSum: ${salary_clean.sum():,.0f}")

### 2.3 Converting to Integer with Missing Values

Standard `int64` can't have NaN values. Use nullable integer types!

In [None]:
# Problem: Converting to int when there are NaN values
data_with_nan = pd.Series(['1', '2', 'NA', '4', '5'])

# Convert to numeric (creates float because of NaN)
as_numeric = pd.to_numeric(data_with_nan, errors='coerce')
print("As numeric (float due to NaN):")
print(as_numeric)
print(f"Type: {as_numeric.dtype}")

# Use nullable integer type 'Int64' (capital I!)
as_nullable_int = as_numeric.astype('Int64')
print("\nAs nullable integer (Int64):")
print(as_nullable_int)
print(f"Type: {as_nullable_int.dtype}")

---
## 3. Working with Dates and Times

### 3.1 Converting Strings to Datetime

In [None]:
# Various date formats you'll encounter
date_formats = pd.DataFrame({
    'date_string': [
        '2024-01-15',          # ISO format
        '01/15/2024',          # US format
        '15/01/2024',          # European format
        'January 15, 2024',    # Full text
        '15-Jan-2024',         # Abbreviated
        '2024/01/15 14:30:00'  # With time
    ]
})

print("Date strings:")
print(date_formats)

In [None]:
# pd.to_datetime() is smart - it can parse many formats automatically
simple_dates = ['2024-01-15', '2024-02-20', '2024-03-25']

# Automatic parsing
parsed = pd.to_datetime(simple_dates)
print("Automatically parsed:")
print(parsed)
print(f"Type: {parsed.dtype}")

In [None]:
# Specifying format explicitly (faster and more reliable)

# Format codes:
# %Y = 4-digit year, %y = 2-digit year
# %m = month (01-12), %d = day (01-31)
# %H = hour (00-23), %M = minute, %S = second
# %B = full month name, %b = abbreviated month name

us_dates = ['01/15/2024', '02/20/2024', '03/25/2024']
eu_dates = ['15/01/2024', '20/02/2024', '25/03/2024']

# US format: MM/DD/YYYY
us_parsed = pd.to_datetime(us_dates, format='%m/%d/%Y')
print("US format (MM/DD/YYYY):")
print(us_parsed)

# European format: DD/MM/YYYY
eu_parsed = pd.to_datetime(eu_dates, format='%d/%m/%Y')
print("\nEuropean format (DD/MM/YYYY):")
print(eu_parsed)

# Full text format
text_dates = ['January 15, 2024', 'February 20, 2024']
text_parsed = pd.to_datetime(text_dates, format='%B %d, %Y')
print("\nText format:")
print(text_parsed)

In [None]:
# Handling errors in date parsing
mixed_dates = ['2024-01-15', 'not a date', '2024-03-20', '']

# errors='coerce' converts invalid dates to NaT (Not a Time)
safe_parsed = pd.to_datetime(mixed_dates, errors='coerce')
print("With errors='coerce':")
print(safe_parsed)
print(f"\nNaT = Not a Time (like NaN for dates)")

### 3.2 Extracting Date Components

In [None]:
# Create a DataFrame with dates
df = pd.DataFrame({
    'order_date': pd.date_range('2024-01-01', periods=10, freq='D'),
    'sales': np.random.randint(100, 1000, 10)
})

# Extract various components using .dt accessor
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['day'] = df['order_date'].dt.day
df['day_name'] = df['order_date'].dt.day_name()
df['day_of_week'] = df['order_date'].dt.dayofweek  # 0=Monday
df['week_of_year'] = df['order_date'].dt.isocalendar().week
df['quarter'] = df['order_date'].dt.quarter
df['is_weekend'] = df['order_date'].dt.dayofweek >= 5

print("Date components extracted:")
print(df)

### 3.3 Date Calculations

In [None]:
# Calculate differences between dates
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'hire_date': pd.to_datetime(['2020-03-15', '2019-07-01', '2021-11-20', '2018-01-10']),
    'birth_date': pd.to_datetime(['1995-06-20', '1988-03-15', '1992-09-08', '1985-12-25'])
})

# Current date
today = pd.Timestamp.today()
print(f"Today: {today.date()}")

# Calculate tenure (days since hire)
employees['tenure_days'] = (today - employees['hire_date']).dt.days
employees['tenure_years'] = employees['tenure_days'] / 365.25

# Calculate age
employees['age'] = ((today - employees['birth_date']).dt.days / 365.25).astype(int)

print("\nEmployee data with calculated fields:")
print(employees)

In [None]:
# Adding/subtracting time
base_date = pd.Timestamp('2024-01-15')

print(f"Base date: {base_date.date()}")
print(f"+ 30 days: {(base_date + pd.Timedelta(days=30)).date()}")
print(f"+ 2 weeks: {(base_date + pd.Timedelta(weeks=2)).date()}")
print(f"+ 3 months: {(base_date + pd.DateOffset(months=3)).date()}")
print(f"+ 1 year: {(base_date + pd.DateOffset(years=1)).date()}")

# Business days
print(f"\n+ 10 business days: {(base_date + pd.offsets.BDay(10)).date()}")

### 3.4 Formatting Dates for Output

In [None]:
# Convert datetime to string with specific format
date = pd.Timestamp('2024-03-15 14:30:45')

print("Date formatting examples:")
print(f"Default: {date}")
print(f"ISO format: {date.strftime('%Y-%m-%d')}")
print(f"US format: {date.strftime('%m/%d/%Y')}")
print(f"European format: {date.strftime('%d/%m/%Y')}")
print(f"Full text: {date.strftime('%B %d, %Y')}")
print(f"With time: {date.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Day and date: {date.strftime('%A, %B %d, %Y')}")

---
## 4. Categorical Data Type

Categorical data type is efficient for columns with a limited number of unique values.

In [None]:
# Create sample data
n = 100000
df = pd.DataFrame({
    'department': np.random.choice(['Sales', 'IT', 'HR', 'Marketing'], n),
    'status': np.random.choice(['Active', 'Inactive', 'On Leave'], n),
    'performance': np.random.choice(['Excellent', 'Good', 'Average', 'Poor'], n)
})

print("Original data types:")
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

In [None]:
# Convert to categorical
df_cat = df.copy()
df_cat['department'] = df_cat['department'].astype('category')
df_cat['status'] = df_cat['status'].astype('category')
df_cat['performance'] = df_cat['performance'].astype('category')

print("After converting to categorical:")
print(df_cat.dtypes)
print(f"\nMemory usage: {df_cat.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Calculate memory savings
original_mem = df.memory_usage(deep=True).sum()
cat_mem = df_cat.memory_usage(deep=True).sum()
savings = (original_mem - cat_mem) / original_mem * 100
print(f"\nüíæ Memory savings: {savings:.1f}%")

In [None]:
# Ordered categories (useful for performance ratings, etc.)
from pandas.api.types import CategoricalDtype

# Define order
perf_order = CategoricalDtype(
    categories=['Poor', 'Average', 'Good', 'Excellent'], 
    ordered=True
)

df_cat['performance'] = df_cat['performance'].astype(perf_order)

# Now we can use comparison operators
print("Employees with performance >= Good:")
print(df_cat[df_cat['performance'] >= 'Good'].head())

print(f"\nCategories in order: {df_cat['performance'].cat.categories.tolist()}")

---
## 5. Boolean Conversions

Real data often has inconsistent boolean representations.

In [None]:
# Various boolean representations
bool_data = pd.Series(['Yes', 'No', 'yes', 'NO', 'True', 'False', 
                       'TRUE', 'false', '1', '0', 'Y', 'N', 'T', 'F'])

print("Various boolean representations:")
print(bool_data.values)

In [None]:
def convert_to_bool(value):
    """
    Convert various string representations to boolean.
    """
    if pd.isna(value):
        return np.nan
    
    # Convert to lowercase string
    value = str(value).lower().strip()
    
    true_values = ['yes', 'true', '1', 'y', 't', 'on', 'active']
    false_values = ['no', 'false', '0', 'n', 'f', 'off', 'inactive']
    
    if value in true_values:
        return True
    elif value in false_values:
        return False
    else:
        return np.nan

bool_converted = bool_data.apply(convert_to_bool)
print("Converted to boolean:")
print(pd.DataFrame({'original': bool_data, 'converted': bool_converted}))

---
## 6. Putting It All Together: Complete Type Cleanup

In [None]:
# Remember our messy data from the beginning
print("Original messy data:")
print(messy_data)
print("\nOriginal types:")
print(messy_data.dtypes)

In [None]:
# Clean the entire dataset
df_clean = messy_data.copy()

# 1. employee_id: Keep as string (IDs should be strings)
#    Already correct

# 2. salary: Clean and convert to numeric
df_clean['salary'] = df_clean['salary'].apply(clean_currency)

# 3. hire_date: Convert to datetime
df_clean['hire_date'] = pd.to_datetime(df_clean['hire_date'], errors='coerce')

# 4. is_manager: Convert to boolean
df_clean['is_manager'] = df_clean['is_manager'].apply(convert_to_bool)

# 5. department: Convert to category
df_clean['department'] = df_clean['department'].astype('category')

# 6. performance_score: Convert to numeric
df_clean['performance_score'] = pd.to_numeric(df_clean['performance_score'], errors='coerce')

# 7. age: Convert to integer
df_clean['age'] = pd.to_numeric(df_clean['age'], errors='coerce').astype('Int64')

print("Cleaned data:")
print(df_clean)
print("\nCleaned types:")
print(df_clean.dtypes)

In [None]:
# Now we can do proper analysis!
print("=== Analysis After Type Conversion ===")

print(f"\nAverage salary: ${df_clean['salary'].mean():,.2f}")
print(f"Average age: {df_clean['age'].mean():.1f}")
print(f"Number of managers: {df_clean['is_manager'].sum()}")

print("\nEmployee tenure (days from hire):")
today = pd.Timestamp.today()
df_clean['tenure_days'] = (today - df_clean['hire_date']).dt.days
print(df_clean[['name', 'hire_date', 'tenure_days']])

---
## 7. Practical Exercises

### Exercise 1: Clean the Sales Data

In [None]:
# Messy sales data
sales = pd.DataFrame({
    'order_id': ['ORD001', 'ORD002', 'ORD003', 'ORD004', 'ORD005'],
    'order_date': ['2024-01-15', '15/02/2024', 'March 10, 2024', '2024/04/20', '05-01-2024'],
    'amount': ['$1,234.56', '2345.67', '‚Ç¨999.99', '4,567', 'FREE'],
    'quantity': ['10', '5', 'twenty', '15', '8'],
    'is_gift': ['Y', 'N', 'Yes', 'no', 'FALSE'],
    'status': ['Completed', 'Pending', 'Shipped', 'Completed', 'Pending']
})

print("Messy sales data:")
print(sales)
print("\nCurrent types:")
print(sales.dtypes)

In [None]:
# TODO: Clean the sales data
# 1. Convert order_date to datetime
# 2. Clean and convert amount to float (handle different currencies, commas)
# 3. Convert quantity to integer (handle text)
# 4. Convert is_gift to boolean
# 5. Convert status to category


### Exercise 2: Date Calculations

In [None]:
# Customer subscription data
subscriptions = pd.DataFrame({
    'customer_id': [101, 102, 103, 104, 105],
    'start_date': ['2023-01-15', '2022-06-01', '2023-09-20', '2021-03-10', '2024-01-01'],
    'plan': ['Monthly', 'Annual', 'Monthly', 'Annual', 'Monthly']
})

print("Subscription data:")
print(subscriptions)

In [None]:
# TODO: 
# 1. Convert start_date to datetime
# 2. Calculate subscription_days (days since start)
# 3. Calculate subscription_months
# 4. Calculate renewal_date (Monthly = start + 1 month, Annual = start + 1 year)
# 5. Add is_due_for_renewal (True if renewal_date is within 30 days)


### Exercise 3: Optimize Data Types for Large Dataset

In [None]:
# Simulate a large dataset
np.random.seed(42)
n = 100000

large_df = pd.DataFrame({
    'transaction_id': range(n),
    'customer_type': np.random.choice(['Regular', 'Premium', 'VIP'], n),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Books', 'Home'], n),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'Cash', 'PayPal'], n),
    'amount': np.random.uniform(10, 1000, n),
    'quantity': np.random.randint(1, 10, n),
    'is_returned': np.random.choice([True, False], n, p=[0.05, 0.95])
})

print(f"Original memory usage: {large_df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")
print("\nData types:")
print(large_df.dtypes)

In [None]:
# TODO: Optimize the data types to reduce memory usage
# Hints:
# - Use category for string columns with limited unique values
# - Use smaller int types (int32, int16) where appropriate
# - Use float32 instead of float64 if precision allows


---
## 8. Key Takeaways

### ‚úÖ Best Practices

1. **Check data types first** - `df.dtypes` and `df.info()`
2. **Use `pd.to_numeric()` and `pd.to_datetime()`** with `errors='coerce'` for safe conversion
3. **Specify date formats explicitly** for reliability
4. **Use categorical types** for columns with few unique values
5. **Document your conversions** for reproducibility

### ‚ö†Ô∏è Common Pitfalls

1. Assuming data types are correct without checking
2. Using `astype()` on messy data (will fail)
3. Confusing date formats (MM/DD vs DD/MM)
4. Forgetting that integers can't hold NaN (use Int64)

### üìã Conversion Quick Reference

```python
# Numbers
pd.to_numeric(series, errors='coerce')

# Dates
pd.to_datetime(series, format='%Y-%m-%d', errors='coerce')

# Categories
series.astype('category')

# Nullable integers
series.astype('Int64')

# Extract date parts
df['date'].dt.year / .month / .day / .day_name()
```