# Data Types and Conversions in Data Science

This notebook explores various data types in Python, NumPy, and Pandas, along with conversion techniques essential for data science workflows. We'll examine how to efficiently work with different data types, convert between them, and optimize memory usage.

## 1. Import Required Libraries

Let's start by importing the essential libraries we'll need for data type operations.

In [None]:
# Import core Python libraries
import sys
import math
import datetime

# Import data science libraries
import numpy as np
import pandas as pd

# Import utilities for memory analysis
from sys import getsizeof
import matplotlib.pyplot as plt

# Print versions for reproducibility
print(f"Python version: {sys.version}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Understanding Python Data Types

Python has several built-in data types that form the foundation for data science operations. Let's explore each of these core data types.

### 2.1 Numeric Types: int, float

In [None]:
# Integer examples
x = 42
y = 1000000000000  # Python integers have arbitrary precision

# Float examples
a = 3.14159
b = 2.71828
c = 1.0e-10  # Scientific notation

# Print types and values
print(f"x = {x}, type: {type(x)}")
print(f"y = {y}, type: {type(y)}")
print(f"a = {a}, type: {type(a)}")
print(f"b = {b}, type: {type(b)}")
print(f"c = {c}, type: {type(c)}")

# Numeric operations
print(f"\nOperations:")
print(f"x + a = {x + a}, type: {type(x + a)}")  # Note: int + float = float
print(f"a / x = {a / x}, type: {type(a / x)}")
print(f"x // 5 = {x // 5}, type: {type(x // 5)}")  # Floor division
print(f"x % 5 = {x % 5}, type: {type(x % 5)}")  # Modulo

### 2.2 Boolean Type: bool

In [None]:
# Boolean examples
t = True
f = False

# Boolean from comparisons
is_greater = x > a
is_equal = 10 == 10.0  # True even though types are different

print(f"t = {t}, type: {type(t)}")
print(f"f = {f}, type: {type(f)}")
print(f"{x} > {a} = {is_greater}, type: {type(is_greater)}")
print(f"10 == 10.0: {is_equal}, type: {type(is_equal)}")

# Numeric values of booleans (used in calculations)
print(f"\nBoolean as numbers:")
print(f"True + True = {True + True}")
print(f"int(True) = {int(True)}")
print(f"int(False) = {int(False)}")

# Boolean operations
print(f"\nBoolean operations:")
print(f"True and False = {True and False}")
print(f"True or False = {True or False}")
print(f"not True = {not True}")

### 2.3 String Type: str

In [None]:
# String examples
s1 = "Hello, World!"
s2 = 'Data Science'
s3 = """Multi-line
string example"""

print(f"s1 = {s1}, type: {type(s1)}")
print(f"s2 = {s2}, type: {type(s2)}")
print(f"s3 = {s3}, type: {type(s3)}")

# String operations
print(f"\nString operations:")
print(f"Length of s1: {len(s1)}")
print(f"s1 uppercase: {s1.upper()}")
print(f"s2 lowercase: {s2.lower()}")
print(f"Split s1: {s1.split(',')}")
print(f"Join strings: {'_'.join(['a', 'b', 'c'])}")
print(f"Replace in s1: {s1.replace('Hello', 'Hi')}")

# String formatting
name = "Python"
version = 3.9
print(f"String formatting: {name} version {version}")
print("Old-style formatting: %s version %.1f" % (name, version))
print("Format method: {} version {}".format(name, version))

### 2.4 Container Types: list, tuple, dict, set

In [None]:
# List - mutable ordered collection
my_list = [1, 2, 3, 'a', 'b', True]
print(f"my_list = {my_list}, type: {type(my_list)}")
print(f"First element: {my_list[0]}")
my_list[0] = 100  # Lists are mutable
print(f"After modification: {my_list}")
print(f"List slice: {my_list[2:5]}")

# Tuple - immutable ordered collection
my_tuple = (1, 2, 3, 'a', 'b', True)
print(f"\nmy_tuple = {my_tuple}, type: {type(my_tuple)}")
print(f"Second element: {my_tuple[1]}")
# my_tuple[0] = 100  # This would raise an error as tuples are immutable

# Dictionary - key-value pairs
my_dict = {'name': 'John', 'age': 30, 'city': 'New York'}
print(f"\nmy_dict = {my_dict}, type: {type(my_dict)}")
print(f"Access by key: my_dict['name'] = {my_dict['name']}")
my_dict['age'] = 31  # Modify value
my_dict['country'] = 'USA'  # Add new key-value pair
print(f"After modification: {my_dict}")
print(f"Keys: {list(my_dict.keys())}")
print(f"Values: {list(my_dict.values())}")
print(f"Items: {list(my_dict.items())}")

# Set - unordered collection of unique elements
my_set = {1, 2, 3, 2, 1, 3, 4, 5}  # Duplicates are automatically removed
print(f"\nmy_set = {my_set}, type: {type(my_set)}")
my_set.add(6)
print(f"After adding 6: {my_set}")
my_set.remove(3)
print(f"After removing 3: {my_set}")

# Set operations
set_a = {1, 2, 3, 4}
set_b = {3, 4, 5, 6}
print(f"\nSet operations:")
print(f"set_a = {set_a}, set_b = {set_b}")
print(f"Union: {set_a | set_b}")
print(f"Intersection: {set_a & set_b}")
print(f"Difference (set_a - set_b): {set_a - set_b}")
print(f"Symmetric difference: {set_a ^ set_b}")

## 3. NumPy Data Types

NumPy introduces a more extensive set of data types that are more memory efficient and better suited for scientific computing.

In [None]:
# Create arrays with different NumPy data types
np_int8 = np.array([1, 2, 3], dtype=np.int8)
np_int32 = np.array([1, 2, 3], dtype=np.int32)
np_int64 = np.array([1, 2, 3], dtype=np.int64)
np_float32 = np.array([1.0, 2.0, 3.0], dtype=np.float32)
np_float64 = np.array([1.0, 2.0, 3.0], dtype=np.float64)
np_bool = np.array([True, False, True], dtype=np.bool_)

# Print data types and memory usage
print(f"NumPy int8 array: {np_int8}, dtype: {np_int8.dtype}, itemsize: {np_int8.itemsize} bytes")
print(f"NumPy int32 array: {np_int32}, dtype: {np_int32.dtype}, itemsize: {np_int32.itemsize} bytes")
print(f"NumPy int64 array: {np_int64}, dtype: {np_int64.dtype}, itemsize: {np_int64.itemsize} bytes")
print(f"NumPy float32 array: {np_float32}, dtype: {np_float32.dtype}, itemsize: {np_float32.itemsize} bytes")
print(f"NumPy float64 array: {np_float64}, dtype: {np_float64.dtype}, itemsize: {np_float64.itemsize} bytes")
print(f"NumPy bool array: {np_bool}, dtype: {np_bool.dtype}, itemsize: {np_bool.itemsize} bytes")

# Data type ranges
print(f"\nNumPy data type ranges:")
print(f"np.int8 range: {np.iinfo(np.int8).min} to {np.iinfo(np.int8).max}")
print(f"np.uint8 range: {np.iinfo(np.uint8).min} to {np.iinfo(np.uint8).max}")
print(f"np.int32 range: {np.iinfo(np.int32).min} to {np.iinfo(np.int32).max}")
print(f"np.int64 range: {np.iinfo(np.int64).min} to {np.iinfo(np.int64).max}")
print(f"np.float32 range: {np.finfo(np.float32).min} to {np.finfo(np.float32).max}")
print(f"np.float32 precision: {np.finfo(np.float32).precision} digits")
print(f"np.float64 range: {np.finfo(np.float64).min} to {np.finfo(np.float64).max}")
print(f"np.float64 precision: {np.finfo(np.float64).precision} digits")

### 3.1 Special NumPy Data Types

In [None]:
# Complex numbers
complex_array = np.array([1+2j, 3+4j, 5+6j], dtype=np.complex128)
print(f"Complex array: {complex_array}, dtype: {complex_array.dtype}")
print(f"Real part: {complex_array.real}")
print(f"Imaginary part: {complex_array.imag}")

# String data type in NumPy
str_array = np.array(['apple', 'banana', 'cherry'], dtype=np.string_)
unicode_array = np.array(['apple', 'banana', 'cherry'], dtype=np.unicode_)
print(f"\nString array: {str_array}, dtype: {str_array.dtype}")
print(f"Unicode array: {unicode_array}, dtype: {unicode_array.dtype}")

# Datetime data types
dates = np.array(['2021-01-01', '2021-01-15', '2021-02-01'], dtype='datetime64')
print(f"\nDatetime array: {dates}, dtype: {dates.dtype}")
print(f"Date difference: {dates[1] - dates[0]}")

# Structured arrays (similar to records/structs in other languages)
person_type = np.dtype([('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
people = np.array([('Alice', 25, 55.0), ('Bob', 30, 85.5), ('Charlie', 35, 68.0)], dtype=person_type)
print(f"\nStructured array:")
print(people)
print(f"Names: {people['name']}")
print(f"Ages: {people['age']}")
print(f"Average weight: {np.mean(people['weight'])}")

## 4. Pandas Data Types

Pandas extends Python and NumPy data types to efficiently handle tabular and time-series data. Let's examine the key Pandas data types and their properties.

In [None]:
# Create a Pandas DataFrame with different data types
data = {
    'integer_col': [1, 2, 3, 4, 5],
    'float_col': [1.1, 2.2, 3.3, 4.4, 5.5],
    'string_col': ['a', 'b', 'c', 'd', 'e'],
    'bool_col': [True, False, True, False, True],
    'datetime_col': pd.date_range('2021-01-01', periods=5),
    'category_col': pd.Categorical(['small', 'medium', 'large', 'small', 'medium']),
    'object_col': [{'a': 1}, [1, 2], (3, 4), {5, 6}, 'text']
}

df = pd.DataFrame(data)

# Display DataFrame and data types
print(df)
print("\nData types in DataFrame:")
print(df.dtypes)

# Info method provides a summary of DataFrame's data types and memory usage
print("\nDataFrame info:")
df.info()

# Detailed introspection of selected columns
print("\nDetailed exploration of columns:")
for col in ['integer_col', 'float_col', 'datetime_col', 'category_col']:
    print(f"\nColumn: {col}")
    print(f"Type: {type(df[col])}")
    print(f"Pandas dtype: {df[col].dtype}")
    print(f"Sample value type: {type(df[col].iloc[0])}")

### 4.1 Pandas Specific Data Types

In [None]:
# Categorical data type
# Useful for columns with a limited set of possible values
color_series = pd.Series(['red', 'green', 'blue', 'red', 'blue', 'red', 'green'])
color_cat = color_series.astype('category')

print(f"Original series: {color_series.dtype}")
print(f"Categorical series: {color_cat.dtype}")
print(f"Categories: {color_cat.cat.categories}")
print(f"Codes: {color_cat.cat.codes}")
print(f"Memory usage original: {color_series.memory_usage(deep=True)} bytes")
print(f"Memory usage categorical: {color_cat.memory_usage(deep=True)} bytes")

# Nullable (pandas extension) data types
# These provide more consistent handling of missing values
df_nullable = pd.DataFrame({
    'Int64': pd.Series([1, 2, None, 4], dtype='Int64'),
    'Float64': pd.Series([1.1, 2.2, None, 4.4], dtype='Float64'),
    'boolean': pd.Series([True, False, None, True], dtype='boolean'),
    'string': pd.Series(['a', 'b', None, 'd'], dtype='string')
})

print("\nDataFrame with nullable/extension types:")
print(df_nullable)
print(df_nullable.dtypes)

# Date and time data types
date_series = pd.Series(pd.date_range('2021-01-01', periods=5))
time_series = pd.Series(pd.date_range('00:00:00', periods=5, freq='H'))
timedelta_series = pd.Series([pd.Timedelta(days=1), 
                              pd.Timedelta(hours=2), 
                              pd.Timedelta(minutes=3)])

print("\nDate series:")
print(date_series)
print(f"dtype: {date_series.dtype}")

print("\nTime series:")
print(time_series)
print(f"dtype: {time_series.dtype}")

print("\nTimedelta series:")
print(timedelta_series)
print(f"dtype: {timedelta_series.dtype}")

## 5. Type Checking and Identification

Let's explore the different methods to check and identify data types in Python, NumPy, and Pandas.

In [None]:
# Python built-in type checking
values = [42, 3.14, "hello", True, [1, 2, 3], {'a': 1}, (1, 2), {1, 2, 3}]

print("Python built-in type checking:")
for val in values:
    print(f"Value: {val}, Type: {type(val)}")

# Using isinstance() for type checking
print("\nType checking with isinstance():")
print(f"42 is an integer: {isinstance(42, int)}")
print(f"3.14 is a float: {isinstance(3.14, float)}")
print(f"3.14 is a number: {isinstance(3.14, (int, float))}")  # Check against multiple types
print(f"'hello' is a string: {isinstance('hello', str)}")
print(f"[1, 2, 3] is a list: {isinstance([1, 2, 3], list)}")

# NumPy type checking
numpy_values = [
    np.array([1, 2, 3]),
    np.array([1.1, 2.2, 3.3]),
    np.array(['a', 'b', 'c']),
    np.array([True, False, True])
]

print("\nNumPy type checking:")
for arr in numpy_values:
    print(f"Array: {arr}, dtype: {arr.dtype}")
    print(f"  Is integer: {np.issubdtype(arr.dtype, np.integer)}")
    print(f"  Is floating: {np.issubdtype(arr.dtype, np.floating)}")
    print(f"  Is string: {np.issubdtype(arr.dtype, np.character)}")
    print(f"  Is boolean: {np.issubdtype(arr.dtype, np.bool_)}")

# Pandas type checking
print("\nPandas type checking:")
for col in df.columns:
    series = df[col]
    print(f"Column: {col}, dtype: {series.dtype}")
    print(f"  Is numeric: {pd.api.types.is_numeric_dtype(series)}")
    print(f"  Is integer: {pd.api.types.is_integer_dtype(series)}")
    print(f"  Is float: {pd.api.types.is_float_dtype(series)}")
    print(f"  Is string: {pd.api.types.is_string_dtype(series)}")
    print(f"  Is boolean: {pd.api.types.is_bool_dtype(series)}")
    print(f"  Is categorical: {pd.api.types.is_categorical_dtype(series)}")
    print(f"  Is datetime: {pd.api.types.is_datetime64_any_dtype(series)}")
    print(f"  Is object: {pd.api.types.is_object_dtype(series)}")

## 6. Type Conversion in Python

Let's explore how to convert between different Python data types.

In [None]:
# Explicit conversion between numeric types
num_int = 123
num_float = 123.45
num_str = "456"
num_bool = True

print("Numeric conversions:")
print(f"int to float: {float(num_int)}, type: {type(float(num_int))}")
print(f"float to int: {int(num_float)}, type: {type(int(num_float))}")  # Truncates, doesn't round
print(f"str to int: {int(num_str)}, type: {type(int(num_str))}")
print(f"str to float: {float(num_str)}, type: {type(float(num_str))}")
print(f"bool to int: {int(num_bool)}, type: {type(int(num_bool))}")
print(f"int to bool: {bool(num_int)}, type: {type(bool(num_int))}")  # 0 is False, any other number is True
print(f"Zero to bool: {bool(0)}, type: {type(bool(0))}")

# String conversions
value_int = 123
value_float = 123.45
value_bool = False
value_list = [1, 2, 3]
value_tuple = (4, 5, 6)
value_dict = {'a': 1, 'b': 2}
value_set = {7, 8, 9}

print("\nString conversions:")
print(f"int to str: {str(value_int)}, type: {type(str(value_int))}")
print(f"float to str: {str(value_float)}, type: {type(str(value_float))}")
print(f"bool to str: {str(value_bool)}, type: {type(str(value_bool))}")
print(f"list to str: {str(value_list)}, type: {type(str(value_list))}")
print(f"tuple to str: {str(value_tuple)}, type: {type(str(value_tuple))}")
print(f"dict to str: {str(value_dict)}, type: {type(str(value_dict))}")
print(f"set to str: {str(value_set)}, type: {type(str(value_set))}")

# Container type conversions
sample_str = "hello"
sample_list = [1, 2, 3, 2, 1]
sample_tuple = (4, 5, 6, 5, 4)
sample_dict = {'a': 1, 'b': 2, 'c': 3}
sample_set = {7, 8, 9}

print("\nContainer type conversions:")
print(f"str to list: {list(sample_str)}, type: {type(list(sample_str))}")
print(f"list to tuple: {tuple(sample_list)}, type: {type(tuple(sample_list))}")
print(f"tuple to list: {list(sample_tuple)}, type: {type(list(sample_tuple))}")
print(f"list to set (removes duplicates): {set(sample_list)}, type: {type(set(sample_list))}")
print(f"dict to list (gets keys): {list(sample_dict)}, type: {type(list(sample_dict))}")
print(f"dict to set (gets keys): {set(sample_dict)}, type: {type(set(sample_dict))}")
print(f"dict items to list: {list(sample_dict.items())}, type: {type(list(sample_dict.items()))}")

# Dictionary creation from sequences
keys = ['name', 'age', 'city']
values = ['Alice', 30, 'New York']
print("\nCreate dict from sequences:")
print(f"Using zip: {dict(zip(keys, values))}")
print(f"Using list comprehension: {dict([(keys[i], values[i]) for i in range(len(keys))])}")

### 6.1 Implicit Type Conversion (Type Coercion)

In [None]:
# Implicit type conversions in Python
print("Implicit type conversions:")
print(f"int + float: {10 + 3.5}, type: {type(10 + 3.5)}")  # int is implicitly converted to float
print(f"int * bool: {10 * True}, type: {type(10 * True)}")  # bool is implicitly converted to int
print(f"str + str: {'hello' + ' world'}, type: {type('hello' + ' world')}")

# Python doesn't do implicit conversion in many cases
try:
    result = "5" + 10  # This will raise TypeError
except TypeError as e:
    print(f"\nError when mixing incompatible types: {e}")

# Common patterns for safe conversion
value = "5"
number = 10
print(f"\nSafe conversions:")
print(f"str + int (safe): {value + str(number)}")
print(f"str to int + int: {int(value) + number}")

# Converting to numeric values with error handling
strings = ["123", "123.45", "hello", "42x", ""]

print("\nSafe numeric conversion with error handling:")
for s in strings:
    try:
        int_value = int(s)
        print(f"'{s}' converted to int: {int_value}")
    except ValueError:
        try:
            float_value = float(s)
            print(f"'{s}' converted to float: {float_value}")
        except ValueError:
            print(f"'{s}' cannot be converted to numeric type")

# Using string methods to check before conversion
print("\nValidation before conversion:")
for s in strings:
    if s.isdigit():
        print(f"'{s}' is a valid integer string")
    elif s.replace(".", "", 1).isdigit() and s.count(".") <= 1:
        print(f"'{s}' is a valid float string")
    else:
        print(f"'{s}' is not a valid numeric string")

## 7. Type Conversion in NumPy

Let's explore various ways to convert between NumPy data types and their implications.

In [None]:
# Create a NumPy array
arr_float = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
print(f"Original array: {arr_float}, dtype: {arr_float.dtype}")

# Convert to different types using astype()
arr_int = arr_float.astype(np.int32)
arr_str = arr_float.astype(np.string_)
arr_bool = arr_float.astype(np.bool_)

print("Conversions:")
print(f"To int32: {arr_int}, dtype: {arr_int.dtype}")  # Note: truncates decimal portions
print(f"To string: {arr_str}, dtype: {arr_str.dtype}")
print(f"To boolean: {arr_bool}, dtype: {arr_bool.dtype}")  # Non-zero values become True

# Converting data types with potential data loss
arr_large = np.array([300, 400, 500], dtype=np.int16)  # int16 max is 32767
try:
    arr_small = arr_large.astype(np.int8)  # int8 max is 127
    print(f"\nConversion with overflow: {arr_small}")  # Will wrap around due to overflow
except Exception as e:
    print(f"Error during conversion: {e}")

# Converting between signed and unsigned
arr_signed = np.array([-5, 0, 5], dtype=np.int8)
arr_unsigned = arr_signed.astype(np.uint8)
print(f"\nSigned array: {arr_signed}, dtype: {arr_signed.dtype}")
print(f"Unsigned array: {arr_unsigned}, dtype: {arr_unsigned.dtype}")  # Negative values wrap around

# Precision loss in float conversions
arr_double = np.array([1.123456789, 2.123456789], dtype=np.float64)
arr_single = arr_double.astype(np.float32)
arr_back = arr_single.astype(np.float64)  # Going back doesn't restore precision

print(f"\nDouble precision: {arr_double}, dtype: {arr_double.dtype}")
print(f"Single precision: {arr_single}, dtype: {arr_single.dtype}")
print(f"Back to double: {arr_back}, dtype: {arr_back.dtype}")
print(f"Original vs converted back (equal): {np.array_equal(arr_double, arr_back)}")

# Creating arrays with mixed types
mixed_list = [1, 2.5, "3", True]
arr_mixed = np.array(mixed_list)  # Gets converted to string (object dtype)
print(f"\nMixed type list to array: {arr_mixed}, dtype: {arr_mixed.dtype}")

# Multiple ways to convert types
arr = np.array([1.5, 2.5, 3.5])
print(f"\nOriginal: {arr}, dtype: {arr.dtype}")
print(f"Using astype(): {arr.astype(int)}, dtype: {arr.astype(int).dtype}")
print(f"Using array constructor: {np.array(arr, dtype=int)}, dtype: {np.array(arr, dtype=int).dtype}")
print(f"Using dtype parameter: {np.int32(arr)}, dtype: {np.int32(arr).dtype}")

### 7.1 NumPy Type Conversion Functions

In [None]:
# Other NumPy type conversion functions
arr = np.array([1.5, 2.6, 3.7, 4.8, 5.9])

print("NumPy conversion functions:")
print(f"np.floor(): {np.floor(arr)}, dtype: {np.floor(arr).dtype}")  # Round down to nearest integer
print(f"np.ceil(): {np.ceil(arr)}, dtype: {np.ceil(arr).dtype}")  # Round up to nearest integer
print(f"np.round(): {np.round(arr)}, dtype: {np.round(arr).dtype}")  # Round to nearest integer
print(f"np.trunc(): {np.trunc(arr)}, dtype: {np.trunc(arr).dtype}")  # Truncate decimal part

# Converting from strings to numbers
str_arr = np.array(['1.5', '2.5', '3.5'])
print(f"\nString array: {str_arr}, dtype: {str_arr.dtype}")
print(f"Converted to float: {str_arr.astype(float)}, dtype: {str_arr.astype(float).dtype}")

# Create a structured array and extract fields with different types
structured = np.array([('Alice', 25, 55.0), ('Bob', 30, 85.5)], 
                      dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')])
print(f"\nStructured array:\n{structured}")
print(f"Names array: {structured['name']}, dtype: {structured['name'].dtype}")
print(f"Ages array: {structured['age']}, dtype: {structured['age'].dtype}")
print(f"Weights array: {structured['weight']}, dtype: {structured['weight'].dtype}")

# View casting - reinterpret bits as a different type without copying
uint8_arr = np.array([72, 101, 108, 108, 111], dtype=np.uint8)  # ASCII for "Hello"
print(f"\nUint8 array (ASCII codes): {uint8_arr}")
char_arr = uint8_arr.view('S1')  # View as 1-byte characters
print(f"As characters: {char_arr}")
back_to_uint8 = char_arr.view(np.uint8)  # View back as uint8
print(f"Back to uint8: {back_to_uint8}")

## 8. Type Conversion in Pandas

Let's explore various ways to convert data types in Pandas DataFrames and Series.

In [None]:
# Create a sample DataFrame with mixed data types
data = {
    'A': ['1', '2', '3', '4', '5'],                   # strings
    'B': [1.1, 2.2, 3.3, 4.4, 5.5],                   # floats
    'C': ['True', 'False', 'True', 'False', 'True'],  # string booleans
    'D': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],  # date strings
    'E': ['a', 'b', 'c', 'd', 'e']                    # categories
}

df = pd.DataFrame(data)
print("Original DataFrame with inferred types:")
print(df.dtypes)

# Basic type conversion with astype()
df['A_int'] = df['A'].astype(int)
df['A_float'] = df['A'].astype(float)
df['B_int'] = df['B'].astype(int)
df['C_bool'] = df['C'].astype(bool)  # This won't work correctly with 'True'/'False' strings
df['E_cat'] = df['E'].astype('category')

print("\nAfter basic type conversions:")
print(df.dtypes)
print(df)

# Convert string boolean values properly
df['C_bool_correct'] = df['C'].map({'True': True, 'False': False})

# Convert date strings to datetime using to_datetime
df['D_datetime'] = pd.to_datetime(df['D'])

print("\nAfter advanced type conversions:")
print(df.dtypes)
print(df[['C', 'C_bool', 'C_bool_correct', 'D', 'D_datetime']])

### 8.1 Pandas Conversion Functions

In [None]:
# Create a DataFrame with diverse data to convert
data = {
    'int_col': ['1', '2', '3', '4', '5'],
    'float_col': ['1.1', '2.2', '3.3', '4.4', '5.5'],
    'mixed_numeric': ['1', '2.2', '3', '4.4', '5'],
    'bool_col': ['True', 'False', 'YES', 'no', '1'],
    'date_col': ['2021-01-01', '01/02/2021', '2021-03-01', '04/01/21', '2021-05-01'],
    'categorical': ['small', 'medium', 'large', 'medium', 'small']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df.dtypes)  # All columns are object (string) type
print(df)

# Using specialized Pandas conversion functions
print("\n1. Using pd.to_numeric():")
# Convert to numeric types with error handling
for col in ['int_col', 'float_col', 'mixed_numeric']:
    # Different error handling options:
    # - 'error': default, raises if can't convert
    # - 'coerce': convert errors to NaN
    # - 'ignore': leave errors as is
    df[f"{col}_num"] = pd.to_numeric(df[col], errors='coerce')

print(df[['int_col', 'int_col_num', 'float_col', 'float_col_num', 'mixed_numeric', 'mixed_numeric_num']].head())
print(df.dtypes)

print("\n2. Using pd.to_datetime():")
# Convert to datetime with different formats
df['date_std'] = pd.to_datetime(df['date_col'], errors='coerce')  # Auto-detect format
df['date_fmt'] = pd.to_datetime(df['date_col'], format='%Y-%m-%d', errors='coerce')  # Specific format

print(df[['date_col', 'date_std', 'date_fmt']].head())
print(df.dtypes)

print("\n3. Using pd.to_timedelta():")
# Convert strings to time intervals
timedelta_data = ['1 days', '2 hours', '3 minutes', '4 seconds', '5 days 6 hours']
df['timedelta_col'] = timedelta_data
df['timedelta'] = pd.to_timedelta(df['timedelta_col'])

print(df[['timedelta_col', 'timedelta']].head())
print(df.dtypes)

print("\n4. Converting to boolean:")
# Convert various forms of boolean data
# True values: 'True', 'true', 't', 'yes', 'y', '1'
# False values: 'False', 'false', 'f', 'no', 'n', '0'
df['bool_std'] = df['bool_col'].astype('boolean')  # This doesn't handle all cases properly
df['bool_custom'] = df['bool_col'].map(lambda x: x.lower() in ['true', 'yes', 'y', '1', 't'])

print(df[['bool_col', 'bool_std', 'bool_custom']].head())
print(df.dtypes)

print("\n5. Converting to categorical:")
# Convert to categorical type
df['cat'] = df['categorical'].astype('category')
# With explicit category order
df['cat_ordered'] = pd.Categorical(df['categorical'], 
                                   categories=['small', 'medium', 'large'], 
                                   ordered=True)

print(df[['categorical', 'cat', 'cat_ordered']].head())
print(f"Categories: {df['cat'].cat.categories}")
print(f"Ordered categories: {df['cat_ordered'].cat.categories}")
print(f"Is ordered: {df['cat_ordered'].cat.ordered}")
print(df.dtypes)

### 8.2 Converting DataFrame Types Efficiently

In [None]:
# Generate a larger DataFrame for demonstrating batch conversions
import numpy as np
np.random.seed(42)
n_rows = 100000

large_df = pd.DataFrame({
    'id': range(n_rows),
    'int_col': np.random.randint(0, 100, size=n_rows),
    'float_col': np.random.rand(n_rows),
    'small_int': np.random.randint(0, 10, size=n_rows),
    'tiny_int': np.random.randint(0, 5, size=n_rows),
    'bool_col': np.random.choice([True, False], size=n_rows),
    'str_col': np.random.choice(['a', 'b', 'c', 'd', 'e'], size=n_rows),
    'date_col': pd.date_range('2021-01-01', periods=n_rows, freq='D')
})

# Check initial memory usage
print(f"Initial DataFrame memory usage:")
print(large_df.info(memory_usage='deep'))
initial_memory = large_df.memory_usage(deep=True).sum()
print(f"Total memory usage: {initial_memory / 1e6:.2f} MB")

# Convert individual columns to more efficient types
print("\nOptimizing memory usage...")

# 1. Integer downcasting
large_df['int_col_opt'] = pd.to_numeric(large_df['int_col'], downcast='integer')
large_df['small_int_opt'] = pd.to_numeric(large_df['small_int'], downcast='integer')
large_df['tiny_int_opt'] = pd.to_numeric(large_df['tiny_int'], downcast='integer')

# 2. Float downcasting
large_df['float_col_opt'] = pd.to_numeric(large_df['float_col'], downcast='float')

# 3. Convert strings to categorical
large_df['str_col_cat'] = large_df['str_col'].astype('category')

# Check memory usage after optimization
print("\nDataFrame memory usage after optimization:")
print(large_df.info(memory_usage='deep'))
optimized_memory = large_df.memory_usage(deep=True).sum()
print(f"Total memory usage: {optimized_memory / 1e6:.2f} MB")
print(f"Memory reduction: {(1 - optimized_memory / initial_memory) * 100:.2f}%")

# Memory usage for individual columns
print("\nMemory usage comparison for each column type:")
for col in ['int_col', 'int_col_opt', 'small_int', 'small_int_opt', 
            'tiny_int', 'tiny_int_opt', 'float_col', 'float_col_opt', 
            'str_col', 'str_col_cat']:
    mem_usage = large_df[col].memory_usage(deep=True) / 1e6
    print(f"{col}: {mem_usage:.2f} MB, dtype: {large_df[col].dtype}")

# Batch conversion function
def optimize_dtypes(df):
    result = df.copy()
    
    # Optimize integers
    int_cols = df.select_dtypes(include=['int64']).columns
    for col in int_cols:
        result[col] = pd.to_numeric(result[col], downcast='integer')
    
    # Optimize floats
    float_cols = df.select_dtypes(include=['float64']).columns
    for col in float_cols:
        result[col] = pd.to_numeric(result[col], downcast='float')
    
    # Convert strings with few unique values to categorical
    obj_cols = df.select_dtypes(include=['object']).columns
    for col in obj_cols:
        num_unique_values = len(df[col].unique())
        num_total_values = len(df[col])
        if num_unique_values / num_total_values < 0.5:  # If less than 50% are unique
            result[col] = result[col].astype('category')
    
    return result

# Create a new copy of the original DataFrame to optimize
df_original = large_df[['id', 'int_col', 'float_col', 'small_int', 'tiny_int', 'bool_col', 'str_col', 'date_col']]
df_optimized = optimize_dtypes(df_original)

# Compare memory usage
original_mem = df_original.memory_usage(deep=True).sum() / 1e6
optimized_mem = df_optimized.memory_usage(deep=True).sum() / 1e6

print(f"\nBatch optimization results:")
print(f"Original memory: {original_mem:.2f} MB")
print(f"Optimized memory: {optimized_mem:.2f} MB")
print(f"Memory reduction: {(1 - optimized_mem / original_mem) * 100:.2f}%")
print("\nOptimized dtypes:")
print(df_optimized.dtypes)

## 9. Memory Usage and Efficiency

Let's analyze how different data types impact memory usage and computational efficiency.

In [None]:
# Comparing memory usage across different Python data structures
import sys

# Memory usage of Python primitive types
print("Memory usage of Python primitive types:")
print(f"int (0): {sys.getsizeof(0)} bytes")
print(f"int (1000): {sys.getsizeof(1000)} bytes")
print(f"int (10^20): {sys.getsizeof(10**20)} bytes")  # Python integers have arbitrary precision
print(f"float: {sys.getsizeof(0.0)} bytes")
print(f"bool: {sys.getsizeof(True)} bytes")
print(f"str (empty): {sys.getsizeof('')} bytes")
print(f"str (10 chars): {sys.getsizeof('a' * 10)} bytes")
print(f"str (100 chars): {sys.getsizeof('a' * 100)} bytes")

# Memory usage of Python containers
print("\nMemory usage of empty containers:")
print(f"list: {sys.getsizeof([])} bytes")
print(f"tuple: {sys.getsizeof(())} bytes")
print(f"dict: {sys.getsizeof({})} bytes")
print(f"set: {sys.getsizeof(set())} bytes")

# Memory usage of containers with elements
print("\nMemory usage of containers with elements:")
print(f"list (10 ints): {sys.getsizeof([0] * 10)} bytes")
print(f"tuple (10 ints): {sys.getsizeof((0,) * 10)} bytes")
print(f"dict (10 items): {sys.getsizeof({i: i for i in range(10)})} bytes")
print(f"set (10 ints): {sys.getsizeof(set(range(10)))} bytes")

# NumPy array memory usage
print("\nNumPy array memory usage:")
for dtype in [np.int8, np.int16, np.int32, np.int64, 
              np.float16, np.float32, np.float64]:
    arr = np.zeros(1000, dtype=dtype)
    print(f"1000 zeros as {dtype.__name__}: {arr.nbytes} bytes")

# Pandas Series memory usage
print("\nPandas Series memory usage (1000 elements):")
for dtype in ['int8', 'int16', 'int32', 'int64', 
              'float32', 'float64', 'object', 'category']:
    if dtype == 'category':
        # For categorical, create data with few unique values
        data = pd.Series(['A', 'B', 'C'] * 333 + ['D'])
        s = data.astype('category')
    elif dtype == 'object':
        # For object, use strings
        s = pd.Series(['string'] * 1000)
    else:
        # For numeric types, use zeros
        s = pd.Series(np.zeros(1000, dtype=dtype))
    
    memory = s.memory_usage(deep=True)
    print(f"Series with dtype {s.dtype}: {memory} bytes")

# Visual comparison of memory usage
print("\nVisualizing memory usage of different NumPy types:")
dtypes = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
memory_usages = [np.zeros(1000, dtype=dtype).nbytes for dtype in dtypes]

plt.figure(figsize=(10, 6))
plt.bar(dtypes, memory_usages)
plt.title('Memory Usage of 1000 zeros in NumPy Arrays')
plt.xlabel('Data Type')
plt.ylabel('Memory Usage (Bytes)')
plt.show()

### 9.1 Performance Comparison

In [None]:
# Performance comparison of different data types
import time

# Helper function to measure execution time
def measure_time(func, *args, **kwargs):
    start = time.time()
    result = func(*args, **kwargs)
    end = time.time()
    return result, end - start

# Create arrays of different types
size = 10_000_000
print(f"Creating arrays with {size:,} elements...")

# NumPy arrays with different dtypes
arrays = {
    'float64': np.random.random(size),
    'float32': np.random.random(size).astype(np.float32),
    'int64': np.random.randint(0, 100, size),
    'int32': np.random.randint(0, 100, size).astype(np.int32),
    'int16': np.random.randint(0, 100, size).astype(np.int16),
    'int8': np.random.randint(0, 100, size).astype(np.int8),
}

# Performance test for addition
print("\nPerformance comparison for array addition:")
for name, arr in arrays.items():
    result, duration = measure_time(lambda: arr + arr)
    print(f"{name}: {duration:.6f} seconds")

# Performance test for multiplication
print("\nPerformance comparison for array multiplication:")
for name, arr in arrays.items():
    result, duration = measure_time(lambda: arr * 2)
    print(f"{name}: {duration:.6f} seconds")

# Performance test for mathematical functions
print("\nPerformance comparison for trigonometric operations:")
for name, arr in {k: v for k, v in arrays.items() if 'float' in k}.items():
    result, duration = measure_time(lambda: np.sin(arr))
    print(f"sin on {name}: {duration:.6f} seconds")

# Performance test for comparison operations
print("\nPerformance comparison for comparison operations:")
for name, arr in arrays.items():
    result, duration = measure_time(lambda: arr > arr.mean())
    print(f"{name}: {duration:.6f} seconds")

# Create Pandas DataFrames with different dtypes
print("\nCreating DataFrames with different data types...")
df_size = 1_000_000
df_float64 = pd.DataFrame({'value': np.random.random(df_size)})
df_float32 = pd.DataFrame({'value': np.random.random(df_size).astype(np.float32)})
df_int64 = pd.DataFrame({'value': np.random.randint(0, 100, df_size)})
df_int32 = pd.DataFrame({'value': np.random.randint(0, 100, df_size).astype(np.int32)})
df_category = pd.DataFrame({'value': pd.Categorical(np.random.choice(['A', 'B', 'C', 'D', 'E'], df_size))})
df_object = pd.DataFrame({'value': np.random.choice(['A', 'B', 'C', 'D', 'E'], df_size)})

# Performance test for filtering in Pandas
print("\nPerformance comparison for DataFrame filtering:")
dataframes = {
    'float64': df_float64,
    'float32': df_float32,
    'int64': df_int64,
    'int32': df_int32,
    'category': df_category,
    'object': df_object
}

for name, df in dataframes.items():
    if name in ['category', 'object']:
        result, duration = measure_time(lambda: df[df['value'] == 'A'])
    else:
        result, duration = measure_time(lambda: df[df['value'] > 50])
    print(f"{name}: {duration:.6f} seconds")

# Performance test for groupby operations in Pandas
print("\nPerformance comparison for DataFrame groupby:")
# Create DataFrames with 1M rows and 10 groups
df_size = 1_000_000
groups = 10
df_int_group = pd.DataFrame({
    'group': np.random.randint(0, groups, df_size),
    'value': np.random.random(df_size)
})
df_obj_group = pd.DataFrame({
    'group': [f'G{i}' for i in np.random.randint(0, groups, df_size)],
    'value': np.random.random(df_size)
})
df_cat_group = df_obj_group.copy()
df_cat_group['group'] = df_cat_group['group'].astype('category')

groupby_dfs = {
    'int_group': df_int_group,
    'obj_group': df_obj_group,
    'cat_group': df_cat_group
}

for name, df in groupby_dfs.items():
    result, duration = measure_time(lambda: df.groupby('group')['value'].mean())
    print(f"{name}: {duration:.6f} seconds")

## 10. Handling Data Type Errors

Let's explore common data type conversion errors and how to handle them effectively.

In [None]:
# Common type conversion errors and handling strategies

# 1. Value Error: When the value cannot be converted to the target type
print("1. Handling ValueError in type conversions:")
strings_to_convert = ['123', '12.3', 'hello', '42x', '']

for s in strings_to_convert:
    # Simple try-except approach
    try:
        int_value = int(s)
        print(f"  Successfully converted '{s}' to int: {int_value}")
    except ValueError:
        print(f"  Cannot convert '{s}' to int")

print("\nCascading conversion attempts:")
for s in strings_to_convert:
    try:
        # Try int first
        value = int(s)
        print(f"  '{s}' converted to int: {value}")
    except ValueError:
        try:
            # If int fails, try float
            value = float(s)
            print(f"  '{s}' converted to float: {value}")
        except ValueError:
            # If both fail, keep as string
            print(f"  '{s}' kept as string")

# 2. Overflow Error: When the value is too large for the target type
print("\n2. Handling overflow in NumPy conversions:")
large_values = np.array([300, 500, 127, 128, 255, 256])
print(f"  Original values: {large_values}")

# With error handling
try:
    # This will raise an error if warning is converted to error
    with np.errstate(over='raise'):
        int8_values = large_values.astype(np.int8)
    print(f"  Converted to int8: {int8_values}")
except FloatingPointError:
    print("  Overflow error occurred during conversion")

# Without error handling (NumPy will wrap around by default)
int8_values = large_values.astype(np.int8)
print(f"  Converted to int8 (with wrap-around): {int8_values}")

# Safe conversion with clipping
print("\n  Safe conversion with clipping:")
int8_min, int8_max = np.iinfo(np.int8).min, np.iinfo(np.int8).max
clipped_values = np.clip(large_values, int8_min, int8_max).astype(np.int8)
print(f"  Clipped to int8 range: {clipped_values}")

# 3. Handling NA values in Pandas
print("\n3. Handling NA values in Pandas conversions:")
df = pd.DataFrame({
    'A': ['1', '2', None, '4'],
    'B': ['1.1', '2.2', 'NaN', '4.4'],
    'C': ['2021-01-01', None, 'not-a-date', '2021-01-04']
})
print(df)

print("\nDefault behavior (may raise errors):")
try:
    df['A_int'] = df['A'].astype(int)
except ValueError as e:
    print(f"  Error: {e}")

print("\nUsing pd.to_numeric with errors='coerce':")
df['A_int'] = pd.to_numeric(df['A'], errors='coerce')
df['B_float'] = pd.to_numeric(df['B'], errors='coerce')
print(df)

print("\nUsing pd.to_datetime with errors='coerce':")
df['C_date'] = pd.to_datetime(df['C'], errors='coerce')
print(df)

# 4. Custom conversion function with validation
print("\n4. Custom conversion function with validation:")

def safe_convert(value, target_type, default=None):
    """Safely convert a value to target_type, returning default if conversion fails."""
    try:
        if target_type == int:
            # For int, try float first to handle '3.0' cases
            return int(float(value)) if value is not None and value != '' else default
        elif target_type == float:
            return float(value) if value is not None and value != '' else default
        elif target_type == bool:
            if isinstance(value, str):
                return value.lower() in ('true', 'yes', 'y', '1', 't') if value else default
            return bool(value) if value is not None else default
        elif target_type == pd.Timestamp:
            return pd.to_datetime(value) if value is not None and value != '' else default
        else:
            return target_type(value) if value is not None and value != '' else default
    except (ValueError, TypeError):
        return default

# Test the safe_convert function
test_values = [
    '123', '456.7', 'True', 'yes', 'no', 'false',
    '2021-01-01', '01/01/2021', 'not-a-date', None, ''
]

print("Testing safe conversion function:")
for value in test_values:
    print(f"  Value: '{value}'")
    print(f"    To int: {safe_convert(value, int, default='INVALID')}")
    print(f"    To float: {safe_convert(value, float, default='INVALID')}")
    print(f"    To bool: {safe_convert(value, bool, default='INVALID')}")
    print(f"    To datetime: {safe_convert(value, pd.Timestamp, default='INVALID')}")

# 5. Apply custom conversion to a DataFrame
print("\n5. Applying safe conversion to a DataFrame:")
messy_data = {
    'id': ['1', '2', '3', '4', '5'],
    'value': ['10', '20.5', 'thirty', '40', None],
    'ratio': ['0.5', '60%', '0.75', '90%', 'unknown'],
    'flag': ['1', 'yes', 'false', 'no', None],
    'date': ['2021-01-01', '01/15/2021', '2021-03-01', 'invalid', '']
}

messy_df = pd.DataFrame(messy_data)
print("Original messy DataFrame:")
print(messy_df)

# Apply safe conversions
print("\nAfter safe conversions:")
cleaned_df = messy_df.copy()
cleaned_df['id'] = messy_df['id'].apply(lambda x: safe_convert(x, int))
cleaned_df['value'] = messy_df['value'].apply(lambda x: safe_convert(x, float))

# Custom parsing for percentages
def parse_percentage(x):
    if isinstance(x, str) and '%' in x:
        return float(x.replace('%', '')) / 100
    return safe_convert(x, float)

cleaned_df['ratio'] = messy_df['ratio'].apply(parse_percentage)
cleaned_df['flag'] = messy_df['flag'].apply(lambda x: safe_convert(x, bool))
cleaned_df['date'] = pd.to_datetime(messy_df['date'], errors='coerce')

print(cleaned_df)
print(cleaned_df.dtypes)

## Conclusion

In this notebook, we've explored a comprehensive overview of data types and conversion techniques in Python, NumPy, and Pandas. Key takeaways include:

1. **Python's core data types** provide the foundation for data science operations with flexibility but less efficiency than specialized libraries.

2. **NumPy data types** offer memory-efficient storage and faster computation through vectorized operations. Choosing appropriate dtypes can significantly improve performance.

3. **Pandas data types** extend Python and NumPy types with specialized capabilities for tabular data, including categoricals, date/time support, and extension arrays.

4. **Type conversion** is a common operation that requires careful attention to avoid data loss, precision issues, or errors.

5. **Memory usage** varies significantly across different data types, and optimizing types can dramatically reduce memory consumption for large datasets.

6. **Error handling** is essential for robust data processing workflows, especially when dealing with real-world data that contains inconsistencies.

Understanding data types and conversion techniques is fundamental to efficient and effective data analysis and machine learning workflows.