## Data Type Constraints

Data type constraints define the permissible formats and values for each variable in a dataset. These constraints are not merely a matter of programming formality; they are essential to ensure data integrity, enable efficient computation, support robust statistical inference, and prevent logical or semantic errors during analysis.

Understanding and rigorously applying data type constraints is foundational in data science and engineering, as it directly impacts:

- **How data are interpreted and processed**
- **What operations are allowed**
- **How errors are detected and handled**
- **Storage efficiency and computational speed**

### Principle Categories of Data Types

#### 1. **Text Data (Strings)**
- **Python Type:** `str`
- **Examples:** Names, addresses, free-text fields
- **Constraint:** Arbitrary Unicode/textual data, not inherently orderable or arithmetically manipulable

#### 2. **Integers**
- **Python Type:** `int`
- **Examples:** Counts, IDs, discrete numeric features (number of transactions)
- **Constraint:** Must be whole numbers, bounded by implementation

#### 3. **Floating Point Numbers (Decimals)**
- **Python Type:** `float`
- **Examples:** Measurements, continuous variables, financial data
- **Constraint:** Support for fractional values, susceptible to floating-point precision issues

#### 4. **Booleans**
- **Python Type:** `bool`
- **Examples:** Binary attributes, flags, logical features
- **Constraint:** Only two possible values: `True` or `False`

#### 5. **Dates and Times**
- **Python Type:** `datetime`
- **Examples:** Timestamps, dates of transactions, durations
- **Constraint:** Must conform to valid date/time representations, including time zones and formats

#### 6. **Categories (Categorical Data)**
- **Python Type:** `category`
- **Examples:** Gender, marital status, blood type, country codes
- **Constraint:** Limited to a finite set of possible values (labels), can be nominal or ordinal, usually improves memory and performance


### Why Enforce Data Type Constraints?

- **Error Prevention:** Invalid data can be flagged early (e.g., attempting arithmetic on strings, or parsing nonsense dates)
- **Logical Clarity:** Clear constraints support unambiguous operations (e.g., cannot average string fields)
- **Efficiency:** Memory and CPU are used more effectively (categorical vs. object; integer vs. float)
- **Reproducibility:** Ensures consistent processing and interpretation across systems and analysts
- **Statistical Correctness:** Summary statistics, regression, and ML models require correct data types for valid results


### Setting and Checking Data Types in Python/Pandas

Pandas, as a Python data analysis library, is strongly type-aware and provides flexible tools for setting, checking, and converting data types.

```python
import pandas as pd

# Read CSV while enforcing data types
dtypes = {
    'age': 'int',
    'name': 'str',
    'is_member': 'bool',
    'signup_date': 'datetime64[ns]',
    'membership_type': 'category'
}
df = pd.read_csv('data.csv', dtype=dtypes, parse_dates=['signup_date'])

# Checking data types
print(df.dtypes)
```

* **Changing data types after reading:**
  Use `.astype()` for type conversion, `.to_datetime()` for dates, and `.astype('category')` for categorical data.

### Data Type Inference and Automatic Conversion

* **Pandas' automatic type inference:**
  On reading, pandas tries to infer column types, but this can be error-prone with mixed or ambiguous data (e.g., '001' vs 1, dates in odd formats, etc).
* **Explicit type setting is always preferable for reliability.**


### Numeric vs. Categorical — A Crucial Distinction

Some columns may contain numbers but should be treated as categories (e.g., codes, labels, ordinal groups). Conversely, a categorical-looking field may be encoded as integers and require conversion for correct analysis.

* **Numeric coded categorical:** Should use `category` not `int`, as statistical summaries (mean, std) are not meaningful.
* **Categorical to numeric:** Sometimes category labels need to be mapped to numbers (label encoding, one-hot encoding).

### Enforcing Data Type Constraints 

* **Validate data types upon data import** (e.g., `df.dtypes`, `df.info()`)
* **Convert and coerce data as required** (`.astype()`, `.to_datetime()`)
* **Assert data type expectations** using Python's `assert` statement for robust code:

  ```python
  assert df['age'].dtype == 'int'
  ```
* **Handle non-conforming values** (e.g., parse errors, missing values) through coercion or pre-cleaning.


### Data Type Constraints in Python

| Datatype   | Example                       | Python Data Type |
| ---------- | ----------------------------- | ---------------- |
| Text       | First name, address           | `str`            |
| Integer    | Customer count, quantity sold | `int`            |
| Decimal    | Temperature, exchange rate    | `float`          |
| Binary     | Yes/No, is\_active            | `bool`           |
| Dates      | Ship date, signup date        | `datetime`       |
| Categories | Gender, marital status        | `category`       |




In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import polars as pl
import seaborn as sns

### Numeric data or ... ?
This dataset is from bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

1) for free riders.
2) for pay per ride.
3) for monthly subscribers.

```python
# Generate random dates between 2017-01-25 and 2020-01-17
start_date = pd.to_datetime('2017-01-25')
end_date = pd.to_datetime('2020-01-17')

# Calculate the number of days between start and end dates
date_range = (end_date - start_date).days

# Generate random dates for each row
random_days = np.random.randint(0, date_range + 1, size=len(ride_sharing))
ride_sharing["ride_date"] = start_date + pd.to_timedelta(random_days, unit='D')
```

In [15]:
ride_sharing = pd.read_csv("data/ride_sharing_new.csv")
ride_sharing.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
0,0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,Subscriber,1959,Male,27.0,2018-09-22
1,1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,Subscriber,1965,Male,26.0,2019-06-15
2,2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,Subscriber,1993,Male,26.0,2019-01-01
3,3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,Subscriber,1979,Male,29.0,2018-10-17
4,4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,Subscriber,1994,Male,27.0,2017-02-03


In [4]:
# Print summary statistics of user_type column
ride_sharing["user_type"].describe()

count          25760
unique             2
top       Subscriber
freq           23209
Name: user_type, dtype: object

In [5]:
# Convert user_type into categorical by assigning it the 'category' data type and store it in the user_type_cat column.
ride_sharing["user_type_cat"] = ride_sharing["user_type"].astype("category")

# Write an assert statement confirming the change
assert ride_sharing["user_type_cat"].dtype == "category"

In [6]:
# Print new summary statistics
ride_sharing["user_type_cat"].describe()

count          25760
unique             2
top       Subscriber
freq           23209
Name: user_type_cat, dtype: object

Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.

In [7]:
# Use the .strip() method to strip duration of "minutes" and store it in the duration_trim column.
ride_sharing["duration_trim"] = ride_sharing["duration"].str.strip("minutes")

ride_sharing["duration_time"] = ride_sharing["duration_trim"].astype("int")

# Write an assert statement making sure of conversion
assert ride_sharing["duration_time"].dtype == "int"

In [8]:
# Print formed columns and calculate average ride duration
print(np.mean(ride_sharing["duration_time"]))

11.389052795031056


In [9]:
ride_sharing = ride_sharing.drop(["duration", "duration_trim"], axis=1)
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,user_type_cat,duration_time
0,0,81,Berry St at 4th St,323,Broadway at Kearny,5480,Subscriber,1959,Male,27.0,Subscriber,12
1,1,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,Subscriber,1965,Male,26.0,Subscriber,24
2,2,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,Subscriber,1993,Male,26.0,Subscriber,8
3,3,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,Subscriber,1979,Male,29.0,Subscriber,4
4,4,22,Howard St at Beale St,350,8th St at Brannan St,4626,Subscriber,1994,Male,27.0,Subscriber,11


Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.

In [10]:
# Convert the tire_sizes column from category to 'int'.
ride_sharing["tire_sizes"] = ride_sharing["tire_sizes"].astype("int")

# Use .loc[] to set all values of tire_sizes above 27 to 27.
ride_sharing.loc[ride_sharing["tire_sizes"] > 27, "tire_sizes"] = 27

# Reconvert tire_sizes back to categorical
ride_sharing["tire_sizes"] = ride_sharing["tire_sizes"].astype("category")

# Print tire size description
print(ride_sharing["tire_sizes"].describe())

count     25760
unique        2
top          27
freq      13274
Name: tire_sizes, dtype: int64
