# Data Exploration and Summary

## 4. Data Type Summary

In [12]:
import pandas as pd
import numpy as np

Understanding and managing **data types** is crucial because:

* It affects **memory usage**, **performance**, and **accuracy** of computations.
* Some methods work only with **specific dtypes** (e.g., `.mean()` for numeric).
* It helps with **type conversions** when preparing data for modeling or visualization.

We'll cover:

1. `df.dtypes`
2. `df.select_dtypes(include=..., exclude=...)`
3. `df.astype()` – Type conversion
4. Type inference methods


### 1. `dtypes`

Shows the data type of each column in a DataFrame.

In [2]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Income': [50000.0, 60000.5, 70000.8],
    'Is_Employed': [True, False, True],
    'Join_Date': pd.to_datetime(['2021-01-01', '2021-02-15', '2021-03-30'])
})

print(df.dtypes)

Name                   object
Age                     int64
Income                float64
Is_Employed              bool
Join_Date      datetime64[ns]
dtype: object


### 2. `df.select_dtypes(include=..., exclude=...)`
Used to **filter columns** based on their data types.

In [3]:
df

Unnamed: 0,Name,Age,Income,Is_Employed,Join_Date
0,Alice,25,50000.0,True,2021-01-01
1,Bob,30,60000.5,False,2021-02-15
2,Charlie,35,70000.8,True,2021-03-30


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Name         3 non-null      object        
 1   Age          3 non-null      int64         
 2   Income       3 non-null      float64       
 3   Is_Employed  3 non-null      bool          
 4   Join_Date    3 non-null      datetime64[ns]
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 227.0+ bytes


In [5]:
# Select only numeric columns
df.select_dtypes(include='number')

Unnamed: 0,Age,Income
0,25,50000.0
1,30,60000.5
2,35,70000.8


In [6]:
# Select only object (string-like) columns
df.select_dtypes(include='object')

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie


In [7]:
# Select only bool columns
df.select_dtypes(include='bool')

Unnamed: 0,Is_Employed
0,True
1,False
2,True


In [8]:
# Select all except datetime
df.select_dtypes(exclude='datetime')

Unnamed: 0,Name,Age,Income,Is_Employed
0,Alice,25,50000.0,True
1,Bob,30,60000.5,False
2,Charlie,35,70000.8,True


In [9]:
# You can pass a list of types:
df.select_dtypes(include=['number', 'bool'])

Unnamed: 0,Age,Income,Is_Employed
0,25,50000.0,True
1,30,60000.5,False
2,35,70000.8,True


Common dtype strings:

| Category         | String       |
| ---------------- | ------------ |
| Numeric types    | `'number'`   |
| Integer only     | `'int'`      |
| Floating-point   | `'float'`    |
| Boolean          | `'bool'`     |
| Object (strings) | `'object'`   |
| Datetime         | `'datetime'` |

### 3. `df.astype()` – Changing Data Types
Used to explicitly convert a column to a specified data type.

In [14]:
# Convert Age to float
df['Age'] = df['Age'].astype(float)
df.Age.dtype

dtype('float64')

In [15]:
# Convert Age to float
df['Age'] = df['Age'].astype(np.float16)
df.Age.dtype

dtype('float16')

In [16]:
# Convert Income to int (will round down)
df['Income'] = df['Income'].astype(int)
df['Income']

0    50000
1    60000
2    70000
Name: Income, dtype: int32

In [17]:
# Convert bool to int
df.Is_Employed = df.Is_Employed.astype(int)
df.Is_Employed

0    1
1    0
2    1
Name: Is_Employed, dtype: int32

In [18]:
# Convert date to string
df['Join_Date'] = df['Join_Date'].astype(str)
df['Join_Date']

0    2021-01-01
1    2021-02-15
2    2021-03-30
Name: Join_Date, dtype: object

In [19]:
# Convert multiple columns using a dictionary
df = df.astype({'Age': 'int32', 'Income': 'float32'})
df

Unnamed: 0,Name,Age,Income,Is_Employed,Join_Date
0,Alice,25,50000.0,1,2021-01-01
1,Bob,30,60000.0,0,2021-02-15
2,Charlie,35,70000.0,1,2021-03-30


In [20]:
# You can even convert to `category` type to reduce memory usage:
df['Name'] = df['Name'].astype('category')
df.Name

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: category
Categories (3, object): ['Alice', 'Bob', 'Charlie']

**What is category type ?**

## 4. Inference & Conversion Utilities

These functions help **infer or convert** data types intelligently:

#### a. `pd.to_numeric()`

Used to safely convert strings/numbers to numeric (int/float) types.

In [24]:
pd.to_numeric(pd.Series(['100', '200.5', '300', None]), errors='coerce')

0    100.0
1    200.5
2    300.0
3      NaN
dtype: float64

`errors='coerce'` replaces invalid parsing with NaN.

#### b. `pd.to_datetime()`

Parses strings to `datetime64`.

In [26]:
pd.to_datetime(['2023-01-01', '2023-02-10'])

DatetimeIndex(['2023-01-01', '2023-02-10'], dtype='datetime64[ns]', freq=None)

#### c. `pd.to_timedelta()`
Converts string/integer time formats to `timedelta` objects.

In [27]:
pd.to_timedelta(['1 days', '5 days'])

TimedeltaIndex(['1 days', '5 days'], dtype='timedelta64[ns]', freq=None)

### **There is some issue with these to_{conversions}**

These conversions only happening if the data is already having the converting properties, otherwise there is no effect

### ✅ Summary Table

| Task                 | Method                          |
| -------------------- | ------------------------------- |
| View data types      | `df.dtypes`                     |
| Select by type       | `df.select_dtypes(include=...)` |
| Convert column types | `df.astype()`                   |
| Parse numeric safely | `pd.to_numeric()`               |
| Parse dates          | `pd.to_datetime()`              |
| Convert to timedelta | `pd.to_timedelta()`             |

---

### 🔍 Real-Life Scenarios

| Scenario                                | Solution                              |
| --------------------------------------- | ------------------------------------- |
| Convert string dates to datetime        | `pd.to_datetime(df['Date'])`          |
| Convert “Yes”/“No” to 1/0               | `df['Flag'].map({'Yes': 1, 'No': 0})` |
| Optimize memory for large string column | `astype('category')`                  |
| Convert price columns with \$/commas    | `str.replace()` + `to_numeric()`      |

<center><b>Thanks</b></center>