# Handling Mixed Variables

In real-world datasets, it is common to encounter **mixed types of variables** within the same dataset, such as:
- Numerical variables (continuous or discrete)  
- Categorical variables (ordinal or nominal)  
- Binary or Boolean variables  

Handling each type correctly is essential for building effective and reliable machine learning models.

---

## Strategies for Handling Mixed Variables

### Separate Columns by Type

The first step is to identify and separate features based on their data types:
- Numerical columns  
- Categorical columns  
- Binary columns  

This ensures that appropriate preprocessing techniques are applied to each feature.

---

### Numerical Variables

Common preprocessing techniques for numerical features include:
- **Scaling / Standardization** to bring features to a common scale  
- **Normalization** to rescale values between 0 and 1  
- **Binning / Discretization** to convert continuous values into categories  

The choice depends on the model and the data distribution.

---

### Categorical Variables

Preprocessing techniques depend on whether the categories have an order:

- **Ordinal Encoding** for ordered categories (e.g., Low < Medium < High)  
- **One-Hot Encoding** for nominal categories with no inherent order  
- **Label Encoding** primarily for target variables or binary features  

Choosing the wrong encoding method can negatively impact model performance.

---

### Mixed Preprocessing Approaches

When working with mixed feature types:
- Use **ColumnTransformer** to apply different transformations to different columns  
- Combine transformations using a **Pipeline**  

This approach:
- Ensures consistent preprocessing  
- Prevents data leakage  
- Improves reproducibility and maintainability  

---

### Special Cases

- **Binary variables** are typically encoded as 0 and 1  
- **Skewed numerical features** may benefit from power transformations or log transformations  
- **Outliers** should be handled before or after scaling and encoding, depending on the use case  

---

## Summary

- Always identify feature types before preprocessing  
- Apply preprocessing techniques appropriate to each variable type  
- Use `ColumnTransformer` and `Pipeline` for clean, consistent workflows  
- Proper handling of mixed variables improves model performance and reliability  


In [1]:
%%capture
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install scikit-learn
!pip install seaborn


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [11]:
df=pd.read_csv('synthetic_titanic.csv')
df.sample(5)

Unnamed: 0,PassengerId,Cabin,Ticket,Number,Survived
82,83,F33,347082,A,1
26,27,A10,A/5 21171,A,1
70,71,E45,113803,B,0
3,4,G6,STON/O2. 3101282,B,1
16,17,D56,347082,5,0


### Number

In [12]:
# Convert numeric values, non-numeric will become NaN
df['Number_numerical'] = pd.to_numeric(df['Number'], errors='coerce', downcast='integer')

# Keep non-numeric values as categorical
df['Number_categorical'] = np.where(df['Number_numerical'].isnull(), df['Number'], None)

df.head()

Unnamed: 0,PassengerId,Cabin,Ticket,Number,Survived,Number_numerical,Number_categorical
0,1,G6,STON/O2. 3101282,5,0,5.0,
1,2,D56,STON/O2. 3101282,A,1,,A
2,3,F33,347082,3,0,3.0,
3,4,G6,STON/O2. 3101282,B,1,,B
4,5,B22,STON/O2. 3101282,A,0,,A


### Cabin

In [14]:
# Captures Numerical part
df['Cabin_num']=df['Cabin'].str.extract('(\d+)')

# Captures the first letter
df['Cabin_cat']=df['Cabin'].str[0]

df.head()

  df['Cabin_num']=df['Cabin'].str.extract('(\d+)')


Unnamed: 0,PassengerId,Cabin,Ticket,Number,Survived,Number_numerical,Number_categorical,Cabin_num,Cabin_cat
0,1,G6,STON/O2. 3101282,5,0,5.0,,6,G
1,2,D56,STON/O2. 3101282,A,1,,A,56,D
2,3,F33,347082,3,0,3.0,,33,F
3,4,G6,STON/O2. 3101282,B,1,,B,6,G
4,5,B22,STON/O2. 3101282,A,0,,A,22,B


### Ticket

In [16]:
# Extract the last part of Ticket as number
df['Ticket_num'] = df['Ticket'].apply(lambda s: s.split()[-1])
df['Ticket_num'] = pd.to_numeric(df['Ticket_num'], errors='coerce', downcast='integer')

# Extract the first part of Ticket as category
df['Ticket_cat'] = df['Ticket'].apply(lambda s: s.split()[0])
df['Ticket_cat'] = np.where(df['Ticket_cat'].str.isdigit(), np.nan, df['Ticket_cat'])

df.head()

Unnamed: 0,PassengerId,Cabin,Ticket,Number,Survived,Number_numerical,Number_categorical,Cabin_num,Cabin_cat,Ticket_num,Ticket_cat
0,1,G6,STON/O2. 3101282,5,0,5.0,,6,G,3101282,STON/O2.
1,2,D56,STON/O2. 3101282,A,1,,A,56,D,3101282,STON/O2.
2,3,F33,347082,3,0,3.0,,33,F,347082,
3,4,G6,STON/O2. 3101282,B,1,,B,6,G,3101282,STON/O2.
4,5,B22,STON/O2. 3101282,A,0,,A,22,B,3101282,STON/O2.
