# Mixed Variable Analysis — Titanic Dataset

This notebook demonstrates how to handle **mixed-type variables** — columns that contain a blend of numeric and categorical values within the same field (e.g. `'3'`, `'6'`, `'A'`).

Such columns appear frequently in real-world datasets and require special treatment before they can be fed into a machine learning model. The general strategy is to **split** each mixed column into two clean columns: one purely numerical and one purely categorical.

### Dataset
A custom subset of the Titanic dataset containing three mixed-type columns:
| Column | Description | Example values |
|--------|-------------|----------------|
| `number` | Companion count (mostly numeric, some alphabetic) | `'3'`, `'A'` |
| `Cabin`  | Deck letter + cabin number | `'C85'`, `'B42'` |
| `Ticket` | Prefix + serial number | `'A/5 21171'`, `'113803'` |

### Workflow
1. Import libraries  
2. Load & inspect data  
3. Analyse the `number` column  
4. Analyse the `Cabin` column  
5. Analyse the `Ticket` column  

### Import Libraries

In [None]:
import numpy as np        # numerical operations
import pandas as pd       # dataframe manipulation
import seaborn as sns     # statistical visualisation
import matplotlib.pyplot as plt  # plot rendering

### Load & Inspect the Data

We load a pre-processed Titanic CSV that already isolates the three mixed-type columns alongside the `Survived` target. A quick look confirms 891 rows and the presence of NaN values in `Cabin`.

In [None]:
df = pd.read_csv('titanic.csv')
df  # display the full dataframe (truncated by Jupyter)

### The `number` Column

The `number` column represents the number of companions a passenger was travelling with. Most values are numeric strings (`'1'`–`'6'`) but at least one entry is alphabetic (`'A'`). This makes the column **object dtype** and prevents direct arithmetic.

In [None]:
# Inspect all unique values to understand the mix
df['number'].unique()

### Visualise Value Distribution

A bar plot lets us see how common each value is and whether non-numeric values (`'A'`) appear frequently enough to matter.

In [None]:
sns.barplot(df['number'].value_counts())
plt.title('Passenger Travelling with')  # title describing companion count distribution
plt.xlabel('Companion Count Value')
plt.ylabel('Frequency')
plt.show()

###  Extract the Numerical Part

`pd.to_numeric(..., errors='coerce')` attempts to convert every value to a number. Values that cannot be converted (like `'A'`) become `NaN`, effectively isolating the numeric signal in a new column.

In [None]:
# Coerce non-numeric values to NaN — numeric strings become floats
df['number_numerical'] = pd.to_numeric(df['number'], errors='coerce', downcast='integer')

###  Extract the Categorical Part
We use `np.where` to keep the original string value **only** where the numerical conversion produced `NaN` (i.e. the entry was truly non-numeric). Otherwise we set `NaN`.

In [None]:
# Retain the original string only where number_numerical is NaN (non-numeric entries)
df['number_Categorical'] = np.where(
    df['number_numerical'].isnull(),  # condition: conversion failed → was non-numeric
    df['number'],                     # true  → keep original string (e.g. 'A')
    np.nan                            # false → numeric values go to number_numerical
)

**Checkpoint** — verify both new columns side-by-side:

In [None]:
df.head()

### The `Cabin` Column

Cabin identifiers follow the pattern **`<Letter><Number>`** (e.g. `C85`, `B42`). Many passengers have `NaN` cabins (deck assignment unknown). We split each value into:
- `Cabin_Categorical` → deck letter (e.g. `'C'`)
- `Cabin_Numerical`   → cabin number (e.g. `85`)


In [None]:
# Preview all unique cabin values to understand the pattern
df['Cabin'].unique()

### 4.1 Extract the Cabin Number

In [None]:
# Regex \d+ matches the first sequence of digits in the string
# NaN cabins will naturally return NaN from str.extract
df['Cabin_Numerical'] = df['Cabin'].str.extract('(\d+)')

### 4.2 Extract the Deck Letter

In [None]:
# The deck letter is always the first character of the cabin string
# str[0] returns NaN for missing cabin values — no extra handling needed
df['Cabin_Categorical'] = df['Cabin'].str[0]

**Checkpoint** — verify Cabin split:

In [None]:
df.head()

## 5. The `Ticket` Column

Ticket values can be:
- Pure numbers: `'113803'`
- Prefix + number: `'A/5 21171'`, `'PC 17599'`, `'STON/O2. 3101282'`

We split on whitespace and extract:
- `Ticket_Categorical` → the prefix/label (if any)
- `Ticket_Numerical`   → the trailing serial number


### 5.1 Extract the Ticket Prefix (Categorical Part)

In [None]:
# Split on spaces and take the FIRST token
df['Ticket_Categorical'] = df['Ticket'].apply(lambda s: s.split()[0])

# If the first token is purely digits, the ticket has no prefix → set NaN
df['Ticket_Categorical'] = np.where(
    df['Ticket_Categorical'].str.isdigit(),  # condition: first token is a number
    np.nan,                                   # true  → no prefix exists
    df['Ticket_Categorical']                  # false → keep the alphabetic prefix
)

**Checkpoint** — verify Ticket categorical extraction:

In [None]:
df.head()

### 5.2 Extract the Ticket Serial Number (Numerical Part)

In [None]:
# Split on spaces and take the LAST token (always the serial number)
df['Ticket_Numerical'] = df['Ticket'].apply(lambda s: s.split()[-1])

# Convert to numeric; non-convertible edge cases become NaN
df['Ticket_Numerical'] = pd.to_numeric(
    df['Ticket_Numerical'],
    errors='coerce',      # coerce any remaining non-numeric values to NaN
    downcast='integer'    # use smallest integer dtype that fits the values
)

## 6. Final Dataset Overview

The original three mixed columns have each been decomposed into a clean numeric column and a clean categorical column. The resulting dataframe now contains **10 columns** — all ready for downstream feature engineering or model training.

> **Note:** The original mixed columns (`number`, `Cabin`, `Ticket`) are retained for reference. Drop them before training a model.

In [None]:
df.head()

### Summary Table

| Original Column | Numerical Derivative | Categorical Derivative |
|-----------------|----------------------|------------------------|
| `number` | `number_numerical` | `number_Categorical` |
| `Cabin`  | `Cabin_Numerical`  | `Cabin_Categorical` |
| `Ticket` | `Ticket_Numerical` | `Ticket_Categorical` |

---
*Notebook prepared as a demonstration of mixed-variable decomposition on the Titanic dataset.*