# String Operations for Text Data Cleaning with the Titanic Dataset

This notebook demonstrates how to clean and standardize textual data using Pandas string methods:
- `.str.lower()`, `.str.upper()`
- `.str.strip()`, `.str.replace()`, `.str.contains()`

We will apply them to the Titanic dataset, which contains messy and inconsistent text fields such as names, cabin numbers, and tickets.


## Step 1: Import Libraries and Load Titanic Dataset

In [1]:
import pandas as pd

# Load dataset
df = pd.read_csv('titanic_data.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",MALE,22.0,1.0,0.0,A/5 21171,7.25,,s
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",FEMALE,38.0,1.0,0.0,PC 17599,71.2833,C85,c
2,3,1,3,"Heikkinen, Miss. Laina",FEMALE,26.0,0.0,0.0,STON/O2. 3101282,7.925,,s
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",FEMALE,35.0,1.0,0.0,113803,53.1,C123,s
4,5,0,3,"Allen, Mr. William Henry",MALE,35.0,0.0,0.0,373450,8.05,,s


##  Step 2: Preview Columns with Text

In [2]:
df[['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']].head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",MALE,A/5 21171,,s
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",FEMALE,PC 17599,C85,c
2,"Heikkinen, Miss. Laina",FEMALE,STON/O2. 3101282,,s
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",FEMALE,113803,C123,s
4,"Allen, Mr. William Henry",MALE,373450,,s


##  Step 3: Use `.str.lower()` and `.str.upper()`

In [5]:
# Standardize to lowercase
df['Sex_lower'] = df['Sex'].str.lower()

# Standardize to uppercase
df['Embarked_upper'] = df['Embarked'].str.upper()

df[['Sex', 'Sex_lower', 'Embarked', 'Embarked_upper']].tail()


Unnamed: 0,Sex,Sex_lower,Embarked,Embarked_upper
292,MALE,male,c,C
293,FEMALE,female,s,S
294,MALE,male,s,S
295,MALE,male,c,C
296,,,,


##  Step 4: Use `.str.strip()` and `.str.replace()`

In [6]:
# Simulate a column with extra spaces
df['Fake_Embarked'] = ' ' + df['Embarked'].astype(str) + ' '

# Remove spaces and replace missing ports with 'Unknown'
df['Embarked_cleaned'] = df['Fake_Embarked'].str.strip().str.replace('nan', 'Unknown', case=False)

df[['Fake_Embarked', 'Embarked_cleaned']].tail()


Unnamed: 0,Fake_Embarked,Embarked_cleaned
292,c,c
293,s,s
294,s,s
295,c,c
296,,Unknown


## Step 5: Use `.str.contains()` to Detect Patterns

In [12]:
# Detect if name contains title 'Mrs'
df['Is_Mrs'] = df['Name'].str.contains('Mrs', case=False)

# Detect if Cabin info is missing
df['Has_Cabin_Info'] = df['Cabin'].str.contains('[A-Za-z]', na=False)

df[['Name', 'Is_Mrs', 'Cabin', 'Has_Cabin_Info']].head()

Unnamed: 0,Name,Is_Mrs,Cabin,Has_Cabin_Info
0,"Braund, Mr. Owen Harris",False,,False
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",True,C85,True
2,"Heikkinen, Miss. Laina",False,,False
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",True,C123,True
4,"Allen, Mr. William Henry",False,,False


##  Step 6: Extract Title from Name

In [14]:
# Extract title using split and strip
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=True)
print(df['Title'])
df['Title'] = df['Title'].str.strip()

print(df[['Name', 'Title']].drop_duplicates().head(10))

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
       ... 
292      Mr
293    Miss
294      Mr
295      Mr
296     NaN
Name: Title, Length: 297, dtype: object
                                                Name   Title
0                            Braund, Mr. Owen Harris      Mr
1  Cumings, Mrs. John Bradley (Florence Briggs Th...     Mrs
2                             Heikkinen, Miss. Laina    Miss
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)     Mrs
4                           Allen, Mr. William Henry      Mr
5                                   Moran, Mr. James      Mr
6                            McCarthy, Mr. Timothy J      Mr
7                     Palsson, Master. Gosta Leonard  Master
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)     Mrs
9                Nasser, Mrs. Nicholas (Adele Achem)     Mrs


## ✅ Summary


We covered:
- Case normalization with `.str.lower()` and `.str.upper()`
- Whitespace removal with `.str.strip()`
- Replacements with `.str.replace()`
- Pattern detection with `.str.contains()`
- Feature extraction from text (titles from names)

You can now apply these techniques to clean and engineer meaningful features from textual data!


In [None]:
# import pandas as pd

# url = "https://raw.githubusercontent.com/nileshely/SuperStore-Dataset-2019-2022/main/superstore_dataset.csv"
# df = pd.read_csv(url)
# df.to_csv("superstore.csv") #save dataset from url to current working directory