# Working with Strings and Dates in Data Wrangling

---

## Strings

Strings are an **ordered sequence of characters**, similar to lists or arrays.

In [2]:
# Strings are an ordered sequence (list) of characters
a = 'Pinky and the Brain'
print(type(a))

<class 'str'>


In [3]:
# Pulls out the first element of the string
# Remember indexes start at 0!
b = a[0]
print(b)

P


In [4]:
# Starts counting from the end.
print(a[-5])

B


In [5]:
# The number of characters in the string
# Equivalent to nchar() in R
len(a)

19

In [6]:
# First 5 elements
x = a[:5]
print(x)

Pinky


In [7]:
# Subsetting by index
y = a[14:]
print(y)

Brain


In [8]:
z = a[-5:]
print(z)

Brain


In [9]:
w = a[6:9]
print(w)

and


---

## Concatenating Strings

In [10]:
# Concatenating Strings (like str_c() in R)
s = x + y
print(s)

s = x + ' ' + y
print(s)

PinkyBrain
Pinky Brain


---

## String Operations

In [11]:
# Check if a character is in a string
s = 'Hello'
print('e' in s)
print('hi' not in s)

True
True


In [12]:
# Repeating strings (similar to rep() + paste() in R)
a = 'Ya'
print(5 * a + ' ' + s)

YaYaYaYaYa Hello


---

## Useful String Methods

Some helpful methods include:

- `s.find(t)`: find first occurrence of `t` in `s`
- `s.index(t)`: same as find()
- `s.split([delim])`: split string into list of substrings
- `s.isdigit()`: check if numeric
- `s.isalpha()`: check if alphabetic

Even more: [Python String Methods Reference](https://www.w3schools.com/python/python_ref_string.asp)

In [13]:
# Stripping and case conversion
s = "      Pinky and the Brain       "

print(s.strip())       # Remove whitespace
print(s.lower())       # Lowercase
print(s.upper().strip())
print(s.strip().replace(" ", "_"))

Pinky and the Brain
      pinky and the brain       
PINKY AND THE BRAIN
Pinky_and_the_Brain


In [14]:
# Finding index positions
print(s.find("i"))
print(s.strip().index("i"))
print(s.rfind("i"))

7
1
23


---

## String Immutability

Strings cannot be modified directly.

In [15]:
s = 'Hello World'
print(s[1])

# You cannot do this:
# s[1] = 'a'  # TypeError

e


To modify, rebuild it:

In [16]:
print(s[0] + 'a' + s[2:])

Hallo World


---

## Replace

In [17]:
print(s.replace("Hello", "Hallo"))
s = s.replace("Hello", "Hallo")
print(s)

Hallo World
Hallo World


---

## String Conversion and Escape Codes

In [18]:
x = 42
print(type(x))
x = str(x)
print(type(x))

<class 'int'>
<class 'str'>


Special characters:
```
\n : new line
\t : tab
\\ : backslash
```

In [19]:
filename = 'C:\\mypath\\myfile.txt'
print(filename)

C:\mypath\myfile.txt


---

## Split

In [20]:
line = 'dnaN,DNA polymerase III'
row = line.split(',')
print(row)

['dnaN', 'DNA polymerase III']


---

# Dates

Working with dates and times is essential for time-based joins and summaries.  
See resources:
- https://strftime.org/
- [Datacamp Cheat Sheet: Dates & Times](https://www.datacamp.com/cheat-sheet/working-with-dates-and-times-in-python-cheat-sheet)

In [21]:
import pandas as pd
import numpy as np
import datetime as dt

df = pd.DataFrame({
    'date': ['04/03/2022', '05/03/2022', '06/03/2022'],
    'date2': ['03/04/22', '03/05/22', '03/06/22'],
    'patients': [16, 19, 11]
})
df

Unnamed: 0,date,date2,patients
0,04/03/2022,03/04/22,16
1,05/03/2022,03/05/22,19
2,06/03/2022,03/06/22,11


---

## Converting to Datetime

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['date2'] = pd.to_datetime(df['date2'])
print(df.dtypes)

---

## Date Math

In [None]:
print(df['date'][1] - df['date'][0])
print(df['date'][2] - df['date'][1])

---

## Formatting Dates

In [None]:
print(df['date'].dt.strftime('%m/%d/%Y'))
print(df['date'].dt.strftime('%b %d, %Y'))

---

## Integer to Date

In [None]:
pd.to_datetime(14667, unit='D', origin='unix')
pd.to_datetime(14667, unit='D', origin='1900-01-01')

---

## Getting Date Information

In [None]:
print(df['date'].dt.year)
print(df['date'].dt.month)
print(df['date'].dt.dayofweek)

---

## Creating Date Ranges

In [None]:
pd.date_range(start='2022-04-13', end='2022-05-05')

---

# Combining Strings and Dates

In [None]:
df['label'] = "Patients on " + df['date'].dt.strftime("%b %d, %Y")
df

 **On Your Own:**
- Create a new date label column using the format “Month Year”
- Concatenate it with the number of patients to create readable summaries.

---

#  Real Dataset Example: CTA Stops

In [None]:
cta_stops = pd.read_csv("../data/CTA_stops.csv")

# Clean string fields
cta_stops["STATION_NAME"] = cta_stops["STATION_NAME"].str.strip().str.title()

# Extract first letter of direction
cta_stops["direction_short"] = cta_stops["DIRECTION_ID"].str[0]

cta_stops.head()

**Discussion:**  
- Why is it important to clean strings before joining datasets?  
- How could inconsistent capitalization affect a merge?

---

#  Babynames Example

In [None]:
babynames = pd.read_csv("https://raw.githubusercontent.com/gjm112/DSCI401/main/data/babynames.csv")
babynames.head()

In [None]:
# Create first letter variable
babynames["first_letter"] = babynames["name"].str[0].str.upper()

# Group and summarize
letter_trends = (
    babynames.groupby(["year", "first_letter"])["n"]
    .sum()
    .reset_index()
)

# Pivot to see letters as columns
letter_pivot = letter_trends.pivot(index="year", columns="first_letter", values="n")
letter_pivot.head()

**Visualization:**

In [None]:
import matplotlib.pyplot as plt

top_letters = ["A", "J", "M", "S"]
for letter in top_letters:
    plt.plot(letter_trends.query("first_letter == @letter")["year"],
             letter_trends.query("first_letter == @letter")["n"],
             label=letter)

plt.legend()
plt.xlabel("Year")
plt.ylabel("Total Births")
plt.title("Most Popular Starting Letters Over Time")
plt.show()

---

# Joining with Dates

In [None]:
weather = pd.DataFrame({
    "date": pd.date_range("2022-04-01", "2022-04-03"),
    "temp": [60, 62, 58]
})

merged = pd.merge(df, weather, on="date", how="left")
merged

**Question:**  
What happens if `df['date']` was not converted to a datetime type before joining?

---

#  Practice Summary

1. Clean string columns (remove spaces, fix case)
2. Convert date columns with `pd.to_datetime()`
3. Use `.dt` methods for date components
4. Join datasets after cleaning formats
5. Pivot long/wide to tidy or reshape data
6. Visualize trends over time using cleaned data

---

**On Your Own:**
- Explore babynames to find the most common *first letters* per decade.  
- Visualize the trend over time.  
- Bonus: Add a line for your own first initial!