## Identifying Data Quality Issues

Data cleaning is a core part of data science — often taking 60–80% of project time. Before fixing data, you must first identify what’s wrong using a structured approach.

### Why Data Becomes “Dirty”

Data issues arise from:

    Human error

    System glitches

    Data integration problems

    Real-world inconsistencies

In [1]:
# imports 
import pandas as pd 

In [2]:
# load data 
df_titanic = pd.read_csv("titanic.csv")


In [3]:
# show head 
df_titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Unnamed: 4,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# check shape
df_titanic.shape 

(893, 13)

In [5]:
# check info
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Unnamed: 4   893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [6]:
df_titanic["Age"]

0      220.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
888      NaN
889     26.0
890     32.0
891     32.0
892     27.0
Name: Age, Length: 893, dtype: float64

### The Main Categories of Data Quality Issues

#### 1. Missing Values

Empty cells, `NaN`, `NULL`, or placeholder values such as `"N/A"` or `"999"`.

**Problem:**  
Missing values can break calculations, distort averages, and introduce bias into analysis if not handled properly.



In [7]:
# how to check for missing values 
df_titanic.isna().sum()


Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Unnamed: 4       0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

In [8]:
df_titanic.isnull().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Unnamed: 4       0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

#### 2. Duplicate Records

Exact duplicates or near-duplicate records with slight variations.

**Problem:**  
Duplicates inflate counts, distort summary statistics, and can lead to double-counting in business metrics.


In [9]:
# example check for duplicates 
df_titanic.duplicated().sum()

np.int64(1)

#### 3. Inconsistent Formatting

Different formats used for the same type of data, such as:
- Multiple date formats
- Inconsistent name casing
- Different phone number formats

**Problem:**  
Prevents accurate grouping, sorting, filtering, and matching of records.


In [10]:
# example  check for inconsistent Formats 
df_titanic["Pclass"].value_counts()

Pclass
3    470
1    201
2    173
?     49
Name: count, dtype: int64

In [11]:
df_titanic["Sex"].value_counts()

Sex
male      579
female    314
Name: count, dtype: int64

In [12]:
df_titanic["Cabin"].value_counts()

Cabin
G6             4
C23 C25 C27    4
B96 B98        4
F33            3
E101           3
              ..
E17            1
A24            1
C50            1
B42            1
C148           1
Name: count, Length: 147, dtype: int64

#### 4. Invalid Data Types

- Numbers stored as text  
- Dates stored as strings  
- Mixed units (e.g., `"25 years"`)

**Problem:**  
Prevents mathematical operations, proper sorting, and accurate analysis.



In [13]:
# example 
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Unnamed: 4   893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [14]:
df_titanic["Pclass"].value_counts()

Pclass
3    470
1    201
2    173
?     49
Name: count, dtype: int64

#### 5. Structural Issues

- Poor column names (e.g., `"Unnamed: 3"`, `"Col_A"`)  
- Spaces or special characters in column names  
- Inconsistent naming conventions  

**Problem:**  
Makes code harder to write, maintain, and debug.

In [15]:
# example
df_titanic.head()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Unnamed: 4,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
df_titanic.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Unnamed: 4', 'Sex',
       'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

#### 6. Outliers and Impossible Values

Examples:
- Negative ages  
- Unrealistic salaries  
- Future birth dates  

**Problem:**  
Skews statistical analysis and leads to misleading conclusions.


In [17]:
# example 
df_titanic.describe()

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare
count,893.0,893.0,893.0,716.0,893.0,893.0,893.0
mean,445.989922,446.992161,0.382979,29.975098,0.521837,0.380739,32.155318
std,257.913891,257.917707,0.486386,16.153539,1.101784,0.805355,49.64858
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0
25%,223.0,224.0,0.0,20.375,0.0,0.0,7.8958
50%,446.0,447.0,0.0,28.0,0.0,0.0,14.4542
75%,669.0,670.0,1.0,38.0,1.0,0.0,31.0
max,890.0,891.0,1.0,220.0,8.0,6.0,512.3292


In [18]:
df_titanic["Fare"].value_counts()

Fare
8.0500     43
13.0000    43
7.8958     38
7.7500     35
26.0000    31
           ..
13.8583     1
50.4958     1
5.0000      1
9.8458      1
10.5167     1
Name: count, Length: 248, dtype: int64

### The 5-Step Data Quality Assessment Framework

| Step | Check        | Method                           | Red Flag                 |
| ---- | ------------ | -------------------------------- | ------------------------ |
| 1    | Completeness | `df.info()`, `df.isnull().sum()` | >10% missing             |
| 2    | Uniqueness   | `df.duplicated().sum()`          | Duplicates in key fields |
| 3    | Validity     | `df.describe()`                  | Impossible values        |
| 4    | Consistency  | `df[col].value_counts()`         | Multiple formats         |
| 5    | Data Types   | `df.dtypes`                      | Numeric stored as object |


## Handling Missing Values

Missing values are unavoidable in real-world datasets. The goal is not to eliminate them blindly, but to handle them strategically based on context.

There is no universal solution. Always ask: **Why is this data missing?**


### Types of Missing Data

#### 1. Missing Completely at Random (MCAR)

No identifiable pattern in the missingness.

**Example:**  
Survey responses skipped accidentally.

**Strategy:**  
- Safe to drop rows if missing data is small (e.g., <5%)  
- Or fill using averages



#### 2. Missing at Random (MAR)

Missing values are related to other observed variables.

**Example:**  
Older users less likely to provide email addresses.

**Strategy:**  
- Fill using group averages  
- Use related columns to inform imputation



#### 3. Missing Not at Random (MNAR)

Missingness itself carries meaning.

**Example:**  
High earners refuse to report salary.

**Strategy:**  
- Requires domain knowledge  
- May require modeling the missingness



#### Method 1: Dropping Missing Values with `.dropna()`

Removes rows or columns containing missing values.

```python
# Drop rows with ANY missing values
df_clean = df.dropna()

# Drop rows where ALL values are missing
df_clean = df.dropna(how='all')

# Drop rows missing specific columns
df_clean = df.dropna(subset=['Email', 'Phone'])

# Drop columns with missing values
df_clean = df.dropna(axis=1)

# Keep rows with at least 3 non-null values
df_clean = df.dropna(thresh=3)


In [19]:
df_titanic.isna().sum()

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Unnamed: 4       0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          689
Embarked         2
dtype: int64

In [20]:
# example
df_titanic_clean = df_titanic.dropna()
df_titanic_clean.info()

<class 'pandas.DataFrame'>
Index: 183 entries, 1 to 889
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   183 non-null    int64  
 1   PassengerId  183 non-null    int64  
 2   Survived     183 non-null    int64  
 3   Pclass       183 non-null    str    
 4   Unnamed: 4   183 non-null    str    
 5   Sex          183 non-null    str    
 6   Age          183 non-null    float64
 7   SibSp        183 non-null    int64  
 8   Parch        183 non-null    int64  
 9   Ticket       183 non-null    str    
 10  Fare         183 non-null    float64
 11  Cabin        183 non-null    str    
 12  Embarked     183 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 20.0 KB


In [21]:
df_titanic_no_cabin = df_titanic.drop("Cabin",axis=1)
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Unnamed: 4   893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [22]:
df_titanic_no_cabin_clean = df_titanic_no_cabin.dropna()
df_titanic_no_cabin_clean.info()

<class 'pandas.DataFrame'>
Index: 714 entries, 0 to 892
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   714 non-null    int64  
 1   PassengerId  714 non-null    int64  
 2   Survived     714 non-null    int64  
 3   Pclass       714 non-null    str    
 4   Unnamed: 4   714 non-null    str    
 5   Sex          714 non-null    str    
 6   Age          714 non-null    float64
 7   SibSp        714 non-null    int64  
 8   Parch        714 non-null    int64  
 9   Ticket       714 non-null    str    
 10  Fare         714 non-null    float64
 11  Embarked     714 non-null    str    
dtypes: float64(2), int64(5), str(5)
memory usage: 72.5 KB


#### Method 2: Filling Missing Values with .fillna()

Replaces missing values while preserving rows.

```python
# Fill with a specific value
df['Status'] = df['Status'].fillna('Unknown')

# Fill with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())

# Fill with median
df['Salary'] = df['Salary'].fillna(df['Salary'].median())

# Fill with mode
df['City'] = df['City'].fillna(df['City'].mode()[0])

# Forward fill
df['Price'] = df['Price'].fillna(method='ffill')

# Backward fill
df['Price'] = df['Price'].fillna(method='bfill')

# Different strategies per column
df = df.fillna({
    'Age': 0,
    'City': 'Unknown',
    'Score': df['Score'].mean()
})
```


In [23]:
# examples
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Unnamed: 4   893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          716 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


In [24]:
df_titanic["Age"] = df_titanic["Age"].fillna(df_titanic["Age"].median())

In [25]:
mid = df_titanic["Age"].median()
mid

np.float64(28.0)

In [26]:
df_titanic["Age"] = df_titanic["Age"].fillna(df_titanic["Age"].median)

In [27]:

df_titanic

Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Unnamed: 4,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.2500,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...
888,888,889,0,?,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S
889,889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
890,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q
891,890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q


In [28]:
df_titanic.info()

<class 'pandas.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   893 non-null    int64  
 1   PassengerId  893 non-null    int64  
 2   Survived     893 non-null    int64  
 3   Pclass       893 non-null    str    
 4   Unnamed: 4   893 non-null    str    
 5   Sex          893 non-null    str    
 6   Age          893 non-null    float64
 7   SibSp        893 non-null    int64  
 8   Parch        893 non-null    int64  
 9   Ticket       893 non-null    str    
 10  Fare         893 non-null    float64
 11  Cabin        204 non-null    str    
 12  Embarked     891 non-null    str    
dtypes: float64(2), int64(5), str(6)
memory usage: 90.8 KB


#### Choosing the Right Fill Strategy
**Numeric Data**

    Mean: Normally distributed data without outliers

    Median: Skewed data or presence of outliers

    0 or -1: When missing indicates absence

**Categorical Data**

    Mode: Most frequent category

    "Unknown": When missing has meaning

    Forward/Backward fill: Time-series data only

| Situation              | Recommended Action | Reason         |
| ---------------------- | ------------------ | -------------- |
| <5% missing            | Drop rows          | Minimal impact |
| 5–15% missing          | Fill strategically | Preserve data  |
| >30% missing column    | Drop column        | Too unreliable |
| Critical field missing | Drop rows          | Cannot proceed |
| Optional field missing | Fill placeholder   | Keep record    |
| Time series data       | Forward/back fill  | Maintain order |


In [29]:
# example
mode = df_titanic["Embarked"].mode()

type(mode)

pandas.Series

In [30]:
print(mode)

0    S
Name: Embarked, dtype: str


### Column Renaming and Standardization

Clean and consistent column names improve readability, reduce errors, and make code easier to maintain. Poor column names create confusion and require extra handling in code.


#### Why Column Names Matter

Messy column names often include:
- Spaces
- Special characters
- Inconsistent casing
- Unclear wording

Example of problematic names:
- "First Name!"
- "Total Sales ($)"
- "E-mail Address"
- "Unnamed: 7"
- "Customer's Age (Years)"

These cause:
- Syntax issues
- Hard-to-read code
- Inconsistent references

Clean versions:
- `first_name`
- `total_sales`
- `email`
- `purchase_count`
- `customer_age`

Benefits:
- Easy to type
- No special characters
- Consistent structure
- Cleaner code (`df.first_name` instead of `df['First Name']`)


### Golden Rules of Column Naming

1. Use lowercase  
2. Replace spaces with underscores  
3. Remove special characters  
4. Keep names descriptive but concise  
5. Use consistent naming patterns (e.g., snake_case)


### Renaming Columns with `.rename()`

#### Rename Specific Columns

```python
df = df.rename(columns={
    'First Name': 'first_name',
    'E-mail': 'email',
    'Total Sales ($)': 'total_sales'
})


In [31]:
df_titanic.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Unnamed: 4', 'Sex',
       'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

In [32]:
# example
df_titanic = df_titanic.rename(columns={
    "Unnamed: 0":"OriginalIndex",
    "Unnamed: 4": "Name",
    "Pclass":"PClass"
})


In [33]:
df_titanic.columns

Index(['OriginalIndex', 'PassengerId', 'Survived', 'PClass', 'Name', 'Sex',
       'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='str')

## Column Selection Overview

Selecting columns lets you focus on only the data you need, improving code readability, speed, and clarity.

Useful in large datasets (20–100+ columns) where only a subset is needed for analysis

### Single Column Selection

| Method | Syntax         | Notes |
|--------|----------------|-------|
| Dot notation | `df.column` | Cleaner, works if column has no spaces/special chars and doesn't start with a number |
| Bracket notation | `df['column']` | Flexible, works with spaces, special chars, or variable column names |

**Return type:**
- `df['age']` → **Series**  
- `df[['age']]` → **DataFrame**


In [34]:
df_titanic.head()

Unnamed: 0,OriginalIndex,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
# example
# method 1
df_titanic.Age


0      220.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
888     28.0
889     26.0
890     32.0
891     32.0
892     27.0
Name: Age, Length: 893, dtype: float64

In [36]:
df_titanic["Age"]

0      220.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
888     28.0
889     26.0
890     32.0
891     32.0
892     27.0
Name: Age, Length: 893, dtype: float64

In [37]:
df_titanic[["Age"]]

Unnamed: 0,Age
0,220.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
888,28.0
889,26.0
890,32.0
891,32.0



### Multiple Column Selection

```python
df[['col1', 'col2']]
important_cols = ['customer_id', 'total_sales']
df[important_cols]
```
    Always use double brackets.

    Can reorder columns or store lists in variables.

    Returns a DataFrame


In [38]:
# example
new_df = df_titanic[["Age","Sex","Survived"]]
new_df

Unnamed: 0,Age,Sex,Survived
0,220.0,male,0
1,38.0,female,1
2,26.0,female,1
3,35.0,female,1
4,35.0,male,0
...,...,...,...
888,28.0,female,0
889,26.0,male,1
890,32.0,male,0
891,32.0,male,0


### Advanced Column Selection
| Technique       | Example                                                     | Use Case                                |
| --------------- | ----------------------------------------------------------- | --------------------------------------- |
| Exclude columns | `df.drop(columns=['email', 'phone'])`                       | Remove unnecessary columns              |
| By data type    | `df.select_dtypes(include=['int64','float64'])`             | Select numeric columns for calculations |
| Pattern-based   | `df.filter(like='sales')` <br> `df.filter(regex='^total_')` | Select columns by substring or regex    |


In [39]:
# example
df_titanic = df_titanic.drop(columns=["OriginalIndex"])
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [40]:
df_titanic.select_dtypes(include=["str"])

Unnamed: 0,PClass,Name,Sex,Ticket,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,3,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,3,"Allen, Mr. William Henry",male,373450,,S
...,...,...,...,...,...,...
888,?,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S
889,1,"Behr, Mr. Karl Howell",male,111369,C148,C
890,3,"Dooley, Mr. Patrick",male,370376,,Q
891,3,"Dooley, Mr. Patrick",male,370376,,Q


In [41]:
df_titanic.filter(like="P")

Unnamed: 0,PassengerId,PClass,Parch
0,1,3,0
1,2,1,0
2,3,3,0
3,4,1,0
4,5,3,0
...,...,...,...
888,889,?,2
889,890,1,0
890,891,3,0
891,891,3,0


### Pandas Boolean Indexing & Filtering

#### Basics of Boolean Indexing
- Boolean: **True/False** value for each row based on a condition.
- Basic filtering pattern:
```python
df[df['age'] > 30]          # Rows where age > 30
condition = df['age'] > 30
df[condition]               # Same as above


In [42]:
# example
df_titanic_old = df_titanic[df_titanic["Age"]>30]
df_titanic_old["Age"].min()

np.float64(30.5)

In [43]:
#
df_titanic_men = df_titanic[df_titanic["Sex"]=="male"]
df_titanic_men

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q
891,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.7500,,Q


In [44]:
df_titanic_men["Fare"].mean()

np.float64(25.47156563039724)

### Combining Conditions

### AND

In [45]:
# example
# df_titanic_old_men = df_titanic_clean[[df_titanic_clean["Age"] > 30 and df_titanic_clean["Sex"] == "Male"]]

## OR

In [46]:
# example

## NOT

In [47]:
# example

### Text Filtering

.str.contains(), .str.startswith(), .str.endswith()

Case-insensitive: case=False

In [48]:
# example
df_titanic[df_titanic["Name"].str.contains("Mrs")]

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S


In [49]:
df_titanic[df_titanic["Name"].str.startswith("A")]

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
25,26,1,?,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...",female,38.0,1,5,347077,31.3875,,S
40,41,0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S
49,50,0,3,"Arnold-Franchi, Mrs. Josef (Josefine Franchi)",female,18.0,1,0,349237,17.8,,S
68,69,1,3,"Andersson, Miss. Erna Alexandra",female,17.0,4,2,3101281,7.925,,S
91,92,0,3,"Andreasson, Mr. Paul Edvin",male,20.0,0,0,347466,7.8542,,S
114,115,0,3,"Attalah, Miss. Malake",female,17.0,0,0,2627,14.4583,,C
119,120,0,?,"Andersson, Miss. Ellis Anna Maria",female,2.0,4,2,347082,31.275,,S
144,145,0,2,"Andrew, Mr. Edgardo Samuel",male,18.0,0,0,231945,11.5,,S


### Filtering Against a List

Use .isin() to check multiple values:

In [50]:
# example
df_titanic[df_titanic["Age"].isin([20,30,40])]

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S
30,31,0,1,"Uruchurtu, Don. Manuel E",male,40.0,0,0,PC 17601,27.7208,,C
40,41,0,3,"Ahlin, Mrs. Johan (Johanna Persdotter Larsson)",female,40.0,1,0,7546,9.475,,S
79,80,1,3,"Dowdell, Miss. Elizabeth",female,30.0,0,0,364516,12.475,,S
91,92,0,3,"Andreasson, Mr. Paul Edvin",male,20.0,0,0,347466,7.8542,,S
113,114,0,3,"Jussila, Miss. Katriina",female,20.0,1,0,4136,9.825,,S
131,132,0,3,"Coelho, Mr. Domingos Fernandeo",male,20.0,0,0,SOTON/O.Q. 3101307,7.05,,S
157,158,0,3,"Corn, Mr. Harry",male,30.0,0,0,SOTON/OQ 392090,8.05,,S
161,162,1,2,"Watt, Mrs. James (Elizabeth ""Bessie"" Inglis Mi...",female,40.0,0,0,C.A. 33595,15.75,,S
178,179,0,2,"Hale, Mr. Reginald",male,30.0,0,0,250653,13.0,,S


### Range Filtering

.between() for cleaner range checks:

In [51]:
# example
df_titanic[df_titanic["Age"].between(18,31)]

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
17,18,1,2,"Williams, Mr. Charles Eugene",male,28.0,0,0,244373,13.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,?,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Pandas `.query()` Method Cheat Sheet

### What is `.query()`?
- Cleaner, SQL-like way to filter rows.
- Write conditions as **strings** instead of messy brackets.
- Uses **and/or/not** instead of `&/|/~`.
- Parentheses usually not required.


In [52]:
# example

## Basic Syntax

```python
df.query("condition_as_string")
# Examples:
older_customers = df.query("age > 30")
ny_customers = df.query("city == 'New York'")
filtered = df.query("age > 30 and city == 'Boston'")
coasts = df.query("city == 'NYC' or city == 'LA'")
```

In [53]:
# example

df_titanic.query("Age > 30 and Survived == 1 or SibSp == 1")

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
...,...,...,...,...,...,...,...,...,...,...,...,...
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C


### Combining Conditions

```python
df.query("state == 'CA' and total_sales > 500")
```

In [54]:
# example and

In [55]:
# example or

In [56]:
# example not

### Using Python Variables

Reference Python variables with @
```python
        min_age = 25
        max_age = 40
        target_city = 'Boston'

        result = df.query("age >= @min_age and age <= @max_age")
        city_filter = df.query("city == @target_city")

```

In [57]:
# example
age = 30
sib_sp =1

df_titanic.query("Age > @age and Survived == 1 or SibSp == @sib_sp")

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
...,...,...,...,...,...,...,...,...,...,...,...,...
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0000,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C


### Operators Examples 

| Operator | Meaning          | Example                             |
| -------- | ---------------- | ----------------------------------- |
| `>`      | Greater than     | `df.query("age > 30")`              |
| `<`      | Less than        | `df.query("price < 100")`           |
| `>=`     | Greater or equal | `df.query("score >= 75")`           |
| `<=`     | Less or equal    | `df.query("inventory <= 10")`       |
| `==`     | Equal            | `df.query("city == 'Boston'")`      |
| `!=`     | Not equal        | `df.query("status != 'cancelled'")` |


### Pandas `.iloc[]` (Position-Based Selection)

#### What is `.iloc[]`?
- Select rows and columns **by integer position**.
- Row 0 = first row, row 1 = second row… Python uses **0-based indexing**.
- Useful for: sampling, previewing, systematic extraction, train/test splits.


#### Selecting Rows

##### Single Row
```python
first_row = df.iloc[0]       # Row 0
second_row = df.iloc[1]      # Row 1
last_row = df.iloc[-1]       # Last row


In [58]:
# example
df_titanic.iloc[-1]

PassengerId                      887
Survived                           0
PClass                             2
Name           Montvila, Rev. Juozas
Sex                             male
Age                             27.0
SibSp                              0
Parch                              0
Ticket                        211536
Fare                            13.0
Cabin                            NaN
Embarked                           S
Name: 892, dtype: object

### Multiple Rows (Slicing)
```python
df.iloc[start:stop:step]
```

In [59]:
# example start stop
df_titanic.iloc[5:11]

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S


In [60]:
# example start stop step
df_titanic.iloc[0:5:2,3::2]

Unnamed: 0,Name,Age,Parch,Fare,Embarked
0,"Braund, Mr. Owen Harris",220.0,0,7.25,S
2,"Heikkinen, Miss. Laina",26.0,0,7.925,S
4,"Allen, Mr. William Henry",35.0,0,8.05,S


### Selecting Rows AND Columns
```python
    df.iloc[rows, columns]
```

In [61]:
# example

### .iloc[] vs .loc[]
| Method    | Selection Type    | Notes                                                |
| --------- | ----------------- | ---------------------------------------------------- |
| `.iloc[]` | integer positions | stop is EXCLUDED, index-based, good for samples      |
| `.loc[]`  | labels / Boolean  | stop is INCLUDED, can filter by labels or conditions |


In [62]:
df_demo = pd.read_csv("demo_data.csv")
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,VIP,Laura,Johnson,31,2,106,$105,109,57,1916,2274,1298,1965-10-09,196,104
1,1002,new,Laura,Johnson,44,2,35,$143,134,70,585,1340,920,1981-08-24,174,95
2,1003,VIP,Sarah,Johnson,186,4,43,$104,35,65,1747,2238,1880,1997-05-23,182,58
3,1004,VIP,Sarah,Brown,92,2,51,50,77,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,38,50,770,1827,970,1961-01-12,170,87


In [63]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     str  
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int64
dtypes: int64(11), str(5)
memory usage: 2.0 KB


In [64]:
df_demo.describe()

Unnamed: 0,OrderID,Price,Quantity,Cost,SalePrice,Score,Target,Sales,LastYearSales,Height,Weight
count,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0
mean,1008.0,94.066667,2.333333,53.466667,102.933333,70.333333,1251.266667,1348.466667,1197.933333,179.266667,88.666667
std,4.472136,51.091608,1.112697,34.022821,61.622661,12.436505,447.34494,662.209922,542.003752,14.230082,17.572977
min,1001.0,21.0,1.0,14.0,22.0,50.0,585.0,402.0,414.0,153.0,58.0
25%,1004.5,43.5,1.5,26.0,39.0,61.0,871.0,809.5,740.5,170.5,74.5
50%,1008.0,94.0,2.0,47.0,92.0,70.0,1226.0,1340.0,1172.0,182.0,92.0
75%,1011.5,121.5,3.0,76.0,168.0,77.0,1719.5,1765.5,1682.5,192.0,99.5
max,1015.0,186.0,4.0,113.0,187.0,93.0,1916.0,2334.0,1951.0,198.0,119.0


### Data Transformation Fundamentals 

Raw data is messy. Before analysis, you must transform it.

Wrong data types cause:
- ❌ Calculation errors
- ❌ Incorrect results
- ❌ Memory waste
- ❌ Broken time-series analysis


#### Common Data Type Issues

| Problem                | Symptom                    | Solution             |
| ---------------------- | -------------------------- | -------------------- |
| Numbers stored as text | Can't calculate sum/mean   | Convert to int/float |
| Dates stored as text   | Can't sort chronologically | Convert to datetime  |
| Float when int needed  | Unnecessary decimals       | Convert to int       |
| Repeated text values   | High memory usage          | Convert to category  |


#### Memory Impact

| Type          | Memory Per Value                   |
| ------------- | ---------------------------------- |
| int64         | 8 bytes                            |
| int32         | 4 bytes                            |
| int16         | 2 bytes                            |
| object (text) | Variable (large)                   |
| category      | Very efficient for repeated values |


### The .astype() Method

```python
df['column'] = df['column'].astype(new_type)

# Multiple columns
df = df.astype({'Age': int, 'Salary': float})
```

In [65]:
# exampleA
df_demo["Weight"] = df_demo["Weight"].astype("int8")

In [66]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     str  
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(10), int8(1), str(5)
memory usage: 1.9 KB


In [67]:
df_demo["Price"] = df_demo["Price"].astype("str")
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     str  
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     str  
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(9), int8(1), str(6)
memory usage: 1.9 KB


In [68]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,VIP,Laura,Johnson,31,2,106,$105,109,57,1916,2274,1298,1965-10-09,196,104
1,1002,new,Laura,Johnson,44,2,35,$143,134,70,585,1340,920,1981-08-24,174,95
2,1003,VIP,Sarah,Johnson,186,4,43,$104,35,65,1747,2238,1880,1997-05-23,182,58
3,1004,VIP,Sarah,Brown,92,2,51,50,77,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,38,50,770,1827,970,1961-01-12,170,87


In [69]:
df_demo["Price"] * df_demo["Quantity"]

0             3131
1             4444
2     186186186186
3             9292
4               43
5           949494
6               21
7     105105105105
8           919191
9             3535
10    123123123123
11             120
12       115115115
13          139139
14             172
dtype: str

In [70]:
df_demo["Price"] = df_demo["Price"].astype("int")

In [71]:
df_demo["SalePrice"] = df_demo["Price"] * df_demo["Quantity"]
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,VIP,Laura,Johnson,31,2,106,$105,62,57,1916,2274,1298,1965-10-09,196,104
1,1002,new,Laura,Johnson,44,2,35,$143,88,70,585,1340,920,1981-08-24,174,95
2,1003,VIP,Sarah,Johnson,186,4,43,$104,744,65,1747,2238,1880,1997-05-23,182,58
3,1004,VIP,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87


In [72]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     str  
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(10), int8(1), str(5)
memory usage: 1.9 KB


#### String Manipulation Basics


Real-world text data is messy:
- Extra spaces
- Inconsistent casing
- Mixed formatting
- Multiple values in one column

The `.str` accessor gives you vectorized string tools to clean entire columns at once.


#### Understanding the `.str` Accessor

##### 🔑 What is `.str`?

`.str` allows you to apply Python string methods to an entire pandas column.

```python
# ❌ This fails
df['Name'].upper()

# ✅ This works
df['Name'].str.upper()
```

In [73]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,VIP,Laura,Johnson,31,2,106,$105,62,57,1916,2274,1298,1965-10-09,196,104
1,1002,new,Laura,Johnson,44,2,35,$143,88,70,585,1340,920,1981-08-24,174,95
2,1003,VIP,Sarah,Johnson,186,4,43,$104,744,65,1747,2238,1880,1997-05-23,182,58
3,1004,VIP,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87


In [74]:
df_demo["CustomerType"].value_counts()

CustomerType
VIP       6
Member    3
New       3
new       1
member    1
 New      1
Name: count, dtype: int64

In [75]:
# exampleu
df_demo["CustomerType"]= df_demo["CustomerType"].str.title()

In [76]:
df_demo["CustomerType"].value_counts()

CustomerType
Vip       6
New       4
Member    4
 New      1
Name: count, dtype: int64

#### Changing Text Case

Standardizing case prevents comparison errors.

In [77]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,Vip,Laura,Johnson,31,2,106,$105,62,57,1916,2274,1298,1965-10-09,196,104
1,1002,New,Laura,Johnson,44,2,35,$143,88,70,585,1340,920,1981-08-24,174,95
2,1003,Vip,Sarah,Johnson,186,4,43,$104,744,65,1747,2238,1880,1997-05-23,182,58
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87


In [78]:
# example
# assignment change first and last name to title case

#### Stripping Whitespace

Invisible spaces break joins and comparisons.

In [79]:
# example
df_demo["CustomerType"] = df_demo["CustomerType"].str.strip()
df_demo["CustomerType"].value_counts()

CustomerType
Vip       6
New       5
Member    4
Name: count, dtype: int64

#### Replacing and Removing Text

In [80]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     str  
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(10), int8(1), str(5)
memory usage: 1.9 KB


In [81]:
# df_demo["OriginalPrice"] = df_demo["OriginalPrice"].astype("int")

In [82]:
# example
df_demo["OriginalPrice"] = df_demo["OriginalPrice"].str.replace("$",'')

In [83]:
df_demo["OriginalPrice"] = df_demo["OriginalPrice"].astype("int")

In [94]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     int64
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(11), int8(1), str(4)
memory usage: 1.9 KB


#### Splitting Strings into Columns

Many columns contain multiple pieces of information.

In [95]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# example
df_titanic[["SirName","Middle"]] = df_titanic["Name"].str.split(",",expand=True)
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SirName,Middle
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry


#### Date and Time Handling



##### Working with Temporal Data

Dates stored as text cannot:
- Be grouped by month
- Be subtracted to calculate duration
- Be filtered by time period
- Be used for time-series analysis

First rule of date handling:

✔ Convert text → datetime


# Converting Text to Datetime

## `pd.to_datetime()`

Primary tool for date conversion.

```python
df['OrderDate'] = pd.to_datetime(df['OrderDate'])

# safe conversion

df['OrderDate'] = pd.to_datetime(df['OrderDate'], errors='coerce')

```


#### Specify Format if not detected 

```python
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df['DateTime'] = pd.to_datetime(df['DateTime'], format='%Y-%m-%d %H:%M:%S')
```

In [99]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87


In [100]:
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   OrderID        15 non-null     int64
 1   CustomerType   15 non-null     str  
 2   FirstName      15 non-null     str  
 3   LastName       15 non-null     str  
 4   Price          15 non-null     int64
 5   Quantity       15 non-null     int64
 6   Cost           15 non-null     int64
 7   OriginalPrice  15 non-null     int64
 8   SalePrice      15 non-null     int64
 9   Score          15 non-null     int64
 10  Target         15 non-null     int64
 11  Sales          15 non-null     int64
 12  LastYearSales  15 non-null     int64
 13  Birthdate      15 non-null     str  
 14  Height         15 non-null     int64
 15  Weight         15 non-null     int8 
dtypes: int64(11), int8(1), str(4)
memory usage: 1.9 KB


In [101]:
# example
df_demo["Birthdate"] = pd.to_datetime(df_demo["Birthdate"])
df_demo.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   OrderID        15 non-null     int64         
 1   CustomerType   15 non-null     str           
 2   FirstName      15 non-null     str           
 3   LastName       15 non-null     str           
 4   Price          15 non-null     int64         
 5   Quantity       15 non-null     int64         
 6   Cost           15 non-null     int64         
 7   OriginalPrice  15 non-null     int64         
 8   SalePrice      15 non-null     int64         
 9   Score          15 non-null     int64         
 10  Target         15 non-null     int64         
 11  Sales          15 non-null     int64         
 12  LastYearSales  15 non-null     int64         
 13  Birthdate      15 non-null     datetime64[us]
 14  Height         15 non-null     int64         
 15  Weight         15 non-null     int8 

#### Extracting Date Components
The `.dt` Accessor

Once converted, use .dt to extract parts.

```python
df['Year'] = df['OrderDate'].dt.year
df['Month'] = df['OrderDate'].dt.month
df['Day'] = df['OrderDate'].dt.day
df['Weekday'] = df['OrderDate'].dt.dayofweek
```

In [104]:
# example
df_demo["Birthdate"].dt.month

0     10
1      8
2      5
3      1
4      1
5      2
6      5
7     11
8      5
9      7
10     9
11     2
12     7
13    11
14     5
Name: Birthdate, dtype: int32

#### Named Components
```python
df['DayName'] = df['OrderDate'].dt.day_name()
df['MonthName'] = df['OrderDate'].dt.month_name()
```


In [105]:
# example
df_demo["BirthMonth"] = df_demo["Birthdate"].dt.month_name()
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight,BirthMonth
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104,October
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95,August
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58,May
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65,January
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87,January


#### Time Components
```python

df['Hour'] = df['Timestamp'].dt.hour
df['Minute'] = df['Timestamp'].dt.minute
df['Second'] = df['Timestamp'].dt.second
```

In [107]:
# example
df_demo["Birthdate"].dt.hour

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
Name: Birthdate, dtype: int32

## Feature Engineering

Raw data rarely gives you everything you need.

You have sales and costs, but need **profit**.  
You have prices and quantities, but need **totals**.  
You have birthdates, but need **ages**.

Creating calculated columns transforms existing data into actionable insights.

### Arithmetic Calculated Columns  

#### Basic Math Operations

The simplest calculated columns use arithmetic operations.

```python
# Calculate total revenue
df['Revenue'] = df['Price'] * df['Quantity']

# Calculate profit
df['Profit'] = df['Revenue'] - df['Cost']

# Calculate profit margin percentage
df['Margin%'] = (df['Profit'] / df['Revenue']) * 100

# Calculate discount amount
df['DiscountAmount'] = df['OriginalPrice'] - df['SalePrice']

# Calculate average
df['AverageScore'] = (df['Score1'] + df['Score2'] + df['Score3']) / 3

# Safe version
df['Margin'] = df['Profit'] / df['Revenue'].replace(0, np.nan)

In [109]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight,BirthMonth
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104,October
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95,August
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58,May
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65,January
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87,January


In [110]:
# example
df_demo["full_name"] = df_demo["FirstName"]+" "+df_demo["LastName"]
df_demo.head()


Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight,BirthMonth,full_name
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104,October,Laura Johnson
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95,August,Laura Johnson
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58,May,Sarah Johnson
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65,January,Sarah Brown
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87,January,Sarah Johnson


### Conditional Columns with np.where()
Creating If-Then Logic

`np.where()` works like an Excel IF statement.

```python
# Syntax: np.where(condition, value_if_true, value_if_false)

df['Status'] = np.where(df['Score'] >= 60, 'Pass', 'Fail')

In [111]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SirName,Middle
0,1,0,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry


In [113]:
# example
import numpy as np

df_titanic["Survived"] = np.where(df_titanic["Survived"]== 1,True,False)
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SirName,Middle
0,1,False,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry


In [114]:
df_titanic["AgeGroup"] = np.where(df_titanic["Age"] >45,"old","young")
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SirName,Middle,AgeGroup
0,1,False,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,old
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),young
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,young
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),young
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,young


Multiple Conditions (Nested np.where())

```python

# Performance tiers
df['Tier'] = np.where(df['Score'] >= 90, 'Excellent',
              np.where(df['Score'] >= 75, 'Good',
              np.where(df['Score'] >= 60, 'Satisfactory',
                       'Needs Improvement')))

```

In [115]:
# example
df_titanic["AgeGroup"] = np.where(df_titanic["Age"] > 45,"Old",
                                  np.where(df_titanic["Age"]>35,"Adult","young"))
df_titanic.head()

Unnamed: 0,PassengerId,Survived,PClass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SirName,Middle,AgeGroup
0,1,False,3,"Braund, Mr. Owen Harris",male,220.0,1,0,A/5 21171,7.25,,S,Braund,Mr. Owen Harris,Old
1,2,True,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs. John Bradley (Florence Briggs Thayer),Adult
2,3,True,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss. Laina,young
3,4,True,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs. Jacques Heath (Lily May Peel),young
4,5,False,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr. William Henry,young


#### Using .apply() for Custom Functions
When to Use `.apply()`

Use `.apply()` when logic is too complex for vectorized operations.

In [116]:
df_demo.head()

Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight,BirthMonth,full_name
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104,October,Laura Johnson
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95,August,Laura Johnson
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58,May,Sarah Johnson
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65,January,Sarah Brown
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87,January,Sarah Johnson


In [117]:
# <!-- labda examples -->
df_demo["cost_dollar"] = df_demo["Cost"].apply(lambda x: x/130)
df_demo.head()


Unnamed: 0,OrderID,CustomerType,FirstName,LastName,Price,Quantity,Cost,OriginalPrice,SalePrice,Score,Target,Sales,LastYearSales,Birthdate,Height,Weight,BirthMonth,full_name,cost_dollar
0,1001,Vip,Laura,Johnson,31,2,106,105,62,57,1916,2274,1298,1965-10-09,196,104,October,Laura Johnson,0.815385
1,1002,New,Laura,Johnson,44,2,35,143,88,70,585,1340,920,1981-08-24,174,95,August,Laura Johnson,0.269231
2,1003,Vip,Sarah,Johnson,186,4,43,104,744,65,1747,2238,1880,1997-05-23,182,58,May,Sarah Johnson,0.330769
3,1004,Vip,Sarah,Brown,92,2,51,50,184,58,683,994,625,1979-01-03,171,65,January,Sarah Brown,0.392308
4,1005,Member,Sarah,Johnson,43,1,16,161,43,50,770,1827,970,1961-01-12,170,87,January,Sarah Johnson,0.123077


In [118]:
def convert_to_dollars(x):
    
    return x/130

In [119]:
# function example 
df_demo["Cost"].apply(convert_to_dollars)

0     0.815385
1     0.269231
2     0.330769
3     0.392308
4     0.123077
5     0.323077
6     0.376923
7     0.130769
8     0.869231
9     0.361538
10    0.700000
11    0.769231
12    0.469231
13    0.130769
14    0.107692
Name: Cost, dtype: float64