## Working With Text Data

## Table of Contents

<ul>
    <li><a href="#1">1. Introduction</a></li>
    <li><a href="#1">2. Common String Method .lower(), .upper(), .title() and .len() Methods</a></li>
    <li><a href="#3">3. The str.replace() Method</a></li>
    <li><a href="#4">4. Filtering with String Methods</a></li>
    <li><a href="#5">5. More String Methods - .strip(), .lstrip(), .rstrip()</a></li>
    <li><a href="#6">6. String Methods on Index and Columns</a></li>
    <li><a href="#7">7. Split String by Characters with .str.split() , .get() Method</a></li>
    <li><a href="#8">8. More Practice with Splits</a></li>
    <li><a href="#9">9. The expand and n Parameters of the .str.split() Method</a></li>
</ul>

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # Show all results without print
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:78% !important; }</style>"))

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.__version__

  from IPython.core.display import display, HTML


'1.5.2'

<a id='1'></a>
### 1. Introduction

In [2]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [3]:
chicago["Department"].nunique()

35

In [4]:
chicago["Department"].count()

32062

<a id='2'></a>
### 2. Common String Method `.lower()`, `.upper()`, `.title()` and `.len()` Methods

In [5]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.tail(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
32059,"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
32060,"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
32061,"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [6]:
# Regular Python
"HELLO WORLD".lower()
"Hello World".lower()
"hello world".lower()

print("-" * 80)

"HELLO WORLD".upper()
"Hello World".upper()
"hello world".upper()

print("-" * 80)

"HELLO WORLD".title()
"Hello World".title()
"hello world".title()

print("-" * 80)

len("HELLO WORLD") #spaces count

'hello world'

'hello world'

'hello world'

--------------------------------------------------------------------------------


'HELLO WORLD'

'HELLO WORLD'

'HELLO WORLD'

--------------------------------------------------------------------------------


'Hello World'

'Hello World'

'Hello World'

--------------------------------------------------------------------------------


11

In [7]:
chicago["Name"] = chicago["Name"].str.lower()

chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"aaron, elvia j",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"aaron, jeffery m",POLICE OFFICER,POLICE,$84450.00
2,"aaron, karina",POLICE OFFICER,POLICE,$84450.00


In [8]:
chicago["Name"] = chicago["Name"].str.lower().str.upper()

chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


In [9]:
chicago["Name"] = chicago["Name"].str.title()

chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
2,"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00


In [10]:
chicago["Position Title"] = chicago["Position Title"].str.title()

chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"Aaron, Elvia J",Water Rate Taker,WATER MGMNT,$90744.00
1,"Aaron, Jeffery M",Police Officer,POLICE,$84450.00
2,"Aaron, Karina",Police Officer,POLICE,$84450.00


In [11]:
len(chicago["Department"])

32062

In [12]:
chicago["Department"].str.len() # Gives length of each record

0        11
1         6
2         6
3        16
4        11
         ..
32057    16
32058     6
32059     6
32060     6
32061     4
Name: Department, Length: 32062, dtype: int64

<a id='3'></a>
### 3. The `str.replace()` Method

In [13]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [14]:
# Regular Python
"Hello World".replace("l", "!!!") #orig, replacement

'He!!!!!!o Wor!!!d'

In [15]:
chicago["Department"] = chicago["Department"].str.replace("MGMNT", "MANAGEMENT")

chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


In [16]:
# Convert Salary to float
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].str.replace("$", "")
chicago["Employee Annual Salary"] = chicago["Employee Annual Salary"].astype(float)

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MANAGEMENT,90744.0
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,84450.0
2,"AARON, KARINA",POLICE OFFICER,POLICE,84450.0


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Name                    32062 non-null  object 
 1   Position Title          32062 non-null  object 
 2   Department              32062 non-null  object 
 3   Employee Annual Salary  32062 non-null  float64
dtypes: float64(1), object(3)
memory usage: 1.2+ MB


In [17]:
chicago["Employee Annual Salary"].sum()

2571506375.36

In [18]:
chicago["Employee Annual Salary"].mean()

80204.17863389682

In [19]:
chicago["Employee Annual Salary"].std()

25098.329867510587

In [20]:
chicago["Employee Annual Salary"].nlargest(n=10)
print("-" * 80)
chicago.iloc[chicago["Employee Annual Salary"].nlargest(n=10).index]

8184     300000.0
7954     216210.0
25532    202728.0
8924     197736.0
8042     197724.0
19208    195000.0
3706     187680.0
18556    187680.0
29466    187680.0
13754    185364.0
Name: Employee Annual Salary, dtype: float64

--------------------------------------------------------------------------------


Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
8184,"EVANS, GINGER S",COMMISSIONER OF AVIATION,AVIATION,300000.0
7954,"EMANUEL, RAHM",MAYOR,MAYOR'S OFFICE,216210.0
25532,"SANTIAGO, JOSE A",FIRE COMMISSIONER,FIRE,202728.0
8924,"FORD II, RICHARD C",FIRST DEPUTY FIRE COMMISSIONER,FIRE,197736.0
8042,"ESCALANTE, JOHN J",FIRST DEPUTY SUPERINTENDENT,POLICE,197724.0
19208,"MITCHELL, EILEEN M",CHIEF OF STAFF,MAYOR'S OFFICE,195000.0
3706,"CALLAHAN, MICHAEL E",DEPUTY FIRE COMMISSIONER,FIRE,187680.0
18556,"MC NICHOLAS, JOHN",DEPUTY FIRE COMMISSIONER,FIRE,187680.0
29466,"VASQUEZ, ANTHONY P",DEPUTY FIRE COMMISSIONER,FIRE,187680.0
13754,"JOHNSON, EDDIE T",CHIEF,POLICE,185364.0


<a id='4'></a>
### 4. Filtering with String Methods - `.contains()`, `.startswith()`, `.endswith()`

In [21]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [22]:
chicago["Position Title"].str.lower()
print("-" * 80)
chicago["Position Title"].str.lower().str.contains("water")

0                      water rate taker
1                        police officer
2                        police officer
3              chief contract expediter
4                     civil engineer iv
                      ...              
32057    frm of machinists - automotive
32058                    police officer
32059                    police officer
32060                    police officer
32061           chief data base analyst
Name: Position Title, Length: 32062, dtype: object

--------------------------------------------------------------------------------


0         True
1        False
2        False
3        False
4        False
         ...  
32057    False
32058    False
32059    False
32060    False
32061    False
Name: Position Title, Length: 32062, dtype: bool

In [23]:
filter1 = chicago["Position Title"].str.lower().str.contains("water")
chicago[filter1]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
554,"ALUISE, VINCENT G",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
685,"ANDERSON, ANDREW J",DISTRICT SUPERINTENDENT OF WATER DISTRIBUTION,WATER MGMNT,$109272.00
702,"ANDERSON, DONALD",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00
...,...,...,...,...
29669,"VERMA, ANUPAM",MANAGING ENGINEER - WATER MANAGEMENT,WATER MGMNT,$111192.00
30239,"WASHINGTON, JOSEPH",WATER CHEMIST III,WATER MGMNT,$89676.00
30544,"WEST, THOMAS R",GEN SUPT OF WATER MANAGEMENT,WATER MGMNT,$115704.00
30991,"WILLIAMS, MATTHEW",FOREMAN OF WATER PIPE CONSTRUCTION,WATER MGMNT,$102440.00


In [24]:
chicago[chicago["Position Title"].str.lower().str.startswith("water")]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
671,"ANDER, PERRY A",WATER CHEMIST II,WATER MGMNT,$82044.00
1054,"ASHLEY, KARMA T",WATER CHEMIST II,WATER MGMNT,$82044.00
1079,"ATKINS, JOANNA M",WATER CHEMIST II,WATER MGMNT,$82044.00
1181,"AZEEM, MOHAMMED A",WATER CHEMIST II,WATER MGMNT,$53172.00
...,...,...,...,...
28574,"THREATT, DENISE R",WATER QUALITY INSPECTOR,WATER MGMNT,$62004.00
28602,"TIGNOR, DARRYL B",WATER RATE TAKER,WATER MGMNT,$78948.00
28955,"TRAVIS COOK, LESLIE R",WATER RATE TAKER,WATER MGMNT,$78948.00
29584,"VELAZQUEZ, JOHN",WATER RATE TAKER,WATER MGMNT,$78948.00


In [25]:
chicago[chicago["Position Title"].str.lower().str.endswith("ist")]

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
184,"AFROZ, NAYYAR",PSYCHIATRIST,HEALTH,$99840.00
308,"ALARCON, LUIS J",LOAN PROCESSING SPECIALIST,COMMUNITY DEVELOPMENT,$81948.00
422,"ALLAIN, CAROLYN",SENIOR TELECOMMUNICATIONS SPECIALIST,DoIT,$89880.00
472,"ALLEN, ROBERT",MACHINIST,WATER MGMNT,$94328.00
705,"ANDERSON, EDWARD M",SR PROCUREMENT SPECIALIST,PROCUREMENT,$91476.00
...,...,...,...,...
31667,"YODER, TERESA G",ARCHIVAL SPECIALIST,PUBLIC LIBRARY,$74304.00
31688,"YOUNGBLOOM, LAURENCE G",CRIMES SURVEILLANCE SPECIALIST,OEMC,$19676.80
31717,"YOUNG, KIMBERLY M",SR PROCUREMENT SPECIALIST,PROCUREMENT,$68556.00
31837,"ZAPATA, HUGO",SR PROCUREMENT SPECIALIST,PROCUREMENT,$87324.00


<a id='5'></a>
### 5. More String Methods - `.strip()`, `.lstrip()`, `.rstrip()`

In [26]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [27]:
#Regular Python
"     Hello World     ".strip()
"     Hello World     ".lstrip()
"     Hello World     ".rstrip()

'Hello World'

'Hello World     '

'     Hello World'

In [28]:
chicago["Name"].str.lstrip()
print("-" * 80)
chicago["Name"].str.rstrip()
print("-" * 80)
chicago["Name"].str.strip()
print("-" * 80)
chicago["Name"].str.lstrip().str.rstrip()

0            AARON,  ELVIA J
1          AARON,  JEFFERY M
2             AARON,  KARINA
3        AARON,  KIMBERLEI R
4        ABAD JR,  VICENTE M
                ...         
32057    ZYGADLO,  MICHAEL J
32058     ZYGOWICZ,  PETER J
32059      ZYMANTAS,  MARK E
32060    ZYRKOWSKI,  CARLO E
32061    ZYSKOWSKI,  DARIUSZ
Name: Name, Length: 32062, dtype: object

--------------------------------------------------------------------------------


0            AARON,  ELVIA J
1          AARON,  JEFFERY M
2             AARON,  KARINA
3        AARON,  KIMBERLEI R
4        ABAD JR,  VICENTE M
                ...         
32057    ZYGADLO,  MICHAEL J
32058     ZYGOWICZ,  PETER J
32059      ZYMANTAS,  MARK E
32060    ZYRKOWSKI,  CARLO E
32061    ZYSKOWSKI,  DARIUSZ
Name: Name, Length: 32062, dtype: object

--------------------------------------------------------------------------------


0            AARON,  ELVIA J
1          AARON,  JEFFERY M
2             AARON,  KARINA
3        AARON,  KIMBERLEI R
4        ABAD JR,  VICENTE M
                ...         
32057    ZYGADLO,  MICHAEL J
32058     ZYGOWICZ,  PETER J
32059      ZYMANTAS,  MARK E
32060    ZYRKOWSKI,  CARLO E
32061    ZYSKOWSKI,  DARIUSZ
Name: Name, Length: 32062, dtype: object

--------------------------------------------------------------------------------


0            AARON,  ELVIA J
1          AARON,  JEFFERY M
2             AARON,  KARINA
3        AARON,  KIMBERLEI R
4        ABAD JR,  VICENTE M
                ...         
32057    ZYGADLO,  MICHAEL J
32058     ZYGOWICZ,  PETER J
32059      ZYMANTAS,  MARK E
32060    ZYRKOWSKI,  CARLO E
32061    ZYSKOWSKI,  DARIUSZ
Name: Name, Length: 32062, dtype: object

In [29]:
chicago["Name"] = chicago["Name"].str.lstrip().str.rstrip()

In [30]:
chicago["Position Title"] = chicago["Position Title"].str.strip()

<a id='6'></a>
### 6. String Methods on Index and Columns Labels

In [31]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [32]:
chicago.set_index("Name", inplace=True)
chicago.tail(n=3)

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"ZYMANTAS, MARK E",POLICE OFFICER,POLICE,$84450.00
"ZYRKOWSKI, CARLO E",POLICE OFFICER,POLICE,$87384.00
"ZYSKOWSKI, DARIUSZ",CHIEF DATA BASE ANALYST,DoIT,$113664.00


In [33]:
chicago.index

Index(['AARON,  ELVIA J', 'AARON,  JEFFERY M', 'AARON,  KARINA',
       'AARON,  KIMBERLEI R', 'ABAD JR,  VICENTE M', 'ABARCA,  ANABEL',
       'ABARCA,  EMMANUEL', 'ABASCAL,  REECE E', 'ABBASI,  CHRISTOPHER',
       'ABBATACOLA,  ROBERT J',
       ...
       'ZWIT,  JEFFREY J', 'ZWOLFER,  MATTHEW W', 'ZYCH,  MATEUSZ',
       'ZYDEK,  BRYAN', 'ZYGADLO,  JOHN P', 'ZYGADLO,  MICHAEL J',
       'ZYGOWICZ,  PETER J', 'ZYMANTAS,  MARK E', 'ZYRKOWSKI,  CARLO E',
       'ZYSKOWSKI,  DARIUSZ'],
      dtype='object', name='Name', length=32062)

In [34]:
chicago.index.str.strip().str.title()

Index(['Aaron,  Elvia J', 'Aaron,  Jeffery M', 'Aaron,  Karina',
       'Aaron,  Kimberlei R', 'Abad Jr,  Vicente M', 'Abarca,  Anabel',
       'Abarca,  Emmanuel', 'Abascal,  Reece E', 'Abbasi,  Christopher',
       'Abbatacola,  Robert J',
       ...
       'Zwit,  Jeffrey J', 'Zwolfer,  Matthew W', 'Zych,  Mateusz',
       'Zydek,  Bryan', 'Zygadlo,  John P', 'Zygadlo,  Michael J',
       'Zygowicz,  Peter J', 'Zymantas,  Mark E', 'Zyrkowski,  Carlo E',
       'Zyskowski,  Dariusz'],
      dtype='object', name='Name', length=32062)

In [35]:
chicago.index = chicago.index.str.strip().str.title()

In [36]:
chicago.head(n=3)

Unnamed: 0_level_0,Position Title,Department,Employee Annual Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00


In [37]:
chicago.columns

chicago.columns.str.upper()

chicago.columns = chicago.columns.str.upper()

chicago.head(n=3)

Index(['Position Title', 'Department', 'Employee Annual Salary'], dtype='object')

Index(['POSITION TITLE', 'DEPARTMENT', 'EMPLOYEE ANNUAL SALARY'], dtype='object')

Unnamed: 0_level_0,POSITION TITLE,DEPARTMENT,EMPLOYEE ANNUAL SALARY
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Aaron, Elvia J",WATER RATE TAKER,WATER MGMNT,$90744.00
"Aaron, Jeffery M",POLICE OFFICER,POLICE,$84450.00
"Aaron, Karina",POLICE OFFICER,POLICE,$84450.00


<a id='7'></a>
### 7. Split String by Characters with `.str.split()` , `.get()` Method

In [38]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [39]:
#Regular Python
"Hi, my name is Siddharth".split() # Default is space
"Hi, my name is Siddharth".split(" ")

['Hi,', 'my', 'name', 'is', 'Siddharth']

['Hi,', 'my', 'name', 'is', 'Siddharth']

In [40]:
chicago["Name"].str.split(",")

0            [AARON,   ELVIA J]
1          [AARON,   JEFFERY M]
2             [AARON,   KARINA]
3        [AARON,   KIMBERLEI R]
4        [ABAD JR,   VICENTE M]
                  ...          
32057    [ZYGADLO,   MICHAEL J]
32058     [ZYGOWICZ,   PETER J]
32059      [ZYMANTAS,   MARK E]
32060    [ZYRKOWSKI,   CARLO E]
32061    [ZYSKOWSKI,   DARIUSZ]
Name: Name, Length: 32062, dtype: object

In [41]:
chicago["Name"].str.split(",").str.get(0)

0            AARON
1            AARON
2            AARON
3            AARON
4          ABAD JR
           ...    
32057      ZYGADLO
32058     ZYGOWICZ
32059     ZYMANTAS
32060    ZYRKOWSKI
32061    ZYSKOWSKI
Name: Name, Length: 32062, dtype: object

In [42]:
chicago["Name"].str.split(",").str.get(1).str.title()

0              Elvia J
1            Jeffery M
2               Karina
3          Kimberlei R
4            Vicente M
             ...      
32057        Michael J
32058          Peter J
32059           Mark E
32060          Carlo E
32061          Dariusz
Name: Name, Length: 32062, dtype: object

In [43]:
chicago["Name"].str.split(",").str.get(0).str.title().value_counts()

Williams     293
Johnson      244
Smith        241
Brown        185
Jones        183
            ... 
Horkavy        1
Horn           1
Horne Jr       1
Horner         1
Zyskowski      1
Name: Name, Length: 13829, dtype: int64

In [44]:
chicago["Position Title"].str.split(" ").str.get(0).str.title().value_counts()

Police             10856
Firefighter-Emt     1509
Sergeant            1186
Pool                 918
Firefighter          810
                   ...  
Dentist                1
Assoc                  1
Telephone              1
Mayor                  1
Prepress               1
Name: Position Title, Length: 320, dtype: int64

<a id='8'></a>
### 8. More Practice with Splits

In [45]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [46]:
chicago["Name"].str.split(",").str.get(0).str.title().value_counts().head(n=3)

Williams    293
Johnson     244
Smith       241
Name: Name, dtype: int64

In [47]:
chicago["Name"].str.split(",").str.get(1).str.split(" ")

0            [, , ELVIA, J]
1          [, , JEFFERY, M]
2              [, , KARINA]
3        [, , KIMBERLEI, R]
4          [, , VICENTE, M]
                ...        
32057      [, , MICHAEL, J]
32058        [, , PETER, J]
32059         [, , MARK, E]
32060        [, , CARLO, E]
32061         [, , DARIUSZ]
Name: Name, Length: 32062, dtype: object

In [48]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).str.title()

0            Elvia
1          Jeffery
2           Karina
3        Kimberlei
4          Vicente
           ...    
32057      Michael
32058        Peter
32059         Mark
32060        Carlo
32061      Dariusz
Name: Name, Length: 32062, dtype: object

In [49]:
chicago["Name"].str.split(",").str.get(1).str.strip().str.split(" ").str.get(0).str.title().value_counts().head(n=5)

Michael    1153
John        899
James       676
Robert      622
Joseph      537
Name: Name, dtype: int64

<a id='9'></a>
### 9. The `expand` and `n` Parameters of the `.str.split()` Method

In [50]:
chicago = pd.read_csv(filepath_or_buffer = 'chicago.csv') # can use index_col to set a column as index
chicago = chicago.dropna(how="all") #drop columns with any null values
chicago["Department"] = chicago["Department"].astype("category") # Convert Department column to category to save space

chicago.head(n=3)
print("-" * 80)
chicago.info()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00


--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32062 entries, 0 to 32061
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32062 non-null  object  
 1   Position Title          32062 non-null  object  
 2   Department              32062 non-null  category
 3   Employee Annual Salary  32062 non-null  object  
dtypes: category(1), object(3)
memory usage: 1.0+ MB


In [51]:
chicago["Name"].str.split(",", expand=True) #returns a dataframe

Unnamed: 0,0,1
0,AARON,ELVIA J
1,AARON,JEFFERY M
2,AARON,KARINA
3,AARON,KIMBERLEI R
4,ABAD JR,VICENTE M
...,...,...
32057,ZYGADLO,MICHAEL J
32058,ZYGOWICZ,PETER J
32059,ZYMANTAS,MARK E
32060,ZYRKOWSKI,CARLO E


In [52]:
chicago[["First Name", "Last Name"]] = chicago["Name"].str.split(",", expand=True)
chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,First Name,Last Name
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA


In [53]:
chicago["Position Title"].str.split(" ", expand=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,WATER,RATE,TAKER,,,,,,
1,POLICE,OFFICER,,,,,,,
2,POLICE,OFFICER,,,,,,,
3,CHIEF,CONTRACT,EXPEDITER,,,,,,
4,CIVIL,ENGINEER,IV,,,,,,
...,...,...,...,...,...,...,...,...,...
32057,FRM,OF,MACHINISTS,-,AUTOMOTIVE,,,,
32058,POLICE,OFFICER,,,,,,,
32059,POLICE,OFFICER,,,,,,,
32060,POLICE,OFFICER,,,,,,,


In [54]:
chicago["Position Title"].str.split(" ", expand=True, n=1).head(n=3) #n is max number of splits to make

Unnamed: 0,0,1
0,WATER,RATE TAKER
1,POLICE,OFFICER
2,POLICE,OFFICER


In [55]:
chicago["Position Title"].str.split(" ", expand=True, n=2).head(n=3) #n is max number of splits to make

Unnamed: 0,0,1,2
0,WATER,RATE,TAKER
1,POLICE,OFFICER,
2,POLICE,OFFICER,


In [56]:
chicago[["First Title Word", "Remaining Words"]] = chicago["Position Title"].str.split(" ", expand=True, n=1)
chicago.head(n=3)

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary,First Name,Last Name,First Title Word,Remaining Words
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$90744.00,AARON,ELVIA J,WATER,RATE TAKER
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$84450.00,AARON,JEFFERY M,POLICE,OFFICER
2,"AARON, KARINA",POLICE OFFICER,POLICE,$84450.00,AARON,KARINA,POLICE,OFFICER
