- When I remove spaces from string data,  
  I mainly use two functions (replace and strip) depending on the situation.
- Let's look at the difference between the two functions,  
  and why whitespace removal is necessary.

In [3]:
import pandas as pd

In [4]:
df = pd.read_excel("data/sample_data_whitespace.xlsx")
print(df.shape)
df

(5, 2)


Unnamed: 0,name,location
0,Bruce Baker,USA
1,Calliope Collins,USA
2,Emma Evans,HKG
3,Linda Lewis,HKG
4,Peter Parker,NZL


- If spaces are removed using the replace function in the "name" column of the sample data,  
  all spaces included in the string are removed.
- On the other hand, when the strip function is applied,  
  spaces located at the left and right ends of the string are removed,  
  but internal spaces(spaces between first and last names) are not removed.

In [5]:
df['name_replace'] = df['name'].str.replace(" ", "")
df['name_strip'] = df['name'].apply(lambda x : x.strip())

In [6]:
print(df.shape)
df

(5, 4)


Unnamed: 0,name,location,name_replace,name_strip
0,Bruce Baker,USA,BruceBaker,Bruce Baker
1,Calliope Collins,USA,CalliopeCollins,Calliope Collins
2,Emma Evans,HKG,EmmaEvans,Emma Evans
3,Linda Lewis,HKG,LindaLewis,Linda Lewis
4,Peter Parker,NZL,PeterParker,Peter Parker


- Let's look at why whitespace removal is necessary in the data cleaning step.
- Suppose you want to combine the "name" and "location" columns into one column.
- "name_loc_1" is a column merged with the "location" column without removing spaces from the "name" column.  
  Looking at row 0, there was a space to the right of "Bruce Baker", so the merged column reveals this("Bruce Baker _USA")
- In the columns "name_loc_2" and "name_loc_3", which are the results of merging after removing blanks,  
  all data appears in the same format without errors.

In [7]:
df['name_loc_1'] = df['name'] + '_' + df['location']
df['name_loc_2'] = df['name_replace'] + '_' + df['location']
df['name_loc_3'] = df['name_strip'] + '_' + df['location']

In [8]:
print(df.shape)
df

(5, 7)


Unnamed: 0,name,location,name_replace,name_strip,name_loc_1,name_loc_2,name_loc_3
0,Bruce Baker,USA,BruceBaker,Bruce Baker,Bruce Baker _USA,BruceBaker_USA,Bruce Baker_USA
1,Calliope Collins,USA,CalliopeCollins,Calliope Collins,Calliope Collins_USA,CalliopeCollins_USA,Calliope Collins_USA
2,Emma Evans,HKG,EmmaEvans,Emma Evans,Emma Evans_HKG,EmmaEvans_HKG,Emma Evans_HKG
3,Linda Lewis,HKG,LindaLewis,Linda Lewis,Linda Lewis_HKG,LindaLewis_HKG,Linda Lewis_HKG
4,Peter Parker,NZL,PeterParker,Peter Parker,Peter Parker_NZL,PeterParker_NZL,Peter Parker_NZL


In [10]:
def bg_color(x, color):
    color = f'background-color:{color}'
    return color

df.style.applymap(bg_color, color='#ff9090', subset = pd.IndexSlice[0, ['name_loc_1', 'name_loc_2', 'name_loc_3']])

Unnamed: 0,name,location,name_replace,name_strip,name_loc_1,name_loc_2,name_loc_3
0,Bruce Baker,USA,BruceBaker,Bruce Baker,Bruce Baker _USA,BruceBaker_USA,Bruce Baker_USA
1,Calliope Collins,USA,CalliopeCollins,Calliope Collins,Calliope Collins_USA,CalliopeCollins_USA,Calliope Collins_USA
2,Emma Evans,HKG,EmmaEvans,Emma Evans,Emma Evans_HKG,EmmaEvans_HKG,Emma Evans_HKG
3,Linda Lewis,HKG,LindaLewis,Linda Lewis,Linda Lewis_HKG,LindaLewis_HKG,Linda Lewis_HKG
4,Peter Parker,NZL,PeterParker,Peter Parker,Peter Parker_NZL,PeterParker_NZL,Peter Parker_NZL
