# '96 Sonics: 3 Regex Methods to Split Names

## Table of Contents
1. Introduction
2. Install & Import Packages
3. Scrape and Display Logo
4. Scrape Roster and Convert to Dataframe
5. **Method 1**: .replace()
6. **Method 2**: splitname function, .apply()
7. **Method 3**: .extract(), dictionarize

## 1. Introduction

Today, we'll walk through 3 simple regex methods to split names into first and last names in a dataframe. We'll work with roster data for the '96 Seattle SuperSonics, one of my all-time favorite teams. Led by Payton, Kemp, Schrempf and coach George Karl, they reached the Finals that year, losing to the Bulls in 6. 

We'll use BeautifulSoup to scrape roster data from Basketball Reference after a search for Seattle SuperSonics (https://www.basketball-reference.com/teams/SEA/1996.html) and convert to a dataframe. Once we have our players, we'll split player names into first and last names using 3 regex methods, walking through the regex logic and methodology in each. 

This simple exercise could be useful for anyone working with string name fields in, for example, customer, applicant, or patient data where first and last names are combined. A similar logic can be used for any string fields that need to be separated (e.g., countries and their capitals). Let's dive in. 

## 2. Install & Import Packages 

In [41]:
import pandas as pd
import numpy as np

# Web scraping using BeautifulSoup and converting to pandas dataframe
import requests 
import urllib.request 
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from urllib.request import urlopen
from bs4 import BeautifulSoup
!pip install lxml # Install lxml parser as it's faster than the built-in html parser

# Displaying images
from IPython.display import Image
from IPython.core.display import HTML 



## 3. Scrape and Display Logo

In [42]:
# Load image using Image method we imported from iPython display and image url
Image(url= "https://d2p3bygnnzw9w3.cloudfront.net/req/202008171/tlogo/bbr/SEA-1996.png", width=180, height=90)

## 4. Scrape Roster & Convert to Dataframe

In [43]:
# Specify url and get html from page
url = "https://www.basketball-reference.com/teams/SEA/1996.html"
html = urlopen(url)

In [44]:
# Create BeautifulSoup object using lxml parser we imported
soup = BeautifulSoup(html, 'lxml')
type(soup)

bs4.BeautifulSoup

In [45]:
# Print title of the page, we see that it's the 1995-96 SeattleSupersonics Roster and Stats page
title = soup.title
print(title)

<title>1995-96 Seattle SuperSonics Roster and Stats | Basketball-Reference.com</title>


In [46]:
# extracting the raw table inside that webpage
table = soup.find_all('table')

In [47]:
# Scrape just the table for Sonics roster, which is the 1st table and convert it into a dataframe
sonics = pd.read_html(str(table[0]), index_col=None, header=0)[0]
sonics

Unnamed: 0,No.,Player,Pos,Ht,Wt,Birth Date,Unnamed: 6,Exp,College
0,2,Vincent Askew,SG,6-6,210,"February 28, 1966",us,6,Memphis
1,34,Frank Brickowski,C,6-9,240,"August 14, 1959",us,10,Penn State
2,1,Sherell Ford,SF,6-7,210,"August 26, 1972",us,R,University of Illinois at Chicago
3,33,Hersey Hawkins,SG,6-3,190,"September 29, 1966",us,7,Bradley
4,50,Ervin Johnson,C,6-11,245,"December 21, 1967",us,2,New Orleans
5,40,Shawn Kemp,PF,6-10,230,"November 26, 1969",us,6,Trinity Valley CC
6,10,Nate McMillan,PG,6-5,195,"August 3, 1964",us,9,NC State
7,20,Gary Payton,PG,6-4,180,"July 23, 1968",us,5,Oregon State
8,14,Sam Perkins,PF,6-9,235,"June 14, 1961",us,11,UNC
9,55,Steve Scheffler,C,6-9,250,"September 3, 1967",us,5,Purdue


In [48]:
# Keep only Player' column as that's the only one we'll need here
keep=['Player']
sonics = sonics[keep]
sonics

Unnamed: 0,Player
0,Vincent Askew
1,Frank Brickowski
2,Sherell Ford
3,Hersey Hawkins
4,Ervin Johnson
5,Shawn Kemp
6,Nate McMillan
7,Gary Payton
8,Sam Perkins
9,Steve Scheffler


In [49]:
# Check dataframe info, we have 13 players all of data type object
sonics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 1 columns):
Player    13 non-null object
dtypes: object(1)
memory usage: 184.0+ bytes


## 5. Method 1: .replace()

Create columns for first and last names and populate each with full player names. For first names, we'll replace all characters after a space with empty string. For last names, we'll replace all characters before a space with empty string. 

In [50]:
# So I want to create two new columns and apply a regex to the projection of the "Player" column.

# Create 'First' column as copy of 'Player' column,  
sonics['First']=sonics['Player']

# Replace al characters after space with empty string 
# [ ].*: [ ] means space, . means any single character, * means an unlimited number of times
sonics['First']=sonics['First'].replace("[ ].*", "", regex=True)

# Create 'Last' column as copy of 'Player' column 
sonics['Last']=sonics['Player']

# Replace al characters before space with empty string
# .*[ ]: . means any single character, * means an unlimited number of times, [ ] means space
sonics["Last"]=sonics["Last"].replace(".*[ ]", "", regex=True)

# Taking a look, we see the names split into first and last name columns
sonics

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,Player,First,Last
0,Vincent Askew,Vincent,Askew
1,Frank Brickowski,Frank,Brickowski
2,Sherell Ford,Sherell,Ford
3,Hersey Hawkins,Hersey,Hawkins
4,Ervin Johnson,Ervin,Johnson
5,Shawn Kemp,Shawn,Kemp
6,Nate McMillan,Nate,McMillan
7,Gary Payton,Gary,Payton
8,Sam Perkins,Sam,Perkins
9,Steve Scheffler,Steve,Scheffler


## 6. Method 2: splitname function, .apply()

We'll define a function splitname with an argument for row, which is a Series object of a single row indexed by column values. For each row, we'll extract the first name by creating a 'First' column for which we'll split the player name on the space (" ") and take the first result ([0]) as the new entry in the series. We'll do the same for 'Last', but extract the last result ([-1]) as the new entry in the series. Lastly, we'll use the apply() function on the player column (.apply automatically merges series with dataframe). 

In [51]:
# Delete 'First' and 'Last' columns so we have only our original 'Player' column
del(sonics['First'], sonics['Last'])
sonics.head()

Unnamed: 0,Player
0,Vincent Askew
1,Frank Brickowski
2,Sherell Ford
3,Hersey Hawkins
4,Ervin Johnson


In [52]:
# Define splitname function that splits string into two pieces on single row of data
def splitname(row):
    row['First']=row['Player'].split(" ")[0] # Extract first name and create new entry in series
    row['Last']=row['Player'].split(" ")[-1] # Extract last name and create new entry in series
    return row

# Aplly splitname function to column of players
sonics = sonics.apply(splitname, axis='columns')

# Taking a look, we see the names split into first and last name columns
sonics

Unnamed: 0,Player,First,Last
0,Vincent Askew,Vincent,Askew
1,Frank Brickowski,Frank,Brickowski
2,Sherell Ford,Sherell,Ford
3,Hersey Hawkins,Hersey,Hawkins
4,Ervin Johnson,Ervin,Johnson
5,Shawn Kemp,Shawn,Kemp
6,Nate McMillan,Nate,McMillan
7,Gary Payton,Gary,Payton
8,Sam Perkins,Sam,Perkins
9,Steve Scheffler,Steve,Scheffler


## 7. Method 3: .extract(), dictionarize

The .extract function is part of the .str attribute of a Series. It takes a regex input of groups we we want to capture that are then output as columns. 

In [53]:
# Delete 'First' and 'Last' columns so we have only our original 'Player' column
del(sonics['First'], sonics['Last'])
sonics.head()

Unnamed: 0,Player
0,Vincent Askew
1,Frank Brickowski
2,Sherell Ford
3,Hersey Hawkins
4,Ervin Johnson


In [54]:
# Define regex pattern
# (^[\w]*): () for 1st group, ^ signifies start of string, [\w] means any word character, * means unlimited number of times
# (?:.*): () for 2nd group, ?: means non-capturing, . means any character, * means unlimited number of times,   means space
# ([\w\-]*$): () for 3rd group, [\w\-] means any word character or hyphen (for hyphenated last names), * means unlimited number of times, $ signifies end of string

pattern = "(^[\w]*)(?:.* )([\w\-]*$)"

# Extract pattern from Player names series and output as columns
sonics['Player'].str.extract(pattern)

Unnamed: 0,0,1
0,Vincent,Askew
1,Frank,Brickowski
2,Sherell,Ford
3,Hersey,Hawkins
4,Ervin,Johnson
5,Shawn,Kemp
6,Nate,McMillan
7,Gary,Payton
8,Sam,Perkins
9,Steve,Scheffler


In [55]:
# We can dictionarize to get columns labeled First and Last (instead of the 0 and 1 column headings above)
# (?P<First>^[\w]*): () for 1st group, ?P<First> means dictionary label 'First', ^ signifies start of string, [\w] means any word character, * means unlimited number of times
# (?:.*): () for 2nd group, ?: means non-capturing, . means any character, * means unlimited number of times,   means space
# (?P<Last>[\w\-]*$): () for 3rd group, ?P<Last> means dictionary label 'Last', [\w\-] means any word character or hyphen (for hyphenated last names), * means unlimited number of times, $ signifies end of string

pattern="(?P<First>^[\w]*)(?:.* )(?P<Last>[\w\-]*$)"

# Now call extract
names=sonics['Player'].str.extract(pattern)
names

Unnamed: 0,First,Last
0,Vincent,Askew
1,Frank,Brickowski
2,Sherell,Ford
3,Hersey,Hawkins
4,Ervin,Johnson
5,Shawn,Kemp
6,Nate,McMillan
7,Gary,Payton
8,Sam,Perkins
9,Steve,Scheffler


In [56]:
# Add these first and last names to our sonics dataframe 
sonics['First']=names['First']
sonics['Last']=names['Last']
sonics

Unnamed: 0,Player,First,Last
0,Vincent Askew,Vincent,Askew
1,Frank Brickowski,Frank,Brickowski
2,Sherell Ford,Sherell,Ford
3,Hersey Hawkins,Hersey,Hawkins
4,Ervin Johnson,Ervin,Johnson
5,Shawn Kemp,Shawn,Kemp
6,Nate McMillan,Nate,McMillan
7,Gary Payton,Gary,Payton
8,Sam Perkins,Sam,Perkins
9,Steve Scheffler,Steve,Scheffler
