# Identifying Retired Players Likely for Hall of Fame Consideration
## Objective
To create a list of retired players who are strong candidates for Hall of Fame (HOF) consideration, based on their career performance and accolades. These players will serve as a critical part of the dataset, bridging the gap between clear HOF inductees and non-HOF players.

#### Criteria for Inclusion
The selection of players is based on several key factors that align with typical HOF considerations:

1. Accolades
All-Star appearances (multiple selections signal consistent excellence).
Championships and Finals MVPs (team success often boosts a player’s HOF case).
Individual awards (e.g., Defensive Player of the Year, Rookie of the Year).
All-NBA Team selections (recognition as one of the best at their position).
2. Career Longevity
Players with long, consistent careers are more likely to be considered.
Those who achieved high peak performance but had shorter careers (e.g., Penny Hardaway) are also included.
3. Statistical Impact
Players with strong per-game averages or cumulative career stats (e.g., points, rebounds, assists, and advanced metrics).
Dominance in specific aspects of the game (e.g., elite defense, scoring, playmaking).
4. Role and Context
Players who were key contributors to successful teams, even if they weren’t the primary stars (e.g., Shawn Marion, Rasheed Wallace).
International impact or contributions to basketball outside the NBA (e.g., Manu Ginóbili).
5. Exclusions
Players who were primarily role players without consistent individual accolades.
Players who were active but didn’t leave a significant statistical or historical impact.
Categorization of Players
To ensure a well-rounded dataset, players are grouped by positions (guards, forwards, centers) since expectations and performance metrics vary by role.

Examples:
- Carmelo Anthony
- Joe Johnson
- Shawn Marrion
- Dwight Howard

These players are not yet in the Hall of Fame but share many traits with those who have been inducted:

Borderline Candidates: Their career accolades and stats make them comparable to existing HOF players, which helps differentiate the nuances of borderline cases.
Contrast with Non-HOF Players: Including these players provides balance to the dataset, showcasing individuals who narrowly missed induction or are still waiting for eligibility.
Predictive Challenges: These players test the model’s ability to distinguish between players with HOF potential and those who fall just short.
Outcome
The resulting dataset will:

Include ~100 retired borderline candidates with strong cases for HOF consideration.
Ensure a balanced mix of accolades, career longevity, and statistical impact.
Help the model learn patterns that separate borderline HOF players from clear inductees and non-HOF players.
Next Steps
Scrape or collect player stats for the identified individuals.
Normalize stats and accolades to account for career length.
Include contextual features (e.g., team success, peak performance).
Integrate this data with HOF and non-HOF players for model training.

In [1]:
#Import libraries 
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [2]:
#Set display to max columns
pd.set_option('display.max_columns', None)

In [3]:
#Scrape players likely to be considered for the hall of fame

#Paste player links
player_urls = [
    "https://www.basketball-reference.com/players/a/anthoca01.html", #Carmelo Anthony
    "https://www.basketball-reference.com/players/m/mariosh01.html", #Shawn Marion
    "https://www.basketball-reference.com/players/w/wallara01.html", #Rasheed Wallace
    "https://www.basketball-reference.com/players/h/howardw01.html", #Dwight Howard
    "https://www.basketball-reference.com/players/a/arenagi01.html", #Gilbert Arenas
    "https://www.basketball-reference.com/players/m/millean02.html", #Andre Miller
    "https://www.basketball-reference.com/players/h/hardaan01.html", #Penny Hardaway
    "https://www.basketball-reference.com/players/r/rosede01.html", #Derrick Rose
    "https://www.basketball-reference.com/players/l/lewisfr02.html", #Freddie Lewis
    "https://www.basketball-reference.com/players/s/stoudam01.html", #Amar'e Stoudemire
    "https://www.basketball-reference.com/players/b/brandel01.html", #Elton Brand
    "https://www.basketball-reference.com/players/j/jamisan01.html", #Antawn Jamison
    "https://www.basketball-reference.com/players/s/smithjo03.html", #Josh Smith
    "https://www.basketball-reference.com/players/j/jeffeal01.html", #Al Jefferson
    "https://www.basketball-reference.com/players/c/chandty01.html", #Tyson Chandler
    "https://www.basketball-reference.com/players/m/millebr01.html", #Brad Miller
    "https://www.basketball-reference.com/players/l/laimbbi01.html", #Bill Laimbeer
    "https://www.basketball-reference.com/players/k/kempsh01.html", #Shawn Kemp
    "https://www.basketball-reference.com/players/j/johnsjo02.html", #Joe Johnson
    "https://www.basketball-reference.com/players/c/chambto01.html", #Tom Chambers
    "https://www.basketball-reference.com/players/j/johnske02.html", #Kevin Johnson
    "https://www.basketball-reference.com/players/g/greenac01.html", #A.C. Green
    "https://www.basketball-reference.com/players/g/grantho01.html", #Horace Grant
    "https://www.basketball-reference.com/players/h/horryro01.html", #Robert Horry
    "https://www.basketball-reference.com/players/r/ricegl01.html", #Glen Rice
    "https://www.basketball-reference.com/players/r/roybr01.html", #Brandon Roy
    "https://www.basketball-reference.com/players/b/bridgbi01.html", #Bill Bridges
    "https://www.basketball-reference.com/players/z/zasloma01.html", #Max Zaslofsky
    "https://www.basketball-reference.com/players/b/blackro01.html" #Rolando Blackman
]

#Headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

#Initialize a list to store all players' data
all_players_data = []

#Player position abbreviations
position_map = {
    'Center': 'C',
    'Power Forward': 'PF',
    'Small Forward': 'SF',
    'Shooting Guard': 'SG',
    'Point Guard': 'PG'
}

#Function to clean and extract player position
def get_position(soup):
    position = None  #Default position if not found
    p_elems = soup.find_all('p')

    for p_elem in p_elems:
        if 'Position:' in p_elem.get_text():
            position_text = p_elem.get_text(separator=" ").split("Position:")[1].strip()
            position_text = position_text.replace('▪', '').strip()
            if 'Shoots:' in position_text:
                position_text = position_text.split('Shoots:')[0].strip()
            position_text = " ".join(position_text.split())
            position_text = position_text.replace(' and ', ', ')
            positions = position_text.split(',')
            primary_position = positions[0].strip()
            position = position_map.get(primary_position, primary_position)
            break
    return position



#Function to safely extract MVP count
def get_mvp_count(soup):
    mvp_count = 0
    mvp_elem = soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'MVP' in s and 'Finals' not in s and 'AS' not in s and 'MBWA NBA' not in s and 'USBWA MVP' not in s)
    for elem in mvp_elem:
        text = elem.text.strip()
        if 'x' in text:
            mvp_count += int(text.split('x')[0])  #For players with multiple MVPs, split on 'x' and extract the first element
        else:
            mvp_count += 1  #For players with only 1 MVP
    return mvp_count

#Function to safely extract Scoring Championships count
def get_scoring_champ_count(soup):
    scoring_champ_count = 0
    scoring_champ_elem = soup.find_all('li', {'data-tip': lambda x: x and 'NBA Scoring Champ' in x})
    for elem in scoring_champ_elem:
        text = elem.text.strip()
        if 'x' in text:
            scoring_champ_count += int(text.split('x')[0])  # For players with multiple scoring championships, split on 'x'
        else:
            scoring_champ_count += 1  # For players with just 1 scoring championship
    return scoring_champ_count

#Function to safely extract NBA Championships count
def get_chips_count(soup):
    chips_count = 0
    chip_elem = soup.find_all('li', class_='', string=lambda s: s and ('NBA Champ' in s or 'ABA Champ' in s or 'BAA Champ' in s))
    for elem in chip_elem:
        text = elem.text.strip()
        if 'x' in text:
            chips_count += int(text.split('x')[0])  #For players with multiple championships, split on 'x'
        else:
            chips_count += 1  #For players with 1 championship
    return chips_count

#Loop over each player URL to scrape their data
for url in player_urls:
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        try:
            #Get player name
            player_name = soup.find('h1').find('span').text.strip()

            #Get player position
            position = get_position(soup)

            #Get player career length
            career_length_elem = soup.find('strong', string=lambda s: s and 'Career Length:' in s)
            career_length = int(career_length_elem.next_sibling.strip().split()[0])

            #Safeguard function for extracting stats
            def safe_find(tag, text):
                element = soup.find('span', {'data-tip': text})
                if element:
                    return element.find_next('p').find_next('p').text.strip()
                return 0.0

            games = int(safe_find('Games', 'Games'))
            ppg = float(safe_find('Points', 'Points'))
            rpg = float(safe_find('Total Rebounds', 'Total Rebounds'))
            apg = float(safe_find('Assists', 'Assists'))

            #Extract PER
            per_elem = soup.find('span', {'data-tip': lambda x: x and 'Player Efficiency Rating' in x})
            per = float(per_elem.find_next('p').find_next('p').text.strip() if per_elem else 0.0)

            #Extract Field Goal Percentage and Free Throw Percentage
            fg_pct = float(safe_find('Field Goal Percentage', 'Field Goal Percentage'))
            ft_pct = float(safe_find('Free Throw Percentage', 'Free Throw Percentage'))

            #Extract Win Shares
            win_shares_elem = soup.find('span', {'data-tip': lambda x: x and 'Win Shares' in x})
            win_shares = float(win_shares_elem.next_sibling.find_next('p').text.strip() if win_shares_elem else 0.0)
            
            #Extract awards counts
            mvp_count = get_mvp_count(soup)
            scoring_champ_count = get_scoring_champ_count(soup)
            chips_count = get_chips_count(soup)

            #Extract All-Stars, All-NBA, All-Defense, and other honors
            all_stars = int(soup.find('li', {'class': 'all_star'}).find('a').text.strip().split('x')[0] if soup.find('li', {'class': 'all_star'}) else 0)
            
            #Extract All-NBA, All-ABA, or All-BAA awards
            def award_count(soup, award_name, tag='a', attributes=None):
                #Find the award element
                elem = soup.find(tag, attributes, string=lambda s: s and award_name in s)
                if elem:
                    if 'x' in elem.text:
                        return int(elem.text.strip().split('x')[0])
                    else:
                        return 1
                else:
                    return 0
            all_nba = award_count(soup, 'All-NBA', tag='li', attributes={'class':  ""})
            all_aba = award_count(soup, 'All-ABA')
            all_baa = award_count(soup, 'All-BAA')
            all_nba_total = all_nba + all_aba + all_baa
            
            all_defense_count = sum([int(a.text.strip().split('x')[0]) if 'x' in a.text else 1 for a in soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s).find('a')]) if soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s) else 0
            all_rookie = 1 if soup.find('li', {'data-tip': lambda s: s and 'All-Rookie' in s}) else 0
            roy = 1 if soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'ROY' in s and 'MBWA NBA' not in s) else 0
            dpoy_count = sum([int(text.split('x')[0]) if 'x' in text else 1 for text in [elem.text.strip() for elem in soup.find_all('li', class_='poptip', string=lambda s: s and 'Def. POY' in s)]])
            
            #Players aren't in the hall of fame so set to 0
            hof = 0

            #Store the data
            player_data = {
                'Name': player_name,
                'Position': position,
                'Games': games,
                'Career Length': career_length,
                'PPG': ppg,
                'RPG': rpg,
                'APG': apg,
                'PER': per,
                'FG%': fg_pct,
                'FT%': ft_pct,
                'Win Shares': win_shares,
                'All-Stars': all_stars,
                'All-NBA': all_nba_total,
                'All-Defense': all_defense_count,
                'All-Rookie Team': all_rookie,
                'MVPs': mvp_count,
                'Chips': chips_count,
                'ROY': roy,
                'DPOYs': dpoy_count,
                'Scoring Champ': scoring_champ_count,
                'HOF': hof
            }
            all_players_data.append(player_data)

        except Exception as e:
            print(f"Error scraping data for {url}: {e}")

        time.sleep(1)  #Be polite with requests, avoid overwhelming the server

#Create a DataFrame from the collected data
df_one = pd.DataFrame(all_players_data)

In [4]:
df_one

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
0,Carmelo Anthony,SF,1260,19,22.5,6.2,2.7,19.5,44.7,81.4,108.5,10,6,0,1,0,0,0,0,1,0
1,Shawn Marion,SF,1163,16,15.2,8.7,1.9,18.8,48.4,81.0,124.9,4,2,0,1,0,1,0,0,0,0
2,Rasheed Wallace,PF,1109,16,14.4,6.7,1.8,17.0,46.7,72.1,105.1,4,0,0,1,0,1,0,0,0,0
3,Dwight Howard,C,1242,18,15.7,11.8,1.3,21.3,58.7,56.7,141.7,8,8,5,1,0,1,0,3,0,0
4,Gilbert Arenas,PG,552,11,20.7,3.9,5.3,19.6,42.1,80.3,51.3,3,3,0,0,0,0,0,0,0,0
5,Andre Miller,PG,1304,17,12.5,3.7,6.5,17.4,46.1,80.7,100.8,0,0,0,1,0,0,0,0,0,0
6,Anfernee Hardaway,SG,704,14,15.2,4.5,5.0,17.4,45.8,77.4,61.9,4,3,0,1,0,0,0,0,0,0
7,Derrick Rose,PG,723,15,17.4,3.2,5.2,18.0,45.6,83.1,44.6,3,1,0,1,1,0,1,0,0,0
8,Freddie Lewis,PG,750,11,16.0,3.7,4.0,15.0,43.2,81.7,53.8,3,0,0,0,0,3,0,0,0,0
9,Amar'e Stoudemire,PF,846,14,18.9,7.8,1.2,21.8,53.7,76.1,92.5,6,5,0,1,0,0,1,0,0,0


In [5]:
#Next batch of players

#Paste player links
player_urls = [
    "https://www.basketball-reference.com/players/a/allento01.html", #Tony Allen
    "https://www.basketball-reference.com/players/r/roberal01.html", #Alvin Robertson
    "https://www.basketball-reference.com/players/b/bridgbi01.html", #Bill Bridges
    "https://www.basketball-reference.com/players/s/schrede01.html", #Detlef Schrempf
    "https://www.basketball-reference.com/players/a/aldrila01.html", #LeMarcus Aldridge
    "https://www.basketball-reference.com/players/j/johnsma01.html", #Marques Johnson
    "https://www.basketball-reference.com/players/o/onealje01.html", #Jermain O'Neal
    "https://www.basketball-reference.com/players/c/calvima01.html", #Mack Calvin
    "https://www.basketball-reference.com/players/g/gasolma01.html", #Marc Gasol
    "https://www.basketball-reference.com/players/w/willibu01.html", #Buck Williams
    "https://www.basketball-reference.com/players/p/portete01.html", #Terry Porter
    "https://www.basketball-reference.com/players/n/nancela01.html", #Larry Nance Sr.
    "https://www.basketball-reference.com/players/h/hornaje01.html", #Jeff Hornacek
    "https://www.basketball-reference.com/players/t/thorpot01.html", #Otis Thorpe
    "https://www.basketball-reference.com/players/p/perkisa01.html", #Sam Perkins
    "https://www.basketball-reference.com/players/j/jonesed02.html", #Eddie Jones
    "https://www.basketball-reference.com/players/c/cummite01.html", #Terry Cummings
    "https://www.basketball-reference.com/players/l/lewisra02.html", #Rashard Lewis
    "https://www.basketball-reference.com/players/e/eatonma01.html", #Mark Eaton
    "https://www.basketball-reference.com/players/s/silasja01.html", #James Silas
    "https://www.basketball-reference.com/players/l/lucasma01.html", #Maurice Lucas
    "https://www.basketball-reference.com/players/p/pricema01.html", #Mark Price
    "https://www.basketball-reference.com/players/j/jonesji01.html", #Jimmy Jones
    "https://www.basketball-reference.com/players/n/noahjo01.html", #Joakim Noah
    "https://www.basketball-reference.com/players/c/couside01.html", #Demarcus Cousins
    "https://www.basketball-reference.com/players/f/freemdo01.html", #Donnie Freeman
    "https://www.basketball-reference.com/players/w/willigu01.html", #Gus Williams
    "https://www.basketball-reference.com/players/i/iguodan01.html", #Andre Iguodala
    "https://www.basketball-reference.com/players/b/boonero01.html", #Ron Boone
    "https://www.basketball-reference.com/players/j/jabalwa01.html", #Warren Jabali
    "https://www.basketball-reference.com/players/j/jonesla01.html", #Larry Jones
    "https://www.basketball-reference.com/players/l/lovebo01.html", #Bob Love
    "https://www.basketball-reference.com/players/n/netolbo01.html", #Bob Netolicky
    "https://www.basketball-reference.com/players/s/sprewla01.html", #Latrell Sprewell
    "https://www.basketball-reference.com/players/w/willide01.html", #Deron Williams
    "https://www.basketball-reference.com/players/a/artesro01.html", #Metta World Peace (Ron Artest)
    "https://www.basketball-reference.com/players/b/bakervi01.html" #Vin Baker
]

#Headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

#Initialize a list to store all players' data
all_players_data = []

#Player position abbreviations
position_map = {
    'Center': 'C',
    'Power Forward': 'PF',
    'Small Forward': 'SF',
    'Shooting Guard': 'SG',
    'Point Guard': 'PG'
}

#Function to clean and extract player position
def get_position(soup):
    position = None  #Default position if not found
    p_elems = soup.find_all('p')

    for p_elem in p_elems:
        if 'Position:' in p_elem.get_text():
            position_text = p_elem.get_text(separator=" ").split("Position:")[1].strip()
            position_text = position_text.replace('▪', '').strip()
            if 'Shoots:' in position_text:
                position_text = position_text.split('Shoots:')[0].strip()
            position_text = " ".join(position_text.split())
            position_text = position_text.replace(' and ', ', ')
            positions = position_text.split(',')
            primary_position = positions[0].strip()
            position = position_map.get(primary_position, primary_position)
            break
    return position



#Function to safely extract MVP count
def get_mvp_count(soup):
    mvp_count = 0
    mvp_elem = soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'MVP' in s and 'Finals' not in s and 'AS' not in s and 'MBWA NBA' not in s and 'USBWA MVP' not in s)
    for elem in mvp_elem:
        text = elem.text.strip()
        if 'x' in text:
            mvp_count += int(text.split('x')[0])  #For players with multiple MVPs, split on 'x' and extract the first element
        else:
            mvp_count += 1  #For players with only 1 MVP
    return mvp_count

#Function to safely extract Scoring Championships count
def get_scoring_champ_count(soup):
    scoring_champ_count = 0
    scoring_champ_elem = soup.find_all('li', {'data-tip': lambda x: x and 'NBA Scoring Champ' in x})
    for elem in scoring_champ_elem:
        text = elem.text.strip()
        if 'x' in text:
            scoring_champ_count += int(text.split('x')[0])  # For players with multiple scoring championships, split on 'x'
        else:
            scoring_champ_count += 1  # For players with just 1 scoring championship
    return scoring_champ_count

#Function to safely extract NBA Championships count
def get_chips_count(soup):
    chips_count = 0
    chip_elem = soup.find_all('li', class_='', string=lambda s: s and ('NBA Champ' in s or 'ABA Champ' in s or 'BAA Champ' in s))
    for elem in chip_elem:
        text = elem.text.strip()
        if 'x' in text:
            chips_count += int(text.split('x')[0])  #For players with multiple championships, split on 'x'
        else:
            chips_count += 1  #For players with 1 championship
    return chips_count

#Loop over each player URL to scrape their data
for url in player_urls:
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        try:
            #Get player name
            player_name = soup.find('h1').find('span').text.strip()

            #Get player position
            position = get_position(soup)

            #Get player career length
            career_length_elem = soup.find('strong', string=lambda s: s and 'Career Length:' in s)
            career_length = int(career_length_elem.next_sibling.strip().split()[0])

            #Safeguard function for extracting stats
            def safe_find(tag, text):
                element = soup.find('span', {'data-tip': text})
                if element:
                    return element.find_next('p').find_next('p').text.strip()
                return 0.0

            games = int(safe_find('Games', 'Games'))
            ppg = float(safe_find('Points', 'Points'))
            rpg = float(safe_find('Total Rebounds', 'Total Rebounds'))
            apg = float(safe_find('Assists', 'Assists'))

            #Extract PER
            per_elem = soup.find('span', {'data-tip': lambda x: x and 'Player Efficiency Rating' in x})
            per = float(per_elem.find_next('p').find_next('p').text.strip() if per_elem else 0.0)

            #Extract Field Goal Percentage and Free Throw Percentage
            fg_pct = float(safe_find('Field Goal Percentage', 'Field Goal Percentage'))
            ft_pct = float(safe_find('Free Throw Percentage', 'Free Throw Percentage'))

            #Extract Win Shares
            win_shares_elem = soup.find('span', {'data-tip': lambda x: x and 'Win Shares' in x})
            win_shares = float(win_shares_elem.next_sibling.find_next('p').text.strip() if win_shares_elem else 0.0)
            
            #Extract awards counts
            mvp_count = get_mvp_count(soup)
            scoring_champ_count = get_scoring_champ_count(soup)
            chips_count = get_chips_count(soup)

            #Extract All-Stars, All-NBA, All-Defense, and other honors
            all_stars = int(soup.find('li', {'class': 'all_star'}).find('a').text.strip().split('x')[0] if soup.find('li', {'class': 'all_star'}) else 0)
            
            #Extract All-NBA, All-ABA, or All-BAA awards
            def award_count(soup, award_name, tag='a', attributes=None):
                #Find the award element
                elem = soup.find(tag, attributes, string=lambda s: s and award_name in s)
                if elem:
                    if 'x' in elem.text:
                        return int(elem.text.strip().split('x')[0])
                    else:
                        return 1
                else:
                    return 0
            all_nba = award_count(soup, 'All-NBA', tag='li', attributes={'class':  ""})
            all_aba = award_count(soup, 'All-ABA')
            all_baa = award_count(soup, 'All-BAA')
            all_nba_total = all_nba + all_aba + all_baa
            
            all_defense_count = sum([int(a.text.strip().split('x')[0]) if 'x' in a.text else 1 for a in soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s).find('a')]) if soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s) else 0
            all_rookie = 1 if soup.find('li', {'data-tip': lambda s: s and 'All-Rookie' in s}) else 0
            roy = 1 if soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'ROY' in s and 'MBWA NBA' not in s) else 0
            dpoy_count = sum([int(text.split('x')[0]) if 'x' in text else 1 for text in [elem.text.strip() for elem in soup.find_all('li', class_='poptip', string=lambda s: s and 'Def. POY' in s)]])
            
            #Players aren't in the hall of fame so set to 0
            hof = 0

            #Store the data
            player_data = {
                'Name': player_name,
                'Position': position,
                'Games': games,
                'Career Length': career_length,
                'PPG': ppg,
                'RPG': rpg,
                'APG': apg,
                'PER': per,
                'FG%': fg_pct,
                'FT%': ft_pct,
                'Win Shares': win_shares,
                'All-Stars': all_stars,
                'All-NBA': all_nba_total,
                'All-Defense': all_defense_count,
                'All-Rookie Team': all_rookie,
                'MVPs': mvp_count,
                'Chips': chips_count,
                'ROY': roy,
                'DPOYs': dpoy_count,
                'Scoring Champ': scoring_champ_count,
                'HOF': hof
            }
            all_players_data.append(player_data)

        except Exception as e:
            print(f"Error scraping data for {url}: {e}")

        time.sleep(1)  #Be polite with requests, avoid overwhelming the server

#Create a DataFrame from the collected data
df_two = pd.DataFrame(all_players_data)
df_two

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
0,Tony Allen,SG,820,14,8.1,3.5,1.3,14.2,47.5,70.9,38.7,0,0,6,0,0,1,0,0,0,0
1,Alvin Robertson,SG,779,10,14.0,5.2,5.0,17.0,47.7,74.3,52.1,4,1,6,0,0,0,0,1,0,0
2,Bill Bridges,PF,926,13,11.9,11.9,2.8,14.5,44.2,69.3,59.9,3,0,2,0,0,1,0,0,0,0
3,Detlef Schrempf,SF,1136,16,13.9,6.2,3.4,17.2,49.1,80.3,109.5,3,1,0,0,0,0,0,0,0,0
4,LaMarcus Aldridge,PF,1076,16,19.1,8.1,1.9,20.7,49.3,81.3,115.7,7,5,0,1,0,0,0,0,0,0
5,Marques Johnson,SF,691,11,20.1,7.0,3.6,20.1,51.8,73.9,79.8,5,3,0,1,0,0,0,0,0,0
6,Jermaine O'Neal,C,1011,18,13.2,7.2,1.4,17.9,46.7,71.5,66.0,6,3,0,0,0,0,0,0,0,0
7,Mack Calvin,PG,755,11,16.1,2.5,4.8,17.4,44.7,86.3,60.4,5,4,0,1,0,0,0,0,0,0
8,Marc Gasol,C,891,13,14.0,7.4,3.4,18.0,48.1,77.6,85.3,3,2,1,1,0,1,0,1,0,0
9,Buck Williams,PF,1307,17,12.8,10.0,1.3,15.3,54.9,66.4,120.1,3,1,4,1,0,0,1,0,0,0


In [6]:
#Next batch of players

#Paste player links
player_urls = [
    "https://www.basketball-reference.com/players/f/foustla01.html", #Larry Foust
    "https://www.basketball-reference.com/players/l/lewisfr02.html", #Freddie Lewis
    "https://www.basketball-reference.com/players/w/wisewi01.html", #Willie Wise
    "https://www.basketball-reference.com/players/b/birdsot01.html", #Odis Birdsong
    "https://www.basketball-reference.com/players/c/cambyma01.html", #Marcus Camby
    "https://www.basketball-reference.com/players/c/cheniph01.html", #Phil Chenier
    "https://www.basketball-reference.com/players/d/daughbr01.html", #Brad Daugherty
    "https://www.basketball-reference.com/players/v/vanlino01.html", #Norm Van Lier
    "https://www.basketball-reference.com/players/s/stojape01.html", #Peja Stojakovic
    "https://www.basketball-reference.com/players/s/stricro02.html", #Rod Strickland
    "https://www.basketball-reference.com/players/k/kenonla01.html", #Larry Kenon 
    "https://www.basketball-reference.com/players/m/marbust01.html", #Stephon Marbury
    "https://www.basketball-reference.com/players/w/walljo01.html", #John Wall
    "https://www.basketball-reference.com/players/g/griffbl01.html", #Blake Griffin
    "https://www.basketball-reference.com/players/c/cartwbi01.html", #Bill Cartwright
    "https://www.basketball-reference.com/players/h/hamilri01.html", #Richard Hamilton
    "https://www.basketball-reference.com/players/a/aguirma01.html", #Mark Aguirre
    "https://www.basketball-reference.com/players/b/blaylmo01.html", #Mookie Baylock
    "https://www.basketball-reference.com/players/m/mannida01.html", #Danny Manning
    "https://www.basketball-reference.com/players/f/finlemi01.html", #Michael Finley
    "https://www.basketball-reference.com/players/b/bowenbr01.html", #Bruce Bowen
    "https://www.basketball-reference.com/players/o/odomla01.html", #Lamar Odom
    "https://www.basketball-reference.com/players/s/scottby01.html", #Byron Scott
    "https://www.basketball-reference.com/players/e/elliose01.html", #Sean Elliot
    "https://www.basketball-reference.com/players/p/princta01.html", #Tayshaun Prince
    "https://www.basketball-reference.com/players/m/maxwece01.html", #Cedric Maxwell 
    "https://www.basketball-reference.com/players/j/johnsvi01.html", #Vinnie Johnson
    "https://www.basketball-reference.com/players/b/brownpj01.html", #PJ Brown
    "https://www.basketball-reference.com/players/e/ellisda01.html", #Dale Ellis
    "https://www.basketball-reference.com/players/h/harpede01.html", #Derek Harper
    "https://www.basketball-reference.com/players/s/smithst01.html", #Steve Smith
    "https://www.basketball-reference.com/players/b/boozeca01.html", #Carlos Boozer
    "https://www.basketball-reference.com/players/v/vandeki01.html", #Kiki Vandeweghe
    "https://www.basketball-reference.com/players/v/vanardi01.html", #Dick Van Arsdale 
    "https://www.basketball-reference.com/players/j/johnsla02.html", #Larry Johnson
    "https://www.basketball-reference.com/players/d/drewjo01.html", #John Drew
    "https://www.basketball-reference.com/players/t/theusre01.html", #Reggie Theus
    "https://www.basketball-reference.com/players/c/colemde01.html", #Derrick Coleman
    "https://www.basketball-reference.com/players/m/mullije01.html", #Jeff Mullins
    "https://www.basketball-reference.com/players/m/mitchmi01.html", #Mike Mitchell
    "https://www.basketball-reference.com/players/s/stackje01.html", #Jerry Stackhouse
]

#Headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

#Initialize a list to store all players' data
all_players_data = []

#Player position abbreviations
position_map = {
    'Center': 'C',
    'Power Forward': 'PF',
    'Small Forward': 'SF',
    'Shooting Guard': 'SG',
    'Point Guard': 'PG'
}

#Function to clean and extract player position
def get_position(soup):
    position = None  #Default position if not found
    p_elems = soup.find_all('p')

    for p_elem in p_elems:
        if 'Position:' in p_elem.get_text():
            position_text = p_elem.get_text(separator=" ").split("Position:")[1].strip()
            position_text = position_text.replace('▪', '').strip()
            if 'Shoots:' in position_text:
                position_text = position_text.split('Shoots:')[0].strip()
            position_text = " ".join(position_text.split())
            position_text = position_text.replace(' and ', ', ')
            positions = position_text.split(',')
            primary_position = positions[0].strip()
            position = position_map.get(primary_position, primary_position)
            break
    return position



#Function to safely extract MVP count
def get_mvp_count(soup):
    mvp_count = 0
    mvp_elem = soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'MVP' in s and 'Finals' not in s and 'AS' not in s and 'MBWA NBA' not in s and 'USBWA MVP' not in s)
    for elem in mvp_elem:
        text = elem.text.strip()
        if 'x' in text:
            mvp_count += int(text.split('x')[0])  #For players with multiple MVPs, split on 'x' and extract the first element
        else:
            mvp_count += 1  #For players with only 1 MVP
    return mvp_count

#Function to safely extract Scoring Championships count
def get_scoring_champ_count(soup):
    scoring_champ_count = 0
    scoring_champ_elem = soup.find_all('li', {'data-tip': lambda x: x and 'NBA Scoring Champ' in x})
    for elem in scoring_champ_elem:
        text = elem.text.strip()
        if 'x' in text:
            scoring_champ_count += int(text.split('x')[0])  # For players with multiple scoring championships, split on 'x'
        else:
            scoring_champ_count += 1  # For players with just 1 scoring championship
    return scoring_champ_count

#Function to safely extract NBA Championships count
def get_chips_count(soup):
    chips_count = 0
    chip_elem = soup.find_all('li', class_='', string=lambda s: s and ('NBA Champ' in s or 'ABA Champ' in s or 'BAA Champ' in s))
    for elem in chip_elem:
        text = elem.text.strip()
        if 'x' in text:
            chips_count += int(text.split('x')[0])  #For players with multiple championships, split on 'x'
        else:
            chips_count += 1  #For players with 1 championship
    return chips_count

#Loop over each player URL to scrape their data
for url in player_urls:
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        try:
            #Get player name
            player_name = soup.find('h1').find('span').text.strip()

            #Get player position
            position = get_position(soup)

            #Get player career length
            career_length_elem = soup.find('strong', string=lambda s: s and 'Career Length:' in s)
            career_length = int(career_length_elem.next_sibling.strip().split()[0])

            #Safeguard function for extracting stats
            def safe_find(tag, text):
                element = soup.find('span', {'data-tip': text})
                if element:
                    return element.find_next('p').find_next('p').text.strip()
                return 0.0

            games = int(safe_find('Games', 'Games'))
            ppg = float(safe_find('Points', 'Points'))
            rpg = float(safe_find('Total Rebounds', 'Total Rebounds'))
            apg = float(safe_find('Assists', 'Assists'))

            #Extract PER
            per_elem = soup.find('span', {'data-tip': lambda x: x and 'Player Efficiency Rating' in x})
            per = float(per_elem.find_next('p').find_next('p').text.strip() if per_elem else 0.0)

            #Extract Field Goal Percentage and Free Throw Percentage
            fg_pct = float(safe_find('Field Goal Percentage', 'Field Goal Percentage'))
            ft_pct = float(safe_find('Free Throw Percentage', 'Free Throw Percentage'))

            #Extract Win Shares
            win_shares_elem = soup.find('span', {'data-tip': lambda x: x and 'Win Shares' in x})
            win_shares = float(win_shares_elem.next_sibling.find_next('p').text.strip() if win_shares_elem else 0.0)
            
            #Extract awards counts
            mvp_count = get_mvp_count(soup)
            scoring_champ_count = get_scoring_champ_count(soup)
            chips_count = get_chips_count(soup)

            #Extract All-Stars, All-NBA, All-Defense, and other honors
            all_stars = int(soup.find('li', {'class': 'all_star'}).find('a').text.strip().split('x')[0] if soup.find('li', {'class': 'all_star'}) else 0)
            
            #Extract All-NBA, All-ABA, or All-BAA awards
            def award_count(soup, award_name, tag='a', attributes=None):
                #Find the award element
                elem = soup.find(tag, attributes, string=lambda s: s and award_name in s)
                if elem:
                    if 'x' in elem.text:
                        return int(elem.text.strip().split('x')[0])
                    else:
                        return 1
                else:
                    return 0
            all_nba = award_count(soup, 'All-NBA', tag='li', attributes={'class':  ""})
            all_aba = award_count(soup, 'All-ABA')
            all_baa = award_count(soup, 'All-BAA')
            all_nba_total = all_nba + all_aba + all_baa
            
            all_defense_count = sum([int(a.text.strip().split('x')[0]) if 'x' in a.text else 1 for a in soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s).find('a')]) if soup.find_all('li', {'class': 'poptip'}, string=lambda s: s and 'All-Defensive' in s) else 0
            all_rookie = 1 if soup.find('li', {'data-tip': lambda s: s and 'All-Rookie' in s}) else 0
            roy = 1 if soup.find('li', {'class': 'poptip'}, string=lambda s: s and 'ROY' in s and 'MBWA NBA' not in s) else 0
            dpoy_count = sum([int(text.split('x')[0]) if 'x' in text else 1 for text in [elem.text.strip() for elem in soup.find_all('li', class_='poptip', string=lambda s: s and 'Def. POY' in s)]])
            
            #Players aren't in the hall of fame so set to 0
            hof = 0

            #Store the data
            player_data = {
                'Name': player_name,
                'Position': position,
                'Games': games,
                'Career Length': career_length,
                'PPG': ppg,
                'RPG': rpg,
                'APG': apg,
                'PER': per,
                'FG%': fg_pct,
                'FT%': ft_pct,
                'Win Shares': win_shares,
                'All-Stars': all_stars,
                'All-NBA': all_nba_total,
                'All-Defense': all_defense_count,
                'All-Rookie Team': all_rookie,
                'MVPs': mvp_count,
                'Chips': chips_count,
                'ROY': roy,
                'DPOYs': dpoy_count,
                'Scoring Champ': scoring_champ_count,
                'HOF': hof
            }
            all_players_data.append(player_data)

        except Exception as e:
            print(f"Error scraping data for {url}: {e}")

        time.sleep(1)  #Be polite with requests, avoid overwhelming the server

#Create a DataFrame from the collected data
df_three = pd.DataFrame(all_players_data)
df_three

Error scraping data for https://www.basketball-reference.com/players/w/walljo01.html: 'NoneType' object has no attribute 'next_sibling'


Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
0,Larry Foust,C,817,12,13.7,9.8,1.7,19.8,40.5,74.1,79.2,8,2,0,0,0,0,0,0,0,0
1,Freddie Lewis,PG,750,11,16.0,3.7,4.0,15.0,43.2,81.7,53.8,3,0,0,0,0,3,0,0,0,0
2,Willie Wise,SF,552,9,17.6,8.3,2.9,18.1,47.5,72.4,48.7,3,2,2,1,0,1,0,0,0,0
3,Otis Birdsong,SG,696,12,18.0,3.0,3.2,16.5,50.6,65.5,48.2,4,1,0,0,0,0,0,0,0,0
4,Marcus Camby,C,973,17,9.5,9.8,1.9,17.8,46.6,67.0,81.6,0,0,4,1,0,0,0,1,0,0
5,Phil Chenier,SG,578,10,17.2,3.6,3.0,15.3,44.4,80.6,39.3,3,1,0,1,0,1,0,0,0,0
6,Brad Daugherty,C,548,8,19.0,9.5,3.7,18.9,53.2,74.7,65.2,5,1,0,1,0,0,0,0,0,0
7,Norm Van Lier,PG,746,10,11.8,4.8,7.0,14.0,41.4,78.0,47.8,3,1,8,0,0,0,0,0,0,0
8,Peja Stojaković,SF,804,13,17.0,4.7,1.8,17.1,45.0,89.5,82.6,3,1,0,0,0,1,0,0,0,0
9,Rod Strickland,PG,1094,17,13.2,3.7,7.3,18.0,45.4,72.1,85.8,0,1,0,1,0,0,0,0,0,0


In [9]:
#Concat together dataframes
df = pd.concat([df_one, df_two, df_three], ignore_index=True)

#Display head
df.head()

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
0,Carmelo Anthony,SF,1260,19,22.5,6.2,2.7,19.5,44.7,81.4,108.5,10,6,0,1,0,0,0,0,1,0
1,Shawn Marion,SF,1163,16,15.2,8.7,1.9,18.8,48.4,81.0,124.9,4,2,0,1,0,1,0,0,0,0
2,Rasheed Wallace,PF,1109,16,14.4,6.7,1.8,17.0,46.7,72.1,105.1,4,0,0,1,0,1,0,0,0,0
3,Dwight Howard,C,1242,18,15.7,11.8,1.3,21.3,58.7,56.7,141.7,8,8,5,1,0,1,0,3,0,0
4,Gilbert Arenas,PG,552,11,20.7,3.9,5.3,19.6,42.1,80.3,51.3,3,3,0,0,0,0,0,0,0,0


In [10]:
#There was an error scraping data for john wall so I'll need to redo it
url = "https://www.basketball-reference.com/players/w/walljo01.html"

#Headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    try:
        #Get player name
        name = soup.find('h1').find('span').text.strip()
        
        #Get position
        position_find = soup.find('strong', string= lambda s: s and 'Position:' in s).next_sibling.text.strip().split()[0:2]
        position_join = " ".join(position_find)
        if 'Point Guard' in position_join:
            position = 'PG' 

        #Get career length
        career_length = int(soup.find('strong', string=lambda s: s and 'Experience' in s).next_sibling.text.strip().split()[0])
        #Get stats
        ppg = float(soup.find('span', {'data-tip': 'Points'}).find_next('p').find_next('p').text.strip())
        games = int(soup.find('span',{'data-tip': 'Games'}).find_next('p').find_next('p').text.strip())
        rpg = float(soup.find('span', {'data-tip': 'Total Rebounds'}).find_next('p').find_next('p').text.strip())
        apg = float(soup.find('span', {'data-tip': 'Assists'}).find_next('p').find_next('p').text.strip())
        fg = float(soup.find('span', {'data-tip': 'Field Goal Percentage'}).find_next('p').find_next('p').text.strip())
        ft = float(soup.find('span', {'data-tip': 'Free Throw Percentage'}).find_next('p').find_next('p').text.strip())
        per = float(soup.find('span', {'data-tip': lambda x: x and 'Player Efficiency Rating' in x}).find_next('p').find_next('p').text.strip())
        win_shares = float(soup.find('span', {'data-tip': lambda x: x and 'Win Shares' in x}).find_next('p').find_next('p').text.strip())

        #Extract awards with safe guards
        all_stars = int(soup.find('li', {'class': 'all_star'}).text.strip().split('x')[0])
        all_nba_find = soup.find('a', string= lambda s: s and 'All-NBA' in s).text
        if 'x' in all_nba_find:
            all_nba = int(all_nba_find.strip().split('x')[0])
        else:
            all_nba = 1
        all_defense_find = soup.find('a', string= lambda s: s and 'All-Defensive' in s).text
        if 'x' in all_defense_find:
            all_defense = int(all_defense_find.strip().split('x')[0])
        else:
            all_defense = 1
        all_rookie_find = soup.find('a', string= lambda s: s and 'All-Rookie' in s).text
        if all_rookie_find:
            all_rookie = 1
        else:
            all_rookie = 0
        mvps = 0
        chips = 0 
        roy = 0
        dpoys = 0
        scoring_champ = 0
        hof = 0
        player_info = {
                'Name': name,
                'Position': position,
                'Games': games,
                'Career Length': career_length,
                'PPG': ppg,
                'RPG': rpg,
                'APG': apg,
                'PER': per,
                'FG%': fg,
                'FT%': ft,
                'Win Shares': win_shares,
                'All-Stars': all_stars,
                'All-NBA': all_nba,
                'All-Defense': all_defense,
                'All-Rookie Team': all_rookie,
                'MVPs': mvps,
                'Chips': chips,
                'ROY': roy,
                'DPOYs': dpoys,
                'Scoring Champ': scoring_champ,
                'HOF': hof
        }
    except Exception as e:
        print(f'Error scraping data for {url}: {e}')

john_wall = pd.DataFrame(player_info,index=[0])
john_wall

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
0,John Wall,PG,647,11,18.7,4.2,8.9,19.0,43.0,77.6,44.5,5,1,1,1,0,0,0,0,0,0


In [11]:
#Add John wall to dataframe
df = pd.concat([df, john_wall], ignore_index=True)

#Display tail
df.tail()

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
102,Derrick Coleman,PF,781,15,16.5,9.3,2.5,18.0,44.7,76.9,64.3,1,2,0,1,0,0,1,0,0,0
103,Jeff Mullins,SG,804,12,16.2,4.3,3.8,16.3,46.3,81.4,62.8,3,0,0,0,0,1,0,0,0,0
104,Mike Mitchell,SF,759,11,19.8,5.6,1.3,16.7,49.3,77.9,50.2,1,0,0,0,0,0,0,0,0,0
105,Jerry Stackhouse,SG,970,18,16.9,3.2,3.3,16.5,40.9,82.2,52.4,2,0,0,1,0,0,0,0,0,0
106,John Wall,PG,647,11,18.7,4.2,8.9,19.0,43.0,77.6,44.5,5,1,1,1,0,0,0,0,0,0


In [17]:
#Check for duplicates 
duplicates = df.duplicated()
duplicates.value_counts()

False    105
True       2
Name: count, dtype: int64

In [19]:
#Find duplicate rows
duplicate_rows = df[duplicates]
duplicate_rows

Unnamed: 0,Name,Position,Games,Career Length,PPG,RPG,APG,PER,FG%,FT%,Win Shares,All-Stars,All-NBA,All-Defense,All-Rookie Team,MVPs,Chips,ROY,DPOYs,Scoring Champ,HOF
31,Bill Bridges,PF,926,13,11.9,11.9,2.8,14.5,44.2,69.3,59.9,3,0,2,0,0,1,0,0,0,0
67,Freddie Lewis,PG,750,11,16.0,3.7,4.0,15.0,43.2,81.7,53.8,3,0,0,0,0,3,0,0,0,0


#### Seems to be a mistake this isn't a duplicate, however I can check using sql towards the end of my data scraping process. 

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             107 non-null    object 
 1   Position         107 non-null    object 
 2   Games            107 non-null    int64  
 3   Career Length    107 non-null    int64  
 4   PPG              107 non-null    float64
 5   RPG              107 non-null    float64
 6   APG              107 non-null    float64
 7   PER              107 non-null    float64
 8   FG%              107 non-null    float64
 9   FT%              107 non-null    float64
 10  Win Shares       107 non-null    float64
 11  All-Stars        107 non-null    int64  
 12  All-NBA          107 non-null    int64  
 13  All-Defense      107 non-null    int64  
 14  All-Rookie Team  107 non-null    int64  
 15  MVPs             107 non-null    int64  
 16  Chips            107 non-null    int64  
 17  ROY             

In [27]:
#Save to csv file
df.to_csv('Borderline HOF Players.csv', index=False)