# Outlier or Catilin Clark? A Data Science Project
## Part 1 - Project Setup and Data Acquisition

This notebook contains the code for the first part of this data science project - data acquisition. Section headings have been included for convenience and the full writeup is available [on my website](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/).

In summary, there will be a notebook (and post) for each part of the process - from initial steps like data acquisition, preprocessing, and cleaning to more advanced steps like feature engineering, machine learning, and creating visualizations. The dataset used in this project contains individual basketball player statistics (such as total points scored and blocks made) for the 2023-2024 NCAA women’s basketball season.

# Getting Started
Full requirements and environment setup information is detailed in the [full blog post](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/).

## Import Packages

In [1]:
import pandas as pd
import requests
import json
import os
import numpy as np
import openpyxl 

# Identifying Datasets
1. Player Information Dataset
   - This dataset will include player information such as name, height, position, team name, and class. 
   - To obtain this data, we'll be navigating to the [NCAA website's basketball statistics section](https://web1.ncaa.org/stats/StatsSrv/rankings?doWhat=archive&sportCode=WBB). From there, we selected the desired season, division, and reporting week to access the statistics. After selecting "All Statistics for Individual," we clicked on "Show report CSV" to generate and download the dataset in CSV format.

2. Player Statistics Dataset
   - This dataset will include player statistics and individual results for the latest season, including points scored, field goals made, blocks, steals, assists, etc.
   - To obtain this dataset, we'll be making multiple API requests to the [Yahoo Sports API](https://sports.yahoo.com/ncaaw/stats/individual/?selectedTable=0&leagueStructure=ncaaw.struct.div.1&sortStatId=FREE_THROWS_MADE). Each request will pull the top players for a given statistic (such as points, blocks, assists, etc.) and then those results will be combined into one dataset.

# Acquiring Player Information Data
Follow the steps in the [full blog post](https://www.pineconedata.com/2024-04-11-basketball-data-acquisition/) if you would like to obtain this data yourself. Otherwise, import the dataset. 

In [3]:
player_info = pd.read_csv("2024-04-07 NCAA WBB-Div1 Player Info.csv", encoding='latin1')
player_info.head()

Unnamed: 0,Player,Team,Class,Height,Position
0,Kiara Jackson,UNLV (Mountain West),Jr.,5-7,G
1,Raven Johnson,South Carolina (SEC),So.,5-8,G
2,Gina Marxen,Montana (Big Sky),Sr.,5-8,G
3,McKenna Hofschild,Colorado St. (Mountain West),Sr.,5-2,G
4,Kaylah Ivey,Boston College (ACC),Jr.,5-8,G


In [4]:
player_info.shape

(1009, 5)

In [14]:
player_info.columns

Index(['Player', 'Team', 'Class', 'Height', 'Position'], dtype='object')

In [5]:
player_info.to_excel('player_info.xlsx', index=False)

# Acquiring Player Statistics Data

# Write a Function to Request Data

In [6]:
def get_data_for_stat(stat_name, season='2023', league='ncaaw', count='500'):
    """
    Retrieve basketball player statistics for a specified statistical category (stat_name) for a given season and league.

    Parameters:
    - stat_name (str): Specifies the statistical category for which data is requested.
    - season (str): Specifies the season for which data is requested (default is '2023').
    - league (str): Specifies the league for which data is requested (default is 'ncaaw' for NCAA Women's Basketball).
    - count (str): Specifies the maximum number of data entries to retrieve (default is '500').

    Returns:
    - dict: JSON response containing player statistics for the specified statistical category.
    """
    url = "https://graphite-secure.sports.yahoo.com/v1/query/shangrila/seasonStatsBasketballTotal"
    params = {
        'lang': 'en-US',
        'region': 'US',
        'tz': 'America/New_York',
        'ysp_redesign': '1',
        'ysp_platform': 'desktop',
        'season': season,
        'league': league,
        'leagueStructure': f'{league}.struct.div.1',
        'count': count,
        'sortStatId': stat_name,
        'positionIds': '',
        'qualified': 'FALSE'
    }
    try:
        # Send GET request to the API endpoint with specified parameters
        response = requests.get(url, params)
        response.raise_for_status()  # Raise exception for 4xx and 5xx status codes
        return response.json()
    except requests.exceptions.RequestException as e:
        # Handle any errors encountered during the API request
        print(f'Error requesting data for {stat_name}: {e}')
        return None

In [7]:
example_response = get_data_for_stat('POINTS')

In [8]:
example_response.keys()

dict_keys(['data', 'extensions'])

In [9]:
example_response['data']['statTypes']

[{'statId': 'GAMES',
  'displayName': 'Games',
  'abbreviation': 'G',
  'sortOrder': 'DESCENDING'},
 {'statId': 'MINUTES_PLAYED',
  'displayName': 'Minutes Played',
  'abbreviation': 'Min',
  'sortOrder': 'DESCENDING'},
 {'statId': 'FIELD_GOALS_MADE',
  'displayName': 'Field Goals Made',
  'abbreviation': 'FGM',
  'sortOrder': 'DESCENDING'},
 {'statId': 'FIELD_GOAL_ATTEMPTS',
  'displayName': 'Field Goal Attempts',
  'abbreviation': 'FGA',
  'sortOrder': 'DESCENDING'},
 {'statId': 'FIELD_GOAL_PERCENTAGE',
  'displayName': 'Field Goal Percentage',
  'abbreviation': 'FG%',
  'sortOrder': 'DESCENDING'},
 {'statId': 'THREE_POINTS_MADE',
  'displayName': 'Three-Points Made',
  'abbreviation': '3PM',
  'sortOrder': 'DESCENDING'},
 {'statId': 'THREE_POINT_ATTEMPTS',
  'displayName': 'Three-Point Attempts',
  'abbreviation': '3PA',
  'sortOrder': 'DESCENDING'},
 {'statId': 'THREE_POINT_PERCENTAGE',
  'displayName': 'Three-Point Percentage',
  'abbreviation': '3P%',
  'sortOrder': 'DESCENDING'}

In [10]:
example_response['data']['leagues'][0]['leaders'][0]

{'player': {'displayName': 'Caitlin Clark',
  'playerId': 'ncaaw.p.64550',
  'team': {'displayName': 'Iowa',
   'abbreviation': 'IOWA',
   'teamLogo': {'url': 'https://s.yimg.com/iu/api/res/1.2/c1eT0fjpIOp9jIlg5xiq0w--~B/YXBwaWQ9c2hhcmVkO2ZpPWZpbGw7cT0xMDA7aD0xMjg7dz0xMjg-/https://s.yimg.com/cv/apiv2/default/ncaab/20181211/500x500/Iowa.png',
    'height': 128,
    'width': 128}},
  'alias': {'url': 'https://sports.yahoo.com/ncaaw/players/64550/'},
  'playerCutout': None},
 'stats': [{'statId': 'GAMES', 'value': '32'},
  {'statId': 'MINUTES_PLAYED', 'value': '1088'},
  {'statId': 'FIELD_GOALS_MADE', 'value': '332'},
  {'statId': 'FIELD_GOAL_ATTEMPTS', 'value': '719'},
  {'statId': 'FIELD_GOAL_PERCENTAGE', 'value': '46.2'},
  {'statId': 'THREE_POINTS_MADE', 'value': '168'},
  {'statId': 'THREE_POINT_ATTEMPTS', 'value': '437'},
  {'statId': 'THREE_POINT_PERCENTAGE', 'value': '38.4'},
  {'statId': 'FREE_THROWS_MADE', 'value': '188'},
  {'statId': 'FREE_THROW_ATTEMPTS', 'value': '219'},
  {

## Write a Function to Format Data

In [11]:
def format_response_data(response_data):
    """
    Process and format the JSON response data obtained from the Yahoo Sports API into a pandas DataFrame.

    Parameters:
    - response_data (dict): JSON response data obtained from the Yahoo Sports API, containing player statistics.

    Returns:
    - DataFrame: Pandas DataFrame containing formatted player statistics.
    """
    if not response_data:
        return None
    try:
        # Extract relevant data from the JSON response
        response_data = response_data['data']['leagues'][0]['leaders']
        data = []
        for item in response_data:
            # Extract player details
            player_details = {
                'PLAYER_NAME': item['player']['displayName'],
                'PLAYER_ID': item['player']['playerId'],
                'TEAM_NAME': item['player']['team']['displayName']
            }
            # Extract player statistics
            player_stats = {stat['statId']: stat['value'] for stat in item['stats']}
            # Combine player details and statistics into a single dictionary
            player_row = {**player_details, **player_stats}
            data.append(player_row)
        # Convert the list of dictionaries into a pandas DataFrame
        return pd.DataFrame(data)
    except KeyError as e:
        # Handle any errors encountered during data formatting
        print(f'Error formatting response data: {e}')
        return None

In [12]:
example_dataframe = format_response_data(example_response)
example_dataframe.head()

Unnamed: 0,PLAYER_NAME,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,FIELD_GOAL_ATTEMPTS,FIELD_GOAL_PERCENTAGE,THREE_POINTS_MADE,THREE_POINT_ATTEMPTS,...,FREE_THROW_PERCENTAGE,OFFENSIVE_REBOUNDS,DEFENSIVE_REBOUNDS,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,Caitlin Clark,ncaaw.p.64550,Iowa,32,1088,332,719,46.2,168,437,...,85.8,10,224,234,282,151,55,17,61,1020
1,JuJu Watkins,ncaaw.p.112021,Indiana,29,989,270,656,41.2,58,176,...,84.6,52,161,213,96,120,72,45,78,801
2,Hannah Hidalgo,ncaaw.p.112250,Notre Dame,31,1104,255,560,45.5,45,132,...,78.3,25,175,200,170,109,145,3,86,725
3,Lucy Olsen,ncaaw.p.67706,Iowa,30,1087,268,612,43.8,47,158,...,80.9,30,114,144,115,73,57,18,72,697
4,Ta'Niya Latson,ncaaw.p.70600,Florida St.,32,991,249,566,44.0,27,98,...,85.2,17,118,135,128,98,50,13,53,680


In [13]:
example_dataframe.columns

Index(['PLAYER_NAME', 'PLAYER_ID', 'TEAM_NAME', 'GAMES', 'MINUTES_PLAYED',
       'FIELD_GOALS_MADE', 'FIELD_GOAL_ATTEMPTS', 'FIELD_GOAL_PERCENTAGE',
       'THREE_POINTS_MADE', 'THREE_POINT_ATTEMPTS', 'THREE_POINT_PERCENTAGE',
       'FREE_THROWS_MADE', 'FREE_THROW_ATTEMPTS', 'FREE_THROW_PERCENTAGE',
       'OFFENSIVE_REBOUNDS', 'DEFENSIVE_REBOUNDS', 'TOTAL_REBOUNDS', 'ASSISTS',
       'TURNOVERS', 'STEALS', 'BLOCKS', 'FOULS', 'POINTS'],
      dtype='object')

In [15]:
example_dataframe.shape

(500, 23)

## Write a Function to Format and Request Data

In [16]:
def get_and_format_data_for_stat(stat_name, season='2023', league='ncaaw'):
    """
    Retrieve basketball player statistics for a specified statistical category (stat_name) for a given season and league
    and format the data into a pandas DataFrame.

    Parameters:
    - stat_name (str): Specifies the statistical category for which data is requested.
    - season (str): Specifies the season for which data is requested (default is '2023').
    - league (str): Specifies the league for which data is requested (default is 'ncaaw' for NCAA Women's Basketball).

    Returns:
    - DataFrame: Pandas DataFrame containing formatted player statistics for the specified statistical category.
    """
    # Retrieve player statistics for the specified statistical category, season, and league
    response_data = get_data_for_stat(stat_name, season, league)
    # Format the retrieved data into a pandas DataFrame
    return format_response_data(response_data)

## Request the Data for each Statistic

In [17]:
# Get and format data for each stat
points_top_players = get_and_format_data_for_stat('POINTS')
assists_top_players = get_and_format_data_for_stat('ASSISTS')
rebounds_top_players = get_and_format_data_for_stat('TOTAL_REBOUNDS')
blocks_top_players = get_and_format_data_for_stat('BLOCKS')
steals_top_players = get_and_format_data_for_stat('STEALS')

## Combine each statistic into one dataset

In [19]:
# Combine the leaders for each stat into one df
player_stats = pd.concat([points_top_players, assists_top_players, rebounds_top_players,
                         blocks_top_players, steals_top_players], ignore_index=True).drop_duplicates()
player_stats.head()

Unnamed: 0,PLAYER_NAME,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,FIELD_GOAL_ATTEMPTS,FIELD_GOAL_PERCENTAGE,THREE_POINTS_MADE,THREE_POINT_ATTEMPTS,...,FREE_THROW_PERCENTAGE,OFFENSIVE_REBOUNDS,DEFENSIVE_REBOUNDS,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,Caitlin Clark,ncaaw.p.64550,Iowa,32,1088,332,719,46.2,168,437,...,85.8,10,224,234,282,151,55,17,61,1020
1,JuJu Watkins,ncaaw.p.112021,Indiana,29,989,270,656,41.2,58,176,...,84.6,52,161,213,96,120,72,45,78,801
2,Hannah Hidalgo,ncaaw.p.112250,Notre Dame,31,1104,255,560,45.5,45,132,...,78.3,25,175,200,170,109,145,3,86,725
3,Lucy Olsen,ncaaw.p.67706,Iowa,30,1087,268,612,43.8,47,158,...,80.9,30,114,144,115,73,57,18,72,697
4,Ta'Niya Latson,ncaaw.p.70600,Florida St.,32,991,249,566,44.0,27,98,...,85.2,17,118,135,128,98,50,13,53,680


In [20]:
player_stats.shape

(1392, 23)

In [21]:
player_stats.to_excel('player_stats.xlsx', index=False)

# Combine Player Information and Statistics Datasets

In [22]:
player_info.rename(columns={"Player": "PLAYER_NAME"}, inplace=True)
player_info.head()

Unnamed: 0,PLAYER_NAME,Team,Class,Height,Position
0,Kiara Jackson,UNLV (Mountain West),Jr.,5-7,G
1,Raven Johnson,South Carolina (SEC),So.,5-8,G
2,Gina Marxen,Montana (Big Sky),Sr.,5-8,G
3,McKenna Hofschild,Colorado St. (Mountain West),Sr.,5-2,G
4,Kaylah Ivey,Boston College (ACC),Jr.,5-8,G


In [23]:
player_data = pd.merge(player_info, player_stats, on=['PLAYER_NAME'], how='inner')
player_data.head()

Unnamed: 0,PLAYER_NAME,Team,Class,Height,Position,PLAYER_ID,TEAM_NAME,GAMES,MINUTES_PLAYED,FIELD_GOALS_MADE,...,FREE_THROW_PERCENTAGE,OFFENSIVE_REBOUNDS,DEFENSIVE_REBOUNDS,TOTAL_REBOUNDS,ASSISTS,TURNOVERS,STEALS,BLOCKS,FOULS,POINTS
0,Kiara Jackson,UNLV (Mountain West),Jr.,5-7,G,ncaaw.p.67149,UNLV,29,895,128,...,75.0,27,102,129,135,42,31,5,47,323
1,Raven Johnson,South Carolina (SEC),So.,5-8,G,ncaaw.p.67515,South Carolina,30,823,98,...,64.3,33,128,161,148,53,60,5,34,243
2,Gina Marxen,Montana (Big Sky),Sr.,5-8,G,ncaaw.p.57909,Montana,29,778,88,...,72.4,6,54,60,111,38,16,2,26,297
3,McKenna Hofschild,Colorado St. (Mountain West),Sr.,5-2,G,ncaaw.p.60402,Colorado St.,29,1046,231,...,83.5,6,109,115,211,71,36,4,34,654
4,Kaylah Ivey,Boston College (ACC),Jr.,5-8,G,ncaaw.p.64531,Boston Coll.,33,995,47,...,60.7,12,45,57,186,64,36,1,48,143


In [24]:
player_data.shape

(900, 27)

In [25]:
player_data.to_excel('player_data_raw.xlsx', index=False)