## CSC 442 Group 6
### Authors: Harsh Patel, Minji Kang, Aaron Chao
Description:

In [13]:
# importing pandas for dataframes, cleaning, wrangling, and analysis
import pandas as pd

# Read in the two datasets
players = pd.read_csv("https://raw.githubusercontent.com/hpatel-27/DataScience_CourseProject/refs/heads/main/datasets/Players.csv", index_col=0)
stats = pd.read_csv("https://raw.githubusercontent.com/hpatel-27/DataScience_CourseProject/refs/heads/main/datasets/Seasons_Stats.csv", index_col=0)

### Dataset Dictionaries:

### Players.csv:
- 'Index': Unique integer value incremented from 0

- 'Player': Player Name (First and Last Name)

- 'Height': Player Height (cm)

- 'Weight': Player Weight (kg)

- 'College': Player's Attended College

- 'Born': Player Birth Year

- 'Birth_city': Player Birth City

- 'Birth_state': Player Birth State

### Stats.csv:
- 'Year' - Year that the season occurred. Since the NBA season is split over two calendar years, the year given is the last year for that season.

- 'Player' - Player name (First and Last)

- 'Pos' - Player position(s)

- 'Age' - Player Age on Feb 1 of the given season

- 'Tm' - Team of the player

- 'G' - Games the player has played in within the season

- 'GS' - Games the player has been among the starting 5 within the season; (Available since 1982 season)

- 'MP' - Minutes played across the season; (available since the 1951-52 season);

- 'PER' - Player Efficiency Rating; (available since the 1951-52 season); "The PER sums up all a player's positive accomplishments; subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."

- 'TS%' - True Shooting Percentage; Points/(2 * True-Shooting-Attempts); Measure of shooting efficiency that takes into account field goals, 3-point field goals, and free throws.

- '3PAr'- 3-Point Attempt Rate; (3PA / FGA); Measure of what % of a player's shots come from long-distance, another good gauge of how they're utilized offensively.

- 'FTr' - Free Throw Rate; (FTA / FGA); Free throw rate is the ratio of free throws attempted per field goal attempted.

- 'ORB%' - Offensive Rebound Percentage; (available since the 1970-71 season in the NBA); 100 * (ORB * (Tm MP / 5)) / (MP * (Tm ORB + Opp DRB)).
Estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor

- 'DRB%' - Defensive Rebound Percentage; (available since the 1970-71 season in the NBA); 100 * (DRB * (Tm MP / 5)) / (MP * (Tm DRB + Opp ORB)).
Estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.

- 'TRB%' - Total Rebound Percentage; (available since the 1970-71 season in the NBA); 100 * (TRB * (Tm MP / 5)) / (MP * (Tm TRB + Opp TRB)).
Estimate of the percentage of available rebounds a player grabbed while he was on the floor.

- 'AST%' - Assist Percentage; (available since the 1964-65 season in the NBA); the formula is 100 * AST / (((MP / (Tm MP / 5)) * Tm FG) - FG).
Assist percentage is an estimate of the percentage of teammate field goals a player assisted while he was on the floor.

- 'STL%' - Steal Percentage; (available since the 1973-74 season in the NBA); 100 * (STL * (Tm MP / 5)) / (MP * Opp Poss).
Estimate of the percentage of opponent possessions that end with a steal by the player while he was on the floor.

- 'BLK%' - Block Percentage; (available since the 1973-74 season in the NBA); 100 * (BLK * (Tm MP / 5)) / (MP * (Opp FGA - Opp 3PA)).
Estimate of the percentage of opponent two-point field goal attempts blocked by the player while he was on the floor.

- 'TOV%' - Turnover Percentage; (available since the 1977-78 season in the NBA); 100 * TOV / (FGA + 0.44 * FTA + TOV).
Estimate of turnovers per 100 plays.

- 'USG%' - Usage Percentage; (available since the 1977-78 season in the NBA); 100 * ((FGA + 0.44 * FTA + TOV) * (Tm MP / 5)) / (MP * (Tm FGA + 0.44 * Tm FTA + Tm TOV)). Estimate of the percentage of team plays used by a player while he was on the floor.

- 'blanl' - Blank? Probably for making the data human-readable

- 'OWS' - Offensive Win Shares; (marginal offense) / (marginal points per win);
Estimate of the portion of a team's wins solely on a player's offensive performance.

- 'DWS' - Defensive Win Shares; (marginal defense) / (marginal points per win);
Estimate of the portion of a team's wins solely on a player's defensive performance.

- 'WS' - Win Shares; OWS + DWS; Estimate of the number of wins contributed by the player

- 'WS/48' - Win Shares Per 48 Minutes; (available since the 1951-52 season in the NBA).
Estimate of the number of wins contributed by the player per 48 minutes (league average is approximately 0.100).

- 'blank2' - Blank2? Probably for making the data human-readable

- 'OBPM' - Offensive Box +/-; Estimates how much a player improves their team's offensive performance

- 'DBPM' - Defensive Box +/-; Estimates how much a player improves their team's defensive performance

- 'BPM' - Box +/-; (available since the 1973-74 season in the NBA);
Estimate of the points per 100 possessions that a player contributed above a league-average player, translated to an average team.

- 'VORP' - Value Over Replacement; (available since the 1973-74 season in the NBA);
Estimate of the points per 100 TEAM possessions that a player contributed above a replacement-level (-2.0) player, translated to an average team and prorated to an 82-game season. Multiply by 2.70 to convert to wins over replacement.

- 'FG' - Field Goals Made; (includes both 2-point field goals and 3-point field goals)

- 'FGA' - Field Goals Attempted; (includes both 2-point field goal attempts and 3-point field goal attempts)

- 'FG%' - Field Goal Percentage;  FG / FGA.

- '3P' - 3-Pointers Made; (available since the 1979-80 season in the NBA)

- '3PA' - 3-Pointers Attempted; (available since the 1979-80 season in the NBA)

- '3P%' - 3-Point Percentage; (available since the 1979-80 season in the NBA); 3P / 3PA.

- '2P' - 2-Point Field Goals Made

- '2PA' - 2-Point Field Goals Attempted

- '2P%' - 2-Point Field Goal Percentage; 2P / 2PA.

- 'eFG%' - Effective Field Goal Percentage; (FG + 0.5 * 3P) / FGA. Adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.

- 'FT' - Free Throws Made

- 'FTA' - Free Throws Attempted

- 'FT%' - Free Throw Percentage; FT / FTA.

- 'ORB' - Offensive Rebounds; (available since the 1973-74 season in the NBA)

- 'DRB' - Defensive Rebounds; (available since the 1973-74 season in the NBA)

- 'TRB' - Total Rebounds Made; (available since the 1950-51 season)

- 'AST' - Assists Made

- 'STL' - Steals Made; (available since the 1973-74 season in the NBA)

- 'BLK' - Blocks Made; (available since the 1973-74 season in the NBA)

- 'TOV' - Turnovers Made (available since the 1977-78 season in the NBA)

- 'PF' - Personal Fouls By Player

- 'PTS' - Points Scored



In [14]:
# Look at the first five rows of the Players.csv
players.head()

Unnamed: 0,Player,height,weight,collage,born,birth_city,birth_state
0,Curly Armstrong,180.0,77.0,Indiana University,1918.0,,
1,Cliff Barker,188.0,83.0,University of Kentucky,1921.0,Yorktown,Indiana
2,Leo Barnhorst,193.0,86.0,University of Notre Dame,1924.0,,
3,Ed Bartels,196.0,88.0,North Carolina State University,1925.0,,
4,Ralph Beard,178.0,79.0,University of Kentucky,1927.0,Hardinsburg,Kentucky


In [15]:
# Looking at column names of players dataframe
players.columns

Index(['Player', 'height', 'weight', 'collage', 'born', 'birth_city',
       'birth_state'],
      dtype='object')

In [16]:
# Checking players data types
players.dtypes

Unnamed: 0,0
Player,object
height,float64
weight,float64
collage,object
born,float64
birth_city,object
birth_state,object


In [17]:
# Check row labels
players.index

Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
       ...
       3912, 3913, 3914, 3915, 3916, 3917, 3918, 3919, 3920, 3921],
      dtype='int64', length=3922)

In [22]:
# Unit of observation (player)
len(players["Player"].unique())

3922

In [5]:
# Look at the first five rows of the Season_Stats.csv
stats.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,176.0,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,109.0,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,140.0,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,20.0,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,20.0,,,,27.0,59.0


In [6]:
# Looking at column names of stats dataframe
stats.columns

Index(['Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'PER', 'TS%',
       '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%',
       'USG%', 'blanl', 'OWS', 'DWS', 'WS', 'WS/48', 'blank2', 'OBPM', 'DBPM',
       'BPM', 'VORP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA',
       '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL',
       'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

In [7]:
# Checking stats data types
stats.dtypes

Unnamed: 0,0
Year,float64
Player,object
Pos,object
Age,float64
Tm,object
G,float64
GS,float64
MP,float64
PER,float64
TS%,float64


In [23]:
# Check row labels
stats.index

Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
       ...
       24681, 24682, 24683, 24684, 24685, 24686, 24687, 24688, 24689, 24690],
      dtype='int64', length=24691)

In [24]:
# Unit of observation (Player)
len(stats["Player"].unique())

3922

### Cleaning Players.csv:

In [8]:
# Remove the following columns from the Players.csv dataset: [born, birth_city, birth_state]
playersDropped = players.drop(columns=["born", "birth_city", "birth_state"])
# And update the error in the column name "collage" to "college"
playersDropped = playersDropped.rename(columns={'collage': 'college'})

# Check the updates
playersDropped.head()

Unnamed: 0,Player,height,weight,college
0,Curly Armstrong,180.0,77.0,Indiana University
1,Cliff Barker,188.0,83.0,University of Kentucky
2,Leo Barnhorst,193.0,86.0,University of Notre Dame
3,Ed Bartels,196.0,88.0,North Carolina State University
4,Ralph Beard,178.0,79.0,University of Kentucky


### Cleaning Seasons_Stats.csv:


In [9]:
# To keep this readable, I split the drop columns across multiple lines

# Drop the following columns:
# ['Team', 'Games Started', '3 Point Attempt Rate', 'Free Throw Attempt Rate', 'Offensive Rebound %', 'Defensive Rebound %', 'Total Rebound %',
# 'Assist %', 'Steal %', 'Block %', 'Turnover %', 'blanl']
statsDropped = stats.drop(columns=['Tm', 'GS', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'blanl'])

# Drop the following columns:
# ['Offensive Win Shares', 'Defensive Win Shares', 'Win Shares Per 48 Minutes', 'blank2', 'Offensive Box +/-', 'Defensive Box +/-', 'Field Goals',
# 'Field Goals Attempted', '3 Pointers', '3 Pointers Attempted', '2 Pointers', '2 Pointers Attempted', 'Free Throws', 'Free Throws Attempted']
statsDropped = statsDropped.drop(columns=['OWS', 'DWS', 'WS/48', 'blank2', 'OBPM', 'DBPM', 'FG', 'FGA', '3P', '3PA', '2P', '2PA', 'FT', 'FTA'])

# Drop the following columns: ['Offensive Rebounds', 'Defensive Rebounds', 'Personal Fouls']
statsDropped = statsDropped.drop(columns=['ORB', 'DRB', 'PF'])

statsDropped.head()

Unnamed: 0,Year,Player,Pos,Age,G,MP,PER,TS%,USG%,WS,...,3P%,2P%,eFG%,FT%,TRB,AST,STL,BLK,TOV,PTS
0,1950.0,Curly Armstrong,G-F,31.0,63.0,,,0.368,,3.5,...,,0.279,0.279,0.705,,176.0,,,,458.0
1,1950.0,Cliff Barker,SG,29.0,49.0,,,0.435,,2.2,...,,0.372,0.372,0.708,,109.0,,,,279.0
2,1950.0,Leo Barnhorst,SF,25.0,67.0,,,0.394,,3.6,...,,0.349,0.349,0.698,,140.0,,,,438.0
3,1950.0,Ed Bartels,F,24.0,15.0,,,0.312,,-0.6,...,,0.256,0.256,0.559,,20.0,,,,63.0
4,1950.0,Ed Bartels,F,24.0,13.0,,,0.308,,-0.6,...,,0.256,0.256,0.548,,20.0,,,,59.0


In [10]:
# Changing year column to Int64
statsDropped['Year'] = statsDropped['Year'].astype('Int64')

In [11]:
print(f'Shape of players.csv before dropping columns: {players.shape}')
print(f'Shape of players.csv after dropping columns: {playersDropped.shape}')
print(f'Shape of stats.csv before dropping columns: {stats.shape}')
print(f'Shape of stats.csv after dropping columns: {statsDropped.shape}')

Shape of players.csv before dropping columns: (3922, 7)
Shape of players.csv after dropping columns: (3922, 4)
Shape of stats.csv before dropping columns: (24691, 52)
Shape of stats.csv after dropping columns: (24691, 23)
