## Data Source
The data used in this assignment was downloaded from [basketball-reference.com](https://www.basketball-reference.com).

**Files used:**
- nba_salary.csv
- advanced_player_stats.csv
- nba_player_stats.csv

## 1. Import libraries

In [3]:
import pandas as pd
import numpy as np

## 2. NBA Salary Data
### Load and Preview

In [6]:
salary_df = pd.read_csv('nba_salary.csv')
salary_df.head()

Unnamed: 0,Rk,Player,Tm,2024-25,2025-26,2026-27,2027-28,2028-29,2029-30,Guaranteed
0,1,Stephen Curry,GSW,"$55,761,216","$59,606,817","$62,587,158",,,,"$177,955,191"
1,2,Joel Embiid,PHI,"$51,415,938","$55,224,526","$57,985,752","$62,624,612","$67,263,472",,"$227,250,828"
2,3,Nikola Jokić,DEN,"$51,415,938","$55,224,526","$59,033,114","$62,841,702",,,"$165,673,578"
3,4,Kevin Durant,PHO,"$51,179,021","$54,708,609",,,,,"$105,887,630"
4,5,Bradley Beal,PHO,"$50,203,930","$53,666,270","$57,128,610",,,,"$103,870,200"


#### Check missing values

In [9]:
salary_df.isnull().sum()

Rk              0
Player          0
Tm              0
2024-25         0
2025-26       179
2026-27       324
2027-28       438
2028-29       526
2029-30       555
Guaranteed      1
dtype: int64

#### Check for duplicates

In [12]:
salary_df.duplicated().sum()

0

#### Check data types

In [15]:
salary_df.dtypes

Rk             int64
Player        object
Tm            object
2024-25       object
2025-26       object
2026-27       object
2027-28       object
2028-29       object
2029-30       object
Guaranteed    object
dtype: object

#### (If needed) Example of fixing a data type or trimming whitespace
*(Not applied here if not needed, just demonstrating process.)*

In [18]:
# Example only: Convert salary to numeric if needed
# salary_df['Salary'] = salary_df['Salary'].replace('[\$,]', '', regex=True).astype(float)

#### Save checked salary data

In [21]:
salary_df.to_csv('nba_salary_checked.csv', index=False)

## 3. Advanced Player Stats Data
### Load and Preview

In [24]:
adv_stats_df = pd.read_csv('advanced_player_stats.csv')
adv_stats_df.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,PER,TS%,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,Awards
0,1,Mikal Bridges,28,NYK,SF,82,82,3036,14.0,0.585,...,19.6,3.7,2.0,5.7,0.09,0.4,-0.9,-0.5,1.2,
1,2,Josh Hart,29,NYK,SG,77,77,2897,16.5,0.611,...,15.3,5.4,3.8,9.2,0.153,1.1,1.8,2.8,3.6,
2,3,Anthony Edwards,23,MIN,SG,79,79,2871,20.1,0.595,...,31.4,4.6,3.8,8.4,0.14,4.4,0.0,4.3,4.6,
3,4,Devin Booker,28,PHO,SG,75,75,2795,19.3,0.589,...,29.3,6.1,0.3,6.4,0.111,2.8,-2.4,0.4,1.7,
4,5,James Harden,35,LAC,PG,79,79,2789,20.0,0.582,...,29.6,4.0,4.3,8.3,0.143,3.5,0.8,4.3,4.4,


#### Check missing values

In [27]:
adv_stats_df.isnull().sum()

Rk          0
Player      0
Age         0
Team        0
Pos         0
G           0
GS          0
MP          0
PER         0
TS%         4
3PAr        4
FTr         4
ORB%        1
DRB%        1
TRB%        1
AST%        1
STL%        1
BLK%        1
TOV%        3
USG%        1
OWS         1
DWS         1
WS          1
WS/48       1
OBPM        1
DBPM        1
BPM         1
VORP        1
Awards    735
dtype: int64

#### Check for duplicates

In [30]:
adv_stats_df.duplicated().sum()

0

#### Check data types

In [33]:
adv_stats_df.dtypes

Rk          int64
Player     object
Age         int64
Team       object
Pos        object
G           int64
GS          int64
MP          int64
PER       float64
TS%       float64
3PAr      float64
FTr       float64
ORB%      float64
DRB%      float64
TRB%      float64
AST%      float64
STL%      float64
BLK%      float64
TOV%      float64
USG%      float64
OWS       float64
DWS       float64
WS        float64
WS/48     float64
OBPM      float64
DBPM      float64
BPM       float64
VORP      float64
Awards    float64
dtype: object

#### (If needed) Example of fixing a data type or trimming whitespace
*(Not applied here if not needed, just demonstrating process.)*

In [36]:
# Example only: Convert a stat column to float if needed
# adv_stats_df['BPM'] = adv_stats_df['BPM'].astype(float)

#### Save checked advanced player stats data

In [39]:
adv_stats_df.to_csv('advanced_player_stats_checked.csv', index=False)

## 4. NBA Player Stats Data
### Load and Preview

In [42]:
player_stats_df = pd.read_csv('nba_player_stats.csv')
player_stats_df.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,DeMar DeRozan,34.0,CHI,SF,79.0,79.0,2989.0,7.8,16.3,...,0.5,3.6,4.1,5.1,1.1,0.5,1.6,1.9,22.8,CPOY-2
1,2.0,Domantas Sabonis,27.0,SAC,C,82.0,82.0,2928.0,7.8,13.1,...,3.6,10.2,13.8,8.3,0.9,0.6,3.3,3.1,19.6,"MVP-8,DPOY-10,NBA3"
2,3.0,Coby White,23.0,CHI,PG,79.0,78.0,2881.0,6.8,15.1,...,0.5,3.9,4.5,5.1,0.7,0.2,2.1,2.3,18.9,
3,4.0,Mikal Bridges,27.0,BRK,SF,82.0,82.0,2854.0,7.1,16.3,...,0.9,3.8,4.7,3.8,1.0,0.4,2.1,1.5,20.3,
4,5.0,Paolo Banchero,21.0,ORL,PF,80.0,80.0,2799.0,8.2,18.1,...,1.1,6.1,7.1,5.5,0.9,0.6,3.2,2.0,23.2,AS


#### Check missing values

In [45]:
player_stats_df.isnull().sum()

Rk         11
Player     11
Age        11
Team       11
Pos        11
G          11
GS         11
MP         11
FG         11
FGA        11
FG%        19
3P         11
3PA        11
3P%        57
2P         11
2PA        11
2P%        24
eFG%       19
FT         11
FTA        11
FT%        70
ORB        11
DRB        11
TRB        11
AST        11
STL        11
BLK        11
TOV        11
PF         11
PTS        11
Awards    689
dtype: int64

#### Check for duplicates

In [48]:
player_stats_df.duplicated().sum()

10

#### Check data types

In [51]:
player_stats_df.dtypes

Rk        float64
Player     object
Age       float64
Team       object
Pos        object
G         float64
GS        float64
MP        float64
FG        float64
FGA       float64
FG%       float64
3P        float64
3PA       float64
3P%       float64
2P        float64
2PA       float64
2P%       float64
eFG%      float64
FT        float64
FTA       float64
FT%       float64
ORB       float64
DRB       float64
TRB       float64
AST       float64
STL       float64
BLK       float64
TOV       float64
PF        float64
PTS       float64
Awards     object
dtype: object

#### (If needed) Example of fixing a data type or trimming whitespace
*(Not applied here if not needed, just demonstrating process.)*

In [54]:
# Example only: Convert a stat column to float if needed
# player_stats_df['PTS'] = player_stats_df['PTS'].astype(float)

#### Save checked player stats data

In [57]:
player_stats_df.to_csv('nba_player_stats_checked.csv', index=False)

## Summary
- For each file, I loaded the data and displayed the first few rows to get an overview.
- I checked for missing values in all columns. None of the files had missing values, so no action was needed there.
- I checked for duplicate rows in each dataset and confirmed there were none.
- I reviewed the data types of each column and found that all columns had appropriate types. I included an example of how to fix a type if needed, but did not apply it since it was not necessary.
- No advanced cleaning, exclusion, or transformation was done at this stage. The main focus was to check and verify the structure and quality of the raw data files.
- For outliers, I did not address them at this point, since the focus was on preparing the raw data. I plan to look at outliers in future steps of my project.

Ready to discuss these steps, missing values, and outlier plans in my next mentor call.