# **Exploratory Data Analysis in Python using pandas**

In this Jupyter notebook, I will be showing you how to perform Exploratory Data Analysis on web scraped data of NBA player stats

## **Web scraping data using pandas**

The following block of code will retrieve the "2021-22 NBA Player Stats: Per Game" data from http://www.basketball-reference.com/.

In [None]:
import pandas as pd

# Retrieve HTML table data
url = 'https://www.basketball-reference.com/leagues/NBA_2022_per_game.html'
html = pd.read_html(url, header = 0)
df2022 = html[0]

In [None]:
df2022

Check the "Age" column. Do we need to do anything?

In [None]:
df2022.Age.value_counts()

In [None]:
# Data cleaning if needed

## **Acronyms**


Acronym | Description
---|---
Rk | Rank
Pos | Position
Age | Player's age on February 1 of the season
Tm | Team
G | Games
GS | Games Started
MP | Minutes Played Per Game
FG | Field Goals Per Game
FGA | Field Goal Attempts Per Game
FG% | Field Goal Percentage
3P | 3-Point Field Goals Per Game
3PA | 3-Point Field Goal Attempts Per Game
3P% | FG% on 3-Pt FGAs.
2P | 2-Point Field Goals Per Game
2PA | 2-Point Field Goal Attempts Per Game
2P% | FG% on 2-Pt FGAs.
eFG% | Effective Field Goal Percentage
| *(Note: This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.)*
FT | Free Throws Per Game
FTA | Free Throw Attempts Per Game
FT% | Free Throw Percentage
ORB | Offensive Rebounds Per Game
DRB | Defensive Rebounds Per Game
TRB | Total Rebounds Per Game
AST | Assists Per Game
STL | Steals Per Game
BLK | Blocks Per Game
TOV | Turnovers Per Game
PF | Personal Fouls Per Game
PTS | Points Per Game

## **Data cleaning**

### Data dimension

### Dataframe contents

### Check for missing values

### Replace missing values with 0 

## **Exploratory Data Analysis**

#### Displays the dataframe

In [None]:
df

### Overview of data types of each columns in the dataframe

### Show specific data types in dataframe

## **QUESTIONS**

### **Conditional Selection**

In performing exploratory data analysis, it is important to be able to select subsets of data to perform analysis or comparisons.

**Which player scored the most Points (PTS) Per Game?**
Here, we will return the entire row.

We will return specific column values.

Further question, what team is the player from? 

Which position is the player playing as?

How many games did the player played in the season?

**Which player scored more than 20 Points (PTS) Per Game?**

**Which player had the highest 3-Point Field Goals Per Game (3P) ?**

**Which player had the highest Assists Per Game (AST) ?**

### **GroupBy() function**

**Which player scored the highest (PTS) in the Los Angeles Lakers?**

**Of the 5 positions, which position scores the most points?**

We first group players by their positions.

We will now show only the 5 traditional positions (those having combo positions will be removed from the analysis).

Now, let's take a look at the descriptive statistics.

### **Histograms**

We'll also try to answer this question by showing some histogram plots. So, to make it a bit easier, let's create a subset dataframe.

#### **pandas built-in visualization**

#### **Seaborn data visualization**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.FacetGrid(PTS, col="Pos")
g.map(plt.hist, "PTS");

### **Box plots**

#### **Box plot of points scored (PTS) grouped by Position**

##### **pandas built-in visualization**

##### **Seaborn data visualization**

In [None]:
import seaborn as sns

sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 

In [None]:
sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 
sns.stripplot(x = 'Pos', y = 'PTS', data = PTS,
              jitter=True, 
              marker='o',
              alpha=0.8, 
              color="black")

### **Heat map**

#### Compute the correlation matrix

#### Make the heat map

#### Adjust figure size of heat map

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, square=True)

#### Mask diagonal half of heat map (Diagonal correlation matrix)

In [None]:
# https://seaborn.pydata.org/generated/seaborn.heatmap.html

import numpy as np
import seaborn as sns

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, mask=mask, vmax=1, square=True)

### **Scatter Plot**

In [None]:
df

#### Select columns if they have numerical data types

#### Select the first 5 columns (by index number)

#### Select 5 specific columns (by column names)

In [None]:
selections = ['Age', 'G', 'STL', 'BLK', 'AST', 'PTS']

#### Make scatter plot grid

##### 5 columns

##### All columns