# M1L3 NumPy Part 2 Data Challenge:  Summary Statistics & Outliers 

## Scenario

You are a data analyst working for a basketball team. You have been given a dataset containing player stats, and your job is to analyze the data. Each row in the dataset represents a player, and the columns represent different stats:
- **Points Scored**
- **Rebounds**
- **Assists**

You will:
1. Calculate summary statistics (mean, median, standard deviation, etc.).
2. Add new data to the existing dataset.
3. Detect outliers using the IQR method.

### Step 1:  Import NumPy 

In [1]:
import numpy as np

### Step 2:  Run the cell below for Player Stats 

Here is the dataset of player stats. Each row represents a player, and the columns represent points, rebounds, and assists.

In [4]:
# Player stats array -- run without changing the data
player_stats = np.array([
    [25, 10, 5],
    [18, 7, 8],
    [30, 12, 4],
    [22, 9, 6],
    [42, 5, 15]
])

# Print the dataset
print(player_stats)

[[25 10  5]
 [18  7  8]
 [30 12  4]
 [22  9  6]
 [42  5 15]]


### Step 3: Calculate Summary Statistics

Task:
- Calculate the mean for each stat (points, rebounds, assists).
- Calculate the median for each stat.
- Calculate the standard deviation for each stat.

*This is expanding on your last data challenge -- instead of calculating summary statistics for one column you will calculate it for all columns*

**Hint:  You can call an axis=0 argument to calculate each statistic all at once.  For example:  np.mean(data, axis=0)**

In [7]:
mean_stats = np.mean(player_stats, axis = 0)
print("Mean stats:", mean_stats)

median_stats = np.median(player_stats, axis = 0)
print("Median stats:", median_stats)

std_stats = np.std(player_stats, axis=0)
print("Standard deviation stats:", std_stats)

Mean stats: [27.4  8.6  7.6]
Median stats: [25.  9.  6.]
Standard deviation stats: [8.28492607 2.41660919 3.92937654]


**What can you conclude from the result you got above?  What are some main takeaways?  Add a markdown cell below this one and type up an answer.**  

The standard deviation for points scored is far more spread than assists or rebounds. 

### Step 4: Add New Data to the Dataset

Task:
A new player has joined the team. Add their stats to the existing dataset:

- Points: 20
- Rebounds: 8
- Assists: 6


In [8]:
new_player = [20,8,6]
player_stats = np.vstack([player_stats, new_player])
print("Updated player stats:")
print(player_stats)

Updated player stats:
[[25 10  5]
 [18  7  8]
 [30 12  4]
 [22  9  6]
 [42  5 15]
 [20  8  6]]


### Step 5: Detect Outliers Using the IQR Method

Task:

- Calculate the IQR for points scored.
- Determine the lower bound and upper bound for outliers.
- Identify any outliers in the points column **there may not be any outliers!**.

In [15]:
Q1 = np.percentile(player_stats[:,0],25)
Q3 = np.percentile(player_stats[:,0],75)
IQR = Q3-Q1



lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 

outliers = player_stats[:, 0][(player_stats[:, 0] < lower_bound) | (player_stats[:, 0] > upper_bound)]
print("Outliers in points:", outliers)

Outliers in points: [42]


## Above and Beyond (OPTIONAL): Detect Outliers Using Standard Deviation

Task:
1. Calculate the mean and standard deviation for points scored.
2. Determine the lower bound and upper bound for outliers (3 standard deviations away from the mean).
3. Identify any outliers in the points column **there may not be any outliers!**

In [29]:
mean_points = np.mean(player_stats[:, 0])
std_points = np.std(player_stats[:, 0])

lower_bound_std = mean_points - 3 * std_points
upper_bound_std = mean_points + 3 * std_points


outliers_std = player_stats[:, 0][(player_stats[:, 0] < lower_bound_std) | (player_stats[:, 0] > upper_bound_std)]
print("Outliers in points (using standard deviation):", outliers_std)

Outliers in points (using standard deviation): []
