<a href="https://colab.research.google.com/github/jessica-guan/TRAIN-Notebooks/blob/main/Copy_of_TRAIN_YLC_Week_4_Homework_%5BS%CE%A4UDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homework 4: Exploratory Data Analysis**
---

### **Description**
In this homework notebook, you will hone your data manipulation and exploration skills using Python's Pandas library. Using an NBA dataset that includes various metrics for basketball players, you will perform a range of tasks—from basic operations to more advanced data manipulation techniques.

<br>

### **Lab Structure**
* **Part 1**: [Reading Datasets](#p1)

* **Part 2**: [Exploring Datasets](#p2)

* **Part 3**: [Most Common and Unique Values](#p3)

* **Part 4**: [Working with Rows and Columns](#p4)

* **Part 5**: [Exploring the Mean, Median, and Sum](#p5)


<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Recognize what pandas is and why we're using it.
* Recognize what a DataFrame object is.
* Recognize basic DataFrame commands.


<br>


### **Resources**
* [Python basics Cheat Sheet](https://docs.google.com/document/d/1bMqW8SKR6xC0-d1f0hb-DnYPJ0CyszjiwPCovAl9TLc/edit)

* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1xnKJsii1AsRH2t22XtrAh7FzSFGqAR0hAmW4oLYM4MI/edit)

<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
!pip install scikit-learn
import pandas as pd
from sklearn import datasets



<a name="p1"></a>

---
## **Part 1: Reading Datasets**
---

#### **Problem #1.1**
This dataset contains historical statistics for NBA (National Basketball Association) players, sourced from Basketball-Reference.com. The data includes a wide range of metrics from basic statistics like games played and minutes played to more advanced statistics like player efficiency ratings. While the dataset is rich and detailed, we are only focusing on a subset of the available columns to introduce you to the basics of data exploration and manipulation.

<br>

**Even if you're not familiar with basketball, understanding the data columns should still be relatively straightforward. Here's what each column we're using means:**

- `player_id`: A unique ID assigned by Basketball-Reference.com to each player.

- `name_common`: The name of the basketball player.

- `year_id`: This refers to the NBA season year. For example, the 2019-2020 NBA season would be represented as "2000".

- `age`: The age of the player as of February 1 of that season.

- `team_id`: The abbreviation for the team that the player played for during that season. Each NBA team has a unique abbreviation, like 'LAL' for the Los Angeles Lakers.

- `G`: Games Played - The number of games the player participated in during that season.

- `Min`: Minutes Played - The total number of minutes the player was on the court during the season.

- `MPG`: Minutes Per Game - This is the average number of minutes the player was on the court per game during the season. It's calculated as Min divided by G.

- `FT%`: Free Throw Percentage - This is the percentage of free throws the player made successfully. A free throw is an opportunity given to a player to score one point, unopposed, from a position 15 feet from the basket. It's calculated as Free Throws Made divided by Free Throws Attempted.

<br>

**Run the code cell below to load the data.**

In [None]:
url = 'https://raw.githubusercontent.com/fivethirtyeight/nba-player-advanced-metrics/master/nba-data-historical.csv'
nba_df = pd.read_csv(url)
nba_df = nba_df[['player_id', 'name_common', 'year_id', 'age', 'team_id', 'G', 'Min', 'MPG', 'FT%']]
nba_df = nba_df.dropna()
nba_df

Unnamed: 0,player_id,name_common,year_id,age,team_id,G,Min,MPG,FT%
808,youngtr01,Trae Young,2019,20,ATL,81,2503,30.9,82.9
809,huertke01,Kevin Huerter,2019,20,ATL,75,2048,27.3,73.2
810,bembrde01,DeAndre' Bembry,2019,24,ATL,82,1931,23.5,64.0
811,collijo01,John Collins,2019,21,ATL,61,1829,30.0,76.3
812,bazemke01,Kent Bazemore,2019,29,ATL,67,1643,24.5,72.6
...,...,...,...,...,...,...,...,...,...
28163,weissbo01,Bob Weiss,1977,34,WSB,62,768,12.4,78.4
28164,riordmi01,Mike Riordan,1977,31,WSB,49,289,5.9,73.3
28165,weathni01,Nick Weatherspoon,1977,26,WSB,11,152,13.8,62.5
28166,pacejo01,Joe Pace,1977,23,WSB,30,119,4.0,55.2


<a name="p2"></a>

---
## **Part 2: Exploring Datasets**
---

#### **Problem #2.1**

How many players are included in this dataset?

In [None]:
nba_df.shape[0]

19489

#### **Problem #2.2**

How many columns are in this DataFrame?

In [None]:
nba_df.shape[1]

9

#### **Problem #2.3**
How many columns contain numerical data?

In [None]:
nba_df.select_dtypes(include=['int']).shape[1]

4

<a name="p3"></a>

---
## **Part 3: Most Common and Unique Values**
---

#### **Problem #3.1**

How many different NBA teams (`team_id`) are included in the dataset?

In [None]:
nba_df['team_id'].nunique()

42

#### **Problem #3.2**

What is the most common `age` among all players in the dataset?

In [None]:
nba_df['age'].mode()[0]
nba_df['age'].value_counts().index[1]

24

<a name="p4"></a>

---
## **Part 4: Working with Rows and Columns**
---




#### **Problem #4.1**

Complete the code below to output players above the age of 35.

In [None]:
older_players = nba_df[nba_df['age'] >= 35]
older_players['player_id']
older_players.shape[0]

804


#### **Problem #4.2**

Extract the following columns: `player_id`, `age`, `FT%`

In [None]:
selected_columns = nba_df[['player_id', 'age', 'FT%']]

####**Problem #4.3**
Identify players with a Free-Throw Percentage (`FT%`) greater than 90%.

In [None]:
nba_df[nba_df['FT%'] > 90].shape[0]

967

<a name="p5"></a>

---
## **Part 5: Exploring the Mean, Median and Sum**
---


#### **Problem #5.1**

What is the average age of the players in the dataset?

In [None]:
nba_df['age'].mean()
nba_df['age'].median()

26.0

#### **Problem #5.2**

What is the median value for the Minutes Per Game (`MPG`) across all players?

In [None]:
nba_df['MPG'].median()
nba_df['MPG'].mean()

20.15706295859203

#### **Problem #5.3**

Calculate the sum of minutes played (`Min`) for all players in the dataset.

In [None]:
nba_df['Min'].sum()

22847833

---

# End of Notebook

© 2023 The Coding School, All rights reserved