## Pandas for the NBA Lover
## Reading Data and Data Types

Thus far, we've focused on how to manipulate DataFrames in order to select or query data. All the data we have been working has been already in the correct format, data we could work without the need for any cleaning or processing. This tutorial will focus on working with data that is not that way - in particular, data that exists on the internet in the form of HTML tables on sites such as basketball-reference.com and ESPN.com. When we are done, we will be able to take what we learned to get augment our existing dataset with data from the internet, and use that data in our analysis. 

In many ways, this is the most important and relevant tutorial section in the entire series, because it mirrors most closely the type analysis you will be doing on your own.

---

#### The `read_html` function

It's not a stretch to say the `read_html` is the part of the pandas library that has saved me the most time with respect to analyzing NBA data. This function makes scraping data from the web a breeze: it accepts a URL as an input, and parses all HTML tables on that page into a list of DataFrames. 

In [15]:
import pandas as pd

# formatting options: 
pd.set_option('display.max_rows', 10)

url = 'https://www.basketball-reference.com/leagues/NBA_2019_per_game.html'
data = pd.read_html(url)

This particular URL only contains 1 table, so in order to see our data, all we need to do is grab the first DataFrame:

In [10]:
df = data[0]
df

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PS/G
0,1,Alex Abrines,SG,25,OKC,29,2,19.8,1.9,5.3,...,.923,0.2,1.4,1.6,0.7,0.6,0.2,0.5,1.8,5.6
1,2,Quincy Acy,PF,28,PHO,10,0,12.3,0.4,1.8,...,.700,0.3,2.2,2.5,0.8,0.1,0.4,0.4,2.4,1.7
2,3,Jaylen Adams,PG,22,ATL,9,0,4.8,0.3,1.1,...,,0.0,0.3,0.3,0.7,0.3,0.1,0.1,0.4,0.9
3,4,Steven Adams,C,25,OKC,48,48,34.1,6.5,10.6,...,.556,4.7,5.3,10.0,1.8,1.5,0.8,1.6,2.6,15.4
4,5,Bam Adebayo,C,21,MIA,48,6,22.0,3.0,5.4,...,.735,2.1,4.6,6.7,2.0,0.8,0.8,1.4,2.5,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,489,Thaddeus Young,PF,30,IND,49,49,30.0,5.5,10.3,...,.600,2.3,3.9,6.2,1.8,1.5,0.5,1.2,2.4,12.7
556,490,Trae Young,PG,20,ATL,49,49,29.9,5.7,14.1,...,.807,0.7,2.5,3.2,7.3,0.9,0.2,4.0,1.7,16.4
557,491,Cody Zeller,C,26,CHO,35,35,24.5,3.7,6.6,...,.842,2.1,4.2,6.2,2.1,0.7,0.8,1.3,3.2,9.3
558,492,Ante Zizic,C,22,CLE,35,8,15.5,2.6,4.9,...,.726,1.5,3.2,4.8,0.7,0.2,0.4,0.8,1.7,6.7


Wasn't that easy? But before we can fully begin to work with this data, we need to do two things. First, we need to process it, to remove any rows without valid data (i.e. header rows). Next, we need to make sure that our data is correctly typed, so that our calculations are correct.

In [20]:
df = df[df['Rk'] != 'Rk']

### NaN values

Some cells have no data, because, for example, Clint Capela has yet to take a 3 pointer this season.

In [25]:
df[df['Player'] == 'Clint Capela']['3P%']

97    NaN
Name: 3P%, dtype: object