# Tutorial 4: Pandas Basics I

## 4.1 Agenda¶
This tutorial focuses on the basic pandas package functions and Series/DataFrame methods, and provide some corresponding practices.

Pandas Series/ DataFrame is designed for efficient handling of heterogenous data, where most financial data belongs to, compared to Numpy ndarrays that are designed for homogenous data.

## 4.2 Import csv file into Pandas package

`DataFrame` is a class of 2D elements in Pandas package, where it consists of main data, index names and column names. \
Index and column names are customizable, although they are set to `pd.RangeIndex` (0, 1, 2, 3...) in default. \
In the following example, A DataFrame is created by loading a table in csv format. 

In [None]:
#Download NBA data in csv format from a website. The csv file is now saved to the working directory.
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

import pandas as pd
nba = pd.read_csv("nba_all_elo.csv")
type(nba) #A DataFrame object is now created by loading the csv file.

Download ready.


pandas.core.frame.DataFrame

In [3]:
print(nba.head()) #display top 5 rows of the loaded data

   gameorder       game_id lg_id  _iscopy  year_id  date_game  seasongame  \
0          1  194611010TRH   NBA        0     1947  11/1/1946           1   
1          1  194611010TRH   NBA        1     1947  11/1/1946           1   
2          2  194611020CHS   NBA        0     1947  11/2/1946           1   
3          2  194611020CHS   NBA        1     1947  11/2/1946           2   
4          3  194611020DTF   NBA        0     1947  11/2/1946           1   

   is_playoffs team_id  fran_id  ...  win_equiv  opp_id  opp_fran  opp_pts  \
0            0     TRH  Huskies  ...  40.294830     NYK    Knicks       68   
1            0     NYK   Knicks  ...  41.705170     TRH   Huskies       66   
2            0     CHS    Stags  ...  42.012257     NYK    Knicks       47   
3            0     NYK   Knicks  ...  40.692783     CHS     Stags       63   
4            0     DTF  Falcons  ...  38.864048     WSC  Capitols       50   

   opp_elo_i  opp_elo_n  game_location  game_result  forecast notes 

- The first row in the csv file is automatically loaded as column names. 

Alternatively, a `DataFrame` can also be created by converting a `dict` of `list` / `pd.Series` into `pd.DataFrame`. \
Each key is rendered as column name while its value (as list) is rendered as column data.

In [4]:
#Create a 1D pd.Series, where index can be customized
city_revenues = pd.Series([4200, 8000, 6500], index=["Amsterdam", "Toronto", "Tokyo"])
print(city_revenues)

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64


In [5]:
#create a pd.Series from a dictionary
city_employee_count = pd.Series({"Amsterdam": 5, "Tokyo": 8})
print(city_employee_count)

Amsterdam    5
Tokyo        8
dtype: int64


In [6]:
#Create a DataFrame by combining the two Series objects
city_data = pd.DataFrame({"revenue": city_revenues, "employee_count": city_employee_count})
print(city_data)

           revenue  employee_count
Amsterdam     4200             5.0
Tokyo         6500             8.0
Toronto       8000             NaN


## 4.3 DataFrame Indexing and slicing
There are two DataFrame methods that perform indexing, `.loc` and `.iloc`. \
`.loc` method takes index and column names as argument. \
`.iloc` method takes row and column indices as argument (similar to indexing of ndarrays)

In [7]:
nba.loc[3, 'pts']

np.int64(47)

In [8]:
nba.iloc[2, 5]

'11/2/1946'

- `DataFrame[column_name]` method slices one column from the DataFrame as `pd.Series`.\
Alternatively `DataFrame.column_name` also works, where the column name can be treated as an attribute of DataFrame. 

In [9]:
nba['team_id']

0         TRH
1         NYK
2         CHS
3         NYK
4         DTF
         ... 
126309    CLE
126310    GSW
126311    CLE
126312    CLE
126313    GSW
Name: team_id, Length: 126314, dtype: object

In [10]:
nba.team_id

0         TRH
1         NYK
2         CHS
3         NYK
4         DTF
         ... 
126309    CLE
126310    GSW
126311    CLE
126312    CLE
126313    GSW
Name: team_id, Length: 126314, dtype: object

- Mutliple columns can be sliced by passing a list of column names into square bracket above, returning a`pd.DataFrame`:

In [11]:
df2 = nba[['team_id', 'date_game', 'pts']]
print(df2)

       team_id  date_game  pts
0          TRH  11/1/1946   66
1          NYK  11/1/1946   68
2          CHS  11/2/1946   63
3          NYK  11/2/1946   47
4          DTF  11/2/1946   33
...        ...        ...  ...
126309     CLE  6/11/2015   82
126310     GSW  6/14/2015  104
126311     CLE  6/14/2015   91
126312     CLE  6/16/2015   97
126313     GSW  6/16/2015  105

[126314 rows x 3 columns]


- Getting a slice of 'pts' column from the first 20 rows using `.loc` and `.iloc` methods:

In [12]:
pts_head1 = nba.iloc[:20,10]
pts_head2 = nba.loc[:19,'pts']
print(pts_head1)
print(pts_head1.equals(pts_head2))

0     66
1     68
2     63
3     47
4     33
5     50
6     53
7     59
8     51
9     56
10    60
11    71
12    56
13    71
14    55
15    57
16    53
17    49
18    75
19    81
Name: pts, dtype: int64
True


## 4.3 Filtering of DataFrames
Filtering can be done by putting conditions with boolean output into the slicing square bracket.
- In the following example, we choose all matches that are played by Cleveland Cavaliers (`teamid : "CLE"`)

In [13]:
nba_data_CLE = nba[nba.team_id == 'CLE']
print(len(nba_data_CLE)) #There are 3810 rows

3810


- Conditions inside the square backet can be joined by `&` (intersection) or `|` (union). Each condition should be enclosed by parenthesis.
- In the following example, choose all matches that the main team wins (`game_result :"W"`) and main team scores above 110 (`pts : >110`)

In [14]:
nba_victory_110 = nba[(nba.game_result == 'W') & (nba.pts > 110)]

In [15]:
print(nba_victory_110)
print(len(nba_victory_110)) 

        gameorder       game_id lg_id  _iscopy  year_id   date_game  \
781           391  194712060PRO   NBA        1     1948   12/6/1947   
1241          621  194811240LAL   NBA        0     1949  11/24/1948   
1448          725  194901040PRO   NBA        1     1949    1/4/1949   
1727          864  194902260BLB   NBA        1     1949   2/26/1949   
2047         1024  194911200LAL   NBA        0     1950  11/20/1949   
...           ...           ...   ...      ...      ...         ...   
126283      63142  201505170HOU   NBA        0     2015   5/17/2015   
126292      63147  201505230HOU   NBA        1     2015   5/23/2015   
126294      63148  201505240CLE   NBA        0     2015   5/24/2015   
126297      63149  201505250HOU   NBA        0     2015   5/25/2015   
126298      63150  201505260CLE   NBA        0     2015   5/26/2015   

        seasongame  is_playoffs team_id    fran_id  ...  win_equiv  opp_id  \
781             11            0     NYK     Knicks  ...  42.514290   

## 4.4 Dataframe other useful methods
- Evaluate the mean and standard deviation of `forecast` column, representing the forecasted chance of victory. 

In [16]:
nba_forecast_mean = nba['forecast'].mean()
print("Average forecasted win chance:", nba_forecast_mean)
nba_forecast_std = nba['forecast'].std()
print("Standard deviation of forecasted win chance:", nba_forecast_std)

Average forecasted win chance: 0.5000000000270357
Standard deviation of forecasted win chance: 0.21525223981658986


- `DataFrame.apply()` method allows us to apply a function to each row of the input DataFrame. \
For simple functions with only one line, they can be created in anonymous manner with `lambda` keyword, then pass into `.apply()` method. 
- Compute the difference between main team points against opponent points. If the difference is negative, return `np.NaN`.

In [None]:
import numpy as np
nba_pts_diff = nba.apply(lambda row: row.pts - row.opp_pts if row.pts > row.opp_pts else np.nan, axis=1)
print(nba_pts_diff)

AttributeError: `np.NaN` was removed in the NumPy 2.0 release. Use `np.nan` instead.

- Vectorized operation of computing the above difference, then only keep entries with positive value:

In [None]:
nba_pts_diff2 = nba.pts - nba.opp_pts
print(nba_pts_diff2)

0         -2
1          2
2         16
3        -16
4        -17
          ..
126309   -21
126310    13
126311   -13
126312    -8
126313     8
Length: 126314, dtype: int64


In [None]:
nba_pts_diff3 = nba_pts_diff2[nba_pts_diff2 > 0]
print(nba_pts_diff3)

1          2
2         16
5         17
7          6
9          5
          ..
126304     2
126307     5
126308    21
126310    13
126313     8
Length: 63157, dtype: int64


## 4.5 Exercises
1. From `nba` DataFrame, Evaluate the average forecasted chance of win `forecast` for rows where `_iscopy` is 0 and `game_result` equals "W".

In [None]:
forecast_winning = nba.forecast[(nba._iscopy == 0) & (nba.game_result == "W")]
avg_forecast_winning = forecast_winning.mean()
print(avg_forecast_winning)

0.6715915002016739


2. From `nba` DataFrame, find the row in which the highest score is obtained for the main team (`pts` set to maximum), then evaluate the following metrics from this row:\
`fran_id`, `opp_fran`, `pts`, `opp_pts`, `date_game`

In [None]:
row_with_max_pts = nba.pts.idxmax()
print(row_with_max_pts)

50094


In [None]:
nba.loc[50094, ["fran_id","opp_fran","pts","opp_pts","date_game"]]

fran_id         Pistons
opp_fran        Nuggets
pts                 186
opp_pts             184
date_game    12/13/1983
Name: 50094, dtype: object

In [None]:
nba.loc[nba.pts.idxmax(), ["fran_id","opp_fran","pts","opp_pts","date_game"]]

fran_id         Pistons
opp_fran        Nuggets
pts                 186
opp_pts             184
date_game    12/13/1983
Name: 50094, dtype: object