# Feature Engineering

This notebook details the feature engineering techniques used in this project and provides code samples for each technique. All methods discussed this notebooks are applied to the data in the `get_feature_engineered_df()` function found in `data/load_data.py`.

In [1]:
import pandas as pd

example_df = pd.read_csv("../data/season_averages/2019_stats.csv")

## 0. Convert all values from number-like strings to floats/integers

Before doing any feature engineering, it must be noted that the data stored in the CSV files are all strings, as the values are scraped from the HTML source and stored directly in the DataFrame. Thus, doing any feature engineering on the unformatted DataFrame would prove futile. The conversion is relatively simple via a `lambda` function and `DataFrame.apply()`. Please note that while this is necessary when feature engineering prior to exporting the DataFrame to a CSV file, the process of exporting seems to make this conversion automatically upon export. An example DataFrame is used as a proof of concept.

In [2]:
data = {
    "Name":["Peter", "Paul", "Trisha", "Joseph"],
    "Age":["20", "21", "19", "18"],
    "Grade":[90.3, "91", "89.2", 95]
}

df = pd.DataFrame(data)

In [3]:
display(df, df.dtypes)

Unnamed: 0,Name,Age,Grade
0,Peter,20,90.3
1,Paul,21,91.0
2,Trisha,19,89.2
3,Joseph,18,95.0


Name     object
Age      object
Grade    object
dtype: object

### 0.1 Implementation

The first thing to do is convert all values to numeric values using `pd.to_numeric`:

In [4]:
df_to_numeric = df.apply(lambda column: pd.to_numeric(column, errors="coerce"))
display(df_to_numeric, df_to_numeric.dtypes)

Unnamed: 0,Name,Age,Grade
0,,20,90.3
1,,21,91.0
2,,19,89.2
3,,18,95.0


Name     float64
Age        int64
Grade    float64
dtype: object

This gets us part of the way there: both `Age` and `Grade` were converted to their expected data types (`int64` and `float64`, respectively), despite the fact that mixed data types were stored in each array. However, as can be seen, the `Name` column is converted to type `float64` when using a keyword argument of `coerce`, which is necessary to convert strings with decimal values to floats. Ideally, these values should be kept the same; there is no need to convert *non*-number-like strings to numeric values. 

We will use `fillna` to fill the `NaN` values with their original value in the following function:

In [5]:
def convert_col_types(column):
    """
    Given a passed column from the season averages data with all string values, converts all number-like strings to floats or integers and returns a column holding the converted values. If a value stored is not number-like, then the original string value is still kept in its corresponding index.

    :param column: An unformatted Series object representing the column of a NBA season average statistics DataFrame.
    :return: A column with all number-like values in the passed columns converted to either integers or floats, with all other strings retaining their original value.
    """
    to_return_col = pd.to_numeric(column, errors="coerce")      # Converts all strings containing number-like values to floats or integers. All other values are filled with NaN
    to_return_col = to_return_col.fillna(column)    # Fills all NaN values with their value from the original column
    
    return to_return_col

In [6]:
df_final = df.apply(lambda column: convert_col_types(column))
display(df_final, df_final.dtypes)

Unnamed: 0,Name,Age,Grade
0,Peter,20,90.3
1,Paul,21,91.0
2,Trisha,19,89.2
3,Joseph,18,95.0


Name      object
Age        int64
Grade    float64
dtype: object

And, simple as that, the desired result is achieved: the `Age` and `Grade` columns are converted to `int64` and `float64`, respectively, while `Name` retains its original data type. Feature engineering may proceed.

## 1. Adding ranks to MVP recipients

### 1.1 Implementation 
Adding ranks to the MVP vote recipients is as simple as using the pandas `Series.rank` method:

In [7]:
example_df["rank"] = example_df.points_won.rank(method="min", ascending=False)

Another important and desired trait of this field is for players who did not receive votes to have `NaN` as their rank. This way, the MVP vote recipients are can be easily accessed by filtering out the rows with `NaN` as a value in the `rank` field.

In [8]:
example_df.loc[example_df.points_won == 0, "rank"] = float("nan")

In [9]:
example_df.loc[~example_df["rank"].isna()]

Unnamed: 0,id,player,pos,age,team_id,g,gs,mp_per_g,fg_per_g,fga_per_g,...,usg_pct,vorp,ws,ws_per_48,leader_pts_per_g,leader_ast_per_g,leader_trb_per_g,leader_blk_per_g,leader_stl_per_g,rank
12,antetgi01,Giannis Antetokounmpo,PF,25,MIL,63,63,30.4,10.9,19.7,...,37.5,6.6,11.1,0.279,0,0,0,0,0,1.0
81,butleji01,Jimmy Butler,SF,30,MIA,58,58,33.8,5.9,13.1,...,25.1,3.7,9.0,0.221,0,0,0,0,0,11.0
122,davisan02,Anthony Davis,PF,26,LAL,62,62,34.4,8.9,17.7,...,29.3,5.4,11.1,0.25,0,0,0,0,0,6.0
133,doncilu01,Luka Dončić,PG,20,DAL,61,61,33.6,9.5,20.6,...,36.8,5.4,8.8,0.207,0,0,0,0,0,4.0
197,hardeja01,James Harden,SG,30,HOU,68,68,36.5,9.9,22.3,...,36.3,7.3,13.1,0.254,1,0,0,0,0,3.0
249,jamesle01,LeBron James,PG,35,LAL,67,67,34.6,9.6,19.4,...,31.5,6.1,9.8,0.204,0,1,0,0,0,2.0
260,jokicni01,Nikola Jokić,C,24,DEN,73,73,32.0,7.7,14.7,...,26.6,5.5,9.8,0.202,0,0,0,0,0,9.0
293,leonaka01,Kawhi Leonard,SF,28,LAC,57,57,32.4,9.3,19.9,...,33.0,5.1,8.7,0.226,0,0,0,0,0,5.0
296,lillada01,Damian Lillard,PG,29,POR,66,66,37.5,9.5,20.4,...,30.3,5.9,11.6,0.225,0,0,0,0,0,8.0
395,paulch01,Chris Paul,PG,34,OKC,70,70,31.5,6.2,12.7,...,23.3,3.5,8.9,0.193,0,0,0,0,0,7.0
