# Pandas - Feature Extraction

Create a dataframe that simulates data.

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

Using the current list of 30 NBA teams below and create matchups so that every team plays each other once.

In [None]:
teams = ['Atlanta Hawks',
         'Boston Celtics',
         'Brooklyn Nets',
         'Charlotte Hornets',
         'Chicago Bulls',
         'Cleveland Cavaliers',
         'Dallas Mavericks',
         'Denver Nuggets',
         'Detroit Pistons',
         'Golden State Warriors',
         'Houston Rockets',
         'Indiana Pacers',
         'Los Angeles Clippers',
         'Los Angeles Lakers',
         'Memphis Grizzlies',
         'Miami Heat',
         'Milwaukee Bucks',
         'Minnesota Timberwolves',
         'New Orleans Pelicans',
         'New York Knicks',
         'Oklahoma City Thunder',
         'Orlando Magic',
         'Philadelphia 76ers',
         'Phoenix Suns',
         'Portland Trail Blazers',
         'Sacramento Kings',
         'San Antonio Spurs',
         'Toronto Raptors',
         'Utah Jazz',
         'Washington Wizards']

Using a nested for loop to create an array of matchups. Each matchup will be a dictionary containing the team and their opponent.

```python
matchups = [{'team': 'Atlanta Hawks', 'opponent': 'Boston Celtics'}, {'team': 'Atlanta Hawks', 'opponent': 'Brooklyn Nets'}, ...]
```

NOTE: Making sure no team plays themselves.

In [None]:
matchups = []
for team in teams:
    for opponent in teams:
        if team == opponent:
            continue
        game = {}
        game['team'] = team
        game['opponent'] = opponent
        matchups.append(game)
        
#         print('team',team)
#         print('opponent', opponent)

In [None]:
team_score, opponent_score = np.random.choice(range(95,115), size=2, replace=False)
location = np.random.choice(['H', 'A'])

# print(team_score)
# print(opponent_score)

Now iterate through matchups and create a data points:

1. The team's score
2. The opponent's score
3. Whether or not the game was home or away.

Use numpy to randomly generate these values.

Matchups will look like this when done:

```python
matchups = [
    {
        'opponent': 'Boston Celtics',
        'opponent_score': 93,
        'team': 'Atlanta Hawks',
        'team_score': 104,
        'location': 'H'
    },
    ...
]
```

In [None]:
for game in matchups:
    team_score, opponent_score = np.random.choice(range(95, 115), size=2, replace=False)
    location = np.random.choice(['H', 'A'])
    game['team_score'] = team_score
    game['opponent_score'] = opponent_score
    game['location'] = location
    # print(team_score)
    # print(opponent_score)
matchups[0]

Use list of dictionaries to create a pandas dataframe.

In [None]:
df = pd.DataFrame(matchups)
df.head()

In [None]:
df.describe()

In [None]:
df['location'].value_counts()

In [None]:
df.corr()

In [None]:
df.plot('opponent_score', 'team_score', kind='scatter')

In [None]:
#$$$# show null values


df.isnull().sum()

# Feature extraction in pandas

## Broadcasting

Broadcasting enables mathemeatical operations across a vector without having to create a for loop. 

Using broadcast to double each number in the array:

In [None]:
df[['team_score']] *2

In [1]:
# create a panda series

Creating new features using broadcasting. 

Creating a win column which will be `True` or `False`, depending on whether or not the team's score is higher than their opponent's.

In [None]:
# create column for win column
df['win'] = df['team_score'] > df['opponent_score']
df.head()

ML needs 1's and 0's instead of booleans, changing the win column's datatype to be `int`

In [None]:
# further efficiency
# turn true / fals (booleans) into integers

df['win'] = df['win'].astype(int)
df.head()


Broadcasting a new column called point spread, which is the difference between the team's score and their opponent's.

e.g. if the team's score is 90 and their opponent's is 99, then the point spread is -9.

In [None]:
# spread, team score 
# MULTIPLE VECTORS


df['spread'] = df['team_score'] - df['opponent_score']
df.head()




# Mapping

[Basketball Reference](http://www.basketball-reference.com/) is a site for NBA statistics. Each team has a unique slug that is used in their urls. 

For example, Atlanta Hawks' url is http://www.basketball-reference.com/teams/ATL/, which means their slug is ATL. 

Below is a dictionary that **maps** each team to their respective slug (hence the name of this section). We'll use this dictionary to add a couple of columns to our data frame.

In [None]:
# dictionary containing team slugs

slug_dict = {'Atlanta Hawks':'ATL', 'Brooklyn Nets':'BRK', 'Boston Celtics':'BOS', 'Charlotte Hornets':'CHO', 'Chicago Bulls':'CHI', 'Cleveland Cavaliers':'CLE', 'Dallas Mavericks':'DAL', 'Denver Nuggets':'DEN', 'Detroit Pistons':'DET', 'Golden State Warriors':'GSW', 'Houston Rockets':'HOU', 'Indiana Pacers':'IND', 'Los Angeles Clippers':'LAC', 'Los Angeles Lakers':'LAL', 'Memphis Grizzlies':'MEM', 'Miami Heat':'MIA', 'Milwaukee Bucks':'MIL', 'Minnesota Timberwolves':'MIN', 'New Orleans Pelicans':'NOP', 'New York Knicks':'NYK', 'Oklahoma City Thunder':'OKC', 'Orlando Magic':'ORL', 'Philadelphia 76ers':'PHI', 'Phoenix Suns':'PHO', 'Portland Trail Blazers':'POR', 'Sacramento Kings':'SAC', 'San Antonio Spurs':'SAS', 'Toronto Raptors':'TOR', 'Utah Jazz':'UTA', 'Washington Wizards':'WAS'}
slug_dict

Using pandas' `map` method along with our dictionary to create a `'team_slug'` column:

In [None]:
# use map to create team_slug column from another dictionary (slug_dict)
# takes values only
df['team_slug'] = df['team'].map(slug_dict)
df.head()

## Mapping practice

Using `slug_dict`, create a new column for the opponent's slug:

In [None]:
# reverse keys and values

{v:k for k, v in slug_dict.items()}

In [None]:
# create opponent SLUG from slug_dict

df['opponent_slug'] = df['opponent'].map(slug_dict)
df.head()



In [None]:
# re-arrange columns, by creating a new dictionary

df['opponent_slug'] = df['opponent'].map(slug_dict)
df = df[['team', 'opponent']]
df.head()


# Apply
vnc
vÂ 
Using functions to transform our data, done in two steps:

1. Create the function to transform data frame
2. Use the `apply` method to run this function across data frame.

Changing slug columns to be the full url for each team/opponent. 

Creating a function that accepts a slug and returns the full Basketball Reference url for that slug:

In [None]:
# have a a column of only a few values, (gender for example, m=1, f=2.
# .map here would be done on location

In [None]:
# use map on binary values
# turn H (home) and A (away) into 0,1 integers.
df['encoded_location'] = df['location'].map({'A': 0, 'H': 1})
df.head()

In [None]:
# handy way to create functions, use .apply method
# if you need a transformation to occur, more than mapping

# http://www.basketball-reference.com/teams/ATL/

df.head()

In [None]:
# first - use .apply once

def create_url_from_slug(slug):
    return 'http://www.basketball-reference.com/teams/{}/'.format(slug)

create_url_from_slug('')

In [None]:
# use function on every team using .apply

df['team_slug'].apply(create_url_from_slug)

Use this function to change `team_slug` column to be the full url:

In [None]:
# set it to a new column
# .apply does a for loop for cell in column, takes the function you gave and uses on cell. 
# do not need (arguments) at the end.

df['team_url'] = df['team_slug'].apply(create_url_from_slug)
df.head()

Now do the same for `opponent_slug`:

In [None]:
# create opponent url column

df['opponent_url']  = df['opponent_slug'].apply(create_url_from_slug)
df.head()

### Cleanup: 

Columns are now the full url, and not just the slug, it makes sense to change the names of the columns:

In [None]:
# always use functions in .apply method (and lambda's)

Not every win in the NBA is the same (home vs. away). Some basketball statistics (https://en.wikipedia.org/wiki/Rating_Percentage_Index#Basketball_formula) account for this by reducing the value of a home win (and increasing the value of an away win). 

Creating a new column called `'adjusted_win'`, which will be 0.6 wins if they won at home and 1.4 wins if they won on the road.

Using pandas `apply` method to create this new column.

First, create a function that accepts an individual row as a parameter. 
- If the game was at home, multiply the win column by 0.6
- If the game was played on the raod, multiply the win column by 1.4
- NOTE: If the win column is zero, then the result will be zero

In [None]:
# this dictionary must have 'home' and 'location'
# row =   , axis=1) 
# alligning columns or rows

def adjusted_win(row):
    if row['location'] == 'H':
        return row['win'] * 0.6
    else:
        return row['win'] * 1.4
    
df['adjusted_win'] = df.apply(adjusted_win, axis=1)
df.head()

Using the `apply` method, along with our function to create the adjusted_win column. 

Note: slight change to `apply` method when dealing with multiple columns

# Dummies (AKA One Hot Encoding)

To incorporate the game's location in our machine learning model, need numerical values, but have strings.

Pandas has a method for converting categorical data into numerical data. Using `get_dummies` to create numerical columns from `location` column:

In [None]:
# if we have a categorical column
# we want that to be a feature, but with tirnary value 1 < 2 < 3 
# use dummy's, 0 or 1 for all.
# can do this in pandas using get dummys


# once
pd.get_dummies(df['location'])

In [None]:
# add multiple columns to dataframe
# replaces location

new_df = pd.get_dummies(df, columns=['location']

Create dummy columns from the `team` and `opponent` columns:

In [None]:
# first method

new_df = pd.get_dummies(df, columns=['team', 'opponent'])
new_df.head()

In [None]:
# only dummy cared about, concatenate to orig list

new_df = pd.get_dummies(df[['team', 'opponent']])
new_df.head()

In [None]:
df[df.columns[:10]]