# Side comment: Programming style

- Programming language is really like a language

- Always think about good programming style

- Better way to do the same thing

- Better how? e.g., easier to read, concise, efficient (computationally), etc.

- Practice makes you better

For example, there are guides and articles such as these:
- http://docs.python-guide.org/en/latest/writing/style/#short-ways-to-manipulate-lists
- https://google.github.io/styleguide/pyguide.html?showone=List_Comprehensions#List_Comprehensions
- https://google.github.io/styleguide/pyguide.html?showone=Naming#Naming
- https://www.python.org/dev/peps/pep-0008/

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Above easter egg is from Zen of Python: https://www.python.org/dev/peps/pep-0020/. 

- https://www.quora.com/What-do-different-aphorisms-in-The-Zen-of-Python-mean 
- 20th aphorism?:https://www.reddit.com/r/Python/comments/3cjhlo/this_disobeys_the_zen_of_python/

# Pandas

- Pandas package is like data frames package for R

- Extensive set of functions ([Chapter 3 in PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html) using Colab)

- [Pandas official website](https://pandas.pydata.org)

- [Latest stable release documentation](http://pandas.pydata.org/pandas-docs/stable/api.html).

- Use correct documentation version

In [2]:
import pandas as pd

pd.__version__

'0.24.2'

## Practice with NBA data

In [3]:
def get_nba_data(endpt, params, return_url=False):

    ## endpt: https://github.com/seemethere/nba_py/wiki/stats.nba.com-Endpoint-Documentation
    ## params: dictionary of parameters: i.e., {'LeagueID':'00'}
    
    from pandas import DataFrame
    from urllib.parse import urlencode
    import json
    
    useragent = "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9\""
    dataurl = "\"" + "http://stats.nba.com/stats/" + endpt + "?" + urlencode(params) + "\""
    
    # for debugging: just return the url
    if return_url:
        return(dataurl)
    
    jsonstr = !wget -q -O - --user-agent={useragent} {dataurl}
    
    data = json.loads(jsonstr[0])
    
    h = data['resultSets'][0]['headers']
    d = data['resultSets'][0]['rowSet']
    
    return(DataFrame(d, columns=h))

In [4]:
## get all teams
params = {'LeagueID':'00'}
# teams = get_nba_data('commonTeamYears', params)  # if NBA does not cooperate
teams = pd.read_pickle('data/commonTeamYears.pkl').dropna()

## get all players
params = {'LeagueID':'00', 'Season': '2017-18', 'IsOnlyCurrentSeason': '0'}
# players = get_nba_data('commonallplayers', params) # if NBA does not cooperate
players = pd.read_pickle('data/commonallplayers.pkl').dropna()

## Pandas Series 

The section on `Series` is here: http://pandas.pydata.org/pandas-docs/stable/api.html#series. These are available by placing a dot after the object.

### Data frames are made of Series
Pandas data frames are different objects:

In [5]:
print("data frame object   :", type(teams))
print("data multirow object:", type(teams.iloc[0:3]))
print("data row object     :", type(teams.iloc[0]))
print("data column object  :", type(teams.ABBREVIATION))

data frame object   : <class 'pandas.core.frame.DataFrame'>
data multirow object: <class 'pandas.core.frame.DataFrame'>
data row object     : <class 'pandas.core.series.Series'>
data column object  : <class 'pandas.core.series.Series'>


- Rows/columns of pandas data frame are `Series` objects

- In R, rows would be a smaller data frame

- Methods for `Series` and `DataFrame` are different

- There are categories of functions that are applicable to certain object types:

- Pandas general functions: http://pandas.pydata.org/pandas-docs/stable/api.html#general-functions   
    e.g., [`pandas.melt()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html#pandas-melt) take `DataFrame` as input. 

- Series methods: http://pandas.pydata.org/pandas-docs/stable/api.html#series

- DataFrame methods: http://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

### `Series` 

- [http://pandas.pydata.org/pandas-docs/stable/api.html#series](http://pandas.pydata.org/pandas-docs/stable/api.html#series)

In [6]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [7]:
pd.Series([1, 2, 3], index=['a', 'b', 'c'])

a    1
b    2
c    3
dtype: int64

In [8]:
teams.head()

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION
0,0,1610612737,1949,2018,ATL
1,0,1610612738,1946,2018,BOS
2,0,1610612739,1970,2018,CLE
3,0,1610612740,2002,2018,NOP
4,0,1610612741,1966,2018,CHI


In [9]:
abbr = teams.ABBREVIATION.copy()

#### `Series` to `list`

- `Series` can be converted to `list`

In [10]:
abbr.head().to_list()

['ATL', 'BOS', 'CLE', 'NOP', 'CHI']

- `list` can be converted to `Series`

In [11]:
tmp = abbr.head().to_list()
pd.Series(tmp)

0    ATL
1    BOS
2    CLE
3    NOP
4    CHI
dtype: object

#### `Series` to `dict`

- `Series` can be converted to `dict`

In [12]:
abbr.head().to_dict()

{0: 'ATL', 1: 'BOS', 2: 'CLE', 3: 'NOP', 4: 'CHI'}

- `dict` can be convered to `Series`

In [13]:
tmp = abbr.head().to_dict()
pd.Series(tmp)

0    ATL
1    BOS
2    CLE
3    NOP
4    CHI
dtype: object

#### Other `Series` methods

- More `Series` methods: [Chapter 3 in PDSH](https://jakevdp.github.io/PythonDataScienceHandbook/03.00-introduction-to-pandas.html). 

- `Series` documentation is here: http://pandas.pydata.org/pandas-docs/stable/api.html#series

- Reading documentation is a critical skill (hint for the midterm)

In [14]:
abbr.unique()

array(['ATL', 'BOS', 'CLE', 'NOP', 'CHI', 'DAL', 'DEN', 'GSW', 'HOU',
       'LAC', 'LAL', 'MIA', 'MIN', 'BKN', 'NYK', 'ORL', 'IND', 'PHI',
       'PHX', 'POR', 'SAC', 'SAS', 'OKC', 'MIL', 'UTA', 'MEM', 'WAS',
       'DET', 'CHA', 'TOR'], dtype=object)

- Example: [`str`](http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling) allow functions to be applied to each value as strings
- Search for patterns: search for team names that end with `S`: 

In [15]:
abbr.head().str.contains('S$') # $ marks end of string

0    False
1     True
2    False
3    False
4    False
Name: ABBREVIATION, dtype: bool

__Exercise__: how would you use this to pick out team names that end with S? Can you use the resulting boolean `Series`?

In [16]:
# indx = ...
# abbr...

__Exercise__: what is `dir()` function?

In [17]:
# dir(abbr)

## Data Frames


- Following ways to call columns are equivalent 

In [18]:
temp = teams.copy()

print(temp['MIN_YEAR'].head()) # can call columns whose name is in string variable
print(temp.MIN_YEAR.head())    # easier to read

0    1949
1    1946
2    1970
3    2002
4    1966
Name: MIN_YEAR, dtype: object
0    1949
1    1946
2    1970
3    2002
4    1966
Name: MIN_YEAR, dtype: object


In [19]:
all(temp['MIN_YEAR'] == temp.MIN_YEAR) # checking all elements are equal

True

### Creating columns

- Dot notation cannot be used to create new column

In [20]:
temp['new_column_1'] = temp.MAX_YEAR
temp.new_column_2 = temp.MAX_YEAR    # does not work
temp.head()

  


Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,new_column_1
0,0,1610612737,1949,2018,ATL,2018
1,0,1610612738,1946,2018,BOS,2018
2,0,1610612739,1970,2018,CLE,2018
3,0,1610612740,2002,2018,NOP,2018
4,0,1610612741,1966,2018,CHI,2018


- Existing column can be set with dot notation

In [21]:
temp.LEAGUE_ID = 'ZZ'
temp.head()

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,new_column_1
0,ZZ,1610612737,1949,2018,ATL,2018
1,ZZ,1610612738,1946,2018,BOS,2018
2,ZZ,1610612739,1970,2018,CLE,2018
3,ZZ,1610612740,2002,2018,NOP,2018
4,ZZ,1610612741,1966,2018,CHI,2018


### Data Frame, Series, dtype

This is different than R data frame in that columns in R data frames have their data types: e.g., `factor`, `integer`, `numeric`, etc. Pandas data frame columns are *all* `Series` with different dtypes. With column types not specified, everything is of dtype `object`:

In [22]:
print(teams.ABBREVIATION.dtype)

object


In [23]:
teams.ABBREVIATION = teams.ABBREVIATION.astype('category')
teams.TEAM_ID      = teams.TEAM_ID.astype('category')
teams.MIN_YEAR     = teams.MIN_YEAR.astype('int')
teams.MAX_YEAR     = teams.MAX_YEAR.astype('int')
teams.head().ABBREVIATION

0    ATL
1    BOS
2    CLE
3    NOP
4    CHI
Name: ABBREVIATION, dtype: category
Categories (30, object): [ATL, BKN, BOS, CHA, ..., SAS, TOR, UTA, WAS]

Note that `object` is a general term

In [24]:
print("type:", type(teams.iloc[0]))
print("object:", teams.iloc[0])

type: <class 'pandas.core.series.Series'>
object: LEAGUE_ID               00
TEAM_ID         1610612737
MIN_YEAR              1949
MAX_YEAR              2018
ABBREVIATION           ATL
Name: 0, dtype: object


### Condition based slicing

Subset just the current teams

In [25]:
teams

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION
0,0,1610612737,1949,2018,ATL
1,0,1610612738,1946,2018,BOS
2,0,1610612739,1970,2018,CLE
3,0,1610612740,2002,2018,NOP
4,0,1610612741,1966,2018,CHI
5,0,1610612742,1980,2018,DAL
6,0,1610612743,1976,2018,DEN
7,0,1610612744,1946,2018,GSW
8,0,1610612745,1967,2018,HOU
9,0,1610612746,1970,2018,LAC


In [26]:
teams['TEAM_AGE'] = teams.MAX_YEAR - teams.MIN_YEAR
teams.loc[teams.TEAM_AGE >= 50,'AGE_GROUP'] = 'OLD'
teams.loc[teams.TEAM_AGE <  50,'AGE_GROUP'] = 'YOUNG'

teams_clean = teams.copy() ## make a copy for later
teams

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP
0,0,1610612737,1949,2018,ATL,69,OLD
1,0,1610612738,1946,2018,BOS,72,OLD
2,0,1610612739,1970,2018,CLE,48,YOUNG
3,0,1610612740,2002,2018,NOP,16,YOUNG
4,0,1610612741,1966,2018,CHI,52,OLD
5,0,1610612742,1980,2018,DAL,38,YOUNG
6,0,1610612743,1976,2018,DEN,42,YOUNG
7,0,1610612744,1946,2018,GSW,72,OLD
8,0,1610612745,1967,2018,HOU,51,OLD
9,0,1610612746,1970,2018,LAC,48,YOUNG


Subset just the players in current teams:

In [27]:
players.head(2)

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,FROM_YEAR,TO_YEAR,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GAMES_PLAYED_FLAG
0,76001,"Abdelnaby, Alaa",Alaa Abdelnaby,0,1990,1994,HISTADD_alaa_abdelnaby,0,,,,,Y
1,76002,"Abdul-Aziz, Zaid",Zaid Abdul-Aziz,0,1968,1977,HISTADD_zaid_abdul-aziz,0,,,,,Y


In [28]:
teams.head(2)

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP
0,0,1610612737,1949,2018,ATL,69,OLD
1,0,1610612738,1946,2018,BOS,72,OLD


In [29]:
players = players[players.TEAM_ID.isin(teams.TEAM_ID)]
players.tail()

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,FROM_YEAR,TO_YEAR,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GAMES_PLAYED_FLAG
4373,203469,"Zeller, Cody",Cody Zeller,1,2013,2018,cody_zeller,1610612766,Charlotte,Hornets,CHA,hornets,Y
4378,203092,"Zeller, Tyler",Tyler Zeller,1,2012,2018,tyler_zeller,1610612749,Milwaukee,Bucks,MIL,bucks,Y
4385,1627835,"Zipser, Paul",Paul Zipser,1,2016,2017,paul_zipser,1610612741,Chicago,Bulls,CHI,bulls,Y
4386,1627790,"Zizic, Ante",Ante Zizic,1,2017,2018,ante_zizic,1610612739,Cleveland,Cavaliers,CLE,cavaliers,Y
4389,1627826,"Zubac, Ivica",Ivica Zubac,1,2016,2018,ivica_zubac,1610612747,Los Angeles,Lakers,LAL,lakers,Y


List players groupped by teams:

In [30]:
players.groupby('TEAM_CODE')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f06930ebb38>

Above is called an iterable. You can iterate on the object to see the _views_.

In [31]:
for t, p in players.groupby('TEAM_NAME'):
    print("***", t)
    print('; '.join(p.DISPLAY_LAST_COMMA_FIRST.values), '\n')

*** 76ers
Anderson, Justin; Bayless, Jerryd; Belinelli, Marco; Covington, Robert; Embiid, Joel; Fultz, Markelle; Holmes, Richaun; Ilyasova, Ersan; Jackson, Demetrius; Johnson, Amir; Korkmaz, Furkan; Luwawu-Cabarrot, Timothe; McConnell, T.J.; Redick, JJ; Saric, Dario; Simmons, Ben 

*** Bucks
Antetokounmpo, Giannis; Bledsoe, Eric; Brogdon, Malcolm; Brown, Sterling; Dellavedova, Matthew; Henson, John; Jennings, Brandon; Maker, Thon; Middleton, Khris; Muhammad, Shabazz; Munford, Xavier; Parker, Jabari; Plumlee, Marshall; Snell, Tony; Terry, Jason; Wilson, D.J.; Zeller, Tyler 

*** Bulls
Arcidiacono, Ryan; Asik, Omer; Blakeney, Antonio; Dunn, Kris; Eddie, Jarell; Felicio, Cristiano; Grant, Jerian; Holiday, Justin; Kilpatrick, Sean; LaVine, Zach; Lopez, Robin; Markkanen, Lauri; Nwaba, David; Payne, Cameron; Portis, Bobby; Valentine, Denzel; Vonleh, Noah; Zipser, Paul 

*** Cavaliers
Calderon, Jose; Clarkson, Jordan; Green, Jeff; Hill, George; Hood, Rodney; James, LeBron; Korver, Kyle; Love,

### Merging data frames

First we can create a table of unique rows with full team names

In [32]:
team_names = players[['TEAM_ABBREVIATION', 'TEAM_CODE']].drop_duplicates()#.set_index('TEAM_ABBREVIATION')
team_names.head()

Unnamed: 0,TEAM_ABBREVIATION,TEAM_CODE
9,OKC,thunder
14,BKN,nets
23,MIA,heat
27,ORL,magic
32,NOP,pelicans


We have team codes (names) as a new column.

In [33]:
teams_clean.head()

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP
0,0,1610612737,1949,2018,ATL,69,OLD
1,0,1610612738,1946,2018,BOS,72,OLD
2,0,1610612739,1970,2018,CLE,48,YOUNG
3,0,1610612740,2002,2018,NOP,16,YOUNG
4,0,1610612741,1966,2018,CHI,52,OLD


In [34]:
teams = pd.merge(teams_clean, team_names, left_on='ABBREVIATION', right_on='TEAM_ABBREVIATION')
teams.tail()

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP,TEAM_ABBREVIATION,TEAM_CODE
25,0,1610612763,1995,2018,MEM,23,YOUNG,MEM,grizzlies
26,0,1610612764,1961,2018,WAS,57,OLD,WAS,wizards
27,0,1610612765,1948,2018,DET,70,OLD,DET,pistons
28,0,1610612766,1988,2018,CHA,30,YOUNG,CHA,hornets
29,0,1610612761,1995,2018,TOR,23,YOUNG,TOR,raptors


We can apply `str` method:

In [35]:
teams.TEAM_CODE = teams.TEAM_CODE.str.capitalize() # returns values so needs to be reassigned
teams.sort_values('ABBREVIATION', inplace=True)    # modifies object
teams.head()

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP,TEAM_ABBREVIATION,TEAM_CODE
0,0,1610612737,1949,2018,ATL,69,OLD,ATL,Hawks
13,0,1610612751,1976,2018,BKN,42,YOUNG,BKN,Nets
1,0,1610612738,1946,2018,BOS,72,OLD,BOS,Celtics
28,0,1610612766,1988,2018,CHA,30,YOUNG,CHA,Hornets
4,0,1610612741,1966,2018,CHI,52,OLD,CHI,Bulls


In [36]:
players.head()

Unnamed: 0,PERSON_ID,DISPLAY_LAST_COMMA_FIRST,DISPLAY_FIRST_LAST,ROSTERSTATUS,FROM_YEAR,TO_YEAR,PLAYERCODE,TEAM_ID,TEAM_CITY,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CODE,GAMES_PLAYED_FLAG
9,203518,"Abrines, Alex",Alex Abrines,1,2016,2018,alex_abrines,1610612760,Oklahoma City,Thunder,OKC,thunder,Y
14,203112,"Acy, Quincy",Quincy Acy,1,2012,2018,quincy_acy,1610612751,Brooklyn,Nets,BKN,nets,Y
21,203500,"Adams, Steven",Steven Adams,1,2013,2018,steven_adams,1610612760,Oklahoma City,Thunder,OKC,thunder,Y
23,1628389,"Adebayo, Bam",Bam Adebayo,1,2017,2018,bam_adebayo,1610612748,Miami,Heat,MIA,heat,Y
27,201167,"Afflalo, Arron",Arron Afflalo,1,2007,2017,arron_afflalo,1610612753,Orlando,Magic,ORL,magic,Y


### Indexing

There are many different ways to index `Series` and `DataFrames` in pandas: https://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing.

- `.loc` is primarily for using labels and booleans: e.g., column and row indices, comparison operators, etc
- `.iloc` is primarily for using integer positions: i.e., like you would matrices

In [37]:
temp = teams.head(7).tail()
temp

Unnamed: 0,LEAGUE_ID,TEAM_ID,MIN_YEAR,MAX_YEAR,ABBREVIATION,TEAM_AGE,AGE_GROUP,TEAM_ABBREVIATION,TEAM_CODE
1,0,1610612738,1946,2018,BOS,72,OLD,BOS,Celtics
28,0,1610612766,1988,2018,CHA,30,YOUNG,CHA,Hornets
4,0,1610612741,1966,2018,CHI,52,OLD,CHI,Bulls
2,0,1610612739,1970,2018,CLE,48,YOUNG,CLE,Cavaliers
5,0,1610612742,1980,2018,DAL,38,YOUNG,DAL,Mavericks


In [38]:
print('*** indexing with .iloc:\n', temp.iloc[2])
print('\n*** indexing with .loc :\n', temp.loc[2])

*** indexing with .iloc:
 LEAGUE_ID                    00
TEAM_ID              1610612741
MIN_YEAR                   1966
MAX_YEAR                   2018
ABBREVIATION                CHI
TEAM_AGE                     52
AGE_GROUP                   OLD
TEAM_ABBREVIATION           CHI
TEAM_CODE                 Bulls
Name: 4, dtype: object

*** indexing with .loc :
 LEAGUE_ID                    00
TEAM_ID              1610612739
MIN_YEAR                   1970
MAX_YEAR                   2018
ABBREVIATION                CLE
TEAM_AGE                     48
AGE_GROUP                 YOUNG
TEAM_ABBREVIATION           CLE
TEAM_CODE             Cavaliers
Name: 2, dtype: object


### Pandas (often) shows you views

Recall that python objects are often _views_ of the same instance in memory space. Following says these are the same objects in memory:

In [39]:
temp = teams
print(id(temp) == id(teams))

True


So, if you change one, you see the change in the other:

In [40]:
s1 = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
s2 = s1
print("id of s1:", id(s1))
print("id of s2:", id(s2))
print("s1 is s2:", s1 is s2)

id of s1: 139666045043432
id of s2: 139666045043432
s1 is s2: True


In [41]:
s1[0] = 10000

print("s1 changed:", s1[0])
print("s2 also   :", s2[0])

s1 changed: 10000.0
s2 also   : 10000.0


Needs to be **copied** in order to make independent duplicate

In [42]:
abbr = teams.ABBREVIATION.copy()