# Challenge 1B.
In Lesson 1, we covered significant downloads and lots of introductory programming concepts. Now, we're going to put forth the skills we learned.

# The 2016 Golden State Warriors

<img src='https://i2.wp.com/ramblingeveron.com/wp-content/uploads/2016/03/warriors.jpg?w=650'>

In [1]:
# IMPORTING THE LIBRARY FOR DATAFRAMES
import pandas as pd

# DO NOT WORRY ABOUT THE FOLLOWING CODE FOR NOW
gsw_2016 = pd.read_csv('../00-data/gsw-2016.csv')
gsw_2016['Player'] = gsw_2016['Player'].str.replace('[A-z]{7}0[0-9]', '').str.replace('\\', '')

We have a dataframe here called `gsw_2016` which we imported from a CSV file on our local drive. We are going to call the `.head()` method on it to show us the first 5 rows!

In [2]:
gsw_2016.head()

Unnamed: 0,Rk,Player,Age,G,GS,MP,FG,FGA,FG%,3P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Klay Thompson,26,78,78,2649,644,1376,0.468,268,...,0.853,49,236,285,160,66,40,128,139,1742
1,2,Stephen Curry,28,79,79,2638,675,1443,0.468,324,...,0.898,61,292,353,524,142,17,239,183,1999
2,3,Draymond Green,26,76,76,2471,272,650,0.418,81,...,0.709,98,501,599,533,154,106,184,217,776
3,4,Kevin Durant,28,62,62,2070,551,1026,0.537,117,...,0.875,39,474,513,300,66,99,138,117,1555
4,5,Andre Iguodala,33,76,0,1998,219,415,0.528,64,...,0.706,51,253,304,261,76,39,58,97,574


We are going to call the `.tail()` method on it to show us the last 5 rows!

In [3]:
gsw_2016.tail()

Unnamed: 0,Rk,Player,Age,G,GS,MP,FG,FGA,FG%,3P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
12,13,Kevon Looney,20,53,4,447,56,107,0.523,2,...,0.618,44,80,124,29,15,17,17,64,135
13,14,Matt Barnes,36,20,5,410,38,90,0.422,18,...,0.87,15,76,91,45,12,9,24,47,114
14,15,Anderson Varejão,34,14,1,92,5,14,0.357,0,...,0.727,12,15,27,10,3,3,8,16,18
15,16,Damian Jones,21,10,0,85,8,16,0.5,0,...,0.3,9,14,23,0,1,4,6,15,19
16,17,Briante Weber,24,7,0,46,5,14,0.357,0,...,0.667,0,4,4,5,3,1,3,4,12


We can also find out what all the columns are by using `.columns`.

In [4]:
gsw_2016.columns

Index(['Rk', 'Player', 'Age', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA',
       '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB',
       'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'],
      dtype='object')

And we can know how big our table is by using `.shape`.

In [5]:
gsw_2016.shape

(17, 28)

Now, this already looks like a cool spreadsheet type of thing. But before we get to working with all that... we need to solidify how to do math in Python. Start by importing `numpy as np`.

In [6]:
import numpy as np

Let's focus on these two wonderful players for now.

In [7]:
our_focus = gsw_2016[gsw_2016['Player'].isin(['Klay Thompson', 'Stephen Curry'])]
our_focus

Unnamed: 0,Rk,Player,Age,G,GS,MP,FG,FGA,FG%,3P,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Klay Thompson,26,78,78,2649,644,1376,0.468,268,...,0.853,49,236,285,160,66,40,128,139,1742
1,2,Stephen Curry,28,79,79,2638,675,1443,0.468,324,...,0.898,61,292,353,524,142,17,239,183,1999


Let's just look at their ages in the object `their_ages`.

In [8]:
their_ages = our_focus['Age']
their_ages

0    26
1    28
Name: Age, dtype: int64

Use indexing to calculate the average (mean) of `their_ages`.

In [9]:
(their_ages[0] + their_ages[1]) / 2

27.0

Now use `sum` and `len` to get the average (mean).

In [10]:
sum(their_ages) / len(their_ages)

27.0

Finally, use the function `np.mean()` to get the mean.

In [11]:
np.mean(their_ages)

27.0

Now clearly, getting their average could have been done simply by looking at the two numbers in your head. Let's go ahead and get the mean of the whole team's ages. Use `gsw_2016['Age']` and caculate the mean.

First, use `sum()` and `len()`.

In [12]:
sum(gsw_2016['Age']) / len(gsw_2016['Age'])

27.88235294117647

Now, use numpy.

In [13]:
np.mean(gsw_2016['Age'])

27.88235294117647

A more advanced way to do this is use `.apply('mean')` on the column itself.

In [14]:
gsw_2016['Age'].apply('mean')

27.88235294117647

As you can see, there are many different ways to do the same thing in Python. You might ask when would be the right time to use which. This is intuition you will discover as you continue working with objects in Python.

Let's do some background research on where these numbers came from. Visit <a href='https://www.basketball-reference.com/teams/GSW/2017.html'>basketball ref</a>. Scroll down to the table that says "Totals". This is the table that we are working with. Read through the associated *Glossary* next to the table. Your next task is to calculate the fraction of games that a player started in based on the number of games they played. That is, we want to calculate

$$
\text{Fraction Started} = \frac{\text{Number of games started}}{\text{Number of games played}}
$$

using Python functions.

Try dividing the associated columns by each other to calculate "Fraction Started".

In [15]:
gsw_2016['GS'] / gsw_2016['G']

0     1.000000
1     1.000000
2     1.000000
3     1.000000
4     0.000000
5     0.039474
6     1.000000
7     0.000000
8     0.281690
9     0.000000
10    0.129870
11    0.038462
12    0.075472
13    0.250000
14    0.071429
15    0.000000
16    0.000000
dtype: float64

Assign this division to a new column in the dataframe. You can do this by assigning `gsw_2016['fraction_started']` to the division you wrote above.

In [16]:
gsw_2016['fraction_started'] = gsw_2016['GS'] / gsw_2016['G']
gsw_2016.head()

Unnamed: 0,Rk,Player,Age,G,GS,MP,FG,FGA,FG%,3P,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,fraction_started
0,1,Klay Thompson,26,78,78,2649,644,1376,0.468,268,...,49,236,285,160,66,40,128,139,1742,1.0
1,2,Stephen Curry,28,79,79,2638,675,1443,0.468,324,...,61,292,353,524,142,17,239,183,1999,1.0
2,3,Draymond Green,26,76,76,2471,272,650,0.418,81,...,98,501,599,533,154,106,184,217,776,1.0
3,4,Kevin Durant,28,62,62,2070,551,1026,0.537,117,...,39,474,513,300,66,99,138,117,1555,1.0
4,5,Andre Iguodala,33,76,0,1998,219,415,0.528,64,...,51,253,304,261,76,39,58,97,574,0.0


What is the minimum `fraction_started`? Use `min()` on the column to find out.

In [17]:
fraction_started = gsw_2016['fraction_started']
min(fraction_started)

0.0

In [18]:
min(gsw_2016['fraction_started'])

0.0

Which player had the minimum `fraction_started`? Use `np.where()` to get the row index of the players who had the smallest `fraction_started`. You need to put a logical statement within the `np.where()` statement to see *where* fraction_started is equal to the minimum. Once you have retrieved the row indices, save them to a variable `min_ix`. Finally, resave `min_ix=min_ix[0]`.

In [19]:
min_ix = np.where(fraction_started == min(fraction_started))
min_ix = min_ix[0]
min_ix

array([ 4,  7,  9, 15, 16])

 Once you have the indices, use it on the `gsw_2016['Player']` column.

In [20]:
gsw_2016['Player'][min_ix]

4     Andre Iguodala
7          Ian Clark
9         David West
15      Damian Jones
16     Briante Weber
Name: Player, dtype: object

Who are these players? Were they great? Okay? Not so great?

Now, do the same for the players that had the maximum `fraction_started`! Name the indices `max_ix`. (Hint: Use `max()`.)

In [21]:
max_ix = np.where(fraction_started == max(fraction_started))
max_ix = max_ix[0]
gsw_2016['Player'][max_ix]

0     Klay Thompson
1     Stephen Curry
2    Draymond Green
3      Kevin Durant
6     Zaza Pachulia
Name: Player, dtype: object

And who were these players? Were they great? Okay? Not so great? Hm?

Now let's compare the stats between the different players. We are going to start thinking in terms of a user input program. Imagine that a user is going to input the indices of the players that we identified above into a program. Then, the program will use an if/else statement to compare the average age, average points, and average assists of these two groups of players. We will output information to the user in an orderly way about the averages and explain who has a higher averages for those metrics. There are many ways to output this information to the user.

First, let's just quickly take a look at the dataframe to see what `Age`, `PTS`, and `AST` look like for these groups of players. In the following chunks, I am selecting certain columns and using a method called `.iloc` which lets me select the parts of the dataframe I want by index number.

In [22]:
gsw_2016[['Player', 'Age', 'PTS', 'AST']].iloc[min_ix]

Unnamed: 0,Player,Age,PTS,AST
4,Andre Iguodala,33,574,261
7,Ian Clark,25,527,90
9,David West,36,316,151
15,Damian Jones,21,19,0
16,Briante Weber,24,12,5


In [23]:
gsw_2016[['Player', 'Age', 'PTS', 'AST']].iloc[max_ix]

Unnamed: 0,Player,Age,PTS,AST
0,Klay Thompson,26,1742,160
1,Stephen Curry,28,1999,524
2,Draymond Green,26,776,533
3,Kevin Durant,28,1555,300
6,Zaza Pachulia,32,426,132


Next, calculate the following for both groups of players.

- Mean age
- Mean points
- Mean assists

Print them out as a *tuple*. A tuple is similar to a list. That's all you need to know for now. You can do this by typing

`mean_age, mean_points, mean_assists`,

assuming you have named the means you calculated in that way. You can also simply not name the means and print out as tuple directly. Give it a try.

In [24]:
# THE STARTERS
np.mean(gsw_2016['Age'][max_ix]), np.mean(gsw_2016['PTS'][max_ix]), np.mean(gsw_2016['AST'][max_ix])

(28.0, 1299.6, 329.8)

In [25]:
# THE BENCH
np.mean(gsw_2016['Age'][min_ix]), np.mean(gsw_2016['PTS'][min_ix]), np.mean(gsw_2016['AST'][min_ix])

(27.8, 289.6, 101.4)

Now, we want to start formalizing what a program could look like. For now, we are going to set `group_1` and `group_2` to be the indices of the players we identified above.

In [26]:
group_1 = min_ix
group_2 = max_ix

Now, write a some `print()` statements to tell the user who is in each group. You will need to use the `gsw_2016['Player']` column and convert into a string separated by a comma and space `', '`. Here are two useful examples. Pay close attention to the data types.

**Option 1: Use the `.join()` method**  
We can use `.join()` on a list.  

`
names = ['Shaquille O'Neal', 'Kobe Bryant', 'Pau Gasol', 'Derek Fisher', 'Karl Malone']
'and '.join(names)
`

**Option 2: Use the `.str.cat()` method**
We can use `.str.cat()` on a pandas *series*. Don't worry too much about what a *series* is. Just know that it is quite like a numpy array or list. Just another name for it in another universe.  

`
names = ['Michael Jordan', 'Steve Kerr', 'Scottie Pippen', 'Dennis Rodman', 'Toni Kokuc']
names = pd.Series(names)
names.str.cat(sep='and ')
`

In [27]:
names_str = ', '.join(gsw_2016['Player'][min_ix])
print(names_str)

Andre Iguodala, Ian Clark, David West, Damian Jones, Briante Weber


In [28]:
names_str = gsw_2016['Player'][min_ix].str.cat(sep=', ')
print(names_str)

Andre Iguodala, Ian Clark, David West, Damian Jones, Briante Weber


In [29]:
# YOU CAN USE THE FIRST
print('Group 1 includes: ', ', '.join(gsw_2016['Player'][min_ix]))

# OR SECOND WAYS, DOES NOT MATTER
print('Group 2 includes: ', gsw_2016['Player'][max_ix].str.cat(sep=', '))

# OR ANOTHER WAY IF YOU FOUND ANOTHER WAY!

Group 1 includes:  Andre Iguodala, Ian Clark, David West, Damian Jones, Briante Weber
Group 2 includes:  Klay Thompson, Stephen Curry, Draymond Green, Kevin Durant, Zaza Pachulia


Now that we have confirmed with the user which players are in Group 1 and Group 2, we want to give them a summary of the averages. Please write several if-else statements that will print out information similar to this. You may choose to reorder how you type out this information. Simply think of what the user might want.

`
Group 1 has a lower average age than Group 2 since  27.8 < 28.0 .
Group 1 has lower average points than Group 2 since  289.6 < 1299.6 .
Group 1 has lower average assists than Group 2 since  101.4 < 329.8 .
`

In [30]:
# AGE
if np.mean(gsw_2016['Age'][group_1]) < np.mean(gsw_2016['Age'][group_2]):
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '<',
          np.mean(gsw_2016['Age'][group_2]),
          '.')
else:
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '>',
          np.mean(gsw_2016['Age'][group_2]),
          '.')

# PTS
if np.mean(gsw_2016['PTS'][group_1]) < np.mean(gsw_2016['PTS'][group_2]):
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '<',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')
else:
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '>',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')

# AST
if np.mean(gsw_2016['AST'][group_1]) < np.mean(gsw_2016['AST'][group_2]):
    print('Group 1 has lower average assists than Group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '<',
          np.mean(gsw_2016['AST'][group_2]),
          '.')
else:
    print('Group 1 has higher average assists than group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '>',
          np.mean(gsw_2016['AST'][group_2]),
          '.')

Group 1 has a lower average age than Group 2 since  27.8 < 28.0 .
Group 1 has lower average points than Group 2 since  289.6 < 1299.6 .
Group 1 has lower average assists than Group 2 since  101.4 < 329.8 .


Now, put everything that we just did into one chunk! Set groups based on indices, print out who is in the groups, and print out the average information. That is, copy and paste the last few chunks and put them together so it'll run all together.

In [31]:
group_1 = min_ix
group_2 = max_ix

# YOU CAN USE THE FIRST
print('Group 1 includes: ', ', '.join(gsw_2016['Player'][min_ix]))

# OR SECOND WAYS, DOES NOT MATTER
print('Group 2 includes: ', gsw_2016['Player'][max_ix].str.cat(sep=', '))

# OR ANOTHER WAY IF YOU FOUND ANOTHER WAY!
print('\n')

# AGE
if np.mean(gsw_2016['Age'][group_1]) < np.mean(gsw_2016['Age'][group_2]):
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '<',
          np.mean(gsw_2016['Age'][group_2]),
          '.')
else:
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '>',
          np.mean(gsw_2016['Age'][group_2]),
          '.')

# PTS
if np.mean(gsw_2016['PTS'][group_1]) < np.mean(gsw_2016['PTS'][group_2]):
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '<',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')
else:
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '>',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')

# AST
if np.mean(gsw_2016['AST'][group_1]) < np.mean(gsw_2016['AST'][group_2]):
    print('Group 1 has lower average assists than Group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '<',
          np.mean(gsw_2016['AST'][group_2]),
          '.')
else:
    print('Group 1 has higher average assists than group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '>',
          np.mean(gsw_2016['AST'][group_2]),
          '.')

Group 1 includes:  Andre Iguodala, Ian Clark, David West, Damian Jones, Briante Weber
Group 2 includes:  Klay Thompson, Stephen Curry, Draymond Green, Kevin Durant, Zaza Pachulia


Group 1 has a lower average age than Group 2 since  27.8 < 28.0 .
Group 1 has lower average points than Group 2 since  289.6 < 1299.6 .
Group 1 has lower average assists than Group 2 since  101.4 < 329.8 .


Finally, change the indices given to the groups. Choose some indices of players like

`
group_1 = [2,3,4,5]
group_2 = [10,11,12,1]
`

and run the whole thing again! We have no seen the power of the beginnings of a program. I think that's pretty awesome.

In [32]:
group_1 = [2,3,4,5]
group_2 = [10,11,12,1]

# YOU CAN USE THE FIRST
print('Group 1 includes: ', ', '.join(gsw_2016['Player'][min_ix]))

# OR SECOND WAYS, DOES NOT MATTER
print('Group 2 includes: ', gsw_2016['Player'][max_ix].str.cat(sep=', '))

# OR ANOTHER WAY IF YOU FOUND ANOTHER WAY!
print('\n')

# AGE
if np.mean(gsw_2016['Age'][group_1]) < np.mean(gsw_2016['Age'][group_2]):
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '<',
          np.mean(gsw_2016['Age'][group_2]),
          '.')
else:
    print('Group 1 has a lower average age than Group 2 since ',
          np.mean(gsw_2016['Age'][group_1]),
          '>',
          np.mean(gsw_2016['Age'][group_2]),
          '.')

# PTS
if np.mean(gsw_2016['PTS'][group_1]) < np.mean(gsw_2016['PTS'][group_2]):
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '<',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')
else:
    print('Group 1 has lower average points than Group 2 since ',
          np.mean(gsw_2016['PTS'][group_1]),
          '>',
          np.mean(gsw_2016['PTS'][group_2]),
          '.')

# AST
if np.mean(gsw_2016['AST'][group_1]) < np.mean(gsw_2016['AST'][group_2]):
    print('Group 1 has lower average assists than Group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '<',
          np.mean(gsw_2016['AST'][group_2]),
          '.')
else:
    print('Group 1 has higher average assists than group 2 since ',
          np.mean(gsw_2016['AST'][group_1]),
          '>',
          np.mean(gsw_2016['AST'][group_2]),
          '.')

Group 1 includes:  Andre Iguodala, Ian Clark, David West, Damian Jones, Briante Weber
Group 2 includes:  Klay Thompson, Stephen Curry, Draymond Green, Kevin Durant, Zaza Pachulia


Group 1 has a lower average age than Group 2 since  29.5 > 25.25 .
Group 1 has lower average points than Group 2 since  823.5 > 688.25 .
Group 1 has higher average assists than group 2 since  308.25 > 147.0 .


# Reflection
Type out a reflection on what you learned and what you're still confused about below!

~ * Write your reflection here * ~