Classifying by One Variable 
Data scientists often need to classify individuals into groups according to shared features, and then identify some characteristics of the groups. For example, in the example using Galton's data on heights, we saw that it was useful to classify families according to the parents' midparent heights, and then find the average height of the children in each group.

This section is about classifying individuals into categories that are not numerical. We begin by recalling the basic use of group.



Counting the Number in Each Category 
The group method with a single argument counts the number of rows for each category in a column. The result contains one row per unique value in the grouped column.

Here is a small table of data on ice cream cones. The group method can be used to list the distinct flavors and provide the counts of each flavor.

Here is a small table of data on ice cream cones. The group method can be used to list the distinct flavors and provide the counts of each flavor

In [2]:
import pandas as pd
import numpy as np

In [3]:
cones = pd.DataFrame({
    'Flavor': ('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate'),
    'Price': (3.55, 4.75, 6.55, 5.25, 5.25)
})
cones

Unnamed: 0,Flavor,Price
0,strawberry,3.55
1,chocolate,4.75
2,chocolate,6.55
3,strawberry,5.25
4,chocolate,5.25


In [8]:
p=cones['Flavor'].value_counts().reset_index().rename(columns={
    'index': 'Flavor',
    'Flavor': 'count'
})
p

Unnamed: 0,Flavor,count
0,chocolate,3
1,strawberry,2


Notice that this can all be worked out from just the Flavor column. The Price column has not been used.

But what if we wanted the total price of the cones of each different flavor? That's where the second argument of group comes in.



Finding a Characteristic of Each Category 
The optional second argument of group names the function that will be used to aggregate values in other columns for all of those rows. For instance, sum will sum up the prices in all rows that match each category. This result also contains one row per unique value in the grouped column, but it has the same number of columns as the original table.

In [None]:
To find the total price of each flavor, we call group again, with Flavor as its first argument as before. But this time there is a second argument: the function name sum.

In [11]:
cones.groupby('Flavor').agg('sum').reset_index()


Unnamed: 0,Flavor,Price
0,chocolate,16.55
1,strawberry,8.8


To create this new table, group has calculated the sum of the Price entries in all the rows corresponding to each distinct flavor. The prices in the three chocolate rows add up to  $16.55  (you can assume that price is being measured in dollars). The prices in the two `strawberry` rows have a total of  $8.80 .

The label of the newly created "sum" column is Price sum, which is created by taking the label of the column being summed, and appending the word sum.

Because group finds the sum of all columns other than the one with the categories, there is no need to specify that it has to sum the prices.

In [16]:
cones.groupby('Flavor').agg('size').reset_index()

Unnamed: 0,Flavor,0
0,chocolate,3
1,strawberry,2


In [29]:
p= cones[cones.Flavor=='chocolate']['Price']
p

1    4.75
2    6.55
4    5.25
Name: Price, dtype: float64

This is what group is doing for each distinct value in Flavor.

In [36]:
# For each distinct value in `Flavor, access all the rows
# and create an array of `Price`

cones_choc =np.array((cones[cones.Flavor=='chocolate']['Price']))
cones_strawb = np.array((cones[cones.Flavor=='strawberry']['Price']))

# Display the arrays in a table

price1=np.sum(cones_choc)
price2=np.sum(cones_strawb)

table = pd.DataFrame({
    'Flavor': ('chocolate', 'strawberry'),
    'Array of All the Prices': (cones_choc,cones_strawb),
    'Sum of the Array':(price1, price2)
})
table

Unnamed: 0,Flavor,Array of All the Prices,Sum of the Array
0,chocolate,"[4.75, 6.55, 5.25]",16.55
1,strawberry,"[3.55, 5.25]",8.8


In [None]:
You can replace sum by any other functions that work on arrays. For example, you could use max to find the largest price in each category:

In [37]:
cones_choc =np.array((cones[cones.Flavor=='chocolate']['Price']))
cones_strawb = np.array((cones[cones.Flavor=='strawberry']['Price']))
maximum1=np.max(cones_choc)
maximum2=np.max(cones_strawb)
# Display the arrays in a table


table = pd.DataFrame({
    'Flavor': ('chocolate', 'strawberry'),
    'Price max': (maximum1,maximum2)
   
})
table

Unnamed: 0,Flavor,Price max
0,chocolate,6.55
1,strawberry,5.25


Once again, group creates arrays of the prices in each Flavor category. But now it finds the max of each array:

In [38]:
table = pd.DataFrame({
    'Flavor': ('chocolate', 'strawberry'),
    'Array of All the Prices': (cones_choc,cones_strawb),
    'Price max': (maximum1,maximum2)
   
})
table

Unnamed: 0,Flavor,Array of All the Prices,Price max
0,chocolate,"[4.75, 6.55, 5.25]",6.55
1,strawberry,"[3.55, 5.25]",5.25


Indeed, the original call to group with just one argument has the same effect as using len as the function and then cleaning up the table.

In [40]:
cones_choc =np.array((cones[cones.Flavor=='chocolate']['Price']))
cones_strawb = np.array((cones[cones.Flavor=='strawberry']['Price']))
leg1=len(cones_choc)
leg2=len(cones_strawb)
# Display the arrays in a table


table = pd.DataFrame({
    'Flavor': ('chocolate', 'strawberry'),
     'Array of All the Prices': (cones_choc,cones_strawb),
    'Length of the Array': (leg1,leg2)
   
})
table

Unnamed: 0,Flavor,Array of All the Prices,Length of the Array
0,chocolate,"[4.75, 6.55, 5.25]",3
1,strawberry,"[3.55, 5.25]",2


In [43]:
nba1=pd.read_csv("nba_salaries.csv")
nba = nba1.rename(columns={
    "'15-'16 SALARY": 'SALARY'
})
nba

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659
1,Al Horford,C,Atlanta Hawks,12.000000
2,Tiago Splitter,C,Atlanta Hawks,9.756250
3,Jeff Teague,PG,Atlanta Hawks,8.000000
4,Kyle Korver,SG,Atlanta Hawks,5.746479
...,...,...,...,...
412,Gary Neal,PG,Washington Wizards,2.139000
413,DeJuan Blair,C,Washington Wizards,2.000000
414,Kelly Oubre Jr.,SF,Washington Wizards,1.920240
415,Garrett Temple,SG,Washington Wizards,1.100602


1. How much money did each team pay for its players' salaries?

The only columns involved are TEAM and SALARY. We have to group the rows by TEAM and then sum the salaries of the groups.

In [46]:
teams_and_money=nba.groupby('TEAM').agg('sum').reset_index()
teams_and_money

Unnamed: 0,TEAM,SALARY
0,Atlanta Hawks,69.573103
1,Boston Celtics,50.285499
2,Brooklyn Nets,57.306976
3,Charlotte Hornets,84.102397
4,Chicago Bulls,78.82089
5,Cleveland Cavaliers,102.312412
6,Dallas Mavericks,65.762559
7,Denver Nuggets,62.429404
8,Detroit Pistons,42.21176
9,Golden State Warriors,94.085137


 2.How many NBA players were there in each of the five positions?

We have to classify by POSITION, and count. This can be done with just one argument to group:

In [52]:
pst=nba.groupby('POSITION').agg('count').reset_index()
pst1=pst.drop('TEAM',axis=1)
pst1=pst1.drop('SALARY',axis=1)
pst1

Unnamed: 0,POSITION,PLAYER
0,C,69
1,PF,85
2,PG,85
3,SF,82
4,SG,96


In [None]:
3. What was the average salary of the players at each of the five positions?

This time, we have to group by POSITION and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.

In [71]:
positions_and_money = nba[['POSITION', 'SALARY']]
q=positions_and_money.groupby('POSITION').agg('mean').rename(columns={
    'SALARY': 'SALARY mean'
}).reset_index()
q

Unnamed: 0,POSITION,SALARY mean
0,C,6.082913
1,PF,4.951344
2,PG,5.165487
3,SF,5.532675
4,SG,3.988195


Center was the most highly paid position, at an average of over 6 million dollars.

If we had not selected the two columns as our first step, group would not attempt to "average" the categorical columns in nba. (It is impossible to average two strings like "Atlanta Hawks" and "Boston Celtics".) It performs arithmetic only on numerical columns and leaves the rest blank.

In [65]:
positions_and_money

Unnamed: 0,POSITION,SALARY
0,PF,18.671659
1,C,12.000000
2,C,9.756250
3,PG,8.000000
4,SG,5.746479
...,...,...
412,PG,2.139000
413,C,2.000000
414,SF,1.920240
415,SG,1.100602


In [72]:
postion=q['POSITION']
mean=q['SALARY mean']