# Demo 2.5 *groupby()*: Aggregating On a *Single* Column 


- **Demonstrates:**   
  - Changing Data Type: ***to_numeric()***    
  - Aggregating on a Single Column:  ***groupby()***  
  - Converting a pandas Series to a Dataframe  
  - Moving the Index Column into the Dataframe  


 
- Data file:  **Cars.csv** 


In [1]:
import pandas as pd

### Read the datafile File into a *pandas* Dataframe  

In [2]:
df = pd.read_csv('Data/Cars.csv')

print(df.shape)
df.head(2)

(428, 13)


Unnamed: 0,Vehicle_Make,Vehicle_Model,Vehicle_Type,Manufacturing_Origin,MPG_City,MPG_Hwy,MSRP,Invoice,Weight,Wheelbase,DriveTrain,EngineSize,Horsepower
0,Acura,MDX,SUV,Asia,17,23,36945,33337,4451,106,All,3.5,265
1,Acura,RSX Type S 2dr,Sedan,Asia,24,31,23820,21761,2778,101,Front,2.0,200


# Change Data Types as Needed  
- If we want to do numeric calculations on a column it is important that pandas recognizes it as numeric. 
- We also want to make sure a column is a float (rather than integer) if it could contain decimal values.
- Otherwise either errors or weird results are going to happen!  


In [3]:
# data Vehicle_Types 'Before' 
df.dtypes

Vehicle_Make             object
Vehicle_Model            object
Vehicle_Type             object
Manufacturing_Origin     object
MPG_City                  int64
MPG_Hwy                   int64
MSRP                      int64
Invoice                   int64
Weight                    int64
Wheelbase                 int64
DriveTrain               object
EngineSize              float64
Horsepower                int64
dtype: object

In [4]:
# Convert MSRP, Invoice, MPG_City, MPG_Hwy to floats
df['MSRP'] = df['MSRP'].astype(float)
df['Invoice'] = df['Invoice'].astype(float)

df['MPG_City'] = df['MPG_City'].astype(float)
df['MPG_Hwy'] = df['MPG_Hwy'].astype(float)

In [5]:
# data Vehicle_Types 'After' 
df.dtypes

Vehicle_Make             object
Vehicle_Model            object
Vehicle_Type             object
Manufacturing_Origin     object
MPG_City                float64
MPG_Hwy                 float64
MSRP                    float64
Invoice                 float64
Weight                    int64
Wheelbase                 int64
DriveTrain               object
EngineSize              float64
Horsepower                int64
dtype: object

# Question:  What is the Average City MPG By Vehicle Type?  
- Categorical Variable to Group On:  **Vehicle_Type**  
- Continuous Variable We're Interested In:  **MPG_City** 
- Aggregation Function:  **mean** 


- **Gotcha:**  
  - If we only select a single continuous variable/column we're interested in, groupby() will creat a pandas Data **Series** rather than a Dataframe  
  - Data Series are similar to Dataframes, but I think Dataframes are easier to work with and more familiar to you, so we're going to convert the Data Series to a Dataframe.


In [6]:
# Optional:  Display the unique values of the column we want to Group on
df['Vehicle_Type'].unique()

array(['SUV', 'Sedan', 'Sports', 'Wagon', 'Truck', 'Hybrid'], dtype=object)

In [7]:
df.head(2)

Unnamed: 0,Vehicle_Make,Vehicle_Model,Vehicle_Type,Manufacturing_Origin,MPG_City,MPG_Hwy,MSRP,Invoice,Weight,Wheelbase,DriveTrain,EngineSize,Horsepower
0,Acura,MDX,SUV,Asia,17.0,23.0,36945.0,33337.0,4451,106,All,3.5,265
1,Acura,RSX Type S 2dr,Sedan,Asia,24.0,31.0,23820.0,21761.0,2778,101,Front,2.0,200


# Aggregate on a *Single* Column:  *Vehicle_Type*    


In [8]:
ser = df.groupby("Vehicle_Type")['MPG_City'].mean()

ser

Vehicle_Type
Hybrid    55.000000
SUV       16.100000
Sedan     21.083969
Sports    18.408163
Truck     16.500000
Wagon     21.100000
Name: MPG_City, dtype: float64

# Convert the pandas ***Series*** to a Dataframe

In [9]:
# First, check that it is a pandas Series
type(ser)

pandas.core.series.Series

In [10]:
# If it is, convert it  to a Dataframe
df = ser.to_frame()

print(df.shape)
df.head()

(6, 1)


Unnamed: 0_level_0,MPG_City
Vehicle_Type,Unnamed: 1_level_1
Hybrid,55.0
SUV,16.1
Sedan,21.083969
Sports,18.408163
Truck,16.5


# Move the Index Column into the Dataframe  
- Since it is no longer the Index, pandas will create a new default index column with values 0, 1, 2, etc...  

In [11]:
df.reset_index(inplace=True)

print(df.shape)
df.head()

(6, 2)


Unnamed: 0,Vehicle_Type,MPG_City
0,Hybrid,55.0
1,SUV,16.1
2,Sedan,21.083969
3,Sports,18.408163
4,Truck,16.5
