# Dataframes Exercises

For several of the following exercises, you'll need to load several datasets using the pydataset library. (If you get an error when trying to run the import below, use pip to install the pydataset package.)

In [1]:
from pydataset import data
import math

All the datasets loaded from the pydataset library will be pandas dataframes.

## 1. Copy the code from the lesson to create a dataframe full of student grades.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})
type(df)

pandas.core.frame.DataFrame

In [224]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   name             12 non-null     object
 1   math             12 non-null     int64 
 2   english          12 non-null     int64 
 3   reading          12 non-null     int64 
 4   passing_english  12 non-null     bool  
dtypes: bool(1), int64(3), object(1)
memory usage: 524.0+ bytes


- a. Create a column named passing_english that indicates whether each student has a passing grade in english.

In [230]:
df = df.assign(passing_english = df.english >= 70)
df

Unnamed: 0,name,math,english,reading,passing_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
9,Richard,69,80,94,True


- b. Sort the english grades by the passing_english column. How are duplicates handled?

In [231]:
df.sort_values(by='passing_english')

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


- c. Sort the english grades first by passing_english and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the .sort_values method)

In [6]:
df.sort_values(by=['passing_english', 'name'])

Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


- d. Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.

In [7]:
df.sort_values(by=['passing_english', 'english'])

Unnamed: 0,name,math,english,reading,passing_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


- e. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

In [240]:
df.assign(overall_grade = (df.math + df.english + df.reading)/3)

Unnamed: 0,name,math,english,reading,passing_english,overall_grade
0,Sally,62,85,80,True,75.666667
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.666667
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,82.666667
6,Thomas,82,64,81,False,75.666667
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.333333
9,Richard,69,80,94,True,81.0


## 2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:

In [248]:
mpg = data('mpg')


In [10]:
data('mpg', show_doc=True)

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




- How many rows and columns are there?

In [11]:
mpg.shape

(234, 11)

- What are the data types of each column?

In [141]:
mpg.dtypes

manufacturer           object
model                  object
displ                 float64
year                    int64
cyl                     int64
trans                  object
drv                    object
city                    int64
highway                 int64
fl                     object
class                  object
mileage_difference      int64
average_mileage       float64
dtype: object

- Summarize the dataframe with .info and .describe

In [13]:
mpg.info(), mpg.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


(None,
             displ         year         cyl         cty         hwy
 count  234.000000   234.000000  234.000000  234.000000  234.000000
 mean     3.471795  2003.500000    5.888889   16.858974   23.440171
 std      1.291959     4.509646    1.611534    4.255946    5.954643
 min      1.600000  1999.000000    4.000000    9.000000   12.000000
 25%      2.400000  1999.000000    4.000000   14.000000   18.000000
 50%      3.300000  2003.500000    6.000000   17.000000   24.000000
 75%      4.600000  2008.000000    8.000000   19.000000   27.000000
 max      7.000000  2008.000000    8.000000   35.000000   44.000000)

 - Rename the cty column to city.

In [249]:
mpg.rename(columns={'cty' : 'city'}, inplace=True)
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


- Rename the hwy column to highway.

In [250]:
mpg.rename(columns={'hwy' : 'highway'}, inplace=True)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


- Do any cars have better city mileage than highway mileage?

In [251]:
mpg[mpg.city > mpg.highway]

# No cars have better city mileage than hwy mileage

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class


- Create a column named mileage_difference this column should contain the difference between highway and city mileage for each car.

In [252]:
mpg = mpg.assign(mileage_difference = mpg.highway - mpg.city)
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8


- Which car (or cars) has the highest mileage difference?

In [253]:
mpg[mpg.mileage_difference == mpg.mileage_difference.max()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


In [254]:
#also can use nlargest:
mpg.nlargest(1, 'mileage_difference', keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


- Which compact class car has the lowest highway mileage? The best?

In [257]:
#compact car with the lowest highway mileage
compact_cars = mpg[mpg['class'] == 'compact'] 
compact_cars[compact_cars['highway'] == compact_cars['highway'].min()]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
220,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7


In [258]:
#compact car with the best highway mileage
mpg[(mpg['class'] == 'compact') & (mpg['highway'] == mpg['highway'].max())]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11


- Create a column named average_mileage that is the mean of the city and highway mileage.

In [259]:
mpg = mpg.assign(average_mileage = ((mpg.highway + mpg.city) / 2))
mpg 

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8,22.0


- Which dodge car has the best average mileage? The worst?

In [260]:
#dodge car with the best average mileage
dodge_cars = mpg[mpg.manufacturer == 'dodge'].copy()

dodge_cars[dodge_cars.average_mileage == dodge_cars.average_mileage.max()]


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
38,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0


In [263]:
#dodge cars with the worst average mileage
dodge_cars.nsmallest(1, 'average_mileage', keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
55,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5


## 3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

In [132]:
Mammals = data('Mammals')

In [134]:
data('Mammals', show_doc=True)

Mammals

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Garland(1983) Data on Running Speed of Mammals

### Description

Observations on the maximal running speed of mammal species and their body
mass.

### Usage

    data(Mammals)

### Format

A data frame with 107 observations on the following 4 variables.

weight

Body mass in Kg for "typical adult sizes"

speed

Maximal running speed (fastest sprint velocity on record)

hoppers

logical variable indicating animals that ambulate by hopping, e.g. kangaroos

specials

logical variable indicating special animals with "lifestyles in which speed
does not figure as an important factor": Hippopotamus, raccoon (Procyon),
badger (Meles), coati (Nasua), skunk (Mephitis), man (Homo), porcupine
(Erithizon), oppossum (didelphis), and sloth (Bradypus)

### Details

Used by Chappell (1989) and Koenker, Ng and Portnoy (1994) to illustrate the
fitting of piecewise linear curves.

### Source

Garland, T. (

- How many rows and columns are there?

In [136]:
Mammals.shape

(107, 4)

- What are the data types?

In [140]:
Mammals.dtypes

weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

- Summarize the dataframe with .info and .describe

In [142]:
Mammals.describe(), Mammals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB


(            weight       speed
 count   107.000000  107.000000
 mean    278.688178   46.208411
 std     839.608269   26.716778
 min       0.016000    1.600000
 25%       1.700000   22.500000
 50%      34.000000   48.000000
 75%     142.500000   65.000000
 max    6000.000000  110.000000,
 None)

- What is the the weight of the fastest animal?

In [197]:
Mammals.weight[Mammals.speed == Mammals.speed.max()]

53    55.0
Name: weight, dtype: float64

- What is the overal percentage of specials?

In [172]:
len(Mammals[(Mammals.specials)])/len(Mammals) * 100

9.345794392523365

- How many animals are hoppers that are above the median speed? What percentage is this?

In [223]:
#the number of animals that are hoppers and above the median speed: 
len(Mammals[(Mammals.hoppers) & (Mammals.speed > Mammals.speed.median())])

7

In [196]:
#percentage of mammals that hoppers and above the median speed:
len(Mammals[(Mammals.hoppers) & (Mammals.speed > Mammals.speed.median())])/len(Mammals) * 100

6.5420560747663545