For several of the following exercises, you'll need to load several datasets using the pydataset library. (If you get an error when trying to run the import below, use pip to install the pydataset package.)

In [5]:
import pandas as pd
from pydataset import data

When the instructions say to load a dataset, you can pass the name of the dataset as a string to the `data` function to load the dataset. You can also view the documentation for the data set by passing the `show_doc` keyword argument.

In [3]:
# data('mpg', show_doc=True) # view the documentation for the dataset
mpg = data('mpg', show_doc=True) # load the dataset and store it in a variable

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




All the datasets loaded from the pydataset library will be pandas dataframes.

1. Copy the code from the lesson to create a dataframe full of student grades.

In [6]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

In [7]:
# a. Create a column named passing_english that indicates whether
#     each student has a passing grade in english.
df['passing_english'] = df['english'] >= 70
df

Unnamed: 0,name,math,english,reading,passing_english
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
9,Richard,69,80,94,True


In [8]:
# b. Sort the english grades by the passing_english column. How are duplicates handled?
df.sort_values('passing_english')
# duplicats sorted by index

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


In [10]:
# c. Sort the english grades first by passing_english and then by student name.
#     All the students that are failing english should be first, and within the students 
#     that are failing english they should be ordered alphabetically. 
#     The same should be true for the students passing english.
#     (Hint: you can pass a list to the .sort_values method)
df.sort_values(['passing_english', 'name'])

Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


In [11]:
# d. Sort the english grades first by passing_english, and then by the actual
#     english grade, similar to how we did in the last step.
df.sort_values(['passing_english', 'english'])

Unnamed: 0,name,math,english,reading,passing_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


In [18]:
(df['reading'] + df['math'] + df['english']).apply(lambda x : x/3)

0     75.666667
1     78.000000
2     87.666667
3     94.000000
4     89.000000
5     82.666667
6     75.666667
7     82.000000
8     80.333333
9     81.000000
10    94.666667
11    75.333333
dtype: float64

In [23]:
# e. Calculate each students overall grade and add it as a column on the dataframe. 
#     The overall grade is the average of the math, english, and reading grades.
overall_avg = (df['reading'] + df['math'] + df['english']).apply(lambda x : x/3)
df['overall'] = round(overall_avg, 2)
df

Unnamed: 0,name,math,english,reading,passing_english,overall
0,Sally,62,85,80,True,75.67
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,87.67
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,82.67
6,Thomas,82,64,81,False,75.67
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.33
9,Richard,69,80,94,True,81.0


2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions:

In [24]:
mpg_df = data('mpg')

In [26]:
# - How many rows and columns are there?
mpg_df.shape

(234, 11)

In [27]:
# - What are the data types of each column?
mpg_df.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

In [28]:
# - Summarize the dataframe with .info and .describe
mpg_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [29]:
mpg_df.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


In [66]:
# - Rename the cty column to city.
# - Rename the hwy column to highway.
mpg_df = mpg_df.rename(columns={'cty': 'city', 'hwy':'highway'})
mpg_df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0


In [44]:
# - Do any cars have better city mileage than highway mileage? *No*
(mpg_df['city'] > mpg_df['highway']).unique()

array([False])

In [65]:
# - Create a column named mileage_difference this column should contain
#    the difference between highway and city mileage for each car.
mpg_df['mileage_difference'] = mpg_df['highway'] - mpg_df['city']
mpg_df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0


In [54]:
# - Which car (or cars) has the highest mileage difference?
# mpg_df.sort_values('mileage_difference', ascending = False)
max_mil_diff = mpg_df['mileage_difference'].max()
max_mpg_df = mpg_df[mpg_df['mileage_difference'] == max_mil_diff]
max_mpg_df

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


In [58]:
df_compact['highway'].max()

44

In [60]:
# - Which compact class car has the lowest highway mileage? The best?
df_compact = mpg_df[mpg_df['class'] == 'compact']
df_compact[df_compact['highway'] == df_compact['highway'].min()] # lowest


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
220,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7


In [61]:
df_compact[df_compact['highway'] == df_compact['highway'].max()] # best

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11


In [64]:
# - Create a column named average_mileage that is the mean of the city and highway mileage.
mpg_df['average_mileage'] = (mpg_df['city'] + mpg_df['highway']) / 2
mpg_df.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0


In [67]:
# - Which dodge car has the best average mileage? The worst?
mpg_df[mpg_df['average_mileage'] == mpg_df['average_mileage'].max()] # best

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
222,volkswagen,new beetle,1.9,1999,4,manual(m5),f,35,44,d,subcompact,9,39.5


In [68]:
mpg_df[mpg_df['average_mileage'] == mpg_df['average_mileage'].min()] # worst

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference,average_mileage
55,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5


3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:



In [None]:
# - How many rows and columns are there?


In [None]:
# - What are the data types?


In [None]:
# - Summarize the dataframe with .info and .describe


In [None]:
# - What is the the weight of the fastest animal?


In [None]:
# - What is the overal percentage of specials?


In [None]:
# - How many animals are hoppers that are above the median speed? What percentage is this?

**Bonus**

https://github.com/guipsamora/pandas_exercises