## Exercises: Explore the dataset

In [98]:
import pandas as pd
import seaborn as sns
taxis = sns.load_dataset("taxis")

**Explore the "taxis" dataset to answer the following questions:**

**Q1:** How many rows and column are in the dataset?

In [99]:
taxis.shape

(6433, 14)

**Q2:** What datatype is the most common in the set?

In [100]:
# object is the most common datatype with six occurences

taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(

**Q3:** What is the average number of passengers in a taxi?

In [101]:
round(taxis['passengers'].mean(), 2)

1.54

**Q4:** What is the most common number of passengers in a taxi?

In [102]:
taxis['passengers'].mode()

0    1
Name: passengers, dtype: int64

**Q5:** What is the most common payment method?

In [103]:
taxis['payment'].mode()

0    credit card
Name: payment, dtype: object

**Q6:** Which of the categorical features has the most categories?

In [104]:
# there are no categorical features in the data set

**Q7:** What percentage of cars in the set are yellow?

In [105]:
len(taxis.query('color == "yellow"')) / len(taxis)

0.8473496036064044

**Q8:** Which dropoff borough is most common? Which one is least common?

In [106]:
# Manhattan is the most common, Staten Island is the least common

taxis['dropoff_borough'].value_counts()

dropoff_borough
Manhattan        5206
Queens            542
Brooklyn          501
Bronx             137
Staten Island       2
Name: count, dtype: int64

**Q9:** Which column has the most missing values? How many?

In [107]:
# dropoff_zone and dropoff_borough both have 45 missing values

taxis.isna().sum()

pickup              0
dropoff             0
passengers          0
distance            0
fare                0
tip                 0
tolls               0
total               0
color               0
payment            44
pickup_zone        26
dropoff_zone       45
pickup_borough     26
dropoff_borough    45
dtype: int64

### Memory usage
``` taxis.info(memory_usage="deep") ``` gives you the total memory usage of the dataframe.

``` taxis.memory_usage(deep=True) ``` give you the total memory usage for each column.

**Answer the following questions:**

**Q10:** What is the total memory usage of the dataframe?

In [108]:
# 2.9 MB

taxis.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(

**Q11:** Which column takes up the most memory? How many kilobytes?

In [109]:
# pickup_zone uses 470 kb

taxis.memory_usage(deep=True)

Index                 132
pickup              51464
dropoff             51464
passengers          51464
distance            51464
fare                51464
tip                 51464
tolls               51464
total               51464
color              404297
payment            423176
pickup_zone        469744
dropoff_zone       469466
pickup_borough     420944
dropoff_borough    420381
dtype: int64

**Q12:** Why does the numeric columns all take up exactly 51464 bytes?

In [83]:
# They are of data types int64 and float64 which uses a set amount of bits in the memory. They also have 0 null values. 6 433 entries * 8 bytes = 51 464 bytes.

**Q13:** What is the total memory usage after converting all *object* columns to *category*?

In [110]:
cat_dict = {column: 'category' for column in taxis.columns if taxis[column].dtype == 'object'}

taxis = taxis.astype(cat_dict)

taxis.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   category      
 9   payment          6389 non-null   category      
 10  pickup_zone      6407 non-null   category      
 11  dropoff_zone     6388 non-null   category      
 12  pickup_borough   6407 non-null   category      
 13  dropoff_borough  6388 non-null   category      
dtypes: category(6), datetime64[ns](2), float

**Q14:** ... and after also converting *float64* to *float32*?

In [111]:
cat_dict = {column: 'float32' for column in taxis.columns if taxis[column].dtype == 'float64'}

taxis = taxis.astype(cat_dict)

taxis.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float32       
 4   fare             6433 non-null   float32       
 5   tip              6433 non-null   float32       
 6   tolls            6433 non-null   float32       
 7   total            6433 non-null   float32       
 8   color            6433 non-null   category      
 9   payment          6389 non-null   category      
 10  pickup_zone      6407 non-null   category      
 11  dropoff_zone     6388 non-null   category      
 12  pickup_borough   6407 non-null   category      
 13  dropoff_borough  6388 non-null   category      
dtypes: category(6), datetime64[ns](2), float

**Q15:** What is the smallest datatype we can convert passengers to? What is the total memory usage after converting passengers to the new type?

In [112]:
# 324.4 kb bytes

taxis = taxis.astype({'passengers': 'int8'})

taxis.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int8          
 3   distance         6433 non-null   float32       
 4   fare             6433 non-null   float32       
 5   tip              6433 non-null   float32       
 6   tolls            6433 non-null   float32       
 7   total            6433 non-null   float32       
 8   color            6433 non-null   category      
 9   payment          6389 non-null   category      
 10  pickup_zone      6407 non-null   category      
 11  dropoff_zone     6388 non-null   category      
 12  pickup_borough   6407 non-null   category      
 13  dropoff_borough  6388 non-null   category      
dtypes: category(6), datetime64[ns](2), float

**Q16:** How many percent of the orignal datasize is the new dataset after converting all the types as above?

In [114]:
# 324.4 kb / 2.9 MB = 11%