## Exercises: Explore the dataset

In [1]:
import pandas as pd
import seaborn as sns
taxis = sns.load_dataset("taxis")

**Explore the "taxis" dataset to answer the following questions:**

**Q1:** How many rows and column are in the dataset?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Rows:</b> 6433
&nbsp;&nbsp;&nbsp;<b>Columns:</b> 14
</details>

In [2]:
taxis.shape

(6433, 14)

**Q2:** What datatype is the most common in the set?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;object (6 columns)
</details>

In [6]:
# object is the most common datatype with six occurences

taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   object        
 9   payment          6389 non-null   object        
 10  pickup_zone      6407 non-null   object        
 11  dropoff_zone     6388 non-null   object        
 12  pickup_borough   6407 non-null   object        
 13  dropoff_borough  6388 non-null   object        
dtypes: datetime64[ns](2), float64(5), int64(

**Q3:** What is the average number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1.54
</details>

In [11]:
round(taxis['passengers'].mean(), 2)

1.54

**Q4:** What is the most common number of passengers in a taxi?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;1
</details>

In [15]:
taxis['passengers'].mode()

0    1
Name: passengers, dtype: int64

**Q5:** What is the most common payment method?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;credit card
</details>

In [16]:
taxis['payment'].mode()

0    credit card
Name: payment, dtype: object

**Q6:** Which of the categorical features has the most categories?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;dropoff_zone (203 categories)
</details>

In [18]:
#taxis.describe(include=['object', 'category'])

# there are no categorical features in the data set

Unnamed: 0,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
count,6433,6389,6407,6388,6407,6388
unique,2,2,194,203,4,5
top,yellow,credit card,Midtown Center,Upper East Side North,Manhattan,Manhattan
freq,5451,4577,230,245,5268,5206


**Q7:** What percentage of cars in the set are yellow?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;84.7%
</details>

In [20]:
len(taxis.query('color == "yellow"')) / len(taxis)

0.8473496036064044

**Q8:** Which dropoff borough is most common? Which one is least common?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Most common:</b> Manhattan (5206)<br>
&nbsp;&nbsp;&nbsp;<b>Least common:</b> Staten Island (2)<br>
</details>

In [22]:
# Manhattan is the most common, Staten Island is the least common

taxis['dropoff_borough'].value_counts()

dropoff_borough
Manhattan        5206
Queens            542
Brooklyn          501
Bronx             137
Staten Island       2
Name: count, dtype: int64

**Q9:** Which column has the most missing values? How many?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;<i>dropoff_zone</i> and <i>dropoff_borough</i> both have 45 missing values.
</details>

In [25]:
# dropoff_zone and dropoff_borough both have 45 missing values

taxis.isna().sum()

pickup              0
dropoff             0
passengers          0
distance            0
fare                0
tip                 0
tolls               0
total               0
color               0
payment            44
pickup_zone        26
dropoff_zone       45
pickup_borough     26
dropoff_borough    45
dtype: int64

### Memory usage
``` taxis.info(memory_usage="deep") ``` gives you the total memory usage of the dataframe.

``` taxis.memory_usage(deep=True) ``` give you the total memory usage for each column.

**Answer the following questions:**

**Q10:** What is the total memory usage of the dataframe?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;2.9 MB
</details>

**Q11:** Which column takes up the most memory? How many kilobytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;pickup_zone (470 KB)
</details>

**Q12:** Why does the numeric columns all take up exactly 51464 bytes?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;They all use 64 bit datatypes. 64 bits = 8 bytes. 6433 entries * 8 bytes = 51464 bytes.
</details>

**Q13:** What is the total memory usage after converting all *object* columns to *category*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;494.0 KB
</details>

**Q14:** ... and after also converting *float64* to *float32*?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;368.4 KB
</details>

**Q15:** What is the smallest datatype we can convert passengers to? What is the total memory usage after converting passengers to the new type?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;The maximum number of passengers in the dataset are 6,<br> 
&nbsp;&nbsp;&nbsp;and therefore the values easily fit into the <i>int8</i> type (8 bit integer).<br>
<br>
&nbsp;&nbsp;&nbsp;New size: 324.4 KB
</details>

**Q16:** How many percent of the orignal datasize is the new dataset after converting all the types as above?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;11.0 %
</details>

### Final note:
Just to be clear, if we want to limit our memory usage by specifying datatypes with a smaller memory footprint, it makes more sense to do so when loading the dataset in to pandas, than changing the type afterwards (as in the example above).

Most common ways to load data into pandas (like pd.from_csv, pd.from_json etc) provides optional parameters for setting the datatype as the files are read into pandas dataframes.

Also, note that this is really only a concern when working with huge sets of data. For smaller datasets, like the one in the example above, it doesn't really matter, and might be only unneccessary work to optimize. The above exercises just serve as examples to better understand data types and their memory footprints.