<a href="https://colab.research.google.com/github/EngComp-Henrique/Effective-Pandas/blob/main/Effective_Pandas_Chapter_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conversion methods

* Converting the type of data has benefits:
    - More manipulation methods
    - Use less memory

In [1]:
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)
city_mpg = df.city08

  exec(code_obj, self.user_global_ns, self.user_ns)


## Automatic conversion
- Let's convert a Series using `convert_dtypes()`
- Just works with types that supports `pd.NA`

In [2]:
city_mpg.convert_dtypes()

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int64

- We can also convert data type, using `astype`, and passing the type
- Remember to give the correct data type. It should fit:
    - The values
    - Memory usage

In [3]:
city_mpg.astype('Int16')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: Int16

In [4]:
city_mpg.astype('Int8')

TypeError: ignored

In [5]:
import numpy as np

In [7]:
np.iinfo('uint8')

iinfo(min=0, max=255, dtype=uint8)

In [8]:
np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

In [9]:
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

## Memory usage
- `nbytes`: Take the amount of memory **used** by a series
- `memory_usage`: Take the amount of memory used by each column from a df
    - We can use `deep=True` to  to include the amount of memory used by the Python objects in the Series.
    - It also includes index memory

In [11]:
city_mpg.nbytes

329152

In [12]:
city_mpg.astype('Int16').nbytes

123432

In [16]:
df.memory_usage(deep=True)

Index             128
barrels08      329152
barrelsA08     329152
charge120      329152
charge240      329152
               ...   
modifiedOn    3497240
startStop     1562048
phevCity       329152
phevHwy        329152
phevComb       329152
Length: 84, dtype: int64

In [17]:
make = df.make

In [20]:
print(f"""make.nbytes: {make.nbytes}
make.memory_usage(): {make.memory_usage()}
make.memory_usage(deep=True): {make.memory_usage(deep=True)}""")

make.nbytes: 329152
make.memory_usage(): 329280
make.memory_usage(deep=True): 2606395


- Saving data using `categorical` type

In [22]:
(make
 .astype('category')
 .memory_usage(deep=True)
 )

95888

## Strings and category types
- Categorical type can save a lot of memory
- The strings functionalities are keeped

In [51]:
city_mpg.astype(str)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object

In [50]:
city_mpg.astype('category')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

## Ordered Categories
- To create ordered categories, you need to define your own CategoricalDtype:

In [25]:
values = pd.Series(sorted(set(city_mpg)))
city_type = pd.CategoricalDtype(categories=values, ordered=True)

In [26]:
city_mpg.astype(city_type)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6 < 7 < 8 < 9 ... 137 < 138 < 140 < 150]

## Converting to others types

- `to_numpy` and `values`: Numpy array of the values
- `to_list`: Return a python list
- `to_frame`: Series to DataFrame

In [28]:
city_mpg.to_frame()

Unnamed: 0,city08
0,19
1,9
2,23
3,10
4,17
...,...
41139,19
41140,20
41141,18
41142,18


## Exercises
1. Convert a numeric column to a smaller type.

In [30]:
my_series = pd.Series([0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
my_series.nbytes

88

In [31]:
my_series.dtype

dtype('int64')

In [34]:
my_series.astype('Int8')

0      0
1      1
2      1
3      2
4      3
5      5
6      8
7     13
8     21
9     34
10    55
dtype: Int8

In [37]:
my_series.astype('Int8').nbytes

22

2. Calculate the memory savings by converting to smaller numeric types.

In [36]:
print(f"Saved memory: {my_series.nbytes - my_series.astype('Int8').nbytes}")

Saved memory: 66


In [52]:
city_mpg.astype(str)

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: object

In [54]:
city_mpg.astype('category')

0        19
1         9
2        23
3        10
4        17
         ..
41139    19
41140    20
41141    18
41142    18
41143    16
Name: city08, Length: 41144, dtype: category
Categories (105, int64): [6, 7, 8, 9, ..., 137, 138, 140, 150]

4. Calculate the memory savings by converting to a categorical type.

In [55]:
print(f"Saved memory: {city_mpg.nbytes - city_mpg.astype('category').nbytes}")

Saved memory: 287168
