In [1]:
import pandas as pd
import numpy as np

![](qtn.PNG)

In [2]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [3]:
drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.1+ KB


Object usually means a string is being stored. But we can also store a pandas series of python lists and a pandas series of python dictionary. In other words, we can store an arbitrary python objects in a pandas series. Pandas basically just stores a reference to that object and calls it type object.

Observe - __memory usage: 9.1+ KB__ implies data uses atleast 9.1 KB of more memory. + coz as pandas object is referenced to another object and as pandas wants info method to run fast so they actually figures out how much space reference to these objects take. So it cn be more depending on the objects size.

In [4]:
drinks.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       193 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 30.4 KB


In [5]:
## Memory usage of drinks for each column
drinks.memory_usage(deep = True)

Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                       12332
dtype: int64

In [6]:
# Total memory usage
drinks.memory_usage(deep = True).sum()

31176

So the object columns can take up large space.  

So trying to make more space efficient:
- If we store our strings as integers, as integers are more space efficient than strings.

In [7]:
drinks['continent'].unique()

array(['Asia', 'Europe', 'Africa', 'North America', 'South America',
       'Oceania'], dtype=object)

In [9]:
sorted(drinks['continent'].unique())

['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']

In [8]:
drinks['continent'].head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: object

We have to map an integer to each string. To do it will be like a look up table.

Eg: 

0 - Asia  
1 - Europe

***

We need not do this seperately because there is a __category__ data type in pandas and it will do the exactly same.

So converting an object column to category type.

In [10]:
drinks['continent'] = drinks['continent'].astype('category')

In [11]:
drinks.dtypes

country                           object
beer_servings                      int64
spirit_servings                    int64
wine_servings                      int64
total_litres_of_pure_alcohol     float64
continent                       category
dtype: object

In [12]:
## it still looks the same but under the hood it stores these strings as integers
drinks['continent'].head()

0      Asia
1    Europe
2    Africa
3    Europe
4    Africa
Name: continent, dtype: category
Categories (6, object): [Africa, Asia, Europe, North America, Oceania, South America]

In [13]:
# to see integer assigned to each string
drinks.continent.cat.codes.head()

0    1
1    2
2    0
3    2
4    0
dtype: int8

Now we did this to reduce our memory usage. So checking

In [14]:
drinks.memory_usage(deep = True)

Index                              80
country                         12588
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

In [15]:
drinks.memory_usage(deep = True).sum()

19588

So bytes reduced as we are storing integers which points to look up tables of 6 strings. So strings are stored only once for mapping and the rest are integers storage so much more space efficient.

Using the same for countries.

In [16]:
drinks['country'] = drinks['country'].astype('category')
drinks.memory_usage(deep = True)

Index                              80
country                         18094
beer_servings                    1544
spirit_servings                  1544
wine_servings                    1544
total_litres_of_pure_alcohol     1544
continent                         744
dtype: int64

Observe for countries the size increased as we are creating a 193 categories. And storing 193 integers which are extra so the storage space inreases.


So use `category` when we have few categories in columns. 

It is useful to improve computational speeds (besides less memory usage) for operations such as group by. 

In [20]:
df = pd.DataFrame({'ID': [100, 101, 102, 103], 'quality': ['good', 'very good', 'good', 'excellent']})
df

Unnamed: 0,ID,quality
0,100,good
1,101,very good
2,102,good
3,103,excellent


In [21]:
df.sort_values('quality')

Unnamed: 0,ID,quality
3,103,excellent
0,100,good
2,102,good
1,101,very good


We will be using category data type and we are going __to define ordered category from lowest to highest.__

In [22]:
df['quality'] = df['quality'].astype('category', categories = ['good', 'very good', 'excellent'], ordered = True)
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,ID,quality
0,100,good
1,101,very good
2,102,good
3,103,excellent


In [23]:
df['quality']

0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]

To remove warning:
- [Categorical pandas documentation](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.Categorical.html#pandas.Categorical)

In [24]:
df = pd.DataFrame({'ID': [100, 101, 102, 103], 'quality': ['good', 'very good', 'good', 'excellent']})
df['quality']

0         good
1    very good
2         good
3    excellent
Name: quality, dtype: object

In [25]:
df['quality'] =  pd.Categorical(df['quality'], categories = ['good', 'very good', 'excellent'], ordered = True)
df['quality']

0         good
1    very good
2         good
3    excellent
Name: quality, dtype: category
Categories (3, object): [good < very good < excellent]

In [26]:
# so sorting will result in the logical order as defined by us instead of alphabetical order
df.sort_values('quality')

Unnamed: 0,ID,quality
0,100,good
2,102,good
1,101,very good
3,103,excellent


Can also use logical condition with it. So we can use comparison operators with strings.

In [27]:
# To get all rows where quality is better than good
df.loc[df['quality'] > 'good', :]

Unnamed: 0,ID,quality
1,101,very good
3,103,excellent


In [28]:
# To get all rows where quality is not equal to very good
df.loc[df['quality'] != 'very good', :]

Unnamed: 0,ID,quality
0,100,good
2,102,good
3,103,excellent
