# Week 9 Pandas Part 3

This week we will introduce another data structure in python: Dictionaries. Dictionaries are a useful way of storing key:value relationships and can help us be more efficient in manipulating dataframes. We will look at how we can use dictionaries to load in data more efficiently, do advanced aggregations, rename columns, and create new columns.

## Dictionarys

Dictionaries are a helpful way to store information when a element (key) is attached to another specific element (value). If you were to look up a apple (key) in webster's dictionary there would be the definition, the round fruit of a tree of the rose family (value).

In [1]:
import pandas as pd
import numpy as np

# dictionary syntax
# {key:value}

my_dict = {'apple':['the round fruit of a tree of the rose family']}

In [2]:
# we can call the value of any given key of the dictionary

my_dict['apple']

['the round fruit of a tree of the rose family']

In [3]:
# we cant call information positionally like we would with a list

my_dict[0]

KeyError: 0

In [4]:
# dictionaries can store many different types of structures as values

my_dict = {
    
    'my_string':'round',
    'my_int':1,
    'my_list':[1,2,3,4],
    'my_tuple':(3,5),
    'my_df':pd.DataFrame(),
}


In [5]:
# only unmutable data structures can be used as keys (strings, integers, tuples)

my_dict = {
    
    1:143,
    'one':5653,
    (9,3):34534
}

## Loading in Columns as Specific Data Types

In order to be efficient with memory, we can load our data in with specifc data types that might take up less space.

In [6]:
cereal_data = pd.read_csv('data//cereal.csv')

cereal_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
name        77 non-null object
mfr         77 non-null object
type        77 non-null object
calories    77 non-null int64
protein     77 non-null int64
fat         77 non-null int64
sodium      77 non-null int64
fiber       77 non-null float64
carbo       77 non-null float64
sugars      77 non-null int64
potass      77 non-null int64
vitamins    77 non-null int64
shelf       77 non-null int64
weight      77 non-null float64
cups        77 non-null float64
rating      77 non-null float64
dtypes: float64(5), int64(8), object(3)
memory usage: 9.7+ KB


In [7]:
dtype_dict = {
    
    'calories':'int8',
    'protein':'int8',
    'fat':'int8',
    'sodium':'int8',
    'sugars':'int8',
    'potass':'int8',
    'vitamins':'int8',
    'shelf':'int8',
    'fiber':'float16',
    'carbo':'float16'
}

cereal_small = pd.read_csv('data//cereal.csv', dtype = dtype_dict)

cereal_small.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
name        77 non-null object
mfr         77 non-null object
type        77 non-null object
calories    77 non-null int8
protein     77 non-null int8
fat         77 non-null int8
sodium      77 non-null int8
fiber       77 non-null float16
carbo       77 non-null float16
sugars      77 non-null int8
potass      77 non-null int8
vitamins    77 non-null int8
shelf       77 non-null int8
weight      77 non-null float64
cups        77 non-null float64
rating      77 non-null float64
dtypes: float16(2), float64(3), int8(8), object(3)
memory usage: 4.6+ KB


## Advanced Group by

Doing a groupby where we assign different operations to different columns is a situation where dictionaries come in handy.

In [8]:
cereal_data.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [13]:
# what if we want the sum of calories and the mean of fat?

cereal_data.groupby('mfr').agg(['sum','mean'])

Unnamed: 0_level_0,calories,calories,protein,protein,fat,fat,sodium,sodium,fiber,fiber,...,vitamins,vitamins,shelf,shelf,weight,weight,cups,cups,rating,rating
Unnamed: 0_level_1,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,...,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean
mfr,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
A,100,100.0,4,4.0,1,1.0,0,0.0,0.0,0.0,...,25,25.0,2,2.0,1.0,1.0,1.0,1.0,54.850917,54.850917
G,2450,111.363636,51,2.318182,30,1.363636,4410,200.454545,28.0,1.272727,...,775,35.227273,47,2.136364,23.08,1.049091,19.25,0.875,758.688737,34.485852
K,2500,108.695652,61,2.652174,14,0.608696,4020,174.782609,63.0,2.73913,...,800,34.782609,54,2.347826,24.79,1.077826,18.31,0.796087,1012.884634,44.038462
N,520,86.666667,17,2.833333,1,0.166667,225,37.5,24.0,4.0,...,50,8.333333,10,1.666667,5.83,0.971667,4.67,0.778333,407.811403,67.968567
P,980,108.888889,22,2.444444,8,0.888889,1315,146.111111,25.0,2.777778,...,225,25.0,22,2.444444,9.58,1.064444,6.43,0.714444,375.351697,41.705744
Q,760,95.0,21,2.625,14,1.75,740,92.5,10.7,1.3375,...,100,12.5,19,2.375,7.0,0.875,6.59,0.82375,343.327919,42.91599
R,920,115.0,20,2.5,10,1.25,1585,198.125,15.0,1.875,...,200,25.0,16,2.0,8.0,1.0,6.97,0.87125,332.343977,41.542997


In [14]:
f = {
    
    'calories':'sum',
    'protein':'median',
    'fiber':'count'
    
}

cereal_grp = cereal_data.groupby('mfr').agg(f).reset_index()

cereal_grp

Unnamed: 0,mfr,calories,protein,fiber
0,A,100,4.0,1
1,G,2450,2.0,22
2,K,2500,3.0,23
3,N,520,3.0,6
4,P,980,3.0,9
5,Q,760,2.5,8
6,R,920,2.0,8


## Renaming columns

Now we have our aggregated data, we might want to rename the columns so we dont get confused with our original dataset. We can use another dictionary to do this!

In [15]:
rename_dict = {
    
    'calories':'calories_sum',
    'protein':'protein_median',
    'fiber':'mfr_item_count'
}

cereal_grp = cereal_grp.rename(columns = rename_dict)

cereal_grp

Unnamed: 0,mfr,calories_sum,protein_median,mfr_item_count
0,A,100,4.0,1
1,G,2450,2.0,22
2,K,2500,3.0,23
3,N,520,3.0,6
4,P,980,3.0,9
5,Q,760,2.5,8
6,R,920,2.0,8


## Mapping values to create columns

In [16]:
mfr_dict = {
    
    'A':'American Home Food Products',
    'G':'General Mills',
    'K':'Kelloggs',
    'N':'Nabisco',
    'P':'Post',
    'Q':'Quaker',
    'R':'Ralston Purina'
}

cereal_grp['mfr_full'] = cereal_grp['mfr'].map(mfr_dict)

cereal_grp

Unnamed: 0,mfr,calories_sum,protein_median,mfr_item_count,mfr_full
0,A,100,4.0,1,American Home Food Products
1,G,2450,2.0,22,General Mills
2,K,2500,3.0,23,Kelloggs
3,N,520,3.0,6,Nabisco
4,P,980,3.0,9,Post
5,Q,760,2.5,8,Quaker
6,R,920,2.0,8,Ralston Purina
