# the apply and the lambda functions

On Monday, 2.13, we learned how to apply functions to every row in a pandas df, and save the output to a new column. One way to do this is the apply function. In some cases, we have to use a lambda function. In this notebook I will try to clarify these two techniques and when you need to use one versus the other.

First, read in our Children's Literature dataset, and do some of the pre-processing step we did in the tutorial. I'll do this in once chunk of code.

In [1]:
#import necessary modules
import pandas
import nltk
import string

#read in our data
df = pandas.read_csv("../data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
#drop missing texts
df = df.dropna(subset=['text'])
#split the text into a list
df['text_split']=df['text'].str.split()

If we want to do something to the entire value of the column cells we can use the apply function.

For example, we want to take the length of the list we just created (the 'text_split' column), we can apply the len function.

In [2]:
df['word_count'] = df['text_split'].apply(len)
df['word_count']

0       96493
1      100603
2       85132
3       92822
4       48251
5       47458
6       22213
7       81524
8       62437
9       36261
10      36121
11      65607
12      26878
13      44150
14     111797
15      94611
16     111717
17      58938
18     109877
19     104108
20      96284
21     114023
22      52361
23     134448
24     118203
25      63804
26      96622
27      76560
28      70129
29      72278
        ...  
102     98097
103     61837
104     35763
105     97099
106     55877
107     56009
108     74173
109     93133
110     11846
111     25056
112     30390
113     66809
114    123786
115    102634
116    117043
117     76185
118     91105
119     86289
120     31944
121    103197
122     68589
123     76593
124     56018
125     53942
126    108275
127     70005
128     56490
129     96425
130    120707
131    123614
Name: word_count, dtype: int64

We can also apply nltk functions, if it is done to the entire value of the column cells. So we can, for example, tokenize the title column.

In [3]:
df['title_token'] = df['title'].apply(nltk.word_tokenize)
df['title_token']

0                           [A, Dog, with, a, Bad, Name]
1                                  [A, Final, Reckoning]
2      [A, House, Party, ,, Don, Gesualdo, ,, and, A,...
3                               [A, Houseful, of, Girls]
4                             [A, Little, Country, Girl]
5                                      [A, Round, Dozen]
6                                  [A, Sailor, 's, Lass]
7                                  [A, World, of, Girls]
8                                [Adrift, in, the, Wild]
9                               [Adventures, in, Africa]
10                           [Adventures, in, Australia]
11                                         [All, Adrift]
12                             [Battles, With, the, Sea]
13                    [Bimbi, :, Stories, for, Children]
14     [Blown, To, Bits, ;, Or, The, Lonely, Man, of,...
15            [Blue, Lights, Hot, Work, In, the, Soudan]
16                             [Bonnie, Prince, Charlie]
17                             

The lambda function is like the apply function, but allows us to do more if needed. We can re-do what we did above using the lambda function.

In [4]:
#apply the len function using .apply
df['word_count'] = df['text_split'].apply(len)
#apply the len function using lambda. This line does the same thing as line 2 above
df['word_count_lambda'] = df['text_split'].apply(lambda x: len(x))

#apply the nltk.word_tokenize function using .apply
df['title_token'] = df['title'].apply(nltk.word_tokenize)
#do the same using lambda. The next line does the same as line 7 above.
df['title_token_lambda'] = df['title'].apply(lambda x: nltk.word_tokenize(x))

df[['word_count', 'word_count_lambda','title_token', 'title_token_lambda']]

Unnamed: 0,word_count,word_count_lambda,title_token,title_token_lambda
0,96493,96493,"[A, Dog, with, a, Bad, Name]","[A, Dog, with, a, Bad, Name]"
1,100603,100603,"[A, Final, Reckoning]","[A, Final, Reckoning]"
2,85132,85132,"[A, House, Party, ,, Don, Gesualdo, ,, and, A,...","[A, House, Party, ,, Don, Gesualdo, ,, and, A,..."
3,92822,92822,"[A, Houseful, of, Girls]","[A, Houseful, of, Girls]"
4,48251,48251,"[A, Little, Country, Girl]","[A, Little, Country, Girl]"
5,47458,47458,"[A, Round, Dozen]","[A, Round, Dozen]"
6,22213,22213,"[A, Sailor, 's, Lass]","[A, Sailor, 's, Lass]"
7,81524,81524,"[A, World, of, Girls]","[A, World, of, Girls]"
8,62437,62437,"[Adrift, in, the, Wild]","[Adrift, in, the, Wild]"
9,36261,36261,"[Adventures, in, Africa]","[Adventures, in, Africa]"


Sometimes we can't use the apply function alone, we must also use the lambda function. This is the case if the column contains a list, and we want to loop through the list. For example, if we want to remove punctuation from our title tokens, we can do this using list comprehension. If we try to do this using apply only we get an error.

In [5]:
df['title_token_clean'] = df['title_token'].apply([word for word in df['title_token'] if word not in list(string.punctuation)])
df['title_token_clean']

TypeError: 'list' object is not callable

We got a TypeError: 'list' object is not callable.

If we try to indicate each element by a variable, for example the variable 'x', we get another error:
NameError: name 'x' is not defined

In [6]:
df['title_token_clean'] = df['title_token'].apply([word for word in x if word not in list(string.punctuation)])
df['title_token_clean']

NameError: name 'x' is not defined

To make a list object callable and to define the variable to indicate each element in th list, we can write a lambda function.

In [7]:
df['title_token_clean'] = df['title_token'].apply(lambda x: [word for word in x if word not in list(string.punctuation)])
df['title_token_clean']

0                           [A, Dog, with, a, Bad, Name]
1                                  [A, Final, Reckoning]
2      [A, House, Party, Don, Gesualdo, and, A, Rainy...
3                               [A, Houseful, of, Girls]
4                             [A, Little, Country, Girl]
5                                      [A, Round, Dozen]
6                                  [A, Sailor, 's, Lass]
7                                  [A, World, of, Girls]
8                                [Adrift, in, the, Wild]
9                               [Adventures, in, Africa]
10                           [Adventures, in, Australia]
11                                         [All, Adrift]
12                             [Battles, With, the, Sea]
13                       [Bimbi, Stories, for, Children]
14     [Blown, To, Bits, Or, The, Lonely, Man, of, Ra...
15            [Blue, Lights, Hot, Work, In, the, Soudan]
16                             [Bonnie, Prince, Charlie]
17                             

The lambda function, or nameless function, allows us to name each element of the list. In the case above we're indicating each element by the variable 'word', and indicating the title_token list as a whole by the variable 'x'. We then apply this lambda function to every row in our dataframe using the .apply function. The combination of apply and lambda allows us to do some really powerful things.

Pandas is continuing to add functions so we won't always need to use lambda, but in some cases we still need it.