# Agenda: Strings in Pandas

1. How does Pandas store strings?
2. How to work with strings using `.str` (the "string accessor")
3. A lot of string methods that we might want to use
4. String methods on non-strings
5. Regular expressions and the string methods
6. Memory usage
7. Extension types
8. PyArrow

# Pandas and strings

We know that normally, Pandas stores its data in NumPy.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [4]:
s.values   # get the NumPy array back from behind the Pandas series

array([10, 20, 30])

In [5]:
s.values.dtype

dtype('int64')

In [6]:
s = Series([10.5, 20.5, 30.5])
s

0    10.5
1    20.5
2    30.5
dtype: float64

In [7]:
s.values.dtype

dtype('float64')

In [8]:
# let's say that I want to create a NumPy array of strings. Can I?

a = np.array('hello out there'.split())
a

array(['hello', 'out', 'there'], dtype='<U5')

When NumPy stores strings, it does so in C-type memory. An array (including a NumPy array), by definition, has elements that are 
all of the same type. In this case, all of the elements of `a` are of type `<U5`, which means: Up to 5 Unicode characters.

In [9]:
a[0] = 'goodbye'

In [10]:
a

array(['goodb', 'out', 'there'], dtype='<U5')

For this reason, and many others, we don't want to use NumPy's strings. They are fragile, in that we can accidentally
remove characters, and they don't have the flexiblity that we know and love from Python strings.

The Pandas solution to this problem is: Use Python strings!

In [11]:
s = Series('hello out there'.split())
s

0    hello
1      out
2    there
dtype: object

A dtype of `object` means: NumPy has a pointer ("reference") to a Python object. Normally, and most of the time, 
if you see a dtype of `object`, it's because we're working with strings.

The good news is then that we can retrieve these strings and run Python methods on them. Also good news is that
the size of the string isn't constrained by the memory allocated by NumPy.



In [12]:
s.loc[0] = 'goodbye'

In [13]:
s

0    goodbye
1        out
2      there
dtype: object

In [15]:
type(s.loc[0])

str

Now that we know that we have Python strings... how can we invoke methods on them? Should we just use a `for` loop
and iterate over each string, doing something to it?

In [16]:
for one_word in s:
    print(one_word.upper())

GOODBYE
OUT
THERE


# Don't do this (for several reasons)

1. Using a `for` loop on a series is almost always a bad idea. Each iteration retrieves the value from NumPy and creates a Python object based on it.
2. We lose the index, which might be important in this particular series.

So how should we work with our strings?

Answer: Via the `.str` accessor. Meaning: If we apply `.str` to our series, we get back an object on which we can run
methods. It will then effectively run the method in a vectorized fashion, faster (we can hope) than would be the case
using a `for` loop.

In [17]:
s.str

<pandas.core.strings.accessor.StringMethods at 0x1569628a0>

In [18]:
s.str.len()  # this will invoke len() on each string, giving us back a new series

0    7
1    3
2    5
dtype: int64

In [19]:
s = Series('hello out there'.split(),
          index=list('abc'))
s

a    hello
b      out
c    there
dtype: object

In [20]:
s.str.len()  # this will invoke len() on each string, giving us back a new series WITH THE original's index!

a    5
b    3
c    5
dtype: int64

# Exercise: Longer-than-average words

1. Create a series of 10 strings (words), each of a different length.
2. Calculate the mean length of a word in the series.
3. Find all of the words that are longer than the mean.

In [21]:
s = Series('this is a fantastic example of strings in action everywhere'.split())
s

0          this
1            is
2             a
3     fantastic
4       example
5            of
6       strings
7            in
8        action
9    everywhere
dtype: object

In [22]:
s.str.len()

0     4
1     2
2     1
3     9
4     7
5     2
6     7
7     2
8     6
9    10
dtype: int64

In [23]:
s.str.len().mean()

5.0

In [24]:
s.str.len().describe()

count    10.000000
mean      5.000000
std       3.231787
min       1.000000
25%       2.000000
50%       5.000000
75%       7.000000
max      10.000000
dtype: float64

In [26]:
# which words are longer than 5 characters?
s.str.len() > 5

0    False
1    False
2    False
3     True
4     True
5    False
6     True
7    False
8     True
9     True
dtype: bool

In [31]:
# which words are longer than the mean of s's words' lengths?
s.loc[
    s.str.len() > s.str.len().mean()
]

3     fantastic
4       example
6       strings
8        action
9    everywhere
dtype: object

In [34]:
# what if our strings contain leading/trailing whitespace?

s = Series(['    abcd   ', 'efgh    ', '    ijkl    ', ' abcd ', ' ijkl      ', '       ijkl  ', '     abcd '])
s

0          abcd   
1         efgh    
2         ijkl    
3            abcd 
4       ijkl      
5           ijkl  
6            abcd 
dtype: object

In [35]:
s.str.len()

0    11
1     8
2    12
3     6
4    11
5    13
6    10
dtype: int64

In [36]:
s.value_counts()   # how often does each string appear in our series?

    abcd         1
efgh             1
    ijkl         1
 abcd            1
 ijkl            1
       ijkl      1
     abcd        1
Name: count, dtype: int64

In [42]:
# I can run another string method, s.str.strip()

s.str.strip().value_counts()

abcd    3
ijkl    3
efgh    1
Name: count, dtype: int64

In [43]:
s

0          abcd   
1         efgh    
2         ijkl    
3            abcd 
4       ijkl      
5           ijkl  
6            abcd 
dtype: object

# Some useful methods for cleaning our data:

- `.str.strip()` -- no argument means "remove all whitespace from the front and back of the string," but we can pass a string to it, in which case all of its characters are stripped from the front and back
- `.str.lower()` -- return a lowercase version
- `.str.upper()` -- return an uppercase version

In [44]:
s = Series(['    abcd   ', 'efgh    ', '    iJkL    ', ' ABCD ', ' Ijkl      ', '       ijkL  ', '     abcd '])


In [46]:
s.str.strip().value_counts()

abcd    2
efgh    1
iJkL    1
ABCD    1
Ijkl    1
ijkL    1
Name: count, dtype: int64

In [48]:
(
    s
    .str.strip()     # remove leading/trailing whitespace
    .str.lower()     # get back a new series of strings, all lowercase, based on output from .str.strip()
    .value_counts()  # runs on the result from str.lower
)

abcd    3
ijkl    3
efgh    1
Name: count, dtype: int64

# Exercise: Alice words

1. Create a series based on the *words* in Alice in Wonderland. (You can use `read` on this file, without worrying about the length.)
2. Find the 10 most common words in the book, without any transformation.
3. Find the 10 most common words in the book, removing punctuation (`string.punctuation`) from the front and back of words.
4. Find the 10 most common words in the book, lowercasing the words *and* removing punctuation.

In [49]:
s = Series(open('alice-in-wonderland.txt').read().split())

In [50]:
len(s)

12763

In [51]:
s.loc[:15]

0            ﻿The
1         Project
2       Gutenberg
3           EBook
4              of
5           Alice
6              in
7     Wonderland,
8              by
9           Lewis
10        Carroll
11           This
12          eBook
13             is
14            for
15            the
dtype: object

In [52]:
s.value_counts()

the            732
and            362
a              321
to             311
of             300
              ... 
present--at      1
morning,         1
then."           1
sternly.         1
newsletter       1
Name: count, Length: 3408, dtype: int64

In [54]:
# 3. Find the 10 most common words in the book, removing punctuation (`string.punctuation`) from the front and back of words.

import string

(
    s
    .str.strip(string.punctuation)
    .value_counts()
    .head(10)
)


the      735
and      384
a        322
to       320
of       303
in       214
she      201
was      167
Alice    166
it       164
Name: count, dtype: int64

In [55]:
# 4. Find the 10 most common words in the book, lowercasing the words *and* removing punctuation.

import string

(
    s
    .str.strip(string.punctuation)
    .str.lower()
    .value_counts()
    .head(10)
)

the    807
and    404
a      328
to     327
of     318
she    237
in     227
it     183
you    171
was    168
Name: count, dtype: int64

In [57]:
# now let's ignore any word whose length is < 5.

import string

(
    s
    .str.strip(string.punctuation)
    .str.lower()
    .loc[s.str.len() >= 5]
    .value_counts()
    .head(10)
)

alice           168
project          87
little           59
gutenberg-tm     56
about            41
other            34
herself          33
there            32
works            32
rabbit           31
Name: count, dtype: int64

In [58]:
# again, but with lambda!

import string

(
    s
    .str.strip(string.punctuation)
    .str.lower()
    .loc[lambda s_ : s_.str.len() >= 5]
    .value_counts()
    .head(10)
)

alice           168
project          87
little           59
gutenberg-tm     56
about            41
other            34
herself          33
works            32
there            32
rabbit           31
Name: count, dtype: int64

In [61]:
s.str.strip(string.punctuation).value_counts().head(15)

the      735
and      384
a        322
to       320
of       303
in       214
she      201
was      167
Alice    166
it       164
said     143
you      141
I        127
with     112
her      107
Name: count, dtype: int64

In [67]:
ignore_words = 'and the if an a i to of in she was it you with her or as at all on be is'.split()

import string

(
    s
    .str.strip(string.punctuation)
    .str.lower()
    # .loc[lambda s_ : s_.str.len() >= 5]
    .loc[lambda s_: ~ s_.isin(ignore_words)]
    .value_counts()
    .head(10)
)

alice           168
said            144
that            111
this             98
project          87
for              74
had              65
not              62
little           59
gutenberg-tm     56
Name: count, dtype: int64

# Turning strings into integers

Normally, we can use `astype(int)` to get a new series back from a series of strings. But what if one or more of those
strings aren't valid ints?

In [68]:
s = Series('10 20 30'.split())
s

0    10
1    20
2    30
dtype: object

In [69]:
s.astype(int)

0    10
1    20
2    30
dtype: int64

In [70]:
# let's try something that doesn't work...

s = Series('10 20 hello 30'.split())
s

0       10
1       20
2    hello
3       30
dtype: object

In [71]:
s.astype(int)

ValueError: invalid literal for int() with base 10: 'hello'

In [73]:
# let's remove any value from the series that cannot be turned into an int
# an easy way to do that is with the str.isdigit method

(
    s
    .loc[s.str.isdigit()]
    .astype(int)
)

0    10
1    20
3    30
dtype: int64

In [74]:
# using a lambda expression

(
    s
    .loc[lambda s_: s_.str.isdigit()]
    .astype(int)
)

0    10
1    20
3    30
dtype: int64

# Examining the text

Normally, when we're using Python, we can use a few operators and methods to explore our strings. These are all available in Pandas, too, but sometimes with different names:

- `index` and `rindex`
- `startswith` and `endswith`
- `contains`, a method version of the `in` operator for regular strings

In [76]:
s = Series('hello to everyone out there in television land'.split())
s

0         hello
1            to
2      everyone
3           out
4         there
5            in
6    television
7          land
dtype: object

In [79]:
s.loc[s.str.contains('e')]

0         hello
2      everyone
4         there
6    television
dtype: object

In [80]:
# what if I want to find words that contain a or e?

s.loc[s.str.contains('e') | s.str.contains('a')]

0         hello
2      everyone
4         there
6    television
7          land
dtype: object

In [None]:
# but ... there's a better way

s.loc[s.str.contains()]