# Agenda: Strings in Pandas

1. How does Pandas store strings?
2. How to work with strings using `.str` (the "string accessor")
3. A lot of string methods that we might want to use
4. String methods on non-strings
5. Regular expressions and the string methods
6. Memory usage
7. Extension types
8. PyArrow

# Pandas and strings

We know that normally, Pandas stores its data in NumPy.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30])
s

0    10
1    20
2    30
dtype: int64

In [4]:
s.values   # get the NumPy array back from behind the Pandas series

array([10, 20, 30])

In [5]:
s.values.dtype

dtype('int64')

In [6]:
s = Series([10.5, 20.5, 30.5])
s

0    10.5
1    20.5
2    30.5
dtype: float64

In [7]:
s.values.dtype

dtype('float64')

In [8]:
# let's say that I want to create a NumPy array of strings. Can I?

a = np.array('hello out there'.split())
a

array(['hello', 'out', 'there'], dtype='<U5')

When NumPy stores strings, it does so in C-type memory. An array (including a NumPy array), by definition, has elements that are 
all of the same type. In this case, all of the elements of `a` are of type `<U5`, which means: Up to 5 Unicode characters.

In [9]:
a[0] = 'goodbye'

In [10]:
a

array(['goodb', 'out', 'there'], dtype='<U5')

For this reason, and many others, we don't want to use NumPy's strings. They are fragile, in that we can accidentally
remove characters, and they don't have the flexiblity that we know and love from Python strings.

The Pandas solution to this problem is: Use Python strings!

In [11]:
s = Series('hello out there'.split())
s

0    hello
1      out
2    there
dtype: object

A dtype of `object` means: NumPy has a pointer ("reference") to a Python object. Normally, and most of the time, 
if you see a dtype of `object`, it's because we're working with strings.

The good news is then that we can retrieve these strings and run Python methods on them. Also good news is that
the size of the string isn't constrained by the memory allocated by NumPy.



In [12]:
s.loc[0] = 'goodbye'

In [13]:
s

0    goodbye
1        out
2      there
dtype: object

In [15]:
type(s.loc[0])

str

Now that we know that we have Python strings... how can we invoke methods on them? Should we just use a `for` loop
and iterate over each string, doing something to it?

In [16]:
for one_word in s:
    print(one_word.upper())

GOODBYE
OUT
THERE


# Don't do this (for several reasons)

1. Using a `for` loop on a series is almost always a bad idea. Each iteration retrieves the value from NumPy and creates a Python object based on it.
2. We lose the index, which might be important in this particular series.

So how should we work with our strings?

Answer: Via the `.str` accessor. Meaning: If we apply `.str` to our series, we get back an object on which we can run
methods. It will then effectively run the method in a vectorized fashion, faster (we can hope) than would be the case
using a `for` loop.

In [17]:
s.str

<pandas.core.strings.accessor.StringMethods at 0x1569628a0>

In [18]:
s.str.len()  # this will invoke len() on each string, giving us back a new series

0    7
1    3
2    5
dtype: int64

In [19]:
s = Series('hello out there'.split(),
          index=list('abc'))
s

a    hello
b      out
c    there
dtype: object

In [20]:
s.str.len()  # this will invoke len() on each string, giving us back a new series WITH THE original's index!

a    5
b    3
c    5
dtype: int64

# Exercise: Longer-than-average words

1. Create a series of 10 strings (words), each of a different length.
2. Calculate the mean length of a word in the series.
3. Find all of the words that are longer than the mean.