# Lab 06

This week, we'll continue with the tutorial on using the HTRC Extracted Features Dataset, through Python. Last week was the preparation, this week is the fun stuff!

## Pandas and the Extracted Features Dataset, continued

In [None]:
import pandas as pd

### Method Chaining

In Pandas, you may find yourself combining a number of Dataframe methods in a row. When the output of each step is a DataFrame, you don't have to save each step to a variable: you can 'chain' the commands. So, if you want to transfer a DataFrame called `original`:

```python
df1 = original.do_something()
df2 = df1.do_something_else()
df3 = df2.do_more()
```
, you can get the same result as follows:

```python
df3 = original.do_something().do_something_else().do_more()
```

You may see the benefit and the downside of method chaining above.

The benefit: you're not saving intermediate DataFrames to variables. `df1` and `df2` were only necessary to get you to `df3`, so why even save them?

The downside is less readability: yuck! This is fine for short chains, but for longer ones you still want the line breaks. That way, when you return to your code in the future, you can make sense of it (and so I can read it when marking!).

To format chained methods better, you can wrap everything in braces, which tells Python that the current line of code isn't done until the braces end:

```python
(df3 = original.do_something()
               .do_something_else()
               .do_more()
)
```

Much prettier. This style will be useful once things get more complex. Remember that you're not forced to use chaining: saving intermediate variables is fine, and can be helpful if you find a bug somewhere in the chain. However, you'll see it occasionally in example code, so it is good to understand what is happening.

### Slicing

Following from last week's reading on [Text Mining in Python through the HTRC Feature Reader](http://programminghistorian.org/lessons/text-mining-with-extracted-features), we'll be continuing from the 'Slicing DataFrames' section to the end.

First, lets reload the volume from last lab task.

In [None]:
from htrc_features import FeatureReader
fr = FeatureReader('../data/mdp.49015002392919.json.bz2')
vol = fr.first()
vol

<htrc_features.feature_reader.Volume at 0x26fb0ebb358>

**Q1**: Fill in the blanks to produce the output show in the image below:

```
(vol.tokenlist(pages=**BLANK1**, pos=**BLANK2**, case=False)
    .loc[("body", slice(None), "**BLANK3**"),]
    .sort_values("count", ascending=**BLANK4**)
    .head(**BLANK5**)
)
```

![](../images/lab6-output.png)

_Multiple Choice_
1. True, False
2. True, False
3. slice(None), "body", "RB", "NNP"
4. True, False
5. 3, 5, 7

**Q2**: What is the code to get the token frequencies for page 39 of the book? You'll start with `tl = vol.tokenlist()`, what's next?

**Q3**: How would you get the five most frequent words tagged as a proper noun or a plural proper noun? Since the question doesn't involve page-level counts, you'll want to start with `tl = vol.tokenlist(pages=False)`.

### Grouping

**Q4**: What does the following code do?

```python
tl = vol.tokenlist()
tl.groupby(level='page').count().sort_values('count', ascending=False)
```

How does it differ from the following?

```python
tl = vol.tokenlist()
tl.groupby(level='page').sum().sort_values('count', ascending=False)
```

**Q5** (2pts): Set a new variable to `vol.tokenlist().reset_index()`.

**a)** What did `reset_index` do?
**b)** How would you get run the summing code from above (i.e. the second example in Q4)?

**Q6**: Using the DataFrame from Q5, how would you select the rows with counts for the word `Tom`? Remember from the reading that 'slicing' is something done only on indexes - you learned to select based on a column value last week.

**Q7**: Using the result from Q6, figure out how to plot the counts of 'Tom' by page. The plot method for DataFrames takes `x` and `y` arguments. Share the code to produce this:

![](../images/lab6-sawyer-plot.png)

### Pandas Series

Where a Pandas DataFrame object is like a spreadsheet, with rows and columns, a Pandas Series object is like just one column: it is a sequence of just one value at a time. You can think of it as a supercharged list.

To pull out a single column of a DataFrame as a Series, use square brackets to reference the column by name. Using the DataFrame from Q7, where the index has been reset to columns, here's an example:

In [None]:
token_series = tl['token']

# Show five random items from the series
token_series.sample(5)

31992         of
35613      Least
16341       them
4477        bear
15935    finally
Name: token, dtype: object

If you want to add a series to a DataFrame as a column, you can do the same in reverse:

In [None]:
tl['new_column'] = token_series
tl.sample(5)

Unnamed: 0,page,section,token,pos,count,new_column
11498,89,body,then,RB,2,then
32915,221,body,No,UH,2,No
22816,158,body,sat,VBD,1,sat
45905,297,body,toward,IN,1,toward
11934,93,body,73,CD,1,73


Tada!

A series has a couple of useful features. For example, you can apply a function against each item with `apply`. If we wanted to get the length of every string (like we manually would do with `len('string')`, it's possible in this way:

In [None]:
token_series.apply(len).head()

0    1
1    1
2    1
3    4
4    6
Name: token, dtype: int64

Is it clear what happened there? `apply` took the function we gave it, `len`, and for each value in the Series applied `len(value)`.

If this was a list instead of a Series, the equivalent would be `[len(string) for string in list_of_strings]`.

Just to be more clear, I'll add it as a column:

In [None]:
tl['token_length'] = token_series.apply(len)
tl.sample(5)

Unnamed: 0,page,section,token,pos,count,new_column,token_length
42327,276,body,pockets—yet,NN,1,pockets—yet,11
19755,140,body,at,IN,1,at,2
4554,46,body,other,JJ,1,other,5
13279,100,body,ten,CD,1,ten,3
12667,97,body,carefully,RB,1,carefully,9


Looks right!

Another useful method of a Series is `value_counts`, which simply counts how often each value occurs:

In [None]:
token_series.value_counts().head()

that    502
"       485
's      364
.       297
the     296
Name: token, dtype: int64

Finally, for a Series that specifically has strings, there are string methods. Try `token_series.str.<TAB>` to see the autofill of what is possible.

Going back to our ALTA filtering for `isalpha()`, we can quickly do the same here:

In [None]:
is_alpha_matches = token_series.str.isalpha()
is_alpha_matches.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
Name: token, dtype: bool

We saw in Lab 5 that supplying a set of True or False values to a DataFrame allows us to select rows. lets try it with the above Series:

In [None]:
tl.head(10)

Unnamed: 0,page,section,token,pos,count,new_column,token_length
0,3,body,.,.,1,.,1
1,3,body,0,CD,1,0,1
2,3,body,1,CD,1,1,1
3,3,body,2003,CD,1,2003,4
4,3,body,38-297,CD,1,38-297,6
5,3,body,4,CD,1,4,1
6,3,body,DEMCO,NNP,1,DEMCO,5
7,3,body,M,NNP,1,M,1
8,7,body,LEATHER,NNP,1,LEATHER,7
9,7,body,LIMP,NNP,1,LIMP,4


In [None]:
tl[is_alpha_matches].head()

Unnamed: 0,page,section,token,pos,count,new_column,token_length
6,3,body,DEMCO,NNP,1,DEMCO,5
7,3,body,M,NNP,1,M,1
8,7,body,LEATHER,NNP,1,LEATHER,7
9,7,body,LIMP,NNP,1,LIMP,4
10,7,body,MARK,NNP,1,MARK,4


It worked! Of the top ten rows, the only ones that are selected are solely alphabetical. Remember that `is_alpha_matches` is simple `tl['token'].str.isalpha()`, which could have been used for selection.

Finally, one more string method, `lower()`:

In [None]:
tl['lowercase'] = token_series.str.lower()
tl.sample(5)

Unnamed: 0,page,section,token,pos,count,new_column,token_length,lowercase
44538,289,body,time,NN,1,time,4,time
20868,146,body,top,NN,1,top,3,top
16134,118,body,matter,NN,1,matter,6,matter
42279,276,body,healing,NN,1,healing,7,healing
23931,165,body,effort,NN,1,effort,6,effort


**Q8**: How is `token_series.str.istitle()` different from `token_series.str.isupper()`?

**Q9**: Which of the following options returns the tokens that have a hyphen in them?

 1. `tl[tl['token'].str.has('-')]`
 2. `tl[tl['token'].str.contains('-')]`
 3. `tl[tl['token'].contains('-')]`
 4. `tl[tl['token'] == '-']`
 5. None of the above