# Lesson 2.2: Scraping Gutenberg: Filtering Queries

So far none of the work we have done on the Gutenberg catalog has not actually gotten us any texts. Instead we did the following:
- Downloaded a CSV file
- Converted it to a dataframe
- Cleaned the dataframe
- Exported as `.pickle` file

We did all this because the dataframe will used to retrieve actual texts from Gutenberg, and we want to be sure everything is clean before we modify it.

In the following lesson you will learn how to filter the dataframe for the specific values you want by running queries. Since we don't want to download all of Gutenberg, we will want to create a subset of data to download before we do the actual webscraping.

#### Load libraries

Libraries need to be reloaded for each new notebook.

In [7]:
import pandas as pd

## 1.1 Import `pg_catalog_clean.pickle`


You will need to import the `.pickle` file and save it as a dataframe file.

In [10]:
df_pg_catalog = pd.read_pickle('pg_catalog_clean.pickle')

In [11]:
df_pg_catalog

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74289,74455,Text,2024-09-21,La survivante,fr,,,,,Balde,Jean,1885,1938
74290,74456,Text,2024-09-21,Lord Lister No. 0029: Het Indische raadsel,nl,,,,"Blankensee, Theo von, 1881-1928",Matull,Kurt,1872,1920
74291,74457,Text,2024-09-21,Voimakasta väkeä,fi,,,,,Malmberg,Aino,1865,1933
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


## 2 Exploring dataframe content

### 2.1 unique()

Before you start filtering a dataframe you might want to know what's actually inside it. It doesn't make a whole lot of sense to start filtering for values if you don't know what values are possible. Pandas dataframes have a built in `.unique()` method that allows you to view all unique values. Let's start by looking at `type`.

In [15]:
df_pg_catalog['type'].unique()

<StringArray>
['Text', 'Dataset', 'StillImage', 'MovingImage', 'Collection', 'Sound',
 'Image']
Length: 7, dtype: string

We can see that the type column classifies this has having 7 different types of information. We are only interested in `Text`.

### 2.2 Dot Notation

We can try to see how many languages there are as well. To speed up our typing we can actually use a little trick called "dot notation". As long as the column titles do not contain special characters and are lower case, we do not have to write brackets and the string around the name of the column, but can just put it in lower case after the name of the dataframe. So `df_pg_catalog.language` will give us the `language` column.

In [19]:
df_pg_catalog.language

0        en
1        en
2        en
3        en
4        en
         ..
74289    fr
74290    nl
74291    fi
74292    en
74293    en
Name: language, Length: 74294, dtype: string

If we want to know the unique values we can simply add `.unique()` after `language`.


In [21]:
df_pg_catalog.language.unique()

<StringArray>
[         'en',          'la',          'es',      'de; en',      'de; la',
          'fr',          'it',      'en; fr',          'ja',          'de',
 ...
     'en; brx',      'de; eo', 'es; fr; myn',      'en; ga',      'fi; sv',
  'en; la; el',      'af; nl',      'bo; en',      'la; pt',     'en; hai']
Length: 118, dtype: string

This is pretty interesting. There are 118 different languages here. Now we know we can limit our search results by simply eliminating everything that is not a text and eliminating everything that is not in english or `en`. Filtering down will be important because some columns may have a very large number of unique values.

In [23]:
df_pg_catalog.subjects.unique()

<StringArray>
[                                                                                                                                                                                                  'United States -- History -- Revolution, 1775-1783 -- Sources; United States. Declaration of Independence',
                                                                                                                                                                                                                 'Civil rights -- United States -- Sources; United States. Constitution. 1st-10th Amendments',
                                                                                                                                                                                                        'United States -- Foreign relations -- 1961-1963; Presidents -- United States -- Inaugural addresses',
                                                                             

There are over 40,000 unique strings in subjects. Figuring out what we want to look at is going to be tricky.

## 3 Filtering the dataframe

### 3.1 Simple Filtering with `==` (Equal to)

Filtering data in pandas is pretty intuitive. You access the column and then retrieve the value(s) that are interesting to you. For strings, the simplest operation is the - **`==` (Equal to)** operator. This:
  - Checks if values in a column are equal to a certain value.
  - Returns **True** where the condition is met.

For example: 

```python
df_pg_catalog.language=='en'
```

Will give us a list of all of the cases where this is `True` or `False`.
    

In [28]:
df_pg_catalog.language=='en'

0         True
1         True
2         True
3         True
4         True
         ...  
74289    False
74290    False
74291    False
74292     True
74293     True
Name: language, Length: 74294, dtype: boolean

This is a bit confusing because now have lost our table. In order to get it back, we have to wrap `df_pg_catalog.language=='en'` in the dataframe:

```python
df_pg_catalog[df_pg_catalog.language=='en']
```
This will return the dataframe, but only in cases where `langauge=='en'` is `True`

In [30]:
df_pg_catalog[df_pg_catalog.language=='en']

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74281,74447,Text,2024-09-20,Discourses of Brigham Young,en,Latter Day Saint churches -- Doctrines,BX,,"Widtsoe, John Andreas, 1872-1952 [Editor]",Young,Brigham,1801,1877
74286,74452,Text,2024-09-20,Humbug,en,,,,,Delafield,E. M.,1890,1943
74288,74454,Text,2024-09-21,Report on the Indian schools of Manitoba and t...,en,,,,,Bryce,P. H. (Peter Henderson),1853,1932
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


We see that the number of rows is now down to 59485. This means we've eliminated about 10k texts.

We can do the same thing for type. Since we only want `Text` we can filter for that.


In [33]:
df_pg_catalog[df_pg_catalog.type=='Text']

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74289,74455,Text,2024-09-21,La survivante,fr,,,,,Balde,Jean,1885,1938
74290,74456,Text,2024-09-21,Lord Lister No. 0029: Het Indische raadsel,nl,,,,"Blankensee, Theo von, 1881-1928",Matull,Kurt,1872,1920
74291,74457,Text,2024-09-21,Voimakasta väkeä,fi,,,,,Malmberg,Aino,1865,1933
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


Something strange happened. We are back to 73k rows, meaning that the number has gone back up. This is confusing. The reason for this is that we did not save the last query and are therefore starting from scratch. We have to run both queries and save the result when we do so.

### 3.2 Combining Queries with `&`


We want to save both queries and eliminate things that are not in English and things that are not texts. We in theory we could do the following:

```python
df_pg_catalog_english = df_pg_catalog[df_pg_catalog.language=='en']
df_pg_catalog_english_texts = df_pg_catalog_english[df_pg_catalog_english.type=='Text']
```

This first saves one copy of English works and then another copy of English texts.

In [37]:
df_pg_catalog_english = df_pg_catalog[df_pg_catalog.language=='en']
df_pg_catalog_english_texts = df_pg_catalog_english[df_pg_catalog_english.type=='Text']
df_pg_catalog_english_texts

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74281,74447,Text,2024-09-20,Discourses of Brigham Young,en,Latter Day Saint churches -- Doctrines,BX,,"Widtsoe, John Andreas, 1872-1952 [Editor]",Young,Brigham,1801,1877
74286,74452,Text,2024-09-20,Humbug,en,,,,,Delafield,E. M.,1890,1943
74288,74454,Text,2024-09-21,Report on the Indian schools of Manitoba and t...,en,,,,,Bryce,P. H. (Peter Henderson),1853,1932
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


This gets us the result we want, but it's pretty cumbersome because we have two separate queries. We can also simply combine queries using the `&` `and` operator. 

- **`&` (Logical AND)**
  - Combines two conditions.
  - Both conditions must be **True** for the result to be **True**.
  

```python
df_pg_catalog[(df_pg_catalog.language=='en') & 
                (df_pg_catalog.type=='Text')
]
```

We are returning a dataframe `df_pg_catalog[]` where both `(df_pg_catalog.language=='en')` AND  `(df_pg_catalog.type=='Text')` are true.


In [39]:
df_pg_catalog[(df_pg_catalog.language=='en') & 
                (df_pg_catalog.type=='Text')
]


Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74281,74447,Text,2024-09-20,Discourses of Brigham Young,en,Latter Day Saint churches -- Doctrines,BX,,"Widtsoe, John Andreas, 1872-1952 [Editor]",Young,Brigham,1801,1877
74286,74452,Text,2024-09-20,Humbug,en,,,,,Delafield,E. M.,1890,1943
74288,74454,Text,2024-09-21,Report on the Indian schools of Manitoba and t...,en,,,,,Bryce,P. H. (Peter Henderson),1853,1932
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


### 3.3 Looking for keywords with `str.contains()`

So far we have looked at columns that have pretty regular set of values: 7 media types and 128 languages. Yet, using a literal search string for subjects will not work because there are too many types of strings.

In [42]:
df_pg_catalog[df_pg_catalog.subjects=="Virginia"]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death


There is no text that uses the word `Virginia` as its subject. Nevertheless, there are many texts that have the word `Virginia` in its subject string of words. This is where the method `str.contains` is extremely useful. It will look through the string and find any match for the word we are looking for. We can even have it be case sensetive and ignore `na` values.

```python
df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)
```

`case` indicates that the search is case sensitive. Since it doesn't matter if Virginia is capitalized or not, we can leave this as `false`. `na=False` means that it will return `False` if the value for that particular record equals `na`.

In [44]:
df_pg_catalog[df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
5,6,Text,1976-12-01,Give Me Liberty or Give Me Death,en,"Speeches, addresses, etc., American; United St...",E201,American Revolutionary War; Browsing: History ...,,Henry,Patrick,1736,1799
72,73,Text,1993-07-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,US Civil War; Historical Fiction; Best Books E...,,Crane,Stephen,1871,1900
351,370,Text,1995-12-01,The Fortunes and Misfortunes of the Famous Mol...,en,London (England) -- Fiction; Picaresque litera...,PR,Browsing: Culture/Civilization/Society; Browsi...,,Defoe,Daniel,,
444,463,Text,1996-03-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,"US Civil War; Historical Fiction; Bestsellers,...",,Crane,Stephen,1871,1900
2351,2384,Text,2000-11-01,The Deliverance: A Romance of the Virginia Tob...,en,Virginia -- Fiction; Tobacco farmers -- Fiction,PS,"Bestsellers, American, 1895-1923; Browsing: Cu...",,Glasgow,Ellen Anderson Gholson,1873,1945
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69266,69369,Text,2022-11-17,Deep channel,en,Villages -- Fiction; Man-woman relationships -...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Montague,Margaret Prescott,1878,1955
69868,70010,Text,2023-02-10,The shadow between them;,en,Orphans -- Fiction; Young women -- Fiction; Di...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Miller,Alex. McVeigh,1850,1937
70166,70331,Text,2023-03-21,Educational laws of Virginia,en,African Americans -- Education -- Virginia; Do...,LC,Browsing: Culture/Civilization/Society; Browsi...,,Douglass,Margaret Crittenden,,
70718,70883,Text,2023-05-30,Doctor Hathern's daughters,en,Sisters -- Fiction; Domestic fiction; Mate sel...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Holmes,Mary Jane,1825,1907


This table is much more manageable! Only 177 results.

### 3.4 Eliminating results with ~ NOT

The result from the query for texts where the subject contains `Virginia` is a lot more manageable, but it also includes texts that may not be interesting. For example, it contains *The Red Badge of Courage* by Stephen Crane, which is `fiction`. We may remember that `fiction` gave Moretti a lot of headaches, so perhaps we simply eliminate that. If want to exclude fiction from our search we can use the `~` operator to indicate NOT. That is, return the row as long as the subject does **not** contain fiction.
```python
df_pg_catalog[~df_pg_catalog.subjects.str.contains('fiction', case=False, na=False)]
```
We simply put the `~` to indicate **not**.

In [48]:
df_pg_catalog[~df_pg_catalog.subjects.str.contains('fiction', case=False, na=False)]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights The Ten Orig...,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address Given November 1...,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74289,74455,Text,2024-09-21,La survivante,fr,,,,,Balde,Jean,1885,1938
74290,74456,Text,2024-09-21,Lord Lister No. 0029: Het Indische raadsel,nl,,,,"Blankensee, Theo von, 1881-1928",Matull,Kurt,1872,1920
74291,74457,Text,2024-09-21,Voimakasta väkeä,fi,,,,,Malmberg,Aino,1865,1933
74292,74458,Text,2024-09-22,Victoria,en,,,,"Chater, Arthur G. [Translator]",Hamsun,Knut,1859,1952


Again, we are back to more results because we did not combine the queries.

In [50]:
df_pg_catalog[(df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)) &
                (~df_pg_catalog.subjects.str.contains('fiction', case=False, na=False))
]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
5,6,Text,1976-12-01,Give Me Liberty or Give Me Death,en,"Speeches, addresses, etc., American; United St...",E201,American Revolutionary War; Browsing: History ...,,Henry,Patrick,1736,1799
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,1936
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown Who Escaped from ...,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,,
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,,
65081,65160,Text,2021-04-25,The Discoveries of John Lederer In three sever...,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,,
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,,


This elimimated another 70 texts. Great!

### 3.4 Broadening a search with `|` **or** and `*` **Wildcard**

We notice in the results above that there are still texts that might not be that interesting for example *Give Me Liberty or Give Me Death* is a speech. We also want to eliminate this from the results. Intuitively, we could create a new search string:

```python
~df_pg_catalog.subjects.str.contains('speeches', case=False, na=False)
```

This creates two problems. 
1. We are generating another line of code that can become cumbersome to read.
2. Since we are looking specifically for subjects that contain `speeches` it might still return any subject that contains `speech`.

We can fix the first problem by using the `|` **or** operator.

```python
~df_pg_catalog.subjects.str.contains('fiction'|'speeches', case=False, na=False)
```

Now we actually look for `fiction` or `speeches` in the subject.

We can fix the second problem by using a wildcard `*` at the end of `speech.*`. This will give us both `speech` and `speeches`. Specifically, it gives us `speech` and all possible variants.



In [89]:
df_pg_catalog[(df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)) &
                (~df_pg_catalog.subjects.str.contains('fiction|speech.*', case=False, na=False))
]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,1936
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
4721,4762,Text,2003-12-01,Civil Government of Virginia A Text-book for ...,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown Who Escaped from ...,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,,
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,,
65081,65160,Text,2021-04-25,The Discoveries of John Lederer In three sever...,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,,
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,,


This eliminated two more results that were speeches.

### 3.5 Combining it all

Now that we have the logic in place we can create one long search string:

```python
df_pg_catalog[
   (df_pg_catalog.language == 'en') & 
    (df_pg_catalog.type == 'Text') &
    (df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)) & 
    (~df_pg_catalog.subjects.str.contains('fiction|speech.*', case=False, na=False)) 
] 
```

In [86]:
df_pg_catalog[
    (df_pg_catalog.language == 'en') & 
    (df_pg_catalog.type == 'Text') &
    (df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)) & 
    (~df_pg_catalog.subjects.str.contains('fiction|speech.*', case=False, na=False)) 
] 

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,1936
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
4721,4762,Text,2003-12-01,Civil Government of Virginia A Text-book for ...,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown Who Escaped from ...,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,,
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,,
65081,65160,Text,2021-04-25,The Discoveries of John Lederer In three sever...,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,,
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,,


Great that gave us 96 results. We can now save this into a new dataframe.

## 4 Saving to a new Dataframe with `.copy()`

So far none of the queries we made were permanent. All we did was write out the query and test the result. As you get more experienced, you will likely not separate this out into different test queries, but will run one query all at once.<br> 

We can save the filtered dataframe in two way **shallow copy** and **deep copy**.

When we use the `=` operator pandas creates a **shallow copy** of the query results and puts them in a new dataframe called `df_virginia_history`. This means that there are still links between `df_virginia_history` and `df_pg_catalog`. This can sometimes have unexpected results where you modify one dataframe and it also changes the other one. We can prevent this by using the method `.copy()` at the end of the query chain to make a **deep copy**. This is an entirely seperate dataframe.

In [84]:
df_virginia_history = df_pg_catalog[
    (df_pg_catalog.language == 'en') & 
    (df_pg_catalog.type == 'Text') &
    (df_pg_catalog.subjects.str.contains('Virginia', case=False, na=False)) & 
    (~df_pg_catalog.subjects.str.contains('fiction|speech.*', case=False, na=False)) 
].copy()

In [45]:
df_virginia_history

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,1936
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
4721,4762,Text,2003-12-01,Civil Government of Virginia A Text-book for ...,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown Who Escaped from ...,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,,
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,,
65081,65160,Text,2021-04-25,The Discoveries of John Lederer In three sever...,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,,
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,,


We can use our `.pickle()` function to export this and not have to worry about doing all this again.

In [47]:
df_virginia_history.to_pickle('virginia_history.pickle')