# Lesson 2.2: Scraping Gutenberg: Filtering Queries

So far none of the work we have done on the Gutenberg catalog has not actually gotten us any texts. Instead we did the following:
- Downloaded a CSV file
- Converted it to a dataframe
- Cleaned the dataframe
- Exported as `.pickle` file

We did all this because the dataframe will used to retrieve actual texts from Gutenberg, and we want to be sure everything is clean before we modify it.

In the following lesson you will learn how to filter the dataframe for the specific values you want by running queries. Since we don't want to download all of Gutenberg, we will want to create a subset of data to download before we do the actual webscraping.

#### Load libraries

Libraries need to be reloaded for each new notebook.

In [25]:
import pandas as pd

## Import `pg_catalog_clean.pickle`


You will need to import the `.pickle` file and save it as a dataframe file.

In [26]:
df_pg_catalog = pd.read_pickle('pg_catalog_clean.pickle')

In [27]:
df_pg_catalog

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
0,1,Text,1971-12-01,The Declaration of Independence of the United ...,en,"United States -- History -- Revolution, 1775-1...",E201; JK,Politics; American Revolutionary War; United S...,,Jefferson,Thomas,1743,1826
1,2,Text,1972-12-01,The United States Bill of Rights,en,Civil rights -- United States -- Sources; Unit...,JK; KF,Politics; American Revolutionary War; United S...,,United States,,,
2,3,Text,1973-11-01,John F. Kennedy's Inaugural Address,en,United States -- Foreign relations -- 1961-196...,E838,Browsing: History - American; Browsing: Politics,,Kennedy,John F. (John Fitzgerald),1917,1963
3,4,Text,1973-11-01,Lincoln's Gettysburg Address,en,Consecration of cemeteries -- Pennsylvania -- ...,E456,US Civil War; Browsing: History - American; Br...,,Lincoln,Abraham,1809,1865
4,5,Text,1975-12-01,The United States Constitution,en,United States -- Politics and government -- 17...,JK; KF,United States; Politics; American Revolutionar...,,United States,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
74250,74416,Text,2024-09-14,Proteus,en,,,,,Lee,Vernon,1856,1935
74251,74417,Text,2024-09-14,The kaleidoscope,en,,,,,Brewster,David,1781,1868
74252,74418,Text,2024-09-15,Their island home,en,,,,"Murphy, Henry Cruse, 1810-1882 [Illustrator]; ...",Verne,Jules,1828,"1905; Murphy, Henry Cruse, 1810-1882 [Illustra..."
74253,74419,Text,2024-09-15,Reuben Sachs; a sketch,en,,,,,Levy,Amy,1861,1889


## Filtering the dataframe

Filtering data in pandas is pretty intuitive. Essentially you select the column you want to filter and then use the method `.str.contains()` to search for the value you want to retrieve.

```python
df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False)
```
`case` indicates that the search is case sensitive. Since it doesn't matter if Virginia is capitalized or not, we can leave this as `false`.

In [28]:
df_pg_catalog['subjects'][df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False)]

5        Speeches, addresses, etc., American; United St...
72       Historical fiction; War stories; United States...
351      London (England) -- Fiction; Picaresque litera...
444      Historical fiction; War stories; United States...
2351       Virginia -- Fiction; Tobacco farmers -- Fiction
                               ...                        
69266    Villages -- Fiction; Man-woman relationships -...
69868    Orphans -- Fiction; Young women -- Fiction; Di...
70166    African Americans -- Education -- Virginia; Do...
70718    Sisters -- Fiction; Domestic fiction; Mate sel...
72725    Women dancers -- France -- Fiction; Businessme...
Name: subjects, Length: 177, dtype: string

In [29]:
df_pg_catalog[df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & (df_pg_catalog['language'] == 'en')]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
5,6,Text,1976-12-01,Give Me Liberty or Give Me Death,en,"Speeches, addresses, etc., American; United St...",E201,American Revolutionary War; Browsing: History ...,,Henry,Patrick,1736,1799
72,73,Text,1993-07-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,US Civil War; Historical Fiction; Best Books E...,,Crane,Stephen,1871,1900
351,370,Text,1995-12-01,The Fortunes and Misfortunes of the Famous Mol...,en,London (England) -- Fiction; Picaresque litera...,PR,Browsing: Culture/Civilization/Society; Browsi...,,Defoe,Daniel,1661?,1731
444,463,Text,1996-03-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,"US Civil War; Historical Fiction; Bestsellers,...",,Crane,Stephen,1871,1900
2351,2384,Text,2000-11-01,The Deliverance: A Romance of the Virginia Tob...,en,Virginia -- Fiction; Tobacco farmers -- Fiction,PS,"Bestsellers, American, 1895-1923; Browsing: Cu...",,Glasgow,Ellen Anderson Gholson,1873,1945
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69266,69369,Text,2022-11-17,Deep channel,en,Villages -- Fiction; Man-woman relationships -...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Montague,Margaret Prescott,1878,1955
69868,70010,Text,2023-02-10,The shadow between them;,en,Orphans -- Fiction; Young women -- Fiction; Di...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Miller,Alex. McVeigh,"Mrs., 1850",1937
70166,70331,Text,2023-03-21,Educational laws of Virginia,en,African Americans -- Education -- Virginia; Do...,LC,Browsing: Culture/Civilization/Society; Browsi...,,Douglass,Margaret Crittenden,1822,
70718,70883,Text,2023-05-30,Doctor Hathern's daughters,en,Sisters -- Fiction; Domestic fiction; Mate sel...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Holmes,Mary Jane,1825,1907


This resulted in 117 records. We may remember that Gutenberg is about 70k records so we have reduced this down quite a bit.

We can try to narrow this down a bit by excluding all works that are not in English `en`

```python
df_pg_catalog[df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & (df_pg_catalog['language'] == 'en')]
```

Instead of seeing if the string contains `en`, now we are simply asking if it equals `en`.

In [30]:
df_pg_catalog[df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & (df_pg_catalog['language'] == 'en')]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
5,6,Text,1976-12-01,Give Me Liberty or Give Me Death,en,"Speeches, addresses, etc., American; United St...",E201,American Revolutionary War; Browsing: History ...,,Henry,Patrick,1736,1799
72,73,Text,1993-07-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,US Civil War; Historical Fiction; Best Books E...,,Crane,Stephen,1871,1900
351,370,Text,1995-12-01,The Fortunes and Misfortunes of the Famous Mol...,en,London (England) -- Fiction; Picaresque litera...,PR,Browsing: Culture/Civilization/Society; Browsi...,,Defoe,Daniel,1661?,1731
444,463,Text,1996-03-01,The Red Badge of Courage: An Episode of the Am...,en,Historical fiction; War stories; United States...,PS,"US Civil War; Historical Fiction; Bestsellers,...",,Crane,Stephen,1871,1900
2351,2384,Text,2000-11-01,The Deliverance: A Romance of the Virginia Tob...,en,Virginia -- Fiction; Tobacco farmers -- Fiction,PS,"Bestsellers, American, 1895-1923; Browsing: Cu...",,Glasgow,Ellen Anderson Gholson,1873,1945
...,...,...,...,...,...,...,...,...,...,...,...,...,...
69266,69369,Text,2022-11-17,Deep channel,en,Villages -- Fiction; Man-woman relationships -...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Montague,Margaret Prescott,1878,1955
69868,70010,Text,2023-02-10,The shadow between them;,en,Orphans -- Fiction; Young women -- Fiction; Di...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Miller,Alex. McVeigh,"Mrs., 1850",1937
70166,70331,Text,2023-03-21,Educational laws of Virginia,en,African Americans -- Education -- Virginia; Do...,LC,Browsing: Culture/Civilization/Society; Browsi...,,Douglass,Margaret Crittenden,1822,
70718,70883,Text,2023-05-30,Doctor Hathern's daughters,en,Sisters -- Fiction; Domestic fiction; Mate sel...,PS,Browsing: Culture/Civilization/Society; Browsi...,,Holmes,Mary Jane,1825,1907


This gives us 176 values. This is still too many.

One of the issus Moretti ran into is the ambiguity of fiction. We can perhaps avoid that ambiguity by excluding those types of books from our corpus. We can do this by introducing yet another query line that looks for `fiction` but then adding a `~` in front of it to negate the result, i.e. **do not** include anything that has the word ficiton in it.


In [31]:
df_pg_catalog[
    df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & 
    (df_pg_catalog['language'] == 'en') & 
    ~df_pg_catalog['subjects'].str.contains('fiction', case=False, na=False)
]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
5,6,Text,1976-12-01,Give Me Liberty or Give Me Death,en,"Speeches, addresses, etc., American; United St...",E201,American Revolutionary War; Browsing: History ...,,Henry,Patrick,1736,1799
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]"
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,1816?,"1897; Stearns, Charles (Abolitionist) [Contrib..."
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,1771?,"1839?; Hurnard, Robert [Author of introduction..."
65081,65160,Text,2021-04-25,The Discoveries of John Lederer,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,1644,"; Talbot, William, Sir, -1691 [Translator]"
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,"Jr. [Editor]; Pitkin, Thomas M., 1901",1988 [Editor]


This is a bit better since we are now down to 99 texts. We can get even more finegrained by excluding multiple words. There are also speeches in the data set. We can exclude these without having to write a whole new line of code. Instead we can use the `|` or operator to exclude `fiction` or `speeches`. Since we want to exclude both `speech` and `speeches` we can use a wildcard `*` on `speech*` to achieve this effect.

In [32]:
df_pg_catalog[
    df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & 
    (df_pg_catalog['language'] == 'en') & 
    ~df_pg_catalog['subjects'].str.contains('fiction|speech*', case=False, na=False)
]

Unnamed: 0,text_id,type,issued,title,language,subjects,locc,bookshelves,second_author,last_name,first_name,birth,death
2637,2674,Text,2001-06-01,The Complete Writings of Charles Dudley Warner...,en,Autobiographies; Virginia -- Description and t...,PS,Browsing: Biographies; Browsing: Literature; B...,,Warner,Charles Dudley,1829,1900
2858,2898,Text,2001-11-01,Pioneers of the Old South: A Chronicle of Engl...,en,"Southern States -- History -- Colonial period,...",E151; F206,United States; Children's History; Browsing: H...,"Johnson, Allen, 1870-1931 [Editor]",Johnston,Mary,1870,"1936; Johnson, Allen, 1870-1931 [Editor]"
3085,3126,Text,2004-10-10,On Horseback,en,California -- Description and travel; Virginia...,F206,Browsing: History - American; Browsing: Travel...,,Warner,Charles Dudley,1829,1900
4206,4247,Text,2003-07-01,A Briefe and True Report of the New Found Land...,en,Indians of North America -- North Carolina; Ro...,F206,Browsing: History - American; Browsing: Histor...,,Harriot,Thomas,1560,1621
4721,4762,Text,2003-12-01,Civil Government of Virginia,en,Virginia -- Politics and government,JK,Browsing: History - American; Browsing: Politi...,,Fox,William Fayette,1836,1909
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64913,64992,Text,2021-04-05,Narrative of Henry Box Brown,en,African American abolitionists -- Biography; B...,E300,Browsing: Biographies; Browsing: Culture/Civil...,"Stearns, Charles (Abolitionist) [Contributor]",Brown,Henry Box,1816?,"1897; Stearns, Charles (Abolitionist) [Contrib..."
64948,65027,Text,2021-04-08,"A narrative of some remarkable incidents, in t...",en,Fugitive slaves -- United States -- Biography;...,E300; HT,Browsing: Biographies; Browsing: Culture/Civil...,"Hurnard, Robert [Author of introduction, etc.]",Bayley,Solomon,1771?,"1839?; Hurnard, Robert [Author of introduction..."
65081,65160,Text,2021-04-25,The Discoveries of John Lederer,en,Indians of North America -- North Carolina -- ...,F206,Browsing: History - American; Browsing: Travel...,"Talbot, William, Sir, -1691 [Translator]",Lederer,John,1644,"; Talbot, William, Sir, -1691 [Translator]"
67666,67745,Text,2022-03-31,Yorktown: Climax of the Revolution,en,"Virginia -- History -- Revolution, 1775-1783; ...",E201,Browsing: History - American; Browsing: Histor...,"Pitkin, Thomas M., 1901-1988 [Editor]",Hatch,Charles E.,"Jr. [Editor]; Pitkin, Thomas M., 1901",1988 [Editor]


We can save all this as a new dataframe.

In [33]:
df_virginia_history = df_pg_catalog[
    df_pg_catalog['subjects'].str.contains('Virginia', case=False, na=False) & 
    (df_pg_catalog['language'] == 'en') & 
    ~df_pg_catalog['subjects'].str.contains('fiction|speech*', case=False, na=False)
]

We can use our .pickle() function to export this.

In [34]:
df_virginia_history.to_pickle('virginia_history.pickle')