# Extracting Structured Data from a Text File

The text file breaks down the data into useful fields. That's great! However, they're all in a text file together, and I'd prefer to have them in a format that I can more easily analyze. 

There are several ways to extract the data I want.  I'm going to use Python's list function to do it. 

## Opening the File

Please see the instructions on downloading a file from MLA.  

When I downloaded the file, I named it "1950_readable.txt." You should name yours whatever you like, but a .txt file works well for these operations. 

In the cell below, I open the file, read it, and split the contents into a list in which each line is an item.  I've called it metadata_list and will contiue to refer to it thoroughout this document. 

In [72]:
infile = open('1950_readable.txt')
metadata = infile.read()
metadata_list = metadata.split("\n")

## Getting Data

The records include very reliable structures, including labels that identify the various fields. If I want the information in the field without the label, I can use Python's list indexes to get it. The script below finds every line that includes the word "Period:", counts the characters from the beginning of the line (starting with 0), and prints out the line from the eighth character onward. 

In [73]:
for line in metadata_list:
    if "Period:" in line:
        print(line[8:])

1800-1899
1800-1899
1800-1899
1800-1899
1900-1999
1800-1899
1800-1899
1800-1899
1900-1999
1800-1899
1800-1899
1900-1999
1900-1999
1800-1899
1800-1899
1800-1899
1800-1899
1700-1799
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1600-1699
1900-1999
1800-1899
1800-1899
1800-1899
1800-1899
1800-1899
1900-1999
1700-1799
1800-1899
1800-1899
1800-1899
1800-1899


This also works for the other fields. 

I got a little tired of counting the characters, so I found a character-counting tool online. The one I used is here: http://www.charactercountonline.com/  , but I'm sure many others exist. 

Remember, Python starts counting with 0, so the index value will always be one more than it would be otherwise. 

In [74]:
for line in metadata_list:
    if "Primary Subject Author: " in line:
        print(line[24: ])

Lowell, James Russell (1819-1891)
Ingraham, Joseph Holt (1809-1860)
Irving, Washington (1783-1859)
Melville, Herman (1819-1891)
James, Henry, Jr. (1843-1916)
Emerson, Ralph Waldo (1803-1882)
James, Henry, Jr. (1843-1916)
role of Flower, Benjamin Orange (1858-1918)
role of Harte, Bret (1836-1902)
Eliot, T. S. (1888-1965)
Philippe, Charles-Louis (1874-1909)
Cranch, Christopher Pearse (1813-1892)
Melville, Herman (1819-1891)
Emerson, Ralph Waldo (1803-1882)
Franklin, Benjamin (1706-1790)
Dwight, John Sullivan (1813-1893)
Longfellow, Henry Wadsworth (1807-1882)
Holmes, Oliver Wendell (1809-1894)
treatment in Clemens, Samuel (1835-1910)
Clemens, Samuel (1835-1910)
Poe, Edgar Allan (1809-1849)
Melville, Herman (1819-1891)
Melville, Herman (1819-1891)
Melville, Herman (1819-1891)
Cotton, John (1584-1652)
Melville, Herman (1819-1891)
Whittier, John Greenleaf (1807-1892)
Whitman, Walt (1819-1892)
Hawthorne, Nathaniel (1804-1864)
Melville, Herman (1819-1891)
Cather, Willa (1873-1947)
Dwight, Tim

In [75]:
for line in metadata_list:
    if "Primary Subject Work:" in line:
        print(line[22:])

Astoria (1836)
The Ambassadors (1903)
The Arena (1889-1909)
Overland Monthly
'Preludes'; 'Rhapsody on a Windy Night'
Bubu de Montparnasse; Marie Donadieu
The Song of Hiawatha (1855)
Kalevala
The Autocrat of the Breakfast Table (1858)
Roughing It (1872)
Billy Budd
The Confidence-Man (1857)
'The Gentle Boy'
Pierre (1852)
Remarks on American Literature (1830)
La DÃ©mocratie en AmÃ©rique (1835-40)


In [76]:
classification_list_1950 = []
for line in metadata_list:
    if "Classification: " in line:
        print(line[16:])
        classification_list_1950.append(line[16:])

poetry; and prose
prose
novel
poetry
novel
poetry; and prose
prose
translation
poetry
prose
prose
novel
novel
novel
prose; manuscript notes; source study
short story
novel
criticism
prose
prose; manuscript notes


## Multiple Values in a Field

You may come across this in any of these fields! Above, you may notice that the fields in question have multiple values separated by semicolons (for instance: prose; manuscript notes).
When you're working with the data in a CSV file, you don't want that.  So, how can you separate them? 

In [77]:
separated_list_1950 = []
for line in classification_list_1950:
    print(line.split(";"))

['poetry', ' and prose']
['prose']
['novel']
['poetry']
['novel']
['poetry', ' and prose']
['prose']
['translation']
['poetry']
['prose']
['prose']
['novel']
['novel']
['novel']
['prose', ' manuscript notes', ' source study']
['short story']
['novel']
['criticism']
['prose']
['prose', ' manuscript notes']


In [78]:
#this doesn't work and it's weird
for line in metadata_list:
    if "Subject Terms" in line:
        label_line = metadata_list.index(line)
        print(metadata_list[label_line + 1])
    else:
        pass

sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature
sources in classical literature


Always remember to close your files when you're done!

In [79]:
infile.close()

## Next Steps

I have three files, so I repeated these steps for each of them.