# More Lists & Loops, Plus Modules

[Download relevant files here](https://melaniewalsh.org/More-Lists-Loops.zip)

Last lesson, we learned how to make, manipulate, and iterate through lists, an important Python collection type. But we weren't actually working with a real CSV file, and we weren't doing a very comprehensive analysis of the data. In this lesson, we're going to work with a real CSV file and try to answer some analytical questions about the Bellevue Almshouse data, such as:

- What is the most common "disease" and the least common "disease"?
- What is the most common "profession" and the least common "profession"?
- What is the gender breakdown of those admitted to the Bellevue Almshouse?

We're going to answer these questions by practicing more with lists and loops while also introducing the csv module and the collections module.

In [141]:
import pandas as pd
almshouse_filepath = '../data/bellevue_almshouse_modified.csv'
almshouse_pd_data = pd.read_csv(almshouse_filepath)
almshouse_pd_data.head(5)

Unnamed: 0,date_in,first_name,last_name,age,disease,profession,gender,children
0,1847-04-17,Mary,Gallagher,28.0,recent emigrant,married,f,Child Alana 10 days
1,1847-04-08,John,Sanin (?),19.0,recent emigrant,laborer,m,Catherine 2 mo
2,1847-04-17,Anthony,Clark,60.0,recent emigrant,laborer,m,Charles Riley afed 10 days
3,1847-04-08,Lawrence,Feeney,32.0,recent emigrant,laborer,m,Child
4,1847-04-13,Henry,Joyce,21.0,recent emigrant,,m,Child 1 mo


# Reading in a CSV File

The [csv module](https://docs.python.org/3/library/csv.html) allows you to read and write tabular data in CSV (comma separated values) format, one of the most common formats for spreadsheets. (Soon we're going to talk about the Python library pandas, which we used in the first cell above, an even more powerful and more convenient way of working with tabular data.)

In [60]:
import csv

To use the csv module, you have to first import it, as above. Then to read in a CSV file, as below, you need to `with open()` your desired CSV file `as` a csv object `:` then use the `csv.reader()` function and insert your csv object. The "delimiter" argument tells the computer how to read the CSV file. Sometimes you might have a CSV file that is separated by tabs (\t) instead of commas (,) so it's typically good to specify.

In [115]:
almshouse_filepath = '../data/bellevue_almshouse_modified.csv'

with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')

In [107]:
almshouse_data

<_csv.reader at 0x1a1789aa50>

The `csv.reader()` function will create a "reader object." To actually get at the data in there, we'll need to iterate through it in some way. Each row in the reader object is a list of strings, so if we iterate through every row in the dataset, we will get 9,000+ lists (!). It's helpful to remember what each row in the dataset represents and name your variables accordingly. For the Bellevue Almshouse dataset, each row represents a person.

In [110]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    for person in almshouse_data:
        print(person)

['date_in', 'first_name', 'last_name', 'age', 'disease', 'profession', 'gender', 'children']
['1847-04-17', 'Mary', 'Gallagher', '28', 'recent emigrant', 'married', 'f', 'Child Alana 10 days']
['1847-04-08', 'John', 'Sanin (?)', '19', 'recent emigrant', 'laborer', 'm', 'Catherine 2 mo']
['1847-04-17', 'Anthony', 'Clark', '60', 'recent emigrant', 'laborer', 'm', 'Charles Riley afed 10 days']
['1847-04-08', 'Lawrence', 'Feeney', '32', 'recent emigrant', 'laborer', 'm', 'Child']
['1847-04-13', 'Henry', 'Joyce', '21', 'recent emigrant', '', 'm', 'Child 1 mo']
['1847-04-14', 'Bridget', 'Hart', '20', 'recent emigrant', 'spinster', 'f', 'Child']
['1847-04-14', 'Mary', 'Green', '40', 'recent emigrant', 'spinster', 'f', 'And child 2 months']
['1847-04-19', 'Daniel', 'Loftus', '27', 'destitution', 'laborer', 'm', '']
['1847-04-10', 'James', 'Day', '35', 'recent emigrant', 'laborer', 'm', '']
['1847-04-10', 'Margaret', 'Farrell', '30', 'recent emigrant', 'widow', 'f', '']
['1847-04-10', 'Bridget'

**Pro tip!** If you have a really long output, you can "Enable Scrolling for Outputs" by right-clicking and selecting that option.

If we wanted to answer our first question — *What is the most common "disease" and the least common "disease"?* — how might we isolate only the names of the diseases so we can count them? Think back to how [we indexed a list](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/Lists-Loops.html#Index) in the last lesson...

In [113]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    for person in almshouse_data:
        print(person[0])

date_in
1847-04-17
1847-04-08
1847-04-17
1847-04-08
1847-04-13
1847-04-14
1847-04-14
1847-04-19
1847-04-10
1847-04-10
1847-04-10
1847-04-10
1847-04-07
1847-04-07
1847-04-07
1847-04-17
1847-04-09
1847-04-09
1847-04-12
1847-04-12
1847-04-13
1847-04-09
1847-04-17
1847-04-09
1847-04-15
1847-04-14
1847-04-06
1847-04-05
1847-04-12
1847-04-17
1847-04-15
1847-04-06
1847-04-06
1847-04-06
1847-04-06
1847-04-06
1847-04-12
1847-04-12
1847-04-16
1847-04-12
1847-04-12
1847-04-12
1847-04-12
1847-04-12
1847-04-12
1847-04-12
1847-04-12
1847-04-14
1847-04-15
1847-04-17
1847-04-08
1847-04-10
1847-04-10
1847-04-10
1847-04-13
1847-04-17
1847-04-17
1847-04-16
1847-04-03
1847-04-03
1847-04-12
1847-04-12
1847-04-19
1847-04-19
1847-04-19
1847-04-19
1847-04-19
1847-04-19
1847-04-15
1847-04-13
1847-04-17
1847-04-16
1847-04-09
1847-04-06
1847-04-19
1847-04-19
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-17
1847-04-12
1847-04-10
1847-04-10
18

In [114]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    for person in almshouse_data:
        print(person[4])

disease
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
destitution
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
typhus
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigrant
recent emigra

## Build a List With a `For` Loop

Great! We figured out how to isolate the diseases. But to count them, we want to get them in a data collection of their own, like a list. How would we put this data into a list? Let's make an empty list and then append each disease from each row into the list.

❌ ❌ ❌ **Not Correct**

In [123]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    for person in almshouse_data:
        diseases = []
        diseases.append(person[4])

In [124]:
diseases

['destitution']

Wait, that's not quite right. We only got a list with a single value. What's going on?

The problem is that the list building is happening *inside* the `for` loop. This means that, `for` every person/row, the list is being re-written over and over again. "destitution" is the very last disease in the dataset, so we're only getting the very last value. To keep building on a list, we need to put the empty list *outside* of the `for` loop and then keep adding to it.

In [125]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    diseases = []
    for person in almshouse_data:   
        diseases.append(person[4])

In [126]:
diseases

['disease',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'destitution',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'typhus',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 

## Measure Length of List

To measure the length of a list, use the `len()` function.

In [127]:
len(diseases)

9585

## Count Items In a List or Collection

The Counter tool from the collections module is extremely useful. It can help you count all kinds of things. To use it, you first need to `import` the Counter `from` collections.

In [128]:
from collections import Counter

To count something, you simply need to insert it inside the `Counter()` function, like so:

In [145]:
Counter(diseases)

Counter({'disease': 1,
         'recent emigrant': 1974,
         'destitution': 841,
         'typhus': 46,
         'pregnant': 134,
         'sickness': 2706,
         'illegible': 47,
         'fever': 192,
         'debility': 11,
         'dysentery': 6,
         'intemperance': 71,
         'sore': 79,
         'insane': 138,
         'diarrhea': 6,
         'lame': 15,
         'cut': 1,
         'rickets': 1,
         'fits': 2,
         'dropsy': 8,
         'measles': 3,
         '': 3087,
         'injuries': 31,
         'burn': 3,
         'vagrant': 17,
         'broken bone': 5,
         'ascites': 1,
         'bronchitis': 9,
         'erysipelas': 6,
         'ophthalmia': 19,
         'scarletina': 2,
         'phthisis': 8,
         'deaf': 1,
         'blind': 9,
         'ulcers': 26,
         'scrofula': 2,
         'tuberculosis': 2,
         'pneumonia': 2,
         'congested head': 1,
         'eczema': 1,
         'rheumatism': 11,
         'bruise': 1,
    

This will give you what's called a dictionary, which includes every disease in the dataset and how many times it appears. To sort this Counter dictionary based on the most common items, you can use the `.most_common()` method.

In [146]:
disease_tally = Counter(diseases)
disease_tally.most_common()

[('', 3087),
 ('sickness', 2706),
 ('recent emigrant', 1974),
 ('destitution', 841),
 ('fever', 192),
 ('insane', 138),
 ('pregnant', 134),
 ('sore', 79),
 ('intemperance', 71),
 ('illegible', 47),
 ('typhus', 46),
 ('injuries', 31),
 ('ulcers', 26),
 ('ophthalmia', 19),
 ('vagrant', 17),
 ('lame', 15),
 ('debility', 11),
 ('rheumatism', 11),
 ('bronchitis', 9),
 ('blind', 9),
 ('dropsy', 8),
 ('phthisis', 8),
 ('syphilis', 7),
 ('old age', 7),
 ('dysentery', 6),
 ('diarrhea', 6),
 ('erysipelas', 6),
 ('broken bone', 5),
 ('cripple', 5),
 ('measles', 3),
 ('burn', 3),
 ('drunkenness', 3),
 ('fits', 2),
 ('scarletina', 2),
 ('scrofula', 2),
 ('tuberculosis', 2),
 ('pneumonia', 2),
 ('delusion dreams', 2),
 ('abandonment', 2),
 ('piles', 2),
 ('jaundice', 2),
 ('sprain', 2),
 ('disease', 1),
 ('cut', 1),
 ('rickets', 1),
 ('ascites', 1),
 ('deaf', 1),
 ('congested head', 1),
 ('eczema', 1),
 ('bruise', 1),
 ('contusion', 1),
 ('severed limb', 1),
 ('poorly', 1),
 ('disabled', 1),
 ('blee

You can also select a certain number of the most common items by placing a number inside the `.most_common()` method.

In [147]:
disease_tally.most_common(10)

[('', 3087),
 ('sickness', 2706),
 ('recent emigrant', 1974),
 ('destitution', 841),
 ('fever', 192),
 ('insane', 138),
 ('pregnant', 134),
 ('sore', 79),
 ('intemperance', 71),
 ('illegible', 47)]

You can also select a certain number of the *least* common items by extracting a slice from the end of list, like so:

In [149]:
disease_tally.most_common()[-10:]

[('paralysis', 1),
 ('abscess', 1),
 ('neuralgia', 1),
 ('hypochondria', 1),
 ('sunburn', 1),
 ('horrors', 1),
 ('from trial', 1),
 ('ungovernable', 1),
 ('smallpox', 1),
 ('asthma', 1)]

In [150]:
disease_tally.most_common()[-3:]

[('ungovernable', 1), ('smallpox', 1), ('asthma', 1)]

# Your Turn!

By using the same methods, find the 10 most common professions and the 10 least common professions in the Bellevue Almshouse dataset.

Build a list called `professions` by using a `for` loop and the `.append()` method.

In [None]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    #Your Code Here
    for person in almshouse_data:   
        #Your Code Here

In [154]:
professions

['profession',
 'married',
 'laborer',
 'laborer',
 'laborer',
 '',
 'spinster',
 'spinster',
 'laborer',
 'laborer',
 'widow',
 'married',
 '',
 'laborer',
 'laborer',
 'laborer',
 '',
 'married',
 'laborer',
 'married',
 '',
 'laborer',
 'laborer',
 'spinster',
 '',
 'married',
 'spinster',
 'laborer',
 'spinster',
 'shoemaker',
 'painter',
 'spinster',
 '',
 'married',
 'laborer',
 '',
 '',
 'laborer',
 'spinster',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'laborer',
 'laborer',
 'married',
 'laborer',
 'widow',
 '',
 '',
 '',
 'mason',
 'married',
 'spinster',
 'laborer',
 'laborer',
 'married',
 'tailor',
 '',
 '',
 'married',
 'mason',
 '',
 '',
 'laborer',
 'laborer',
 'laborer',
 'farmer',
 'married',
 'laborer',
 'laborer',
 'laborer',
 '',
 '',
 'spinster',
 'baker',
 'laborer',
 'widow',
 '',
 'married',
 '',
 '',
 '',
 'painter',
 'married',
 '',
 '',
 'spinster',
 'spinster',
 'spinster',
 'laborer',
 '',
 'married',
 '',
 '',
 '',
 'spinster',
 'married',
 'laborer',


Count the list `professions` with the Counter tool then display the top 10 most common values.

In [156]:
from collections import Counter

professions_tally = #Your Code Here
#Your Code Here

[('laborer', 3108),
 ('married', 1584),
 ('spinster', 1521),
 ('widow', 1053),
 ('', 1019),
 ('shoemaker', 158),
 ('tailor', 116),
 ('blacksmith', 104),
 ('mason', 98),
 ('weaver', 66)]

Display the 10 least common values.

In [157]:
#Your Code Here

[('auctioneer', 1),
 ('moulder', 1),
 ('builder', 1),
 ('jeweller', 1),
 ('wood sawyer', 1),
 ('gw anderson per e witherell', 1),
 ('drayman', 1),
 ('groom', 1),
 ('rectifier', 1),
 ('superintendent', 1)]

Now find out how many men vs women are included in the Bellevue Almshouse data. Build a list called `gender` with a `for` loop and the `.append()` method

In [None]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    #Your Code Here
    for person in almshouse_data:   
        #Your Code Here

In [160]:
gender

['gender',
 'f',
 'm',
 'm',
 'm',
 'm',
 'f',
 'f',
 'm',
 'm',
 'f',
 'f',
 'm',
 'm',
 'm',
 'm',
 'm',
 'f',
 'm',
 'f',
 'm',
 'm',
 'm',
 'f',
 'm',
 'f',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'm',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'm',
 'f',
 'f',
 'f',
 'f',
 'f',
 'm',
 'm',
 'm',
 'm',
 'm',
 'f',
 'm',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'f',
 'm',
 'm',
 'f',
 'm',
 'm',
 'f',
 'f',
 'm',
 'f',
 'f',
 'm',
 'm',
 'm',
 'm',
 'f',
 'm',
 'm',
 'm',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'f',
 'f',
 'f',
 'f',
 'm',
 'm',
 'f',
 'f',
 'f',
 'f',
 'f',
 'f',
 'm',
 'm',
 'f',
 'm',
 'm',
 'm',
 'f',
 'f',
 'm',
 'm',
 'm',
 'm',
 'm',
 'm',
 'f',
 'f',
 'f',
 'f',
 'm',
 'f',
 'm',
 'f',
 'f',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'f',
 'f',
 'm',
 'm',
 'f',
 'm',
 'm',
 'm',
 'f',
 'm',
 'f',
 'f',
 'm',
 'f',
 'm',
 'm',
 'f',
 'm',
 'm',
 'f',
 'm',
 'f',
 'f',
 'm',
 'f',
 'f',
 'f',
 'f',
 'm',
 'f',
 'f',
 'f',
 'm',
 'f',
 'f',
 'm',
 'm',
 'm',
 'm',
 'm',
 'm',
 'f',

Count the values in the gender column wiht the Counter tool and then display the results.

In [162]:
from collections import Counter

gender_tally = #Your Code Here
gender_tally

Counter({'gender': 1, 'f': 4621, 'm': 4958, '?': 2, 'h': 1, 'g': 2})

# List Comprehensions

There's a slightly easier and more compact way to build a list with a `for` loop called a "list comprehension." Instead of creating an empty list, you can build the `for` loop inside of a list.

In [None]:
diseases = [person[4] for person in almshouse_data]

In [176]:
with open(almshouse_filepath) as csv_object:
    almshouse_data = csv.reader(csv_object, delimiter=',')
    
    diseases = [person[4] for person in almshouse_data]

In [177]:
diseases

['disease',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'destitution',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'typhus',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 'recent emigrant',
 

Remember our Python script for counting words in a text file? Though you probably didn't recognize it at the time, this code contains a list comprehension. Can you spot it?

In [59]:
import re
from collections import Counter
from nltk.corpus import stopwords

def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

filepath_of_text = "../texts/literature/The-Yellow-Wallpaper.txt"
nltk_stop_words = stopwords.words("english")
number_of_desired_words = 40

full_text = open(filepath_of_text, encoding="utf-8").read()

all_the_words = split_into_words(full_text)
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]
meaningful_words_tally = Counter(meaningful_words)
most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_desired_words)

print(most_frequent_meaningful_words)

[('john', 45), ('one', 33), ('said', 30), ('would', 27), ('get', 24), ('see', 24), ('room', 24), ('pattern', 24), ('paper', 23), ('like', 21), ('little', 20), ('much', 16), ('good', 16), ('think', 16), ('well', 15), ('know', 15), ('go', 15), ('really', 14), ('thing', 14), ('wallpaper', 13), ('night', 13), ('long', 12), ('course', 12), ('things', 12), ('take', 12), ('always', 12), ('could', 12), ('jennie', 12), ('great', 11), ('says', 11), ('feel', 11), ('even', 11), ('used', 11), ('dear', 11), ('time', 11), ('enough', 11), ('away', 11), ('want', 11), ('never', 10), ('must', 10)]


This is the list comprehension:

In [None]:
meaningful_words = [word for word in all_the_words if word not in nltk_stop_words]

which is exactly the same as

In [None]:
meaningful_words = []
for word in all_the_words:
    if word not in nltk_stop_words:
        meaningful_words.append(word)

In [None]:
empty_string = []
    for item in collection:
        if item in items_we_want:
            empty_string.append(item)

In [None]:
empty_string = [item for item in collection if item in items_we_want]