This assignment will give you practice with regular expressions (and other Python skills) by processing data on cumulative case counts and deaths in every county/parish in the US for covid-19. These data are compiled and updated regulary by the NY Times, and made available on GitHub: https://github.com/nytimes/covid-19-data. I downloaded the data on March 30th and have included that file (covidData_3.30.20.txt) in this week's repository.

Each cell in this Jupyter notebook includes a comment that indicates what I want you to accomplish with Python code. To submit your assignment, please create your own folder (with your name) inside the assignments folder for this week. In this folder, please put the completed notebook and the new files you created. Please submit these via a pull request to the class GitHub page, as with previous weeks.

If you have *any questions at all*, please let me know. I will be available for office hours via Zoom at our regularly scheduled class times. I'll also respond to all emails as quickly as possible.

This assignment will be due by 5 PM next Monday, April 6th.

In [45]:
# Load the re module
import re

In [46]:
# Create a file object to read from the covid data file
# This file contains cumulative case counts and deaths from 
# covid-19 for every country in the US. It was current as of
# March 30th.

inData = open("covidData_3.30.20.txt",'r')

In [47]:
# Find all parishes from Louisiana in this file.
# You should make a list that contains only the parish
# names. This list can be created with a single command.
# Be careful to include ALL the parishes in your search.

# Isolate parishes in Louisiana, and then to only parish
matches = re.findall(r",\w.+,Louisiana",inData.read())
matches = ''.join(matches).split(',')
counties = matches[1::2]

In [48]:
# Close the covid data file object

inData.close()

In [49]:
# Print out the Louisiana search results

print(counties)

['Jefferson', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'St. Charles', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Washington', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Unknown', 'Washington', 'Ascension', 'As

In [50]:
# Create a dictionary that has the Louisiana parish 
# names as keys and the number of times they appear 
# in the data file as values.

# Hint: You can test if a dictionary does or does not
# contain a certain key using:
# myKey in myDict --or-- myKey not in myDict

laDict = {}
# If parish was in the dictionary, add 1 to the value
# Else, add parish to dictionary and set value to 1
for i in counties:
    if i in laDict:
        laDict[i] = laDict.get(i) + 1
    else:
        laDict[i] = 1

In [51]:
# Using your dictionary, figure out which of these parishes has
# the most entries in the file. The one with the most entries
# had the earliest observed covid case.
# Jefferson, Orleans, East Baton Rouge, Plaquemines

entryList = ['Jefferson','Orleans','East Baton Rouge','Plaquemines']
maxVal = 0
# If the value for key is higher than the last, set it equal
# to the max value to find the highest value
for i in entryList:
    if laDict[i] > maxVal:
        maxVal = laDict[i]
        maxKey = i
print("Parish with the highest entries: " + maxKey)

Parish with the highest entries: Jefferson


In [52]:
# Figure out how many parishes are in the dictionary. How does 
# this compare to the total number of parishes in the state?

print(len(laDict))

# There are 64 parishes in total, meaning that nearly all 
# of the parishes have had a confirmed case!

60


In [53]:
# Now, reopen the data file and use a regex search to find the first instance 
# where a Louisiana parish recorded a covid-19 death. This will be the first 
# line in the file that includes Louisiana and has a death count >0. After 
# your search, print the date and the parish. Be sure to also close the input file.
# Example output: ZZZZ parish reported the first covid-19 death in Louisiana on 2020-ZZ-ZZ

inData = open("covidData_3.30.20.txt",'r')

# Find LA cases
laMatches = re.findall(r"\w.+,Louisiana.+",inData.read())

# If the last number is greater than 0, collect the date and
# parish. Break once found
for i in laMatches:
    if int(re.search(r"\d+$",i).group()) > 0:
        info = re.search(r"^\w+-\w+-\w+,\w+",i).group()
        break
print(info + " Parish: First COVID-19 death in Louisiana")

inData.close()

2020-03-14,Orleans Parish: First COVID-19 death in Louisiana


In [54]:
# Now we need to create a new file that only has records from Louisiana. 
# Extract the lines that pertain to Louisiana and write them to a new file.

outData = open("outputDataLA.txt", 'w')

for i in laMatches:
    outData.write(i + "\n")
    
outData.close()

In [55]:
# Lastly, let's imagine that we want to do further downstream analyses with
# the Louisiana data, but our downstream analyses require some formatting chanages.
# Read in the lines from your newly created file with only Louisiana records, reformat them, 
# and create a separate file to hold the newly formatted information. Specifically, the 
# dates in the file should be reformatted from looking like this (2020-MM-DD) to looking 
# like this (MM.DD.2020). Also, the fips codes (in the 4th column) should be removed, and
# Louisiana should be abbreviated as LA. So, this file should end up with entries that look
# like this: 03.25.2020,Morehouse,LA,1,0

outData = open("formattedLA.txt",'w')

# Rearrange date based on character placement
# Isolate parish and state, then subtract number of characters 
# from LA to parish
# Get the number of cases and deaths, without including fips
# codes to remove that column
# Write to file
for i in laMatches:
    date = i[5:7] + "." + i[8:10] + "." + i[0:4]
    
    parishState = re.search(r"\w+,Louisiana",i).group()
    parish = parishState[0:len(parishState)-10]
    
    numbers = re.search(r"\d+,\d+$",i).group()
    
    outData.write(date + "," + parish + ",LA," + numbers + "\n")

outData.close()