This assignment will give you practice with regular expressions (and other Python skills) by processing data on cumulative case counts and deaths in every county/parish in the US for covid-19. These data are compiled and updated regulary by the NY Times, and made available on GitHub: https://github.com/nytimes/covid-19-data. I downloaded the data on March 30th and have included that file (covidData_3.30.20.txt) in this week's repository.

Each cell in this Jupyter notebook includes a comment that indicates what I want you to accomplish with Python code. To submit your assignment, please create your own folder (with your name) inside the assignments folder for this week. In this folder, please put the completed notebook and the new files you created. Please submit these via a pull request to the class GitHub page, as with previous weeks.

If you have *any questions at all*, please let me know. I will be available for office hours via Zoom at our regularly scheduled class times. I'll also respond to all emails as quickly as possible.

This assignment will be due by 5 PM next Monday, April 6th.

In [1]:
# Load the re module

import re

In [2]:
# Create a file object to read from the covid data file
# This file contains cumulative case counts and deaths from 
# covid-19 for every country in the US. It was current as of
# March 30th.

#create file object to read from
covidFile = open("covidData_3.30.20.txt",'r')

In [3]:
# Find all parishes from Louisiana in this file.
# You should make a list that contains only the parish
# names. This list can be created with a single command.
# Be careful to include ALL the parishes in your search.

#parishes = re.findall(r",(.*),Louisiana",covidFile.read())
#print(parishes)

listParishes = re.findall(r",(.*),Louisiana",covidFile.read())


In [4]:
# Close the covid data file object

covidFile.close()

In [5]:
# Print out the Louisiana search results

print(listParishes)

['Jefferson', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'St. Charles', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Washington', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Unknown', 'Washington', 'Ascension', 'As

In [6]:
# Create a dictionary that has the Louisiana parish 
# names as keys and the number of times they appear 
# in the data file as values.

# Hint: You can test if a dictionary does or does not
# contain a certain key using:
# myKey in myDict --or-- myKey not in myDict

#count number of time parish appears in list. 
#The output gives you all the info you want so just save that as a dictionary
from collections import Counter
covidDict = Counter(listParishes)

#Test:
#covidDict["Jefferson"]
#Success!

In [7]:
# Using your dictionary, figure out which of these parishes has
# the most entries in the file. The one with the most entries
# had the earliest observed covid case.
# Jefferson, Orleans, East Baton Rouge, Plaquemines

mostCases = max(covidDict, key=covidDict.get)
print(mostCases)

Jefferson


In [8]:
# Figure out how many parishes are in the dictionary. How does 
# this compare to the total number of parishes in the state?

#Count number of entries in dictionary
len(covidDict)

#There are 64 parishes in Louisiana, ad 60 of them have covid19 cases

60

In [9]:
# Now, reopen the data file and use a regex search to find the first instance 
# where a Louisiana parish recorded a covid-19 death. This will be the first 
# line in the file that includes Louisiana and has a death count >0. After 
# your search, print the date and the parish. Be sure to also close the input file.
# Example output: ZZZZ parish reported the first covid-19 death in Louisiana on 2020-ZZ-ZZ

#open file
covidFile = open("covidData_3.30.20.txt",'r')
#regex search for first LA death, and single out the date and parish to print later
firstDeath = re.search(r"(.*),(.*),Louisiana,.*,.*,[1+]",covidFile.read())
#print(firstDeath)

#close file
covidFile.close()

#print results in a readable manner
print(firstDeath[2] +" parish reported the first death related to covid-19 in Louisiana on " + firstDeath[1])


Orleans parish reported the first death related to covid-19 in Louisiana on 2020-03-14


In [10]:
# Now we need to create a new file that only has records from Louisiana. 
# Extract the lines that pertain to Louisiana and write them to a new file.

#Open covid file and new file for LA data
covidFile = open("covidData_3.30.20.txt")
covidLA = open("covidLa.txt",'w')


#If a line in covidFile has Lousisiana, append that line to covidLA
dataLA = re.findall(r".*Louisiana.*",covidFile.read())

#Close covidFile
covidFile.close()

#Test
#print(dataLA)

#Convert list to string
dataLAstr = '\n'.join(dataLA)

#Append dataLA to covidLA file
covidLA.write(dataLAstr)

#Close covidLA file
covidLA.close()

In [59]:
# Lastly, let's imagine that we want to do further downstream analyses with
# the Louisiana data, but our downstream analyses require some formatting chanages.
# Read in the lines from your newly created file with only Louisiana records, reformat them, 
# and create a separate file to hold the newly formatted information. Specifically, the 
# dates in the file should be reformatted from looking like this (2020-MM-DD) to looking 
# like this (MM.DD.2020). Also, the fips codes (in the 4th column) should be removed, and
# Louisiana should be abbreviated as LA. So, this file should end up with entries that look
# like this: 03.25.2020,Morehouse,LA,1,0

#Open LA data file
covidLA = open("covidLa.txt",'r')

#Create new file for reformatted data
rfCovidLA = open("rfCovidLa.txt",'w')

#Reformat dates
#newDate = re.sub(r"(dddd)/(dd)/(dd),"),r"\2/\3/\1",covidLA.read())
import re
newDates = re.sub(r"(\d*)-(\d*)-(\d*),",r"\2.\3.\1,",covidLA.read())
#print(newDates)

#write data to rfCovidLA file
rfCovidLA.write(newDates)
             
#close rfCovidLA
rfCovidLA.close()


#Open rfCovidLA as read
rfCovidLA = open("rfCovidLa.txt", 'r')    
    
#Abbreviate to LA
abbrev = re.sub(r"Louisiana",r"LA",rfCovidLA.read())
#print(abbrev)

#Open rfCovidLA as write
rfCovidLA = open("rfCovidLa.txt", 'w')    

#Write data to rfCovidLA file
rfCovidLA.write(abbrev)


#Open rfCovidLA as read
rfCovidLA = open("rfCovidLa.txt", 'r')    

#Remove fips codes
rmFlip = re.sub(r"(LA)(,.*),(.*),(.*)",r"\1,\3,\4",rfCovidLA.read())
#print(rmFlip)

#Open rfCovidLA as write
rfCovidLA = open("rfCovidLa.txt", 'w')    

#Write data to rfCovidLA file
rfCovidLA.write(rmFlip)


#Close files
rfCovidLA.close()
covidLA.close()