This assignment will give you practice with regular expressions (and other Python skills) by processing data on cumulative case counts and deaths in every county/parish in the US for covid-19. These data are compiled and updated regulary by the NY Times, and made available on GitHub: https://github.com/nytimes/covid-19-data. I downloaded the data on March 30th and have included that file (covidData_3.30.20.txt) in this week's repository.

Each cell in this Jupyter notebook includes a comment that indicates what I want you to accomplish with Python code. To submit your assignment, please create your own folder (with your name) inside the assignments folder for this week. In this folder, please put the completed notebook and the new files you created. Please submit these via a pull request to the class GitHub page, as with previous weeks.

If you have *any questions at all*, please let me know. I will be available for office hours via Zoom at our regularly scheduled class times. I'll also respond to all emails as quickly as possible.

This assignment will be due by 5 PM next Monday, April 6th.

In [33]:
# Load the re module
import re

In [34]:
# Create a file object to read from the covid data file
# This file contains cumulative case counts and deaths from 
# covid-19 for every country in the US. It was current as of
# March 30th.

data = open("covidData_3.30.20.txt","r")

In [35]:
# Find all parishes from Louisiana in this file.
# You should make a list that contains only the parish
# names. This list can be created with a single command.
# Be careful to include ALL the parishes in your search.

parishes = re.findall(r"(?<=,)(.*)(?=,Louisiana)", data.read()) #create a list that identifies all the cases using regex, beginning from the comma and includes all characters up until the comma and Louisiana (using the look behind and look ahead syntaxes) from the file 

In [36]:
# Close the covid data file object

data.close()

In [37]:
# Print out the Louisiana search results

print(parishes)

['Jefferson', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'Caddo', 'Jefferson', 'Orleans', 'St. Charles', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Washington', 'Ascension', 'Bossier', 'Caddo', 'East Baton Rouge', 'Jefferson', 'Lafourche', 'Orleans', 'St. Bernard', 'St. Charles', 'St. John the Baptist', 'St. Tammany', 'Terrebonne', 'Unknown', 'Washington', 'Ascension', 'As

In [38]:
# Create a dictionary that has the Louisiana parish 
# names as keys and the number of times they appear 
# in the data file as values.

# Hint: You can test if a dictionary does or does not
# contain a certain key using:
# myKey in myDict --or-- myKey not in myDict

#create an empty dictionary
freq = {}
for x in parishes: #iterate through the list of parishes
    if (x in freq):
        freq[x] += 1 #when the parish is found, count up the instance
    else:
        freq[x] = 1 #if parish was not previously found in dictionary first time, add it
for key, value in freq.items(): #for each of the keys, grab the value from the dictionary created
    print ("%s : % d"%(key, value)) #print out the string then digit for key: value

Jefferson :  21
Orleans :  20
Caddo :  19
St. Charles :  17
St. Tammany :  17
Terrebonne :  17
Bossier :  16
Lafourche :  16
St. John the Baptist :  16
St. Bernard :  15
Ascension :  14
East Baton Rouge :  13
Washington :  13
Unknown :  12
Assumption :  11
Calcasieu :  11
Iberia :  11
Iberville :  11
Lafayette :  11
Livingston :  11
Plaquemines :  11
St. James :  11
St. Landry :  11
Webster :  11
West Baton Rouge :  11
Catahoula :  10
De Soto :  10
Rapides :  10
Tangipahoa :  10
Avoyelles :  9
Beauregard :  9
Bienville :  9
Claiborne :  9
Evangeline :  9
Ouachita :  9
St. Mary :  9
Acadia :  8
Allen :  7
Grant :  7
Lincoln :  7
Natchitoches :  7
Richland :  7
St. Martin :  7
Vernon :  6
Jackson :  5
Jefferson Davis :  5
Morehouse :  5
Union :  5
Winn :  5
East Feliciana :  4
LaSalle :  4
Madison :  4
Pointe Coupee :  4
Vermilion :  4
West Feliciana :  3
East Carroll :  2
Franklin :  2
Caldwell :  1
Red River :  1
Sabine :  1


In [39]:
# Using your dictionary, figure out which of these parishes has
# the most entries in the file. The one with the most entries
# had the earliest observed covid case.
# Jefferson, Orleans, East Baton Rouge, Plaquemines

sorted (freq.items(), key=lambda x: x[1], reverse=True) #sort the dictionary according to the values, in descending order
print(freq)

{'Jefferson': 21, 'Orleans': 20, 'Caddo': 19, 'St. Charles': 17, 'St. Tammany': 17, 'Terrebonne': 17, 'Bossier': 16, 'Lafourche': 16, 'St. John the Baptist': 16, 'St. Bernard': 15, 'Ascension': 14, 'East Baton Rouge': 13, 'Washington': 13, 'Unknown': 12, 'Assumption': 11, 'Calcasieu': 11, 'Iberia': 11, 'Iberville': 11, 'Lafayette': 11, 'Livingston': 11, 'Plaquemines': 11, 'St. James': 11, 'St. Landry': 11, 'Webster': 11, 'West Baton Rouge': 11, 'Catahoula': 10, 'De Soto': 10, 'Rapides': 10, 'Tangipahoa': 10, 'Avoyelles': 9, 'Beauregard': 9, 'Bienville': 9, 'Claiborne': 9, 'Evangeline': 9, 'Ouachita': 9, 'St. Mary': 9, 'Acadia': 8, 'Allen': 7, 'Grant': 7, 'Lincoln': 7, 'Natchitoches': 7, 'Richland': 7, 'St. Martin': 7, 'Vernon': 6, 'Jackson': 5, 'Jefferson Davis': 5, 'Morehouse': 5, 'Union': 5, 'Winn': 5, 'East Feliciana': 4, 'LaSalle': 4, 'Madison': 4, 'Pointe Coupee': 4, 'Vermilion': 4, 'West Feliciana': 3, 'East Carroll': 2, 'Franklin': 2, 'Caldwell': 1, 'Red River': 1, 'Sabine': 1}


In [40]:
# Figure out how many parishes are in the dictionary. How does 
# this compare to the total number of parishes in the state?
print("There are a total of", len(freq.items()), "parishes affected.") #this prints out the length of the dictionary. Possible since none of the items are repeats
print("There are a total of 64 parishes in Louisiana, so...")
tot= 60/64
print(tot*100, "% of parishes in Louisiana are affected by Covid19.")

There are a total of 60 parishes affected.
There are a total of 64 parishes in Louisiana, so...
93.75 % of parishes in Louisiana are affected by Covid19.


In [56]:
# Now, reopen the data file and use a regex search to find the first instance 
# where a Louisiana parish recorded a covid-19 death. This will be the first 
# line in the file that includes Louisiana and has a death count >0. After 
# your search, print the date and the parish. Be sure to also close the input file.
# Example output: ZZZZ parish reported the first covid-19 death in Louisiana on 2020-ZZ-ZZ
import re
data2 = open("covidData_3.30.20.txt","r")
death = re.search(r"(.*)(,Louisiana)(,[0-9]\d*){2}(,[1-9]\d*)", data2.read()).group(0) #identify the first line of the file that has Louisiana as well as two sections with digits (the tips and number of cases) followed by the last digit as deaths
parishData=death.split(",") #split up the data according to commas so can grab info as separate items
print(parishData[1], "parish reported the first covid-19 death in Louisiana on ", parishData[0])
data2.close() #close the file


Orleans parish reported the first covid-19 death in Louisiana on  2020-03-14


In [75]:
# Now we need to create a new file that only has records from Louisiana. 
# Extract the lines that pertain to Louisiana and write them to a new file.
data = open("covidData_3.30.20.txt","r")
LA = re.findall(r"(.*)(Louisiana)(.*)", data.read()) #identify all the Louisiana cases
data.close()
LAFile= open ("LAFile.txt","w") #creates the new file
for element in LA:
    LAFile.write(''.join(element)) #converting each tuple to a string so it makes one big string
    LAFile.write('\n') #creates a new line following each string
LAFile.close()
print ("Your file has been created")


Your file has been created


In [80]:
# Lastly, let's imagine that we want to do further downstream analyses with
# the Louisiana data, but our downstream analyses require some formatting chanages.
# Read in the lines from your newly created file with only Louisiana records, reformat them, 
# and create a separate file to hold the newly formatted information. Specifically, the 
# dates in the file should be reformatted from looking like this (2020-MM-DD) to looking 
# like this (MM.DD.2020). Also, the fips codes (in the 4th column) should be removed, and
# Louisiana should be abbreviated as LA. So, this file should end up with entries that look
# like this: 03.25.2020,Morehouse,LA,1,0

LAdata = open("LAFile.txt","r") #open the newly created file
newLAData= open("NewLAData.txt","w") #create new file to write
for line in LAdata: #iterate through the newly created file
    date, county, state, fips, cases, deaths=line.split(",") #divide each line into 6 different parts using the comma as a delimitor with line.split
    state = "LA" #replace the state variable with LA in all cases
    year, month, day = date.split("-") #separate out the parts of date into 3 parts
    date=month+"."+day+"."+year #reformats the date variable
    newLAData.write (date+","+county+","+state+","+cases+","+deaths) #writes a new file using the variables desired
print("Your new file has been created.")
#close the relevant files
LAdata.close()
newLAData.close()

Your new file has been created.
