Skip to content
jcchurch edited this page Jan 3, 2012 · 5 revisions

Homework #1

Reading Data From a File


In this assignment, you should download the file "file_reading/Tax_Year_2007_County_Income_Data.csv". This data was pulled from the website data.gov a taxpayer funded repository of data. You don't need to actually pull it from data.gov. I've got it in the "file_reading" subdirectory of this site.

This file describes tax returns filed by each county in the United States. Write a Python program to report the following:

  • The highest average gross income by county
  • The lowest average gross income by county
  • The average of all counties by average gross income

A few pointers:

  • The data must be cleaned. There are dollar signs throughout the script on dollar values. Use your python program to remove them so that your input file does not need to be edited.
  • The data contains total returns by state as well as county, which means there are 50 additional records in the file which are not necessary. Each of these have a "County Code" of 0. Ignore these records.

To open a file:

Your text file of data must be in the same directory as your script. This is important.

#!/usr/bin/env python

filename = "Tax_Year_2007_County_Income_Data.csv"

for line in file(filename):
   line = line.strip()
   print line

That's all there is to it. You need to build up your code from here.

Due Date: Tuesday, January 3rd by 8 AM.

The Instructor's Solution

I realize now that there was an incomplete record in the text file for Kalawao County, HI. I even missed this, and it shows that data cleaning is actually a very difficult step that everyone must do. Still, ignoring that, here is my solution.

#!/usr/bin/env python

filename = "Tax_Year_2007_County_Income_Data.csv"
grossIncome = []
i = 0
for line in file(filename):
    if i > 0:
        line = line.strip() # Remove new lines
        line = line.replace("$", "") # Remove dollar signs
        parts = line.split(",")
        countyCode = int(parts[1])
        thisCountyReturns = float(parts[4])
        thisGrossIncome = float(parts[6])

        if countyCode > 0:
            grossIncome.append( thisGrossIncome / thisCountyReturns )
    i += 1

grossIncome = sorted(grossIncome)
sumGrossIncome = sum(grossIncome)
avgGrossIncome = sumGrossIncome / float(len(grossIncome))

print "Poorest County Gross Income:", grossIncome[0]
print "Riches County Gross Income:", grossIncome[-1]
print "Average Gross Income:", avgGrossIncome