This data set records information on every historical member of the U.S. Congress. Here's a preview of the dataset:

last_name|first_name|birthday|gender|type|state|party
---|---|---|---|---|---|---
Bassett|Richard|1745-04-02|M|sen|DE|Anti-Administration
Bland|Theodorick|1742-03-21||rep|VA|
Burke|Aedanus|1743-06-16||rep|SC|
Carroll|Daniel|1730-07-22|M|rep|MD|

data cleaning: find unexpected values by extracting unique values from the list.
missing data, non intuitive values, recuring themes

You can use one of the following strategies to address missing data:

Remove any rows that contain missing data.
Populate the empty fields with a specified value.
Populate the empty fields with a calculated value.
Use analysis techniques that work with missing data: try except
# Data preparation

In [8]:
import csv
legislators = list(csv.reader(open('legislators.csv','r')))

# This class will check each column of the data and print out the unique value sets. 
# This will help us identify empty or wrong values in the dataset.

class Cleandata:
    def __init__(self, data):    #create an instance of the class and remove the header
        self.data = data[1:]         
        
    def birthint(self):
        for row in self.data:      #convert the birthday column data into integer, then add to the rows.Empty values are converted to 0.
            if row[2] == '':
                row[2] = "0-0-0"   # note: here is 0-0-0 instead of 0.
            num = row[2].split('-')
            for item in num:
                try:
                    intbirth = int(item)
                except:
                    intbirth = 0
                row.append(intbirth)
        return self.data
                
                   
    def columnset(self):       # check set of unique value in each column for abnormal data
        for num in range(0,len(self.data[0])):
            setlist = []
            for row in self.data:
                try:
                    setlist.append(row[num])
                except IndexError as e:
                    print(e)
            setlist = set(setlist)
            print(setlist)
        return None
        
# from the results of legis.columnset we can see the following columns need adjustment:
# the gender column has empty values. Because majority of the people in the dataset are male, we will replace the empty values with 'M'.
# the party column has empty values. We will use 'None' to replace them.
# the year/month/day columns have '0' values. We will replace them with the previous values in the same columns.
# run legis.columnset again, this time run legis.adjust first, see if the empty values are replced.

    def adjust(self):
        last_year = 1
        last_month = 1
        last_day = 1
        for i in self.data:
            if i[3] != 'F' and i[3] != 'M':
                i[3] = 'M'
            if i[6] =='':
                i[6]='NA'
            if i[7] == 0:
                i[7] = last_year
            if i[8] == 0:
                i[8] = last_month
            if i[9] == 0:
                i[9] = last_day    
            last_year = i[7]
            last_month = i[8]
            last_day = i[9]
        return self.data

    def top_name(self, gender):
        namedict = {}
        for row in self.data:
            if row[7]>1940 and row[3] == gender:
                if row[1] in namedict:
                    namedict[row[1]] +=1
                else: namedict[row[1]] = 1    
        maxnum = None
        top_names = []
        for name, num in namedict.items():
            if maxnum is None or num > maxnum:
                maxnum = num
                top_names.append(name)
        return top_names, maxnum
            
            
    
legis= Cleandata(legislators) # create a class instance first
legis.birthint()
legis.adjust()
legis.columnset() # This is how to correctly print a class method by invoking the method.


{'Singiser', 'Goss', 'Karth', 'Irvin', 'Wiggins', 'Seybert', 'Ellwood', 'Maher', 'Inglis', 'Jenckes', 'Oxley', 'Longnecker', 'DeWine', 'Ganly', 'Colquitt', 'Eager', 'Coke', 'Jontz', 'Merrill', 'Gallagher', 'Flack', 'Frye', 'MacLafferty', 'Noland', 'Daly', 'McHale', 'Buffett', 'Hoffman', 'Akers', 'Combest', 'Murphey', 'Cooke', 'Leavitt', 'Hopkins', 'Adrain', 'Maginnis', 'Compton', 'Bailey', 'Bowersock', 'Keller', 'Milnor', 'Clover', 'De Veyra', 'Dennison', 'Guffey', 'Redlin', 'Snook', 'Sloan', 'McCoy', 'Erlenborn', 'Sevier', 'Hinshaw', 'Denver', 'Laporte', 'Chrysler', 'Almon', 'Biddle', 'Alexander', 'Wells', 'Codd', 'Dunn', 'Hanly', 'Treloar', 'Farley', 'Ellett', 'Hutchison', 'Dickinson', 'Riddleberger', 'Darden', 'Pettengill', 'Kremer', 'Cazayoux', 'Crosby', 'Kilgore', 'Mahany', 'Strode', 'Neville', 'Lively', 'Borah', 'Preyer', 'Law', 'Shuster', 'McFall', 'Landes', 'Millward', 'Averett', 'Pfeifer', 'Rice', 'Quarles', 'SoulÃ©', 'Learned', 'Paddock', 'Alston', 'Crowe', 'Randell', 'Trimbl

# Now we will analyze the data to answer the below questions:

-  What are the most common male and female names among the legislators after 1940? 



In [11]:
print(legis.top_name('M'))
print(legis.top_name('F'))    

['Alan', 'Michael', 'David', 'James', 'John']
['Enid', 'Karen']
