# Data Manipulation with `Python` Exercises

Welcome to one of your first exercise notebooks. 
So what should you expect from these notebooks? 
Well, we will be touching on the concepts and code that we ran through in the subsequent labs and practices, 
except the majority of the coding will be done by you now. 
The questions that we ask of you will be very familiar, although the output might throw a few more errors. 
**Some of these issues we have not seen yet and this is meant to challenge you.** 
Learning to resolve new issues and development of your problem solving vocabulary for internet research is critical to developing you as a data scientist.

In these notebooks, we will ask you to write and execute your own code for questions that will look similar to what we have learned in the Labs and Practices. However, Exercises will often be a bit more challenging in that 1) you may be working with a new data set with which you will have to familiarize yourself, and 2) you will be asked to write code to problems you have yet to see.


## Read in the Data

We will be using a different data set for this exercise. These data are filled with all of the U.S. Congress members from January 1947 to February 2014 along with some information about them.

Go ahead and read in the `congress-terms.csv` in the `all_datasets/` directory. Pay particular attention to the encoding. Run the following line...

In [3]:
# with and without encoding = 'ISO-8859-1'

import pandas as pd

with open('/dsa/data/all_datasets/congress-terms.csv', 'r', encoding='Latin-1') as file:
    data = file.read()

    data_lists = data.split("\n")

    list_of_lists = []
    for line in data_lists:
        row = line.split(',')
        list_of_lists.append(row)

    # return the first 11 lists (rows) to get an idea of what the data looks like     
    for row in list_of_lists[0:11]:
        print(' ,'.join(row))

congress ,chamber ,bioguide ,firstname ,middlename ,lastname ,suffix ,birthday ,state ,party ,incumbent ,termstart ,age
80 ,house ,M000112 ,Joseph ,Jefferson ,Mansfield , ,1861-02-09 ,TX ,D ,Yes ,1/3/47 ,85.9
80 ,house ,D000448 ,Robert ,Lee ,Doughton , ,1863-11-07 ,NC ,D ,Yes ,1/3/47 ,83.2
80 ,house ,S000001 ,Adolph ,Joachim ,Sabath , ,1866-04-04 ,IL ,D ,Yes ,1/3/47 ,80.7
80 ,house ,E000023 ,Charles ,Aubrey ,Eaton , ,1868-03-29 ,NJ ,R ,Yes ,1/3/47 ,78.8
80 ,house ,L000296 ,William , ,Lewis , ,1868-09-22 ,KY ,R ,No ,1/3/47 ,78.3
80 ,house ,G000017 ,James ,A. ,Gallagher , ,1869-01-16 ,PA ,R ,No ,1/3/47 ,78
80 ,house ,W000265 ,Richard ,Joseph ,Welch , ,1869-02-13 ,CA ,R ,Yes ,1/3/47 ,77.9
80 ,house ,B000565 ,Sol , ,Bloom , ,1870-03-09 ,NY ,D ,Yes ,1/3/47 ,76.8
80 ,house ,H000943 ,Merlin , ,Hull , ,1870-12-18 ,WI ,R ,Yes ,1/3/47 ,76
80 ,house ,G000169 ,Charles ,Laceille ,Gifford , ,1871-03-15 ,MA ,R ,Yes ,1/3/47 ,75.8


**Question 1**: You will notice something a little bit different about reading in this file, particularly the `encoding` parameter. Do a bit of research on what encoding is. What happens when you remove this parameter all together? Do your best to describe any errors being thrown.

**Question 2**: In the `list_of_lists` variable, the last item of each list is the `age` of the member of congress. This is currently a string. Without using any packages, create a subset that contains all of the values for `age` stored as floats.

In [5]:
# Execute your code for question 2 here
# -------------------------------------

## Age is column 13
# We will use a loop to coerce the element in position 13 as floats, but we need to start after row 1

list_of_agefloat = []

# Here, we append the coerced value of age to the new list
for row in list_of_lists[1:]:
    list_of_agefloat.append(float(row[12]))
    
# Print out values of first 10 items and their types
for num in list_of_agefloat[0:10]:
    print(num)
    print(type(num))


85.9
<class 'float'>
83.2
<class 'float'>
80.7
<class 'float'>
78.8
<class 'float'>
78.3
<class 'float'>
78.0
<class 'float'>
77.9
<class 'float'>
76.8
<class 'float'>
76.0
<class 'float'>
75.8
<class 'float'>


**Question 3**: Now go ahead and read in the file with `pandas` save the data frame to a variable called `df`.

In [6]:
# Execute your code for question 3 here
# -------------------------------------

# Pandas was imported above

with open('/dsa/data/all_datasets/congress-terms.csv', 'r', encoding='Latin-1') as file:
    df = pd.read_csv(file)
    

# Check data was read into df 

df.head(10)

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
0,80,house,M000112,Joseph,Jefferson,Mansfield,,1861-02-09,TX,D,Yes,1/3/47,85.9
1,80,house,D000448,Robert,Lee,Doughton,,1863-11-07,NC,D,Yes,1/3/47,83.2
2,80,house,S000001,Adolph,Joachim,Sabath,,1866-04-04,IL,D,Yes,1/3/47,80.7
3,80,house,E000023,Charles,Aubrey,Eaton,,1868-03-29,NJ,R,Yes,1/3/47,78.8
4,80,house,L000296,William,,Lewis,,1868-09-22,KY,R,No,1/3/47,78.3
5,80,house,G000017,James,A.,Gallagher,,1869-01-16,PA,R,No,1/3/47,78.0
6,80,house,W000265,Richard,Joseph,Welch,,1869-02-13,CA,R,Yes,1/3/47,77.9
7,80,house,B000565,Sol,,Bloom,,1870-03-09,NY,D,Yes,1/3/47,76.8
8,80,house,H000943,Merlin,,Hull,,1870-12-18,WI,R,Yes,1/3/47,76.0
9,80,house,G000169,Charles,Laceille,Gifford,,1871-03-15,MA,R,Yes,1/3/47,75.8


**Question 4**: Find a method to print of the column headers of the data frame `df`.

In [10]:
# Execute your code for question 4 here
# -------------------------------------

# Below method columns will print out the names of the columns of the data frame. 
# Can also use loops to make it more readable such as the below
df.columns

Index(['congress', 'chamber', 'bioguide', 'firstname', 'middlename',
       'lastname', 'suffix', 'birthday', 'state', 'party', 'incumbent',
       'termstart', 'age'],
      dtype='object')

In [12]:
print("Columns in the Congress Data Frame\n")
for name in df.columns:
    print(name)

Columns in the Congress Data Frame

congress
chamber
bioguide
firstname
middlename
lastname
suffix
birthday
state
party
incumbent
termstart
age


**Question 5**: Congresses are numbered. Notice that there is a column devoted to the Cogress number. This column is conveniently called `congress`. Create a subsetted data frame of the 80th congress only and call this subset `congress80`. 

In [14]:
# Execute your code for question 5 here
# -------------------------------------

congress80 = df[df['congress'] == 80]

# printing out the tail to show the last members listed will still be in congress 80
congress80.tail(10)

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
545,80,senate,W000518,John,James,Williams,,5/17/04,DE,R,No,1/3/47,42.6
546,80,senate,E000018,James,Oliver,Eastland,,11/28/04,MS,D,Yes,1/3/47,42.1
547,80,senate,F000401,James,William,Fulbright,,4/9/05,AR,D,Yes,1/3/47,41.7
548,80,senate,M000053,Warren,Grant,Magnuson,,4/12/05,WA,D,Yes,1/3/47,41.7
549,80,senate,B000099,Joseph,Hurst,Ball,,11/3/05,MN,R,Yes,1/3/47,41.2
550,80,senate,C000021,Harry,Pulliam,Cain,,1/10/06,WA,R,No,1/3/47,41.0
551,80,senate,K000292,William,Fife,Knowland,,6/26/08,CA,R,Yes,1/3/47,38.5
552,80,senate,J000093,William,Ezra,Jenner,,7/21/08,IN,R,No,1/3/47,38.5
553,80,senate,M000315,Joseph,Raymond,McCarthy,,11/14/08,WI,R,No,1/3/47,38.1
554,80,senate,L000428,Russell,Billiu,Long,,11/3/18,LA,D,Yes,1/3/47,28.2


**Question 6**: Now, from this `congress80` subset, use a method that will count the rows who are House members and then again for Senate Members.

In [18]:
# Execute your code for question 6 here
# -------------------------------------

print("Number of Senate: {}".format(len(congress80[congress80['chamber'] == 'senate'])))

print("Number of House: {}".format(len(congress80[congress80['chamber'] == 'house'])))

print("Total members: {}".format(len(congress80)))

Number of Senate: 102
Number of House: 453
Total members: 555


# Save your notebook, then `File > Close and Halt`