### Work with a partner to complete the tasks below and submit your results via a pull request on GitHub by the beginning of tutorial next Friday.

To begin this week, one of the partners should fork the TA's Exercise 5 Github repo and provide collaborative access to the other partner. Clone the forked repo so that you have the required files. Be sure to commit regularly to show how you arrived at your solutions and demonstrate coding effort by both partners.


## Tutorial

In this short tutorial, we will work with two tabular datasets to extend a couple concepts from the Patient Data exercise from Software Carpentry.

&nbsp;

First, we will load a simple data table using `numpy.loadtxt()`. In the Patient Data exercise, we saw how we can use square brackets `[]` to index or subset data. We can also use results of logic tests to index our data. This can be really useful when we have a large dataset that we want to access a subset of based on characterstics of the data itself.

In [4]:
import numpy
data=numpy.loadtxt(fname="test.dat",delimiter=" ")
data

array([[  1.,   5.,   9.,  13.,  17.],
       [  2.,   6.,  10.,  14.,  18.],
       [  3.,   7.,  11.,  15.,  19.],
       [  4.,   8.,  12.,  16.,  20.]])

In [3]:
# we can test for equality using double equal signs
data[:,0]==1

array([ True, False, False, False], dtype=bool)

In [5]:
# we can test for greater than or less than as well
data[:,0]>2

array([False, False,  True,  True], dtype=bool)

In [6]:
# the logical values returned by a logic test can be used just like numbers to index a data structure
data[data[:,0]>2,:]

array([[  3.,   7.,  11.,  15.,  19.],
       [  4.,   8.,  12.,  16.,  20.]])

Note that the data in `test.dat` are all integers. If we had a mixture of data types, this would cause problems for `numpy.loadtxt()`. Luckily, another `Python` package (`pandas`) contains a `DataFrame` structure that can handle multiple data types in the same data structure, like what we saw in `wages.csv` from our exercise last week.

In [14]:
import pandas
wages=pandas.read_csv("wages.csv")
wages.shape

(3294, 4)

In [15]:
wages.head(n=5)

Unnamed: 0,gender,yearsExperience,yearsSchool,wage
0,female,9,13,6.315296
1,female,12,12,5.47977
2,female,11,11,3.64217
3,female,9,14,4.593337
4,female,8,14,2.418157


We can index portions of a `pandas` DataFrame using square brackets as well. However, this works a bit differently than a `numpy` array.

In [19]:
# we can access a column from a pandas DataFrame directly using .column_name
genders=wages.gender
print(genders.shape)
genders.head(n=5)

(3294,)


0    female
1    female
2    female
3    female
4    female
Name: gender, dtype: object

In [20]:
# we can also use square brackets to access specific rows within a column
fiveGenders=wages.gender[0:5]
print(fiveGenders.shape)
fiveGenders

(5,)


0    female
1    female
2    female
3    female
4    female
Name: gender, dtype: object

In [21]:
# we can also use numeric-based indexing, like with numpy arrays, but this requires the iloc function
fiveGenders2=wages.iloc[0:5,0]
print(fiveGenders2.shape)
fiveGenders2

(5,)


0    female
1    female
2    female
3    female
4    female
Name: gender, dtype: object

We can use the logic-test based indexing with `pandas` too!

In [26]:
females=wages[wages.gender=="female"]
print(females.shape)

(1569, 4)


In [27]:
print(females.gender.unique())

['female']


&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

&nbsp;

## Challenge

Work with your partner to develop a Python script that accomplishes the three tasks below. These tasks will require the file "wages.csv", which you should have in your local directory since you cloned the repo you forked from the TA.

&nbsp;

1. Write a file containing the unique gender-yearsExperience combinations contained in the file "wages.csv". The file you create should contain gender in the first column and yearsExperience in a second column with a space separating the two columns. The rows should be sorted first by gender and then by yearsExperience, but remember to keep the pairings in a given row intact. Don't worry about column names in the output file.

2. Return the following information to the console when the script is executed: the gender, yearsExperience, and wage for the highest earner, the gender, yearsExperience, and wage for the lowest earner, and the number of females in the top ten earners in this data set. Be sure to indicate, which output is which when returning them to the console.

3. Return one more piece of information to the console: the effect of graduating college (12 vs. 16 years of school) on the minimum wage for earners in this dataset. 

&nbsp;

Devise a plan for splitting up the work and generating the required code. Do this in parrallel, not sequentially. Don't forget to check and edit each other's code. Remember to frequently `add`-`commit` locally and `push`-`pull` to GitHub to avoid conflicts. Also, remember you don't have to be in the same place at the same time to work on this collaboratively thanks to GitHub!!!

**Turning in your assignment via GitHub**

Once you have committed all changes to your local Git repos and pushed all of those commits to the forked repo on GitHub, you can "turn in" your assignment using a `pull request`. This can be done from the GitHub repo website. When viewing the forked repo, select "Pull requests" in the upper middle of the screen, then click the green "New pull request" button in the upper right. You'll then see a screen with a history of commits for you and your collaborator, select the green "Create pull request button". In the text box next to your user icon near the top of the page, remove whatever text is there and add "owner's last name - collaborator's last name submission", but obviously substitute your last names. If I and Ann Raiho worked on the project together the text would read "jones-raiho submission". Then click the green "Create pull request" button. **Only one of you will need to create a pull request.**