# Automating data wrangling

Sometimes we require a "one off" solution to a unique data analysis problem. In this situation, we write code to do a particular analysis on a particular data set. Then, if the analysis is part of a publication, we make the code and data publically available and... we're done.

Often, however, we require a ***reusable*** solution that operates on data of a given format even though some of the particulars, such as sample size or variable names, might change. In this case, we want our code to be "dynamic" in the sense that it should be able to handle any anticipated changes to the details of the input data.

Here, we'll tackle the same problem as last time – reformatting a data set from a cumbersome format into a more useful and "tidy" format.

### Learning goals:

* write reusable code for a data wrangling problem
* create a function to make the code handy to use

## Import pandas and look at the data from last time

In [21]:
import pandas as pd

Read in the data from last time.

In [22]:
my_input_data = pd.read_csv('datasets/018DataFile.csv')

Take a peek to remind ourselves of the data format.

In [23]:
my_input_data.head()

Unnamed: 0,Male Mutant,Female Mutant,Male Wild Type,Female Wild Type
0,12.333785,11.6807,20.497758,24.589659
1,11.675152,10.242694,19.525014,24.978385
2,12.029059,10.26465,20.492631,23.644814
3,12.12643,9.230632,19.928137,24.352657
4,10.307197,10.336082,19.084682,23.223305


In this data set, there are two "independent variables", sex and genotype of laboratory rats, and one "dependent variable", response time. The data are formatted such that each column contains the data from a unique combination of the two independent variables, *i.e.* a "cell" of the experimental design. Like this:

|     | male | female |
| --- | --- | --- |
| **mutant** | mm | fm |
| **wildtype** | mw | fw |

This format might seem to make sense, but it's actually not very flexible. For analysis purposes, it's generally better to have data in a format that obeys a couple of rules:

* *each row should correspond to a single observation (measurement)*
* *each column should correspond to a single variable*

Data in this format are also referred to as "tidy".

So in this case, our goal is to take the above data and put it into a format like this:

| response time | sex | genotype |
| ---| --- | --- |
| rt value | male or female | wild or not |

Once the data are in this format, we can easily use our tools to do things like compare wild to mutant, or compare wild to mutant only in females, etc.

Last time, we stacked the reaction time values into a single column using pandas functions. This relied on us knowing and "hard coding" the column names ("Male Mutant", etc.). If we're going to automate things, we want our code to be agnostic about these. One way would be to somehow read the column names into variables and work with them somehow... 

Here, we are going to introduce a new package called `numpy` (we'll be covering `numpy` in much more detail later). The `numpy` package allows us to represent tables of numbers as numbers only; there are no  column names or row indexes to worry about. You can think of them as a stepping stone between a Python list-of-lists, in which the tabular nature of data is implied but not explicit, and a pandas `DataFrame` in which the tabular nature is explicit, but columns have names and the the rows have indexes. A numpy array sits in the middle; it has explicit rows and columns, but there are no names or indexes to bother about. Arrays in `numpy` are just the data and only the data, but arranged in rows and columns.

So let's try using numpy!

In [24]:
import numpy as np

Pandas dataframes know how to convert themselves to numpy arrays. They have a `to_numpy()` method that will pull *just the numbers* out of our dataframe, ignoring the column labels and row indexs.

In [25]:
raw_data = my_input_data.to_numpy() 

Let's take a look!

In [26]:
raw_data

array([[12.33378457, 11.68069959, 20.49775794, 24.58965924],
       [11.6751524 , 10.24269437, 19.52501417, 24.97838494],
       [12.0290592 , 10.26465025, 20.49263055, 23.64481397],
       [12.1264299 ,  9.23063235, 19.9281375 , 24.35265655],
       [10.30719715, 10.33608152, 19.08468218, 23.22330529],
       [12.2544078 , 10.07605573, 20.71048269, 24.43905903],
       [12.54308949,  8.2865646 , 19.19128027, 24.81529791],
       [11.61341333,  8.2950312 , 21.71823849, 23.87161107],
       [13.03543524, 10.61335271, 20.10639852, 25.32321127],
       [12.58005895,  9.33971663, 18.62673355, 25.49125457],
       [12.14134418, 10.4446447 , 21.48322922, 23.59188335],
       [10.76797804, 11.73504873, 20.22945171, 23.22910822],
       [12.18229646,  9.86065481, 18.45275337, 24.89268429],
       [13.25883451, 10.49711074, 19.86921999, 22.87740504],
       [12.10180863, 10.27518936, 20.24940003, 24.08539592],
       [10.14111384, 10.19483253, 18.86189578, 24.88682079],
       [10.51735514,  9.

## Get some useful information from the original data

So far so good! Now we are going to put the data into the format we want. To automate this, we are going to get 

* the number of observations in each group (which is the number of rows), and 
* the number of groups (which is the number of columns)

and store them in variables.

In [27]:
obs_per_grp, grps = raw_data.shape
print("We have ", obs_per_grp, " observations per group and ", grps, " groups.")

We have  20  observations per group and  4  groups.


Now we'll calculate the total number of observations, which is also how long we want our new data frame to be.

In [28]:
new_length = obs_per_grp*grps
print("We have ", new_length, " total observations.")

We have  80  total observations.


$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below and explain in your own words why we used Numpy Arrays in the previous cells. What was our final goal? Why did we dump the data into a Numpy Array?

we use numpy because we want to get just the data numbers and in the columns and rows.

## Build our response time (dependent variable) column

We could now play legos "by hand", stacking the columns of our numpy array on top of each other to make a new array (and we already know how to do that). 

Or we could take advantage of the fact that one of the things numpy arrays know how to do – one of the methods they have – is to change their shape. So we'll take our `obs` by `cols` array and `numpy.reshape()` into a `new_length` by 1 array.

What this command does (effectively) is read out the data values from the original array one-by-one, and places them in the cells of a new array of a shape you specify. The only catch is that the total number of cells in the new array has to be the same as in the old array – in other words, each and every data value has to have one and only one place to go in the new array. Which makes sense.

In [29]:
values_col = np.reshape(raw_data, (new_length, 1))

We called it `values_col` because it will eventually become the values column of our new pandas data frame.

Let's see if that worked:

In [30]:
values_col

array([[12.33378457],
       [11.68069959],
       [20.49775794],
       [24.58965924],
       [11.6751524 ],
       [10.24269437],
       [19.52501417],
       [24.97838494],
       [12.0290592 ],
       [10.26465025],
       [20.49263055],
       [23.64481397],
       [12.1264299 ],
       [ 9.23063235],
       [19.9281375 ],
       [24.35265655],
       [10.30719715],
       [10.33608152],
       [19.08468218],
       [23.22330529],
       [12.2544078 ],
       [10.07605573],
       [20.71048269],
       [24.43905903],
       [12.54308949],
       [ 8.2865646 ],
       [19.19128027],
       [24.81529791],
       [11.61341333],
       [ 8.2950312 ],
       [21.71823849],
       [23.87161107],
       [13.03543524],
       [10.61335271],
       [20.10639852],
       [25.32321127],
       [12.58005895],
       [ 9.33971663],
       [18.62673355],
       [25.49125457],
       [12.14134418],
       [10.4446447 ],
       [21.48322922],
       [23.59188335],
       [10.76797804],
       [11

Nice! But let's make absolutely sure that worked. What we want is for the columns of the original data to be stacked on top of one another. Is that what we have?

Nope, it's not right. What happened is that the values got read out *left to right, top to bottom* (or row-wise) and placed into the new array one-by-one. But what we want is for the values to be read *top to bottom, left to right* (or columnwise). We can make this happen with the `order=` argument of `numpy.reshape()`.

In [31]:
values_col = np.reshape(raw_data, (new_length, 1), order = 'F')

Let's make sure that worked:

In [32]:
values_col

array([[12.33378457],
       [11.6751524 ],
       [12.0290592 ],
       [12.1264299 ],
       [10.30719715],
       [12.2544078 ],
       [12.54308949],
       [11.61341333],
       [13.03543524],
       [12.58005895],
       [12.14134418],
       [10.76797804],
       [12.18229646],
       [13.25883451],
       [12.10180863],
       [10.14111384],
       [10.51735514],
       [13.94314857],
       [10.86396375],
       [11.35527842],
       [11.68069959],
       [10.24269437],
       [10.26465025],
       [ 9.23063235],
       [10.33608152],
       [10.07605573],
       [ 8.2865646 ],
       [ 8.2950312 ],
       [10.61335271],
       [ 9.33971663],
       [10.4446447 ],
       [11.73504873],
       [ 9.86065481],
       [10.49711074],
       [10.27518936],
       [10.19483253],
       [ 9.81661295],
       [ 8.47491819],
       [10.83815446],
       [11.50135182],
       [20.49775794],
       [19.52501417],
       [20.49263055],
       [19.9281375 ],
       [19.08468218],
       [20

**Yay!** It did!

**Geek trivia**: Two of Ye Olde Major Programming Languages are **C** (used mainly by programmers) and **Fortran** (used mainly by scientists). C (the language used to write Python) uses row-wise indexing, whereas Fortran uses columnwise indexing. That's why "F" is used to specify columnwise indexing above: the "F" is for "Fortran".

Minor annoying thing: (there is always at least one that pops up in any coding task, amirite?) `values_col` is a (40x1) 2-dimensional numpy array but, when we go to build our new data frame, we'll need it to be a 40 long (40,) 1-dimensional array. 

This actually comes up so often that `numpy` has a `squeeze()` function to squeeze the dimension of length one into nothingness. It turns (n, 1) things into (n,) things.

Let's check the shape of our new array:

In [33]:
values_col.shape

(80, 1)

Now let's squeeze the (uneeded and unwanted) column dimension into oblivion:

In [34]:
values_col = np.squeeze(values_col)

And check the shape again:

In [35]:
values_col.shape

(80,)

Okay, that worked, now onto...

$\color{blue}{\text{Complete the following exercise.}}$

  - Use the next cell to explaing what happened to the numpy array after the squeeze operation

When we had (80, 1), we have two dimensions: the one column that we want and another empty unwanted dimension or in this case column that we don't want or need. The squeeze function will get rid of that dimension/column that we don't want, making the shape in (80, ).

  - Type below code demonstrating how you could explore the help for the method `.shape()` to explore what it does:

In [36]:
raw_data.shape

#the shape function produces (number of rows, number of columns)

(20, 4)

   - Use the cell below to explain the use of the method `.resape()`:

The .reshape( ) method is to reshape/stack the data onto each other. The .reshape() without the order = F is stacked in the order of a row. But with the order = 'F', we are able to keep each values under each column together.  

## Building the independent variable columns

What we require is that the levels our two independent variables repeat themselves in the right order down their respective columns. We could certainly type this in by hand, but that would be really annoying to change if we required new labels later on or something. 

We could also use `for()` loops; they are designed for exactly such repetitive tasks after all. That might look something like this:

In [37]:
gen_var = list()                     # create a python list 
for i in range(new_length) :         # loop through all observations
    if i < new_length/2 :            # for the first half, ...
        gen_var.append("wildtype")   # set to male
    else :                           # otherwise...
        gen_var.append("mutant")     # set to female

In [38]:
print(gen_var)

['wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'wildtype', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant', 'mutant']


We'd have to get a little bit more fancy with our `if...` to create the sex variable, that'd be the idea.

But pandas provides easy ways to repeat and stack things (numpy does too), so let's try those. The two will use are

* `pandas.Series.repeat()` 
* `pandas.concat()`

Note: When you see `pandas.Series.somefunction()` or `pandas.DataFrame.somefunction()` in the documentation, that means that all Series or DataFrames know how to do `somefunction()`. So if you had a Series named `Phred`, you would say `Phred.somefunction()` to use `somefunction()`.

$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below to explain what the variable `new_length` contain:

new_length contains the number of observations we have in the column

   - Use the cell below to explain the reason why we use `new_length/2` in combination with the `if, else`:

new_length/2 splits the data in half, which is how the wildtypes and mutants are splitted in the data. The first half of the data is wild type, and the second half of the data is mutant. If the i is greater than 20, we're at the second half of the data, which should be labelled at mutant.

### Make the genetic strain variable

In the way we have formatted the data, genetic strain is the "outer" variable, in that it only changes once as we go down the data set: all the wildtypes are on top, and all mutants are on the bottom. The sex variable is the "inner" variable, because it changes once within each value of strain, so it needs to three times as we go down the data set.

This is arbitrary and has nothing to do with the experimental design; we could have formatted the data such that the roles were reversed.

What we will do is 

* make a short series containing the two levels of our variable
* repeat each value to make the long series 
* deal with annoying index values (there's always something...)

In [39]:
strain = pd.Series(['wildtype', 'mutant'])  # make the short series

strain = strain.repeat(2*obs_per_grp)       # repeat each over two cell's worth of data

strain = strain.reset_index(drop=True)  # reset the series's index value


$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below to explain what is and what it is contained by the variable `strain`:

The variable strain contains the organizing wild types and mutant. Essentially it's the same thing that we did for sorting with numpy, but now we're putting it into Panda Series. The first 40 observations are wildtype, and the second 40 observations are mutant.

Let's see if that worked:

In [40]:
print(strain)

0     wildtype
1     wildtype
2     wildtype
3     wildtype
4     wildtype
        ...   
75      mutant
76      mutant
77      mutant
78      mutant
79      mutant
Length: 80, dtype: object


$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below to explain why `mutants` appear at the bottom of the previous `Pandas Series`, who decided that order?

We have decided that order when we created the panda series. 0 goes before 1.

### Make the sex variable

As the sex variable is the inner variable, we need it have `['male'..., 'female'...]` within each outer block of genotype. So what we'll do is make one block of `['male'..., 'female'...]` and then just stack two copies of that to make our variable. So the steps are

* make a short series containing the two levels of our variable (just like above)
* repeat it (just like above)
* stack two copies on top of each other (dropping the annoying indexes in the process)

In [41]:
sexes = pd.Series(['male', 'female'])             # make the short series
sexes = sexes.repeat(obs_per_grp)                 # repeat each over one cell's worth of data
sexes = pd.concat([sexes]*2, ignore_index=True)   # stack or "concatonate" two copies

In [42]:
print(sexes)

0       male
1       male
2       male
3       male
4       male
       ...  
75    female
76    female
77    female
78    female
79    female
Length: 80, dtype: object


$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below to explain in your own words what happened in the previous cell:

We first create a panda series that index 0 to male and 1 to female. Then, we repeat this 20 times for 'male' and 20 times for 'female'. First twenty is male and second 20 is female. Then we multiply this by 2, to get our final length of 80, and the data repeats like male female male female. We then ignore the index of 0 and 1, and lets the panda series have the index in the order it is in.

  - Use the cell below to show your code to create a pandas series called `unicorns` comprising of 20 mistical equines half of which are `white` and half `pearl-white` in color (well ... what what do you want, they are unicorns):

In [43]:
unicorns = pd.Series(['white', 'pearl-white'])
unicorns = unicorns.repeat(10)
unicorns = unicorns.reset_index(drop=True)
print(unicorns)

0           white
1           white
2           white
3           white
4           white
5           white
6           white
7           white
8           white
9           white
10    pearl-white
11    pearl-white
12    pearl-white
13    pearl-white
14    pearl-white
15    pearl-white
16    pearl-white
17    pearl-white
18    pearl-white
19    pearl-white
dtype: object


  - Use the cell below to show your code to create a pandas series called `Three trees` comprising of 30 trees 1/3 of which are `Live Oaks`, 1/3 `White Oaks` and 1/3 `Red Oaks`:

In [44]:
three_trees = pd.Series(['Live Oaks', 'White Oaks', 'Red Oaks'])
three_trees = three_trees.repeat(10)
three_trees = three_trees.reset_index(drop=True)
print(three_trees)

0      Live Oaks
1      Live Oaks
2      Live Oaks
3      Live Oaks
4      Live Oaks
5      Live Oaks
6      Live Oaks
7      Live Oaks
8      Live Oaks
9      Live Oaks
10    White Oaks
11    White Oaks
12    White Oaks
13    White Oaks
14    White Oaks
15    White Oaks
16    White Oaks
17    White Oaks
18    White Oaks
19    White Oaks
20      Red Oaks
21      Red Oaks
22      Red Oaks
23      Red Oaks
24      Red Oaks
25      Red Oaks
26      Red Oaks
27      Red Oaks
28      Red Oaks
29      Red Oaks
dtype: object


### Build our new data frame!

Data frames are created in pandas by handing it data it can make sense of. There are various ways to accomplish this, and one handy one is to hand it data in a "column label 1 : data 1, column label 2 : data 2, ..." format. 

We can accomplish this with a python "dictionary" (remember those?). A python `dict` associates a label (the "word") with a value or set of values or whatever (the "definition"). They are very useful, so let's take a look at a simple example before we use one to build out data frame. You create a dictionary using curly braces, and then use colons to bind each word or `key` with its definition or `value`. Commas separate each key-value pair.

In [45]:
myData = {"name": "Larry", "rank": "full", "years": 30, "bikes": 5, "motorcycles": 2, "teslas": 1}

In [46]:
myData["name"]

'Larry'

In [47]:
myData["bikes"]

5

$\color{blue}{\text{Complete the following exercise.}}$

  - Use the cell below to build a `dict()` describing a student, with a name, with a student ID, a GPA and a major, make up all the values but use the lables as described here:

In [48]:
stu_dict = {"name": "Phoebe", "student ID": 45, "GPA": 4.0, "major": 'Psychology'}

So a dictionary associates a label with data values. **Perfect!**

Time to build our data frame!

In [50]:
my_tidy_data = pd.DataFrame(      # invoke creation
    {                             # start the dictionary with a {
        "RTs": values_col,        # assign each variable to a label
        "sex": sexes,
        "strain": strain
    }                             # end the dictionary with a }
)                                 # end of creation

Note that the formatting above is just to make the columns we're creating more obvious and human-readable. This will work too:

In [51]:
my_tidy_data = pd.DataFrame({"RTs": values_col, "sex": sexes, "strain": strain})

It's just not as pretty.

Let's look at our creation!

In [52]:
my_tidy_data

Unnamed: 0,RTs,sex,strain
0,12.333785,male,wildtype
1,11.675152,male,wildtype
2,12.029059,male,wildtype
3,12.126430,male,wildtype
4,10.307197,male,wildtype
...,...,...,...
75,24.886821,female,mutant
76,24.475663,female,mutant
77,21.935896,female,mutant
78,23.852748,female,mutant


Yay! We win!

**Important point:** Crucially, *the above code doesn't rely on us knowing much about the input data ahead of time*. As long as it's a pandas data frame that contains numerical values, the code will run. It's automatic.

## Look at new data with more observations with same code

We'll make this code self-contained, so it can be run without running anything above. We'll also add comments, so that future-us can read the code more easily without having to wade through the notebook text above.

In [53]:
my_input_data = pd.read_csv('datasets/018DataFile.csv')  # read the data

raw_data = my_input_data.to_numpy()                      # convert to numpy array

obs, grps = raw_data.shape                               # get the number of rows and columns

Check the size of the new data real quick:

In [54]:
print("We have ", obs, " observations per group and ", grps, " groups.")

We have  20  observations per group and  4  groups.


And now run the "meat" of the code:

In [55]:
new_length = obs*grps                                    # compute total number of observations

values_col = np.reshape(raw_data, (new_length, 1), 
                        order = 'F')                     # reshape the array
values_col = np.squeeze(values_col)                      # squeeze to make 1D

# construct the inner grouping variable
sexes = pd.Series(['male', 'female'])                    # define the levels
sexes = sexes.repeat(obs)                                # make one cycle of the levels
sexes = pd.concat([sexes]*2, ignore_index=True)          # and repeat the cycle, ditching the indexes

# construct the outer grouping variable
strain = pd.Series(['wildtype', 'mutant'])               # define the levels
strain = strain.repeat(2*obs)                            # make the one cycle
strain = strain.reset_index(drop=True)                   # drop the pesky index

# construct the data frame
my_new_tidy_data = pd.DataFrame(
    {
        "RTs": values_col,                               # make a column named RTs and put the values in
        "sex": sexes,                                    # ditto for sex
        "strain": strain                                 # and for genetic strain
    }    
)

In [56]:
my_new_tidy_data

Unnamed: 0,RTs,sex,strain
0,12.333785,male,wildtype
1,11.675152,male,wildtype
2,12.029059,male,wildtype
3,12.126430,male,wildtype
4,10.307197,male,wildtype
...,...,...,...
75,24.886821,female,mutant
76,24.475663,female,mutant
77,21.935896,female,mutant
78,23.852748,female,mutant


**Success!**

## Making the code even more functional

Now we have a chunk of code that seems handy and re-usable. How could we make it ever more handy?

If we make it into a ***function***, then we can run the whole entire thing just by typing one command – no copying, no pasting, fewer ways to make mistakes.

### Defining a function
Since we already have all the code, we can literally just indent it and throw a `def...` in front of it!

In [57]:
def tidyMyData() :
    import pandas as pd
    import numpy as np

    my_input_data = pd.read_csv('datasets/018DataFile.csv')  # read the data

    raw_data = my_input_data.to_numpy()                      # convert to numpy array

    obs, grps = raw_data.shape                               # get the number of rows and columns

    new_length = obs*grps                                    # compute total number of observations

    values_col = np.reshape(raw_data, (new_length, 1), 
                            order = 'F')                     # reshape the array
    values_col = np.squeeze(values_col)                      # squeeze to make 1D

    # construct the inner grouping variable
    sexes = pd.Series(['male', 'female'])                    # define the levels
    sexes = sexes.repeat(obs)                                # make one cycle of the levels
    sexes = pd.concat([sexes]*2, ignore_index=True)     # and repeat the cycle, ditching the indexes

    # construct the outer grouping variable
    strain = pd.Series(['wildtype', 'mutant'])               # define the levels
    strain = strain.repeat(2*obs)                            # make the one cycle
    strain = strain.reset_index(drop=True)                   # drop the pesky index

    # construct the data frame
    my_new_tidy_data = pd.DataFrame(
        {
            "RTs": values_col,                               # make a column named RTs and put the values in
            "sex": sexes,                                    # ditto for sex
            "strain": strain                                 # and for genetic strain
        }    
    )
    
    return my_new_tidy_data

In [58]:
datFromFun = tidyMyData()

In [59]:
datFromFun

Unnamed: 0,RTs,sex,strain
0,12.333785,male,wildtype
1,11.675152,male,wildtype
2,12.029059,male,wildtype
3,12.126430,male,wildtype
4,10.307197,male,wildtype
...,...,...,...
75,24.886821,female,mutant
76,24.475663,female,mutant
77,21.935896,female,mutant
78,23.852748,female,mutant


### Defining a function with an argument
A common (very common) scenario in data analysis is wanting to run the same code – like the code we just wrote – on different files. So one really nice addition to this function would be to add the ability for the user to specify a filename to tell the function which data file to read.

This is actually fairly straightforward. All we have to do as add an **argument** to our function, and then replace the hardcoded filename in the function with the **variable** created by the function argument.

In [60]:
def tidyMyData(filename) :
    import pandas as pd
    import numpy as np

    my_input_data = pd.read_csv(filename)  # read the data

    raw_data = my_input_data.to_numpy()                      # convert to numpy array

    obs, grps = raw_data.shape                               # get the number of rows and columns

    new_length = obs*grps                                    # compute total number of observations

    values_col = np.reshape(raw_data, (new_length, 1), 
                            order = 'F')                     # reshape the array
    values_col = np.squeeze(values_col)                      # squeeze to make 1D

    # construct the inner grouping variable
    sexes = pd.Series(['male', 'female'])                    # define the levels
    sexes = sexes.repeat(obs)                                # make one cycle of the levels
    sexes = pd.concat([sexes]*2, ignore_index=True)     # and repeat the cycle, ditching the indexes

    # construct the outer grouping variable
    strain = pd.Series(['wildtype', 'mutant'])               # define the levels
    strain = strain.repeat(2*obs)                            # make the one cycle
    strain = strain.reset_index(drop=True)                   # drop the pesky index

    # construct the data frame
    my_new_tidy_data = pd.DataFrame(
        {
            "RTs": values_col,                               # make a column named RTs and put the values in
            "sex": sexes,                                    # ditto for sex
            "strain": strain                                 # and for genetic strain
        }    
    )
    
    return my_new_tidy_data

Now we can call the function and specify whatever data files exist. Let's try it with "datasets/018DataFile2.csv"!

In [61]:
newDataFromFun = tidyMyData("datasets/018DataFile2.csv")

In [62]:
newDataFromFun

Unnamed: 0,RTs,sex,strain
0,12.577226,male,wildtype
1,12.778183,male,wildtype
2,13.389130,male,wildtype
3,12.747877,male,wildtype
4,13.615121,male,wildtype
...,...,...,...
163,24.539374,female,mutant
164,23.877924,female,mutant
165,23.161896,female,mutant
166,24.426455,female,mutant


### Adding help
It's always a good idea to **heavily comment your code!** 

When writing fuctions, it's also a good idea to add a documentation string, called a `docstring`, to your function. This way people can get help on your function with the `help()` function. Like `help(tidyMyData)`.

In [63]:
def tidyMyData(filename) :
    '''
    tidyMyData() Takes one-column-per-cell rat reaction time data as input.
    Returns tidy one-column-per-variable data.
    User specifies a filename string.
    '''
    
    import pandas as pd
    import numpy as np

    my_input_data = pd.read_csv(filename)  # read the data

    raw_data = my_input_data.to_numpy()                      # convert to numpy array

    obs, grps = raw_data.shape                               # get the number of rows and columns

    new_length = obs*grps                                    # compute total number of observations

    values_col = np.reshape(raw_data, (new_length, 1), 
                            order = 'F')                     # reshape the array
    values_col = np.squeeze(values_col)                      # squeeze to make 1D

    # construct the inner grouping variable
    sexes = pd.Series(['male', 'female'])                    # define the levels
    sexes = sexes.repeat(obs)                                # make one cycle of the levels
    sexes = pd.concat([sexes]*2, ignore_index=True)     # and repeat the cycle, ditching the indexes

    # construct the outer grouping variable
    strain = pd.Series(['wildtype', 'mutant'])               # define the levels
    strain = strain.repeat(2*obs)                            # make the one cycle
    strain = strain.reset_index(drop=True)                   # drop the pesky index

    # construct the data frame
    my_new_tidy_data = pd.DataFrame(
        {
            "RTs": values_col,                               # make a column named RTs and put the values in
            "sex": sexes,                                    # ditto for sex
            "strain": strain                                 # and for genetic strain
        }    
    )
    
    return my_new_tidy_data

In [65]:
help(tidyMyData)

Help on function tidyMyData in module __main__:

tidyMyData(filename)
    tidyMyData() Takes one-column-per-cell rat reaction time data as input.
    Returns tidy one-column-per-variable data.
    User specifies a filename string.



$\color{blue}{\text{Complete the following exercise.}}$


- Use the cell below to show how you would modify the previous function so as to make it even more flexible. Let the user specify the output column headers to be whatever they want.

More specifically how would you allos passing in the three labels, `sex`, `RTs` and `strain`, instead of having them 'hard coded' inside the code. This means that instead of using labels such as `sex`, `RTs` and `strain`, we will want to pass paramters for each one of the labels and use the parameters in the function. For example, instead of `sex`, `RTs` and `strain` we will want to pass others say, `s`, `ReactionTime` or `type` or any three combinations of lables, always three but that can change everytime we call the function.

You would do this with arguments (obviously). But you could do it with multiple arguments, so users would call it like:

`tidyMyData("datasets/018DataFile2.csv", "Times", "Gender", "Genotype")`

or you could do it with one additional arguments, so the user would call it by either:

`tidyMyData("datasets/018DataFile2.csv", ["Times", "Gender", "Genotype"])`

or

`colNames = ["Times", "Gender", "Genotype"]`

`tidyMyData("datasets/018DataFile2.csv", colNames)`

Pro tip: The function would probably be most handy if there were *default* values for the column names, so that user could just type something like

`myTidyData = tidyMyData("datasets/018DataFile2.csv")`

if they didn't want to specify custom column headers.

In [66]:
def tidyMyData(filename, col_name1, col_name2, col_name3) :
    '''
    tidyMyData() Takes one-column-per-cell rat reaction time data as input.
    Returns tidy one-column-per-variable data.
    User specifies a filename string.
    '''
    
    import pandas as pd
    import numpy as np

    my_input_data = pd.read_csv(filename)  # read the data

    raw_data = my_input_data.to_numpy()                      # convert to numpy array

    obs, grps = raw_data.shape                               # get the number of rows and columns

    new_length = obs*grps                                    # compute total number of observations

    values_col = np.reshape(raw_data, (new_length, 1), 
                            order = 'F')                     # reshape the array
    values_col = np.squeeze(values_col)                      # squeeze to make 1D
    
    colNames = [col_name1, col_name2, col_name3, default == "Times", "Gender", "Genotype"]

    # construct the inner grouping variable
    sexes = pd.Series(['male', 'female'])                    # define the levels
    sexes = sexes.repeat(obs)                                # make one cycle of the levels
    sexes = pd.concat([sexes]*2, ignore_index=True)     # and repeat the cycle, ditching the indexes

    # construct the outer grouping variable
    strain = pd.Series(['wildtype', 'mutant'])               # define the levels
    strain = strain.repeat(2*obs)                            # make the one cycle
    strain = strain.reset_index(drop=True)                   # drop the pesky index

    # construct the data frame
    my_new_tidy_data = pd.DataFrame(
        {
            colNames[0]: values_col,                               # make a column named RTs and put the values in
            colNames[1]: sexes,                                    # ditto for sex
            colNames[2]: strain                                 # and for genetic strain
        }    
    )
    
    return my_new_tidy_data

SyntaxError: invalid syntax. Maybe you meant '==' or ':=' instead of '='? (3711836553.py, line 23)