# Working With Multiple Data Files Using Copilot

In this lesson we will revisit some of the material covered at the end of the introductory chapter on Python. Specifically, reading multiple data files, manipulating the data using pandas, and deriving some basic information from the data.

The instructions are very high-level, because we want you to work on writing Copilot prompts yourself, based on high-level instructions rather than us giving you step-by step guidance. The version of this lesson in the online textbook, however, shows the worked example with prompts and Copilot-generated code.

## Instructions

Your task is to perform the operations below, using Python code generated by a combination of your own writing, and GitHub Copilot's suggestions. For your own learning, you are best off to start each step by trying to write the code yourself. If it works, great – you've learned something in Python! Congratulations. But, tTe tasks below intentionally ask you to do things that we haven't taught in this course yet. That's because we want you to learn how to use Copilot to help you out when you don't know how to do something.

If you try writing your own code and it doesn't work, then you can try again, or use Copilot to help you out. Copilot is meant to speed up your coding, so in practice it's preferable to use it to help you out when you get stuck, than spend too much time trying to figure out how to do something yourself. On the other hand, once you have the solution from Copilot, it's important to try to understand what it is doing. You may have to look up some of the functions it's using, if you're not familiar with them. 

If you're ignorant of what the code you generated is doing, you risk errors. Indeed, in the tutorial below we will see an example of Copilot's generating erroneous code. Even if the code works as desired, it is important for you to be able to explain correctly what you did. Also, look for ways to check that the results you get are correct (using code). 



### The Data and What to Do With It

There are three data files in the `data` directory: `s1.csv`, `s2.csv`, and `s3.csv`. Each file contains the reaction times (RTs) from 10 trials of a relatively simple task in which participants had to indicate which direction a briefly-presented arrow was pointing. The RTs are in seconds (s). Each file contains the RTs from a different participant. In each file there are three columns. You can determine what the columns are by looking at the first row (header) of each file.

Generate code that reads in the three files and combines them in a pandas DataFrame. Include error-checking  code that determines you loaded in the correct number of trials (30) and the correct number of columns (3). If the number of trials or columns is incorrect, print an error message.

Save the dataframe to a file called `all_data.csv`, in the `data` folder.

Once you have the DataFrame, calculate:
- the mean RT for each participant
- the mean RT across all participants
- the 95% confidence intervals (CIs) for the mean RT for each participant
- the 95% CIs for the mean RT across all participants

Finally, print a table that includes all of the results: with columns for mean RT, lower 95% CI, and upper 95% CI, and with one row for each participant and the bottom row showing the mean and CIs across participants.

### How to Approach This Exercise

A critical skill to develop in coding is **problem decomposition**, or **programmatic thinking** — in other words, breaking down a task into smaller and smaller components, so that you can write the code to perform each step in the logical sequence. The instructions above are written in a sequential way, so you should be able to identify each individual step that  you need to take.

It is good practice, when working in Jupyter notebooks, to write the code for each step in a separate cell. This makes it easier to test each step, and to go back and change things if you need to. It also makes it easier to see what you've done, and understand what the code is doing. As well, it's sometime helpful to put Markdown cells in between code cells, to provide longer explanations than might be appropriate for a comment in the code itself. You can also make notes about things you might want to change later, or interpretations of the output of the code.

### Using the Copilot Assistant

As noted above, we encourage you to try writing the code for each step of the instructions above. Coding is a *procedural* skill that you only learn by doing — and the more you learn the better you will be at solving the bugs that Copilot-generated code will inevitably have. But, if you get stuck, or encounter an instruction that you haven't yet learned how to perform in Python, then by all means use Copilot. Just like writing your own code, it is best if you use a separate notebook cell for each Copilot prompt. That way, if you need to edit the code, you can do so without having to re-run all of the code that Copilot generated. It's also easier to debug because you can see the output of each cell and now exactly which line generated the error.

As you will see, it takes some trial and error – and critical thinking – to generate prompts that produce the code you want. But, it's a good way to learn how to use Copilot effectively. And if you do it right, you will learn a lot about Python along the way.

<div class="alert alert-info">
Copilot does not generate the same code every time you type the same prompt. This is the nature of the type of generative AI that Copilot uses. So, when working through this lesson, you may get different outputs than what is shown in the lesson. Try to use the processes explained here for interpreting error messages and Copilot's descriptions and suggested code. It's much more important to get familiar with engaging in the *process* of debugging, through the thought processes explained here, than to worry about getting different suggestions from Copilot.

---
## Using Copilot to Perform the Tasks Above

The cells below all reflect prompts written on the basis of the instructions above, and code generated entirely by Copilot. While your experience may be different, in writing this lesson we only had to type the first 2-line prompt, and Copilot generated not only the code but the other prompts/comments. 

However, you have to get used to the flow of working with Copilot. After typing the first two-line prompt below, if you only hit `Enter` once, you probably won't get any code generated. Copilot encourages a good coding style, and so it will only generate code if you hit `Enter` twice, so that there's an empty line between your prompt and the code. And in fact, after I typed the first two lines of the prompt and hit `Enter`, Copilot added a third suggested line to the prompt after I hit `Enter` the first time, which was to `write the dataframe to a csv file`. This is pretty cool, since that was also part of the instructions above! Is Copilot a mind-reader? No, it's just a very good predictor of what you might want to do next, based on the prompt you've written. It's a pretty common task in data science to read in a set of individual data files, concatenate them into a single DataFrame, and then save that DataFrame to a file for later use. So, Copilot is just doing what it's trained to do, which is to predict what you might want to do next, based on the prompt you've written.

Anyway, after I accepted the third line of the prompt below, I had to hit `Enter` twice to get the first line of code (`import pandas as pd`) generated.

In [2]:
# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe

import pandas as pd
import glob

# use glob to get all the files that start with "s" and end with "csv"
# glob returns a list of file names
filenames = glob.glob("data/s*.csv")
print(filenames)

# read in the files and concatenate them into one dataframe
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))
df = pd.concat(dataframes, ignore_index=True)
df

# write the dataframe to a csv file
df.to_csv("data/concatenated.csv", index=False)

['data/s2.csv', 'data/s3.csv', 'data/s1.csv']


The above appears to have worked fine. Next we'll to some checks to confirm that, as described in the instructions. 

<div class="alert alert-info">
<b>Learning Point</b><p>
The above appears to have worked fine. However, the above code uses a <code>for</code> loop to iterate over the files and read them in. There is a way we could do this in one line of code, using <em>list comprehension</em>:</p>

```python
df = pd.concat([pd.read_csv(f) for f in files])
```

This demonstrates that Copilot can generate working code, but not always the most efficient code.
</div>

For the moment, we'll stick with the code Copilot generated, and check that it is accurate. We'll start by checking the number of columns in the DataFrame:

In [6]:
# print the first 5 rows of the dataframe
df.head()

Unnamed: 0,participantID,trial,RT
0,s2,1,0.433094
1,s2,2,0.392526
2,s2,3,0.396831
3,s2,4,0.417988
4,s2,5,0.37181


We see that the DataFrame has 3 columns, as expected. Next, we'll check that we get the expected number of rows and columns:

In [9]:
# Check that the dataframe has 30 rows and 3 columns
df.shape

(30, 3)

We also were instructed to raise an error message if the number of rows or columns is incorrect:

In [5]:
# raise an error if the dataframe does not have 30 rows and 3 columns
assert df.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

### Test Your Test
This generates no output. This is a Good Thing, because it means that the number of rows and columns is correct. However, it's a bit dangerous to assume no news is good news, because no news could also mean your code is not working. So, we should test that the error message is generated if we change the expected number of rows or columns.

We don't want to actually remove any rows from the DataFrame, but we can use slicing to create a view of the DataFrame that has fewer rows:

In [5]:
# create a slice of df that contains 29 rows, 
# then raise an error if the dataframe does not have 30 rows and 3 columns
df_slice = df.iloc[0:29]
assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

AssertionError: The dataframe does not have 30 rows and 3 columns

Now we get an `AssertionError`, which is a Good Thing because it confirms that our error-checking code is working. We can do the same thing to test the error-checking code for the number of columns:

In [6]:
# create a slice of df that contains 2 columns, 
# then raise an error if the dataframe does not have 30 rows and 3 columns
df_slice = df.iloc[:, 0:2]
assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"


AssertionError: The dataframe does not have 30 rows and 3 columns

So yes, our code will throw errors if the number of rows or columns is incorrect. 

### Calculating the Mean RT for Each Participant... and Our First Bug
Our next instruction is to calculate the mean RT for each participant. Let's prompt Copilot to do that:

In [10]:
# calculate mean rt for each subject
df.groupby('subject').mean()

KeyError: 'subject'

---
## Debugging Copilot-Generated Code

Typically, when you get a long, scary error message like the one above, you can ignore a lof of what is in the middle. The most important parts are the last line, which tells you what the error is, and the first lines, which usually indicate what line in the code you tried to run caused the error. What's in between is the *stack trace*, which is a list of all the functions that were called in the process of trying to run the code. But most of the time, the error is a result of the code you wrote (the first lines), not the code in the underlying Python functions that your code called (the middle lines).

In this case, we see a `KeyError: 'subject'` at the bottom of the error message. Recall that Python dictionaries are sets of **key**-value pairs. The keys are the names of the columns, and the values are the data in those columns. 

You can think of a pandas DataFrame as a dictionary in which the column names are keys, and the values in that column are its values. This is a common way that pandas functions refer to column names and their values. So the error message above indicates that the code tried to access a column called `subject` in the DataFrame, but that column doesn't exist. We can see at the top of the error message that the code that caused the error is: `---> 2 df.groupby('subject').mean()`. This means it's line 2 of the cell above that caused the error. 

As a novice coder, however, you might not know or remember that pandas DataFrames represent column names and data as dictionary key-value pairs. If you don't understand what an error means, you can use Copilot to help you!

### Using Copilot to Explain Errors and Suggest Fixes

Somewhat counterintuitively, even though Copilot can generate code with bugs, it can also provide feedback on errors in the code it generated. 

Copilot has a few features that can help you diagnose and fix your code. If you right-click in a code cell, you will see `Copilot` within the menu that pops up. Under that heading there are a number of options, including `Explain This` and `Fix This`. Some work better than others, but we will look at each of them in turn.


### Copilot `Explain This`

A great place to start in diagnosing errors is `Explain This`. Intuitively, you might go straight for `Fix This`. However, it's probably instructive to see what Copilot thinks your code is doing, and if that is what *you* think your code is doing. Copilot's explanation might give you insight into the error. Below is the output from selecting `Explain This` for the line of code that caused the error above:

> This code is designed to calculate the mean reaction time (RT) for each subject in a dataframe. The code uses the groupby() method to group the data by the 'subject' column and then calculates the mean for each group using the mean() method.
>
>The resulting dataframe contains the mean RT for each subject. The index of the dataframe is the unique values in the 'subject' column, and the columns are the other numerical columns in the original dataframe.
>
> This code is useful when working with data that has multiple subjects and you want to calculate summary statistics for each subject. By using the groupby() method, you can easily calculate summary statistics for each group without having to manually subset the data.
>
> One possible way to improve this code would be to specify the column to calculate the mean for, rather than calculating the mean for all numerical columns. This would make the code more specific and reduce the risk of errors. Additionally, the code could be modified to calculate other summary statistics, such as the standard deviation or median, by using the appropriate method instead of the mean() method.

First off, this shows how Copilot can be a great learning tool. It's giving us a clear explanation of what the code is doing, and teaching us a bit about how the `.groupby()` method works. It also suggests some improvements to the code, which are actually good suggestions.

None of this explanation tells us directly why the code generates an error, but as a novice coder this probably provided you with greater understanding of what the code is actually doing. This will be helpful as we try to figure out what is going wrong. 

The first paragraph of Copilot's explanation *does* provide a subtle clue to us about the error, though. It notes that the code is grouping the data by the `subject` column. The error message also referenced `subject`. 

### Copilot `Fix This`

If you select `Fix This` from the Copilot menu, you will see what's called a **diff**, as shown below. A diff is a way of showing the difference between two pieces of code. The code on the left (with a pink background) is the code you wrote, and the code on the right (with a green background) is the code that Copilot suggests. In this case, the suggested "fix" would not fix the problem. 

![](images/copilot_fix_this.png)


The suggested code is doing two things differently from the original code. Firstly, it's suggesting that you assign the output of the command to a variable, and then print the variable. Assigning output to a variable and then printing it is in general good coding style, because we may want to use that variable again later in our code. But that won't fix the error. 

The other difference – which is subtle to a novice coder – is that the new code is referencing the column `Subject` rather than `subject`. Recall that Python is **case-sensitive**, meaning that `Subject` and `subject` are not the same thing. Copilot seems to be guessing that the error might be cause by a common mistake, which is to refer to a variable using the wrong case. 

In this particular situation, however, the error is not caused by a case-sensitivity error. The error is caused by the fact that the column `subject` doesn't exist in the DataFrame.

So, you can click the `Discard` button to reject Copilot's suggestion, and the `Escape` key to exit the diff view.


### ![](images/chat_icon.png)  Copilot `Chat` 

 Another, really useful option is the `Chat` option, which you can option by clicking the `Chat` icon (shown above) in the Activity Bar (left side of VS Code). This allows you to have a conversation with Copilot about what you want to do. If you've used [Chat-GPT](https://chat.openai.com/) before, this will be familiar to you as it's essentially the same type of natural-language interface.

There are different approaches to using Copilot Chat to help fix an error. One is to click in the cell that generates the error, and then click in the Copilot Chat input box, and typing: 

```
the selected cell generates an error. how can i fix this error?
``` 

In response to this prompt for the present error, I got the following response:

>If you are getting an error with the selected cell, there could be a few reasons why. Here are some possible solutions:
>
>1. Check that the 'subject' column is present in the dataframe. You can do this by printing out the dataframe using the `print()` function and checking that the 'subject' column is present.
>
>2. Check that the 'subject' column is spelled correctly. If the column is misspelled, you will get a "KeyError" when trying to group by that column.
>
>3. Check that the 'subject' column is of the correct data type. If the 'subject' column is not a string or categorical data type, you may need to convert it using the `astype()` method.
>
>4. Check that the dataframe contains numerical columns. If the dataframe only contains non-numerical columns, you will get a "DataError" when trying to calculate the mean.
>
>5. Check that the dataframe does not contain missing values. If the dataframe contains missing values, you may need to remove them using the `dropna()` method before calculating the mean.

## Finding the Error with Copilot's Help

These are actually really good responses, and they are provided in a logical order to go through in debugging. We could work through them in order, but — spoiler alert — the first suggestion actually leads to a solution. 

### Check that the 'subject' Column is Present in the DataFrame
The first thing to do is to check that the column exists, and is spelled correctly. So let's start with that. Do you remember how to check the names of the columns in a pandas DataFrame? If not, you can write a Copilot prompt to help you out. In the cell below, write a prompt that will print the names of the columns in the DataFrame. Then run the cell, and see what happens.

In [1]:
# print the column names of the dataframe
df.columns

NameError: name 'df' is not defined

Another option is to look at the first few rows of the DataFrame, which includes the column names:

In [5]:
# print the first few rows of the dataframe
df.head()

Unnamed: 0,participantID,trial,RT
0,s2,1,0.433094
1,s2,2,0.392526
2,s2,3,0.396831
3,s2,4,0.417988
4,s2,5,0.37181


A third option, when using Jupyter notebooks with VS Code, is to click on the `Variables` button in the toolbar at the top of the notebook window. This will pop up a variable viewer in sub-window below your notebook. You can click on the variable names to see their values. For DataFrames, it actually shows a list of the columns in the window, and you can double-click on the variable name to see the contents of the DataFrame in another window, the **Data Viewer**. This view is similar to a spreadsheet. In fact, you can directly edit values in the Data Viewer. *You should never directly edit values like this*, however. Any steps you do manually are not documented in your code, and are not reproducible.  

The screenshot below shows the variables and Data Viewer for the current context.

![](images/inspectors.png)

### The Solution

Using any of the three approaches above, when we look at the column names, we see that they are `participantID`, `trial`, and `RT`. The code that generated the error was trying to access a column called `subject`, which doesn't exist. It should be `participantID`. So we need to change the code to access the correct column name. Or, engineer our prompt to do so:

In [9]:
# calculate mean rt for each participantID
df.groupby('participantID').mean()


Unnamed: 0_level_0,trial,RT
participantID,Unnamed: 1_level_1,Unnamed: 2_level_1
s1,5.5,0.389548
s2,5.5,0.444785
s3,5.5,0.446009


This looks good, however the code is providing means for both columns in the DataFrame (`trial` and `RT`), not just for `RT` (sometimes the same prompt actually does select only `RT` but we'll explore when it doesn't). We can add to our prompt to tell it not to include `trial` in the output:

In [None]:
# calculate mean rt for each subject. Do not show the mean for trial 
df.groupby('participantID').mean().drop('trial', axis=1)

The above generated code does what we want. However, from the perspectives of coding style and efficiency, it's not optimal. Python executes this **chained command** from left to right. So, it first computes the mean for each column in the DataFrame, and then drops the column `trial`.

It seems unnecessary to compute the mean for `trial` and then drop it. This isn't really Copilot's fault — we did explicitly tell it not to show the mean for trial, but it's not smart enough to know that we don't want to compute it in the first place; it seems to have interpreted our prompt as a literal sequence of commands.

We can modify the prompt in a way that generates more efficient code, by being specific about the column that we do want, rather than what we don't want:

In [10]:
# calculate mean for each participantID using the RT column
df.groupby('participantID')['RT'].mean()

participantID
s1    0.389548
s2    0.444785
s3    0.446009
Name: RT, dtype: float64

By way of showing how sensitive Copilot is to the structure of your prompt, a slightly different (and arguably more logical) phrasing of the prompt above generates the less-efficient code:

In [8]:
# calculate mean RT for each participantID
df.groupby('participantID').mean().drop('trial', axis=1)


Unnamed: 0_level_0,RT
participantID,Unnamed: 1_level_1
s1,0.389548
s2,0.444785
s3,0.446009


<div class="alert alert-info">
One thing you may notice is that the result of the last command above is nicely-formatted when it is displayed, whereas the one before it is in a more "raw" format. This is not really important here, but it's worth understanding why the difference occurs. When you call a pandas DataFrame it prints in a nicely formatted output. However, when you call a pandas Series (which is a single column), it prints in a more detailed but less "pretty" way. 

In the output immediately above, the code created a DataFrame with two columns (`trial` and `RT`) and then dropped the `trial` column, but as such it remained a DataFrame and so was nicely formatted.

In contrast, the output of using the `mean()` method on a single column (`RT`) in the cell above that is a Series.  
<p><p>
We'll worry about the formatting later, but it's good to understand why it happens because the distinction between DataFrames and Series often causes confusion and errors if it's not understood.
</div>

---
## Continue Coding

Having fixed that error, we can now move on with the steps in the instructions. The next step was to calculate the mean RT across subjects. Do that below:

In [11]:
# calculate the mean rt across all subjects
df['rt'].mean()

KeyError: 'rt'

As you see above, when I tried this, I got another `key Error`, this time for `rt`. This is because the code is trying to access a column called `rt`, but the column name is actually `RT`. Again, let's start by trying to fix this ourselves, and then use Copilot if we get stuck. 

The code above selects the column in `df` called `rt` and computes its mean. However, if you look back to the listing of column names that we generated earlier, you'll see that the column name is actually `RT`. Remember that Python is **case sensitive**, meaning that capitalization matters: `RT` and `rt` are not the same. So we need to change the code to access the correct column name:

In [12]:
df['RT'].mean()

0.4267806816333334

Having fixed that error, let's move on to the next step, which is to calculate the 95% confidence intervals for the mean RT for each participant:

In [13]:
# calculate 95% confidence intervals for each subject
df.groupby('subject').mean().apply(lambda x: x.sem() * 1.96, axis=1)

KeyError: 'subject'

Again we get `Key Error: 'subject'`, which is the same as the first error above. So again, Copilot is labelling the column incorrectly. We know that we can fix this by replacing `subject` with `participantID`, but it's a bit annoying that Copilot keeps making this error. On the other hand, many errors are actually human errors. We are prompting Copilot to do something for each  "subject", which then requires Copilot to know that by "subject" we really mean the column called `participantID`. Maybe we could engineer our prompt to be better instead, by referring to the column name directly. Let's try that:

In [14]:
# calculate 95% confidence intervals for each participantID
df.groupby('participantID').mean().apply(lambda x: x.sem() * 1.96, axis=1)


participantID
s1    5.008243
s2    4.954111
s3    4.952911
dtype: float64

## Pink Elephants

This is a frightening example of a Copilot **hallucination**. The above code generates a table of numbers, one for each participantID.  You might think this is a plausible result. However, there are a couple of problems. Firstly, confidence intervals (CIs) are typically reported as a pair of values. The 95% CIs reflect the *range* within which it is 95% likely that the true mean lies, so we need an upper *and* lower limits of the CI. These are typically the same amount plus or minus the mean (e.g., if the mean is 2, the 95% CIs might be 1.5 and 2.5). So given a conceptual understanding of what CIs are, we recognize that we need to modify the code to generate the upper and lower values of the CIs. 

Secondly, if you look back at the mean RT for each participant, they were in the range of .339 – .449. Again, confidence intervals reflect the range of possible true values for the mean, given the sample data.  So for our confidence intervals to have values around 5 s seems implausible, given the mean RTs are < 500 ms. We would expect the CIs to be perhaps 100-200 ms. So it seems like Copilot is calculating the CIs incorrectly.

This highlights the critical importance of understanding not only what your code is doing, but what you are asking it to do. You should always check the output of your code to make sure it makes sense, and also make sure you understand what you are supposed to be doing with the code.

## Copilot Chat to the Rescue

Since Copilot Chat was helpful before, let's try using it to help us figure out what is going on. If we click on the `Chat` icon in the Activity Bar, we can type in a question to Copilot. In this case, I asked Copilot the following:

> the highlighted code is supposed to compute the 95% confidence intervals of mean RT. However, the output values are incorrect. Can you please suggest code that correctly calculates the 95% confidence intervals, and provides both the upper and lower CIs in a table

Copilot provided the answer below:

>> To correctly calculate the 95% confidence intervals for the mean RT and provide both the upper and lower CIs in a table, you can modify the existing code as follows:

```python
import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)
```

>> In this code, the `groupby()` method is used to group the data by the 'participantID' column, and the `mean()` and `sem()` methods are used to calculate the mean and standard error of the mean for the 'RT' column for each group. The resulting series contains the mean and SEM for each unique value in the 'participantID' column.
>> 
>> The `ci_rt` variable calculates the 95% confidence intervals for the mean RT using the formula `SEM * 1.96`, where `SEM` is the standard error of the mean.
>> 
>> The `upper_ci` and `lower_ci` variables calculate the upper and lower CIs by adding and subtracting the `ci_rt` value from the `mean_rt` value, respectively.
>> 
>> The `ci_table` variable creates a table of the upper and lower CIs using the `pd.DataFrame()` function.
>> 
>> Finally, the `print()` function is used to print the table of CIs.
>> 
>> This code should give you the desired output of a table containing the upper and lower CIs for the mean RT for each participant.

If you move your cursor over the code that Copilot generated, you will see a `Copy` button appear. Clicking this will copy the code to your clipboard, so you can paste it into your notebook. We'll run this code below:

In [15]:
import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)

               Upper CI  Lower CI
participantID                    
s1             0.430626  0.348470
s2             0.496197  0.393373
s3             0.515907  0.376111


OK, this is great progress. We now have our output in a table with both upper and lower CIs, and we can see that the CIs are in the range we would expect. It wouldn't hurt to cross-check the formula that Copilot is using against a textbook or online resource.

### Cross-Check Copilot's Code with Other Sources

There's an old saying, "once bitten, twice shy". We've already seen that Copilot can generate code that is incorrect. So, it's a good idea to check the code that Copilot generates against other sources. In this case, we can look up the formula for calculating the 95% CI for the mean, and compare it to the code that Copilot generated.

If we do a Web search for "confidence interval formula", we will find that the results across all of the top hits indicate that the formula for calculating the 95% CI for the mean is:

```python
ci_upper = mean + 1.96 * (std / sqrt(n))
ci_lower = mean - 1.96 * (std / sqrt(n))
```

where `mean` is the mean of the data, `std` is the standard deviation of the data, and `n` is the number of observations.

However, Copilot's code multiplied 1.96 by `sem_rt`, which is the *standard error* of the mean (if you ddin't know that, you could find out via a Web search or asking Copilot chat). But if you remember (or look up) the formula for standard error, it is the standard deviation of the data divided by the square root of the number of observations. So, Copilot's code is correct, it's just using a different formula than the one we found online. But at least now we can be confident that the code is doing what we want it to do.

### 95% CIs across all participants

Our next instruction is to compute 95% CIs across all participants. Given our troubles above, we may not have much confidence that it will work the first time. But, let's start by copying the  phrasing that Copilot Chat used when describing its answer to the previous question, but adding in "across all participantID" to the prompt:

In [30]:
# calculate the 95% confidence intervals for the mean RT across all participantID
# and provide both the upper and lower CIs in a table
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})
print(ci_table)

               Upper CI  Lower CI
participantID                    
s1             0.430626  0.348470
s2             0.496197  0.393373
s3             0.515907  0.376111


Unfortunately Copilot doesn't seem to have understood "across all participantID", and it gave us CIs for each participant separately. So, let's try again, but this time we'll be more explicit about what we want:

In [33]:
# calculate the mean RT across all participantID
# then compute the 95% confidence intervals for the mean RT across all participantID
# and provide both the upper and lower CIs in a table
mean_rt = df['RT'].mean()
sem_rt = df['RT'].sem()
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci}, index=['RT'])
print(ci_table) 


    Upper CI  Lower CI
RT  0.459111   0.39445


This looks better — there's only one pair of CIs, for the whole group — and if we examine the code, we see that it's using the same formula as the (correct) code that Copilot generated for the previous step. 

The prompt above is an example of how **programmatic thinking** is required even when using an AI coding assistant. In order to get the desired result, we had to break down the problem into smaller chunks. In this case, we told it to compute the mean across all participantID, and *then* compute the 95% CIs.

Another prompt that also works is below; note that the output is formatted differently, but the values are the same:

In [18]:
# calculate the upper and lower 95% confidence interval limits for RT across participantid
df['RT'].mean() + df['RT'].sem() * 1.96, df['RT'].mean() - df['RT'].sem() * 1.96

(0.45911120518780957, 0.3944501580788572)

This code illustrates a feature of Python that we haven't seen before: using a comma to separate two commands. This is a way of executing two commands in a single line of code. It's not a very common way of writing code, but as you can see in this example, it can be useful: the output is a Python `tuple`, which is an immutable, ordered collection of values. This is a good type to use when you have a pair of values that belong together, and the immutable nature of tuples means that you can't accidentally change one of the values. Furthermore, the ordering means you can access the values by their position in the tuple, which is useful in this case because, using this prompt, Copilot will always put the upper CI first and the lower CI second.

---

## Formatting the output nicely


The final instruction was, "Finally, show a table that includes all of the results: with columns for mean RT, lower 95% CI, and upper 95% CI, and with one row for each participant and the bottom row showing the mean and CIs across participants."

In [None]:
# show a table that has columns for mean RT, lower 95% CI, and upper 95% CI, 
# each rounded to 2 decimal places,
# with one row for each participantID, 
# and the final row for the mean across all participantIDD
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)


               Mean RT  Lower CI  Upper CI
participantID                             
s1                0.39    0.3508    0.4292
s2                0.44    0.3812    0.4988
s3                0.45    0.3716    0.5284


This is a decent first pass, but the CIs are rounded to 4, not 2, decimal places, and the table is missing the mean across participants. These are two different problems, so let's try to solve them one at a time. We'll start with rounding. Our prompt above was pretty specific in asking Copilot to round each of the stated values to 2 decimal places, so maybe instead of prompt engineering we can ask Chat for help. Let's try this prompt in Copilot Chat: `modify the code in the selected cell so that the CIs are rounded to 2 decimal places`. The result is:


> To modify the code to round the CIs to 2 decimal places, you can change the `round()` method to round to 2 decimal places instead of the default 4 decimal places. Here's the modified code:

```python
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)
```

> In this code, the `round()` method is used to round the mean RT and SEM values to 2 decimal places, and the `round()` method is also used to round the upper and lower CI values to 2 decimal places.
>
> The `upper_ci` and `lower_ci` variables calculate the upper and lower CIs by adding and subtracting the `ci_rt` value from the `mean_rt` value, respectively. The `round()` method is used to round the resulting values to 2 decimal places.
> 
> Finally, the `ci_table` variable creates a table of the mean RT, lower CI, and upper CI values using the `pd.DataFrame()` function, and the `print()` function is used to print the table.
> 
> This code should give you the desired output of a table containing the mean RT, lower CI, and upper CI values rounded to 2 decimal places.

And indeed, as you can see below, it works!

In [27]:
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)

               Mean RT  Lower CI  Upper CI
participantID                             
s1                0.39      0.35      0.43
s2                0.44      0.38      0.50
s3                0.45      0.37      0.53


Now let's address the problem of the missing mean across participants. Based on what we learned earlier in the course, this might not be surprising. While we can use `.groupby()` to split the data by participantID, or simply `.mean()` to get the average across participantIDs, there is no way of generating both split and averaged data using the same command. 

Instead, what we could try doing is create the across-subject data as a separate table, and then merge that table with the by-participant table. In writing the prompt below I also tried a slightly different way of asking for rounding to 2 decimal places, which turns out to be effective:

In [28]:
# show a table with columns for mean RT, lower 95% CI, and upper 95% CI,
# averaged across all participant ID
# round all values in the table to 2 decimal places
mean_rt = df['RT'].mean().round(2)
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
print(ci_table)


    Mean RT  Lower CI  Upper CI
RT     0.43      0.39      0.47


Now we want to merge the two tables. In the cell below, I only had to type the first one-line prompt, and Copilot generated all the subsequent code and prompts/comments. However, I had to hit `Enter` twice after the first prompt, and after some other lines (where you see blank lines below) to get all the code. I would tell when Copilot had finished because it started regenerating the same prompts and code it had already generated.

Unfortunately when we try to run the cell below we get an error:

In [29]:
# merge the two tables that are the output of each of the two code cells above

# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})

# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])

# merge the two tables
ci_table.append(ci_table2)

AttributeError: 'DataFrame' object has no attribute 'append'

The error message is that `'DataFrame' object has no attribute 'append'`. This means that the code is trying to append the across-subject table to the by-participant table, but that the `append()` method doesn't exist for DataFrames. 

If you do a Web search for "pandas dataframe append", you may find this error puzzling, because there are many descriptions of how to use this method, including in the official pandas documentation. However, what's not immediately obvious is that the `.append()` method in pandas was **depricated** in [November, 2021](https://github.com/pandas-dev/pandas/pull/44539). "Depricated" in a programming language means that a feature/function/method/etc. was made obsolete — typically replaced by a newer feature, or perhaps because it was already redundant with another way of doing the same thing. Since Copilot is trained on a vast amount of code, this will inevitably include **legacy code** that is no longer used. 

We'll ask Copilot Chat for help again, this time telling it what the error message is, and asking how to fix it:

`the selected cell generates an error: "'DatFrame' object has no attribute 'append'". how can i fix this error?`

The response is:

> The error message "'DataFrame' object has no attribute 'append'" suggests that the `append()` method is not available for the `DataFrame` object in the code.
> 
> To fix this error, you can use the `concat()` function instead of the `append()` method to concatenate the two dataframes vertically. Here's the modified code:

```python
# merge the two tables that are the output of each of the two code cells above

# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})

# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])

# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])

# print the merged table
print(ci_table)
```

> In this code, the `concat()` function is used to concatenate the two dataframes `ci_table` and `ci_table2` vertically. The resulting dataframe contains the rows of both dataframes stacked on top of each other.
> 
> Finally, the `print()` function is used to print the merged table.
> 
> This code should fix the error and give you the desired output of a merged table containing the mean RT, lower CI, and upper CI values for each participant, as well as the mean RT, lower CI, and upper CI values across all participants.

When we try this, it works. The only lingering issue is that the last row is labelled `RT`, which does not tell us the values represent the mean across all participants:

In [30]:
# merge the two tables that are the output of each of the two code cells above

# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})

# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])

# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])

# print the merged table
print(ci_table)

    Mean RT  Lower CI  Upper CI
s1     0.39      0.35      0.43
s2     0.44      0.38      0.50
s3     0.45      0.37      0.53
RT     0.43      0.39      0.47


We can as Chat to do this for us: `modify the code in the selected cell so that the last row in the final dataframe is labelled "Average"`

> To modify the code to label the last row in the final dataframe as "Average", you can change the index label of the last row in the `ci_table2` dataframe to "Average" before concatenating the two dataframes. Here's the modified code:

```python
# merge the two tables that are the output of each of the two code cells above

# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})

# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['Average'])

# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])

# print the merged table
print(ci_table)
```

> In this code, the `index` parameter of the `pd.DataFrame()` function is used to set the index label of the last row in the `ci_table2` dataframe to "Average".
> 
> The `concat()` function is then used to concatenate the two dataframes `ci_table` and `ci_table2` vertically. The resulting dataframe contains the rows of both dataframes stacked on top of each other.
> 
> Finally, the `print()` function is used to print the merged table.
> 
> This code should give you the desired output of a merged table containing the mean RT, lower CI, and upper CI values for each participant, as well as the mean RT, lower CI, and upper CI values across all participants, with the last row labelled as "Average".

In [32]:
# merge the two tables that are the output of each of the two code cells above

# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})

# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)

# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['Average'])

# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])

# print the merged table
print(ci_table)

         Mean RT  Lower CI  Upper CI
s1          0.39      0.35      0.43
s2          0.44      0.38      0.50
s3          0.45      0.37      0.53
Average     0.43      0.39      0.47


## Conclusion

We've now completed the task of calculating the mean RT and 95% CIs for each participant, and for all participants. We've also learned a lot about how to use Copilot to help us write code, and how to use Copilot Chat to help us understand errors and fix them.

As you've seen, the process of using AI to generate code isn't always smooth, nor are the results always accurate. It's critical to critically assess the outputs you see, and to understand what you are asking Copilot to do — as well as what the code it generates is doing. In this lesson, for the sake of expediency and length we did not provide detailed explanations of what the code was doing every step of the way. However, you've seen that Copilot can be useful at explaining code for you. 

There were also many points in this lesson where, as an experienced coder, I knew what the problem was and how to fix it. However, I deliberately did not fix the problems myself, because I recognize that as a novice coder you may not know how to fix the problems. Hopefully this lesson has shown you how to use Copilot effectively, but also underscored the value of actually knowing Python and being able to read and debug code yourself.

The best approach to learning to use Python for data science is to write code yourself when you can,  because this helps build and reinforce your knowledge (remember, coding is a procedural skill that you only learn by doing — and you lose the knowledge if you don't use it). But, when you encounter a problem you've never solved before, or are getting errors in your Code, Copilot can be a huge help to deepen your knowledge and solve problems. As we've seen, Copilot Chat can be much more useful in this regard than simply writing prompts in code cells of your notebook, because it gives you both code, and detailed explanations of what that code is doing. 

### Copilot in Later Chapters of this Book

Throughout most of the rest of this book, we focus on teaching you to write code yourself, rather than using Copilot. We believe — and feel this lesson has demonstrated — that learning to write code yourself is the best way to learn to code. However, whether you're taking this course for credit and have assignments that Copilot may be useful in helping with, or simply using this book to learn how to apply Python to your own data, we recognize that Copilot can be a useful tool. We hope this chapter has given you a good introduction to how to use Copilot effectively, and that you will continue to use it on your coding journey.