# Data Aggregation
In this tutorial, I will go through a very simple example of CSV aggregation in Python. The aim here is to get a bunch of text files, and aggregate them into a CSV file containing all of the responses. Have a think about how you would do this manually, as it will guide you in deciding the best way to do this in Python!

## Data Description
I had a very simple experimental survey that asked people two questions

```
Questions
1. Describe how you felt when you were interrupted.
2. Describe what (if any) strategies you used when you were interrupted?
```



## 1: Listing Files in Folders

_note that this why we don't use spaces in filenames_

Resources:  
http://www.google.com/search?q=how+to+open+all+files+in+folder+python  
https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python


In [3]:
# Method 1: os.listdir()
import os

data_dir = "data/"

for file in os.listdir(data_dir):
    if file.endswith(".txt"):
#         print(file)
        full_file_location = os.path.join(data_dir, file)
        print(full_file_location)

data/EXP2part3.txt
data/EXP2part4.txt
data/EXP2part1.txt
data/EXP2part2.txt
data/EXP2part6.txt
data/EXP2part5.txt


In [4]:
# Method 2: glob
import glob

for file in glob.glob("./data/*.txt"):
    print(file)

./data/EXP2part3.txt
./data/EXP2part4.txt
./data/EXP2part1.txt
./data/EXP2part2.txt
./data/EXP2part6.txt
./data/EXP2part5.txt


## 2: Reading Content of Files at this point we have told python to print the filenames (it can be find them) so now we
need to tell it to do something with them.

http://www.google.com/search?q=how+to+get+filename+python  
https://stackoverflow.com/questions/678236/how-to-get-the-filename-without-the-extension-from-a-path-in-python


In [5]:
import glob
import os

for file in glob.glob("./data/*.txt"):

        # print the filename without folder
        print(os.path.basename(file))

        # Open file
        opened_file = open(file, 'r')
        # Read the file
        read_file = opened_file.readlines()

        for line in read_file:
            print(line)


EXP2part3.txt
1. Knew there would be collisions happening but couldn't do anything. Felt like I needed to prepare.

2. Moved the mouse toward the expected collisions. Thought of what key to press for the special hand offs.

EXP2part4.txt
1. Interruption is not necessary because It does not stops collision.

2. I just relaxed my eyes when interrupted by blank screen.

EXP2part1.txt
1. A little frustrated as I was concentrating on the simulation.

2. Went over again and again in my head what I needed to do once I returned to the simulation.

EXP2part2.txt
1. Not much different, I anticipated that there would be more tasks to do in the filled distraction though.

2. I made a mental note of where the conflict in the main page was and tried to keep repeating which key to press for the departing plane

EXP2part6.txt
1. Felt a pressure to remember the incomplete tasks from the primary stimulation.

2. Tried to remember the incomplete tasks by saying them in my head

EXP2part5.txt
1. Annoyed, 

## 3: Saving Contents of Files (Temporarily)
**Awesome** so at this point we can read the contents of the file. We can't manipulate this data. But how can we tell Python we want to store these values and refer to them later?

In [10]:
import glob
import os

full_results_list = []

for file in sorted(glob.glob("./data/*.txt")):
        opened_file = open(file, 'r')
        read_file = opened_file.readlines()
        participant_id = os.path.basename(file)
        
        for line in read_file:
            if "1." in line:
                response_q1 = line
            if "2." in line:
                response_q2 = line
        results = [participant_id, response_q1, response_q2]
        full_results_list.append(results)
        
print(full_results_list[0])

['EXP2part1.txt', '1. A little frustrated as I was concentrating on the simulation.\n', '2. Went over again and again in my head what I needed to do once I returned to the simulation.\n']


However, we have a problem, this code isn't very efficient. Imagine if we had 100, or 200 items, we don't want
to manually keep adding lines for if "1...59" in line. So what do we do? Also, the order the results appear in seems a little random, let's tell it do it sorted!


In [11]:
import glob
import os

full_results_list = []

for file in sorted(glob.glob("./data/*.txt")):
        opened_file = open(file, 'r')
        read_file = opened_file.readlines()
        results = [os.path.basename(file)]
        
        # Notice here it's treating each new line as a new result!
        # Obviously you need to know what you data looks like to know if this is safe!
        for line in read_file:
            results.append(line)
        full_results_list.append(results)

for item in full_results_list:
    print(item)
## We can also apply formatting at this point to inspect what's going on
#     print(item[0])
#     print(item[1])
#     print(item[2])


['EXP2part1.txt', '1. A little frustrated as I was concentrating on the simulation.\n', '2. Went over again and again in my head what I needed to do once I returned to the simulation.\n']
['EXP2part2.txt', '1. Not much different, I anticipated that there would be more tasks to do in the filled distraction though.\n', '2. I made a mental note of where the conflict in the main page was and tried to keep repeating which key to press for the departing plane\n']
['EXP2part3.txt', "1. Knew there would be collisions happening but couldn't do anything. Felt like I needed to prepare.\n", '2. Moved the mouse toward the expected collisions. Thought of what key to press for the special hand offs.\n']
['EXP2part4.txt', '1. Interruption is not necessary because It does not stops collision.\n', '2. I just relaxed my eyes when interrupted by blank screen.\n']
['EXP2part5.txt', '1. Annoyed, it distracted me from the game.\n', '2. Remembered where the trouble was so I could fix it straight away when the

## 4: Cleaning the Data
Before we save this file in our final state, we want to fix up some of the mistakes we have. Firstly, do you notice that all the newlines are represented as `\n`. Also, we still have the question numbers in the responses! Finally, notice the participant names are still the filename! Let's fix that

Let's get to it!  
Google "Python remove trailing newline"  
Google "Python remove first characters from string"  

In [12]:
import glob
import os

full_results_list = []

for file in sorted(glob.glob("./data/*.txt")):
        opened_file = open(file, 'r')
        read_file = opened_file.readlines()
        # Here we use another function from OS to get just the base, and we tell it to strip the file name
        results = [os.path.splitext(os.path.basename(file))[0].strip()]
        
        for line in read_file:
            # Here we use square brackets to say only read from the third character onwards
            # Then we tell it to strip, which by default removes \n from the string
            line = line[2:].strip()
            results.append(line)
            
        full_results_list.append(results)
print(full_results_list)

[['EXP2part1', 'A little frustrated as I was concentrating on the simulation.', 'Went over again and again in my head what I needed to do once I returned to the simulation.'], ['EXP2part2', 'Not much different, I anticipated that there would be more tasks to do in the filled distraction though.', 'I made a mental note of where the conflict in the main page was and tried to keep repeating which key to press for the departing plane'], ['EXP2part3', "Knew there would be collisions happening but couldn't do anything. Felt like I needed to prepare.", 'Moved the mouse toward the expected collisions. Thought of what key to press for the special hand offs.'], ['EXP2part4', 'Interruption is not necessary because It does not stops collision.', 'I just relaxed my eyes when interrupted by blank screen.'], ['EXP2part5', 'Annoyed, it distracted me from the game.', 'Remembered where the trouble was so I could fix it straight away when the interruption ended.'], ['EXP2part6', 'Felt a pressure to remem

Let's just do a few final checks of the data before we decide it's ready! First of all we will check each individual response stored in the list!

In [None]:
# Participant 1
# full_results_list[0]

# Participant 3 - Question 1
# full_results_list[2][1]

# Just to test the functionality
# print("Participant", full_results_list[0][0], "answered:", full_results_list[0][1])

## 5: Saving Permentantly to a CSV file
We are now at the stage where our results are coming out cleanly as we anticipated - so we are ready to send it to a final CSV file for exporting! 

http://www.google.com/search?q=how+do+I+write+to+a+csv+in+python

In [13]:
import glob
import os
import csv

full_results_list = []

for file in sorted(glob.glob("./data/*.txt")):
        opened_file = open(file, 'r')
        read_file = opened_file.readlines()
        results = [os.path.splitext(os.path.basename(file))[0]]
        for line in read_file:
            line = line[2:].strip()
            results.append(line)
        full_results_list.append(results)
        
with open('final_results.csv', 'w') as csvfile:
    my_writer = csv.writer(csvfile)
    my_writer.writerows(full_results_list)
        
    


# Fin