# 4. Working with Data in Python

![](images/logo.png)

Welcome to the third lab session!

This lab includes some exercises about the material covered in session 4 concerning files and pandas. 

> **NOTE**: If your are running this lab on your own machine (i.e. not on our JupyterHub server), you need to make sure Pandas has already been installed. To check whether you have installed Pandas try running the cell below: 

In [1]:
import pandas as pd

If nothing much happens then you're good to go. However, if you get an `ImportError`, this means pandas has not been installed. To install pandas, open your command prompt and type 

```
pip install pandas
```

and hit enter. Once the install has finished, you should be good to go. Come back and retry running the cell above. 

With that done, let's jump in!



# Exercise 1: Reading in `.txt` files

## a) `with` statements

If you change tab to the Jupyter file explorer, you'll notice a folder named `dreams`. If you take a look inside, you'll see 5 files called `dream1.txt`, `dream2.txt` ... , `dream5.txt`. These are short write-ups of dreams people have had, taken from the website [DreamBank](http://www.dreambank.net/), which has a collection of over 20,000 dreams recorded from many different people. 

Each file is made up of 3 lines. The first gives the name of the dreamer, the second gives a date, and the third has the dream content. 

Below is some code that reads in `dream1.txt`. 

In the cell below that, rewrite the code so that it does the exact same thing, but uses the `with` statement. 

In [2]:
text_file = open('dreams/dream1.txt')

contents = text_file.read()

print(contents)

text_file.close()

Name: Alta
Date: 19/01/1986
At work (?) was on the elevator with an older woman who wasn't sure what floor she wanted so she pushed extra buttons - I'd already gone up when I meant to go down to 2, and wasn't thrilled at having to stop so much. So I got off (she was surprised I took exception) and found myself outside, sort of, and now I have to get in a car that goes somewhere down some streets like in Oak town - the car was like an old '50s sort of thing with fins and they were coming off and needed to be stuck back down, took a while - arrived at a place that was an "elevator stop", nice old tree-lined street, the door didn't face the street - it was ivy-covered stone, very nice. Inside it was cave-like and open, but there was the elevator door. There was somebody there as well as 2 cats, and I made some comment about them having kittens (like it would be a good idea) and the guy said they were both males or something and I felt the same sort of cuteness about the exchange you usual

In [3]:
with open('dreams/dream1.txt') as text_file:
    
    contents = text_file.read()
    
    print(contents)

Name: Alta
Date: 19/01/1986
At work (?) was on the elevator with an older woman who wasn't sure what floor she wanted so she pushed extra buttons - I'd already gone up when I meant to go down to 2, and wasn't thrilled at having to stop so much. So I got off (she was surprised I took exception) and found myself outside, sort of, and now I have to get in a car that goes somewhere down some streets like in Oak town - the car was like an old '50s sort of thing with fins and they were coming off and needed to be stuck back down, took a while - arrived at a place that was an "elevator stop", nice old tree-lined street, the door didn't face the street - it was ivy-covered stone, very nice. Inside it was cave-like and open, but there was the elevator door. There was somebody there as well as 2 cats, and I made some comment about them having kittens (like it would be a good idea) and the guy said they were both males or something and I felt the same sort of cuteness about the exchange you usual

## b) `readlines`

Copy and paste your code from above. Replace `.read()` with `.readlines()`. This will create a list, where each element is a new line.

Remember, sometimes it may look like text is going over multiple lines when really it's on just one. 

Using list indexing, select the 3rd element from this list (remember, list indexing starts from 0) so that you extract only the text, not the name or date. Print this text. 

In [9]:
with open('dreams/dream1.txt') as text_file:
    
    contents = text_file.readlines()[2]
    
print(contents)

At work (?) was on the elevator with an older woman who wasn't sure what floor she wanted so she pushed extra buttons - I'd already gone up when I meant to go down to 2, and wasn't thrilled at having to stop so much. So I got off (she was surprised I took exception) and found myself outside, sort of, and now I have to get in a car that goes somewhere down some streets like in Oak town - the car was like an old '50s sort of thing with fins and they were coming off and needed to be stuck back down, took a while - arrived at a place that was an "elevator stop", nice old tree-lined street, the door didn't face the street - it was ivy-covered stone, very nice. Inside it was cave-like and open, but there was the elevator door. There was somebody there as well as 2 cats, and I made some comment about them having kittens (like it would be a good idea) and the guy said they were both males or something and I felt the same sort of cuteness about the exchange you usually do talking about kitties.

## c) Extracting the text from multiple files

Complete the code below to print the text from all five files. 

In [8]:
os.listdir('dreams')

['dream3.txt', 'dream1.txt', 'dream2.txt', 'dream5.txt', 'dream4.txt']

In [10]:
import os

for file_name in os.listdir('dreams'):
     
    path = 'dreams/' + file_name
    
    with open(path) as text_file:
        
        print(text_file.readlines()[2])

A tall many-storied building stands near a road. A construction crew is looking for a structural problem. I go into the building to the basement and I see a twisted place in a girder, a long beam. I come out and tell the foreman the building is about to collapse. We start running away and hiding in the bushes. I am next to the road and realize I must stop traffic because the building is going to fall on the road. I run onto the road yelling at the pedestrians to stop! Some ignore me and break past me, walking into the dangerous area. I see the building and actually try to visualize it falling because it isn't and I've got all these people stopped and they are annoyed at me. It finally collapses far away from the road. A person glares at me for hindering them. I shrug. Better safe than sorry.
At work (?) was on the elevator with an older woman who wasn't sure what floor she wanted so she pushed extra buttons - I'd already gone up when I meant to go down to 2, and wasn't thrilled at havi

# Exercise 2: Writing files

Below you will see a function being imported from a file called `random_sentence.py`. (This is another use of `import` - to bring in functions written in external files). Try running the function `generate_random_sentence()` a few times to see what it does. 

In [11]:
from random_sentence import generate_random_sentence

random_sentence = generate_random_sentence()

print(random_sentence)

The splendid pug really loved the industrious duck


## a) Writing many lines 1

By using the function `generate_random_sentence()`, complete the code below to write 100 random sentences to a file called `random_sentences1.txt`, using the `.write()` function. Check this has worked by opening the file in the Jupyter launcher.

Remember to use the `\n` character to signify a new line. 

In [58]:
n_sentences = 100

with open('random_sentences1.txt', 'w') as my_file:
    
    for i in range(n_sentences):
        
        my_file.write(generate_random_sentence() + '\n')

## a) Writing many lines 2

Do the same thing again, but this time use a list of sentences with the `.writelines()` function. 


In [60]:
n_sentences = 100

lines = []

for i in range(n_sentences):
    
    lines.append(generate_random_sentence() + '\n')
    
    
with open('random_sentences2.txt', 'w') as my_file:
    
    my_file.writelines(lines)

# Exercise 3: Pandas

## a) Creating a DataFrame from a list of lists

Use the following data to create a pandas DataFrame. Store it in a variable called `df`. 

In [12]:
import pandas as pd


column_names = ['name', 'calories', 'protein', 'fat', 'sodium', 'fiber']

data = [['100% Bran', 70, 4, 1, 130, 10],
        ['100% Natural Bran', 120, 3, 5, 15, 2],
        ['All-Bran', 70,  4, 1, 260, 9],
        ['All-Bran with Extra Fiber', 50,  4, 0, 140, 14],
        ['Almond Delight', 110, 2, 2, 200, 1]]

df = pd.DataFrame(data, columns=column_names)
df


Unnamed: 0,name,calories,protein,fat,sodium,fiber
0,100% Bran,70,4,1,130,10
1,100% Natural Bran,120,3,5,15,2
2,All-Bran,70,4,1,260,9
3,All-Bran with Extra Fiber,50,4,0,140,14
4,Almond Delight,110,2,2,200,1


## b) Column information

Use the function `df.mean()` to find the mean of each column. Do the same for `.min()` and `.max()`. 


In [13]:
df.mean()

calories     84.0
protein       3.4
fat           1.8
sodium      149.0
fiber         7.2
dtype: float64

In [14]:
df.min()

name        100% Bran
calories           50
protein             2
fat                 0
sodium             15
fiber               1
dtype: object

In [15]:
df.max()

name        Almond Delight
calories               120
protein                  4
fat                      5
sodium                 260
fiber                   14
dtype: object

## c) Loading data from a csv file

Below is some code that opens the full dataset from a file. Edit it so that the `'name'` column becomes the index column. 


In [16]:
df = pd.read_csv('cereal.csv', index_col='name')
df

Unnamed: 0_level_0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
All-Bran with Extra Fiber,50,4,0,140,14.0,8.0,0,330,25,93.704912
All-Bran,70,4,1,260,9.0,7.0,5,320,25,59.425505
100% Bran,70,4,1,130,10.0,5.0,6,280,25,68.402973
Post Nat. Raisin Bran,120,3,1,200,6.0,11.0,14,260,25,37.840594
Raisin Bran,120,3,1,210,5.0,14.0,12,240,25,39.259197
...,...,...,...,...,...,...,...,...,...,...
Trix,110,1,1,140,0.0,13.0,12,25,25,27.753301
Corn Pops,110,1,0,90,1.0,13.0,12,20,25,35.782791
Puffed Rice,50,1,0,0,0.0,13.0,0,15,0,60.756112
Almond Delight,110,2,2,200,1.0,14.0,8,0,25,34.384843


## d) Getting row information

Use `.loc[]` to find out all information about `'Cheerios'`

In [17]:
df.loc['Cheerios']

calories    110.000000
protein       6.000000
fat           2.000000
sodium      290.000000
fiber         2.000000
carbo        17.000000
sugars        1.000000
potass      105.000000
vitamins     25.000000
rating       50.764999
Name: Cheerios, dtype: float64

## e) Adding a healthy or unhealthy label

We are now going to add a new column called `'healthy'` which will contain the values `True` or `False` for each cereal type. If the calories are greater than 120 or the sugar is greater than 10, we will mark it as unhealthy. Otherwise we will mark it as healthy. 

The code below creates a list where each element is `True` or `False`, which specifies the health status of the corresponding cereal. Add this as a new column to the dataframe. 

In [18]:
healthy = []

for cereal in df.index:
    
    row = df.loc[cereal]
    
    if row['calories'] > 120 or row['sugars'] > 10:
       
        healthy.append(False)
        
    else:
        
        healthy.append(True)
        
print(healthy)

[True, True, True, False, False, False, True, False, True, False, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False, False, True, True, True, True, False, True, False, True, True, False, False, True, False, True, True, False, True, False, True, True, False, False, False, False, True, True, True]


In [19]:
df['healthy'] = healthy
df

Unnamed: 0_level_0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,rating,healthy
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
All-Bran with Extra Fiber,50,4,0,140,14.0,8.0,0,330,25,93.704912,True
All-Bran,70,4,1,260,9.0,7.0,5,320,25,59.425505,True
100% Bran,70,4,1,130,10.0,5.0,6,280,25,68.402973,True
Post Nat. Raisin Bran,120,3,1,200,6.0,11.0,14,260,25,37.840594,False
Raisin Bran,120,3,1,210,5.0,14.0,12,240,25,39.259197,False
...,...,...,...,...,...,...,...,...,...,...,...
Trix,110,1,1,140,0.0,13.0,12,25,25,27.753301,False
Corn Pops,110,1,0,90,1.0,13.0,12,20,25,35.782791,False
Puffed Rice,50,1,0,0,0.0,13.0,0,15,0,60.756112,True
Almond Delight,110,2,2,200,1.0,14.0,8,0,25,34.384843,True
