# 4. Working with Data in Python

![](images/logo.png)

Welcome to the third lab session!

This lab includes some exercises about the material covered in session 4 concerning files and pandas. 

> **NOTE**: If your are running this lab on your own machine (i.e. not on our JupyterHub server), you need to make sure Pandas has already been installed. To check whether you have installed Pandas try running the cell below: 

In [None]:
import pandas as pd

If nothing much happens then you're good to go. However, if you get an `ImportError`, this means pandas has not been installed. To install pandas, open your command prompt and type 

```
pip install pandas
```

and hit enter. Once the install has finished, you should be good to go. Come back and retry running the cell above. 

With that done, let's jump in!



# Exercise 1: Reading in `.txt` files

## a) `with` statements

If you change tab to the Jupyter file explorer, you'll notice a folder named `dreams`. If you take a look inside, you'll see 5 files called `dream1.txt`, `dream2.txt` ... , `dream5.txt`. These are short write-ups of dreams people have had, taken from the website [DreamBank](http://www.dreambank.net/), which has a collection of over 20,000 dreams recorded from many different people. 

Each file is made up of 3 lines. The first gives the name of the dreamer, the second gives a date, and the third has the dream content. 

Below is some code that reads in `dream1.txt`. 

In the cell below that, rewrite the code so that it does the exact same thing, but uses the `with` statement. 

In [None]:
text_file = open('dreams/dream1.txt')

contents = text_file.read()

print(contents)

text_file.close()

In [None]:
# your code here

## b) `readlines`

Copy and paste your code from above. Replace `.read()` with `.readlines()`. This will create a list, where each element is a new line.

Remember, sometimes it may look like text is going over multiple lines when really it's on just one. 

Using list indexing, select the 3rd element from this list (remember, list indexing starts from 0) so that you extract only the text, not the name or date. Print this text. 

In [None]:
# your code here

## c) Extracting the text from multiple files

Complete the code below to print the text from all five files. 

In [None]:
import os

for file_name in os.listdir('dreams'):
     
    path = 'dreams/' + file_name
    
    print(path)

# Exercise 2: Writing files

Below you will see a function being imported from a file called `random_sentence.py`. (This is another use of `import` - to bring in functions written in external files). Try running the function `generate_random_sentence()` a few times to see what it does. 

In [None]:
from random_sentence import generate_random_sentence

random_sentence = generate_random_sentence()

print(random_sentence)

## a) Writing many lines 1

By using the function `generate_random_sentence()`, complete the code below to write 100 random sentences to a file called `random_sentences1.txt`, using the `.write()` function. Check this has worked by opening the file in the Jupyter launcher.

Remember to use the `\n` character to signify a new line. 

In [None]:
n_sentences = 100

with open('random_sentences1.txt', 'w') as my_file:
    
    for i in range(n_sentences):
        
        # your code here

## a) Writing many lines 2

Do the same thing again, but this time use a list of sentences with the `.writelines()` function. 


In [None]:
n_sentences = 100

lines = []

for i in range(n_sentences):
    
    lines.append(generate_random_sentence() + '\n')
    
print(lines)

# your code here

# Exercise 3: Pandas

## a) Creating a DataFrame from a list of lists

Use the following data to create a pandas DataFrame. Store it in a variable called `df`. 

In [None]:
import pandas as pd


column_names = ['name', 'calories', 'protein', 'fat', 'sodium', 'fiber']

data = [['100% Bran', 70, 4, 1, 130, 10],
        ['100% Natural Bran', 120, 3, 5, 15, 2],
        ['All-Bran', 70,  4, 1, 260, 9],
        ['All-Bran with Extra Fiber', 50,  4, 0, 140, 14],
        ['Almond Delight', 110, 2, 2, 200, 1]]

# your code here


## b) Column information

Use the function `df.mean()` to find the mean of each column. Do the same for `.min()` and `.max()`. 


In [None]:
# your code here


In [None]:
# your code 


In [None]:
# your code here


## c) Loading data from a csv file

Below is some code that opens the full dataset from a file. Edit it so that the `'name'` column becomes the index column. 


In [None]:
df = pd.read_csv('cereal.csv')

## d) Getting row information

Use `.loc[]` to find out all information about `'Cheerios'`

In [None]:
# your code here

## e) Adding a healthy or unhealthy label

We are now going to add a new column called `'healthy'` which will contain the values `True` or `False` for each cereal type. If the calories are greater than 120 or the sugar is greater than 10, we will mark it as unhealthy. Otherwise we will mark it as healthy. 

The code below creates a list where each element is `True` or `False`, which specifies the health status of the corresponding cereal. Add this as a new column to the dataframe. 

In [None]:
healthy = []

for cereal in df.index:
    
    row = df.loc[cereal]
    
    if row['calories'] > 120 or row['sugars'] > 10:
       
        healthy.append(False)
        
    else:
        
        healthy.append(True)
        
print(healthy)

In [None]:
# your code here
