# Reading data from a file into a DataFrame

<div class="alert alert-warning">

**In this Notebook you will learn how to read datasets stored in files into a DataFrame for analysis.**
    
</div>

In the last few Notebooks we have used lists of numbers or strings to store data. For example we had a list of blood groups stored in the variable called `blood_groups`:

```python
blood_groups = ['A+', 'O+', 'A+', 'O+', 'A+', 'O-', 'A+', ... etc.]
```

This is not the normal way to store data though. Instead datasets are stored in files. A common file format for storing data is Microsoft excel. You open excel files using the Microsoft office program called excel. Once opened you can see your data, manipulate it and plot it in graphs.

But that is only one of many ways in which your data can be saved in a file, opened, analysed and plotted. We will now look at an alternative way using Python. 

Which method you chose to analyse your data is up to you. You may prefer the excel way of doing things, or you may prefer the Python way of doing things. They both have their pros and cons, and for basic data analysis and plotting it doesn't really matter which one you use.

## CSV data files


As well as excel files, there are other types of files that can store your data. One of the most common is called "comma-separated values" or "csv" for short. You can tell if a file is a csv file because it will have the extension ".csv" after it. Similarly an excel file will have the extension ".xlsx".

CSV files are called human-readable. This is because you can open them in any text editor and see their contents and edit them, for example, to correct mistakes or change the name of a variable. This is different from say excel files which are encoded in a non-human-readable format. Their contents cannot be viewed in a text editor.

<div class="alert alert-info">

If you go to the "Variation1/Self-study Notebooks" tab in your browser (the one that contains all the self-study Notebooks) you'll see a folder called **Datasets**. Click on it and you'll see a lot of csv data files which you'll be using in this course. You can click on any of them and a new tab will open with the contents of that file. Try it.
    
</div>

There are several ways to open a csv file in Python. In this course we will use a module called **pandas**. 

We'll use a simple dataset of body masses of Alaskan sockeye salmon to demonstrate how to do this. 
</div>

## Body mass of a sample of Alaskan sockeye salmon

<div>
<img src="attachment:salmon.jpg" width='70%' title="Bureau of Land Management CC BY 2.0"/>
    
</div>

In the file [Datasets/alaskan salmon.csv](Datasets/alaskan%20salmon.csv) are the body masses (in kg) of 228 female sockeye salmon sampled from Pick Creek in Alaska. 

The first few lines of `alaskan salmon.csv` look like this:
```csv
mass
3.09
2.91
3.06
2.69
... and so on
```

The first line is called a **header**. This describes what has been measured. In this case it is `'mass'`. Each line thereafter contains the mass (in kg) of an individual salmon.

## Pandas: A data analysis module


Pandas is widely used in data science. It provides lots of functions for reading in, analysing, manipulating and describing data. The official pandas website is [here](http://pandas.pydata.org). This website provides a lot more information on the use of this library than can be covered in these workshops. 

To use pandas we must include the following code once in each Notebook.

```python
import pandas as pd
```

This imports the pandas module and gives it the shorthand name `pd`.

## Reading in the dataset

The first two things we have to do is
1. read in the Alaskan salmon data from the csv file
2. call the dataset something sensible.  

We do this using the pandas function `read_csv()` like so:

```python
salmon_masses = pd.read_csv('Datasets/alaskan salmon.csv')
```
<div class="alert alert-info">

Run the following code cell to import pandas, read in `alaskan salmon.csv` and call the dataset `salmon_masses`.
    
There is no output, but the data will be read in and saved in memory.
</div>

In [None]:
import pandas as pd

salmon_masses = pd.read_csv('Datasets/alaskan salmon.csv')

## File not found error

If you type in the wrong name for a file then Python will complain with a "File Not Found Error" message. 

For example, if you forget to include the "Datasets" folder in the filename like so

```python
salmon_masses = pd.read_csv('alaskan salmon.csv')
```
Python will print the following error message followed by lots of other stuff:

```python
FileNotFoundError
Input In [1], in <cell line: 1>()
----> 1 salmon_masses = pd.read_csv('alaskan salmon.csv')
```

The arrow points to the line of code on which the error occurred.

Unfortunately Python error messages can be long and verbose. The important thing to remember is that the main error, and the line on which it occurred, is at the top of the error message.

<div class="alert alert-info">

Run the following code cell to see the error. 
</div>

In [None]:
salmon_masses = pd.read_csv('alaskan salmon.csv')

<div class="alert alert-success">

If you want to clear the output of a code cell, for example to get rid of these long error messages, click on **Cell > Current Outputs > Clear** in the toolbar above.
</div>

## Printing the dataset

Notice when you ran the above code to read in the dataset nothing appeared to happen. This is in contrast to excel; when we open an excel file the data are immediately displayed in an excel spreadsheet.

When we use pandas to read in a dataset, the data are stored in memory, but they are not immediately displayed. To see the data we have to print it like so:

```python
print(salmon_masses)
```

<div class="alert alert-info">

Try this in the code cell below.
</div>

In [None]:
print(salmon_masses)

There are two columns. The first column is an index, or row number, for each salmon and can be ignored. The second column contains the masses in kilograms of all salmon.

As this is a very long dataset, pandas has only printed the first and last five values; the middle values (from index 5 to index 222) have been omitted.

Also notice that the number of rows and columns are printed at the bottom, in this case 228 rows and 1 columns (the index column is ignored).

## Get the number of values in the dataset: The sample size

The number of values in a dataset is also called the **sample size**. It is usually represented by lowercase *n*.

We can use the `count()` method to get the number of values in a dataset:

```python
n = salmon_masses['mass'].count()
```

<div class="alert alert-info">

Try these in the following code cell.
</div>

In [None]:
n = salmon_masses['mass'].count()

print(n)

## A pandas dataset is called a DataFrame

Pandas (and other statistical software) calls a dataset a **DataFrame**. Excel calls a dataset a "sheet". The variable called `salmon_masses` has the type DataFrame. you can see this if you run the code

```python
print( type(salmon_masses) )
```

<div class="alert alert-info">

Try this in the following code cell.
</div>

In [None]:
print( type(salmon_masses) )

## Exercise Notebook

[Reading data from files](../Exercise%20Notebooks/2.6%20-%20Reading%20data%20from%20files.ipynb)