(c) Kevin Dunn, All content is under Creative Commons Attribution CC-BY 4.0 and all code is under BSD-3 clause.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice! 

# Loading data from files into Python: CSV files

This is a part of a series showing how to import data:
1. CSV files, or text files
2. Excel files
3. MATLAB files
4. Reading data from a remote web-based location
5. Reading real-time sensor data (data stream)

## Background

CSV stands for Comma-Separated Values. It is a table of data, in a **plain text file**, where the numbers are separated by commas.

In the figure below, on the left is what the raw CSV file looks like if you open it in a regular text editor. On the right is what that file looks like after loading it into a spreadsheet.

Raw file | CSV imported into a spreadsheet
-        | - 
![Raw CSV file](images/reading-loading-data/CSV-raw.png) | ![CSV file is loaded](images/reading-loading-data/CSV-table.png)




### Try it now

1. Download the file: https://openmv.net/file/batch-yield-and-purity.csv (also [available locally](datasets/batch-yield-and-purity.csv))
2. Open the file in a text editor. ([NotePad++ is recommended on Windows](https://notepad-plus-plus.org/) or use your built-in text editor on Mac/Linux/other operating systems.
3. Verify that the file is nothing but pure text: you can read it. Text files are human readable, where as a binary file, such as Excel file, would look only like cryptic characters.




### How is a CSV file processed by Python?
* When the file is imported, it is read from top to bottom, line-by-line.
* Within each line we will read from left-to-right. Every time a comma is encountered, the importer will assume it is the start of a new column, and place the following number into that column.
* It continues until it reachs the end of the line of text.

**Terminology:** the comma symbol is used to separate one number from the next. In this case, the comman is also known as a *delimiter*. You can imagine that you can devise a file format where you specify some other symbol as the delimiter, for example the 'tab' character, or the semi-colon. 

You will see the word *delimiter* used below in the code. Now you know what it means.




### Problems and pitfalls
* The comma indicator which is often used in European countries for decimals, as in "72,6" for the number 72.6 will be problematic. Since importing reads from left to right "72,6" will create an integer of 72 in one column, and a value of 6 in the next column. 
* CSV files must therefore be created using a period (or full-stop) as the decimal separator for floating point numbers (i.e. non-integer numbers).
* The CSV format is not economical for large datasets. 
* It should be clear from the above description that it is not certain how many rows or columns will be encountered before loading a CSV file.
* Using text headings for either the columns [called column headers] or for the rows [called row headers] can lead to confusion. You will see how we can handle this below: we skip them.
* Editing data in a CSV file is not simple: you cannot easily delete an entire column, for example. Manipulating data, such as calculate an average of a column, is also not possible. Contrast this to Excel files, described elsewhere.


### Advantages
* The CSV format is future-proof: since everything inside is just a plain-old text file, it will always be openable, no matter which software you use in the future.
* You can quickly change the CSV file to update/add/remove parts of it using your text editor. No special software is required.
* We can handle missing data: put a ``NaN`` between two commas. For example: `` 71,NaN,73 `` indicates a value is missing between the ``71`` and the ``73``.
* The CSV file is a good data-interchange format. By that we mean: most data manipulation software can export to CSV, and once you have that as an intermediate file, you can re-import it into other software, including Python. Use it, for example, to export a database to CSV.
* Because of its simplicity, it widely supported in most data manipulation software. But its simplicity is also its downfall for more complex data (e.g. data greater than 2-dimensions, such as digital images). 





### Using only Numpy to read the CSV file

In [None]:
# Run this code to ensure that the values inside ``yield_purity`` match what 
# you see in the figure above.
import numpy as np

# Create a file object, referring to the CSV file. You might need to adjust
# the part between quotation marks to point to the actual CSV file location
# on your computer.
csvfile = open("datasets/batch-yield-and-purity.csv", "rt")

# Now we read the above file object
# Notice the function input 'delimiter': we indicate what delimiter was used.
# We also instruct it to skip the first row; that was just a header row (non-numeric)
yield_purity_np = np.loadtxt(csvfile, delimiter=",", skiprows=1)  
print(yield_purity_np)

Now break the code a bit. Try these:
* Change ``skiprows=0`` and rerun the code: what error message do you get? Does it make sense?
* Add ``help(np.loadtxt)`` to the code, or visit the [help for the function](https://www.numpy.org/devdocs/reference/generated/numpy.loadtxt.html). What do you change to the above code if each row has a header too?
* Edit the text file and remove a value, replacing it with ``NaN`` instead. Now try to read the file. Does it show as ``nan`` in the output?
* Try it again, this time, putting nothing: just two commas next to each other. Did it work?

### Using the CSV library to read, and load into Numpy

The [CSV library](https://docs.python.org/3/library/csv.html) has many flexible options. We will just use the basics here, but look at that link to see the different options for importing a variety of file types.

In [None]:
import csv

csvfile = open("datasets/batch-yield-and-purity.csv", "rt")

# Creates a special Python object type called an iterator. Iterator have a
# function method called ``next`` which will run iterator once. 
# You can repeatedly call the ``next(...)`` function and you will get the 
# next element from the iterator, until no more elements are left over.
reader = csv.reader(csvfile, delimiter=",")

# Use the ``next(...)`` function to call the iterator once, thereby skipping 
# the first row of text. Iterators in this context give you one line of the CSV file.
next(reader)

# Now use the ``list`` function. The ``list`` function is greedy: it will 
# keep called ``next(...)`` internally, until there are no more elements.
# So instead of writing a for-loop, we can simply say:
raw_list = list(reader)
# print(raw_list)    # uncomment this to see what the intermediate data looks like

# Lastly, convert the Python list of lists to an array.
result = numpy.array(raw_list).astype("float")
print(result)



### Using Pandas to load a CSV file: simple and quick

The Pandas function ``read_csv`` has a lot of flexiblity and [smart processing](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) built-in to make reading CSV files easy with headers, missing values, and other settings.

It is a swiss-army knife function: very versatile, but you need to know how to use it.

It also produces an object ``yield_purity_pd`` in this code below, which has many methods. You can call the ``.head()`` method, as shown, but also try some others. Some suggestions are given below.

In [None]:
import pandas as pd
yield_purity_pd = pd.read_csv('datasets/batch-yield-and-purity.csv')

# Show only the first few lines of the data table
yield_purity_pd.head()  

### Exercises to end off:

1. Edit the text file to create some missing values, like this, using ``,,`` or ``NaN`` or ``N/A``. 
![CSV with missing values](images/reading-loading-data/CSV-with-missing.png)
1. Try the Pandas code and verify that it needs no modification to handle those variations. This is the advantage of using a library like Pandas: plenty of flexibility and options.
1. Now try running the Numpy code and the CSV/Numpy code above on your CSV file with missing values. It will not work so easily; the code will likely crash.

1. Try some of the other methods on the Pandas object above. Add these lines below the existing code
    * ``yield_purity_pd.plot()``
    * ``yield_purity_pd.mean()`` to calculate the average. The average of what is being shown?
    * ``yield_purity_pd.std()`` to calculate the standard deviation. The standard deviation of what is shown?

## Enrichment

Curious to learn more about importing what are called "*flat files*" into Python? Follow the first part (free) of this course on DataCamp: https://www.datacamp.com/courses/importing-data-in-python-part-1/

