(c) Kevin Dunn, All content is under Creative Commons Attribution CC-BY 4.0 and all code is under BSD-3 clause.
Please reuse, remix, revise, and reshare this content in any way, keeping this notice! 

# Loading data from files into Python: Excel files

This is a part of a series showing how to import data:
1. CSV files, or text files
2. Excel files
3. MATLAB files
4. Reading data from a remote web-based location
5. Reading real-time sensor data (data stream)

## Background

Excel files are a proprietary file format specified by Microsoft for their spreadsheeting software. There are two variants: XLS (older format) and XLSX (newer). 

An Excel file can have one or more sheets (tabs). It is desirable to have a single instruction that can read an Excel file, specifying which tab to load.

**Terminology:** the word *parse* is used in the descriptions below. To *parse* a file means to process it, or analyze it. It often also implies to return the input in a different form.

We will use code below to *parse* an Excel file. Now you know what it means.


### Problems and pitfalls
* The Excel file format, thought publically specified, is controlled by a corporation. Though unlikely, files from 10 years ago might not easily be readily in 30 years from now. Compare this to flat text files, which have always, and almost certainly be readable in the future. Point being: do not rely Excel files for long-time archives.
* Excel files do not just contain data: they may contain all sorts of formatting (e.g. cell borders, headers, colours, graphs), which are not imported into Python. What is visually obvious as a heading in an opened Excel file is not necessarily so when reading the file into Python.
* Reading and processing an Excel file might cost you the price of software license. In practice, most Excel files can fully and functionally be opened by some [alternative software](https://www.google.nl/search?q=alternative+to+Excel).


### Advantages
* The data in an Excel file can be graphically manipulated. Completely removing a column from a spreadsheet is no problem, for example.
* Additional calculations can be added to the spreadsheet, and then read into Python for further processing.





### Getting set up for this project

1. Download a file of CO2 emmissions from http://mkt.tableau.com/Public/Datasets/World_Bank_CO2.xlsx 
2. This file is also [available from GitHub](https://github.com/kgdunn/digital-skills-module5/blob/master/Notebooks/datasets/World_Bank_CO2.xlsx)
3. Store it on a known place on your computer if you are running this notebook on your own laptop or desktop.

4. Open the file in Microsoft Excel, or Open Office or Google Docs, to get an idea of what data is contained inside it, and which sheets are available.





### Using Pandas to read your file


In [None]:
import pandas as pd


# Assign a filename where to find the spreadsheet: from your local computer, of from the internet
# Change the file location in the next line, if needed, to point to the actual file 
excel_file = 'datasets/blender-efficiency.xlsx'  


# or if you want to get it directly from the internet:
excel_file = 'http://openmv.net/blender-efficiency.xlsx'


# Load the spreadsheet into a variable
xl = pd.ExcelFile(excel_file)

# What can you do with this variable?
dir(xl)   # remember the ``dir`` command? It shows the attributes and methods available for an object

# There are two that we will use:
#    .sheet_names    
#    .parse(...)

# Show the names of the sheets:
print(xl.sheet_names)


# Load (parse) as a dataframe a specific sheet. But how should we use the .parse() method?
help(xl.parse)

data = xl.parse('blender-efficiency')
data




Now add to the code a bit. Try these:


* Add ``skiprows=0`` to the parse function.
* Add ``help(np.loadtxt)`` to the code, or visit the [help for the function](https://www.numpy.org/devdocs/reference/generated/numpy.loadtxt.html). What do you change to the above code if each row has a header too?
* Edit the text file and remove a value, replacing it with ``NaN`` instead. Now try to read the file. Does it show as ``nan`` in the output?
* Try it again, this time, putting nothing: just two commas next to each other. Did it work?

### Exercises to end off:

1. Edit the text file to create some missing values, like this, using ``,,`` or ``NaN`` or ``N/A``. 
![CSV with missing values](images/reading-loading-data/CSV-with-missing.png)
1. Try the Pandas code and verify that it needs no modification to handle those variations. This is the advantage of using a library like Pandas: plenty of flexibility and options.
1. Now try running the Numpy code and the CSV/Numpy code above on your CSV file with missing values. It will not work so easily; the code will likely crash.

1. Try some of the other methods on the Pandas object above. Add these lines below the existing code
    * ``yield_purity_pd.plot()``
    * ``yield_purity_pd.mean()`` to calculate the average. The average of what is being shown?
    * ``yield_purity_pd.std()`` to calculate the standard deviation. The standard deviation of what is shown?

## Enrichment

Curious to learn more about importing from and writing to Excel files via Python? Follow this community tutorial on DataCamp: 
https://www.datacamp.com/community/tutorials/python-excel-tutorial