# Other data file formats

In this Notebook, you will learn how to work with a variety of other file formats. Details for some file formats are left deliberately sparse. If you find yourself spending a lot of time working with such file formats, feel free to add additional notes to this Notebook, or create a new Notebook to record the recipes you find useful.

## Spreadsheet files (Excel XLS and XLSX files)

Although spreadsheet files are one of the most widely used file formats for sharing data, we have relegated them to this Notebook because we want you to get into the habit of using other file formats to publish and request data yourself.  

Part 7 of the module looks at some of the weaknesses for analysis and management of data in spreadsheet form.

As one of the most widely used spreadsheet applications, the file formats used by Microsoft Excel by default are the ones most commonly encountered. Excel spreadsheet files can be recognised from the file extensions `.xls` and `.xlsx`.

You can open a file from a spreadsheet into a *pandas* DataFrame using the `read_excel()` function.

To start with, load in the *pandas* package:

In [None]:
import pandas as pd

We can try to import a sheet directly into pandas using the `.read_excel()` method. Setting the sheetname parameter to `None` allows us to load in all the sheets as a `dict` of dataframes.

In [None]:
# The following spreadsheet is taken from the Greater London Authority, London DataStore.
#                     https://londondatastore-upload.s3.amazonaws.com/tfl-buses-type.xls
#                     [retrieved 20/07/15]

#Set the sheetname parameter to None to load in all the sheets as a dict of dataframes
xl = pd.read_excel('data/tfl-buses-type.xls',  sheet_name=None)


xl

We can identify the sheets that have been loaded in as the `dict` keys:

In [None]:
xl.keys()

Preview the first few rows of the `Data` sheet:

In [None]:
xl['Data'][:3]

Alternatively, we can read in a single sheet by name:

In [None]:
data = pd.read_excel('data/tfl-buses-type.xls', sheet_name='Data')

data[:3]

By inspecting this data, or by opening the spreadsheet using a spreadsheet application or the OpenRefine tool (which is introduced in Part 2 of the module), we can check to see how many of the first few rows are metadata or blank rows. We can discount a certain number of lines at the top of the sheet using the `skiprows` parameter, or we can specify the spreadsheet row number of the header row explicitly and ignore the rows preceding that one. We can also define which columns we wish to import.  

The `NaN`s sometimes indicate that cells are empty, or contain formula or other 'non' value data. In the cells under those containing 'Single deck' and 'Double deck' and alongside the description in the final row, the `NaN`s are there because the cells have been merged into a single spreadsheet spanning cell.

(For more information, see the documentation for the [*pandas* read_excel method]( http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html)    

### *xlrd*

The `xlrd` package is a powerful package for reading and writing files using Excel's `.xls` and `.xlsx` formats, and lower level access to the contents of Excel spreadsheets than `pandas` provides. 

For more details see: http://xlrd.readthedocs.io/en/latest/

In [None]:
import xlrd

workbook = xlrd.open_workbook('data/tfl-buses-type.xls')
# The library also allows us to preview the sheet names.
print(workbook.sheet_names())

In [None]:
# By manual inspection of the originally previewed sheet, we can use 
# xlrd to read the metadata from the metadata cell.
# Note that row/columns indices are integer values, indexed on 0, 
# and also note that some cells span multiple rows.
sheet = workbook.sheet_by_name('Data')
sheet.cell_value(rowx=14, colx=0)

## XML Files

Importing XML data into a *pandas* `DataFrame` is currently a little trickier than importing JSON, as there are no default *pandas* methods for supporting the import.

Instead, you need to load in a file, parse it using a third party parser such as `lxml`, and then handle the mapping to the `DataFrame` yourself.

Alternatively, use OpenRefine to parse the elements of the XML document that you are interested in and then save the data out again as a tabular CSV document which is a little easier to import.

We will try to limit our use of XML-based datasets in this module, preferring instead CSV formats for tabular data and JSON for more elaborately structured datasets. You will, however, work with a particular style of XML later in the module when you look at Linked Data and the semantic web.

One thing worth bearing in mind is that popular versions of XML formats may have Python libraries defined to make it easier to parse them, and read and write files defined using the format. For example, the KML format that is used to transport geographical data (points, lines, boundaries) can be parsed using the `fastkml` library.

##  Working with KML Files

We can load in data from a KML file (a file format for geographic data sets) and then render it onto a map quite easily.

For example, in the data directory is a file, `CarParks.kml` that contains a list of car park  locations on the Isle of Wight.

In [None]:
!ls data

The `fastkml` package provides various tools for parsing KML files and manipulating related data structures:

In [None]:
from fastkml import kml
k = kml.KML()

We need to open the file as a bytestream - and let the `lxml` parser used by the `fastxml` package identify the encoding itself:

In [None]:
doc = open("data/CarParks.kml",'rb').read()
k.from_string(doc)

An alternative approach is to open the file with a UTF-8 encoding to get a Unicode string, then throw away the first line that declares the decoding to be UTF-8. (The `.from_string()` function simply expects a KML document without the XML encoding prefix.)

In [None]:
!head -n 3 data/CarParks.kml
doc = open("data/CarParks.kml", encoding='utf-8')
lines = '\n'.join(doc.readlines()[1:])
k.from_string(lines)

We can parse the locations of the carpark placemarks from the file:

In [None]:
placemarks = []

for feature in k.features():
    for placemark in feature.features():
        placemarks.append((placemark.name, placemark.geometry.y, placemark.geometry.x))

placemarks[:3]

We can then create a simple `DataFrame` from these values:

In [None]:
df_placemarks = pd.DataFrame(placemarks)
df_placemarks.columns = ['Name', "Latitude", "Longitude"]

df_placemarks.head()

Let's quickly map the markers to show how the parser has pulled out the placemark information. The `folium` package provides a set of tool for creating interactive maps, and adding markers to them, quite straightforwardly. (You will meet `folium` again in more detail in Part 5 of the module.)

NOTE: `folium` uses an external tileset to render the map background appearance. This requires that you have an internet connection when the map is being displayed, it may use cached tile data, but some tiles will be missing if you change scale by zooming.

In [None]:
import folium

If we know the latitude and longitude at the centre of the map we want to display, we can set it directly:

In [None]:
carpark_map = folium.Map(location=[50.68, -1.2667], width = 960, height = 500, zoom_start=11)

One of the inbuilt operators of a *pandas* dataframe is the `mean()` operator. This can be used to calculate the mean value(s) for items in one or more numerically datatyped columns.

We can use this operator to calculate the mean latitude and longitude of the points we wish to plot directly from the dataframe:

In [None]:
lat_mean, lon_mean = df_placemarks[['Latitude', 'Longitude']].mean()
lat_mean, lon_mean

To place markers on a map, we can create a simple function that places a single marker given a latitude and longitude, and then apply that to each row of the dataframe:

In [None]:
def add_marker(row):
    """Add a marker to a map."""
    folium.CircleMarker(location=(row['Latitude'], row['Longitude']),
                        popup=row['Name'],
                        radius=20,
                        fill_color='blue',
                        fill_opacity=0.2
                   ).add_to(carpark_map)


#Apply the add_marker() function to each row (axis=1) of the dataframe
df_placemarks.apply(add_marker, axis=1)

carpark_map

Finally we create the HTML file for the map, and display it below. (The HTML file can then be opened as a standalone file, outside of the Jupyter notebook context, directly from your browser.)

In [None]:
carpark_map.save('data/IOWcarparlocations.html')

## YAML

*pandas* does not support YAML imports directly, but it is possible to use libraries such as the `PyYaml` library to load in a YAML file and convert it to a Python dict that can then be transformed to a *pandas* `DataFrame`.

WARNING:  The `yaml.load()` and `yaml.load_all()` should not be used to parse arbitrary content from unsafe sources.  These functions are capable of creating arbitrary Python objects, including code.  The `yaml.safe_load()` and `yaml.safe_load_all()` limit that ability to objects that cannot generate executable code.

As with XML, we will tend *not* to focus on the use of YAML, preferring instead JSON and CSV representations.

The `yaml` package is one of many packages that can be used to open and parse YAML  files.

`yaml.load()`  and `yaml.safe_load()` will both accept a single document string, and parse it to generate a python `dict`:

In [None]:
import yaml

document = """
image:
    width: 800
    height: 600
    title:  View from 15th Floor
    thumbnail:
        url: http://www.example.com/image/481989943
        height: 125
        width:  100
        animated : false
    IDs:
        - 116
        - 943
        - 234
        - 38793
"""
parsedYAML = yaml.safe_load(document)
parsedYAML

The `yaml.load()` and `yaml.safe_load()` functions will also accept a file name, open and read that file, and parse the contents into a Python `dict`:

In [None]:
stream = open('data/document.yaml', 'r') 
yaml.safe_load(stream)

We can also cast a `dict` to YAML using the `yaml.dump()` function applied to a dict:

In [None]:
yaml.dump(parsedYAML)

If you are interested in exploring Python's handling of YAML further, the `PyYAML` library documentation can be found at  http://pyyaml.org/wiki/PyYAMLDocumentation.

## Summary
In this Notebook you have seen how to:
1. read .xls and .xlsx spreadsheet files
2. handle XML files
3. read KML files and seen map data plotted in folium
4. parse YAML data and load it into a Python dict.


## What next?

That completes the coverage of data file formats for this module; we will make extensive use of CSV and JSON formats in the module and may introduce others as we work through different tools and techniques.

Return to the module materials now.