# Data Carpentry with `Python`

Up until this point, reading in our data has been relatively simple. There is a bit of a learning curve to learning a programming language, but our files have been organized in such a way that we have been able to start manipulating it upon reading it in. More often than not, our data will not be in this pre-packaged, ready to play with format. That's where data carpentry comes in.

Data carpentry, sometimes referred to as data cleaning, is the first step after acquiring a dataset as your data are rarely in a format that is ready for analysis. This will become more evident to you as you start exploring your own datasets that you will use to answer some type of question, or predict some outcome.

So, what does "messy" data look like? Take a look at the image below...

![title](../images/messy2.png)

This is the data we will be working with today.

In [1]:
# import the required packages

# xlrd is a package for developers to extract data from
# Excel spreadsheets. https://pypi.python.org/pypi/xlrd

import pandas as pd 
import xlrd 

### The Data

This is only a partial dataset on several [species of desert rodents](http://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm). The files that we have used before have all been `csv`s, but `Python` can handle a lot of different types of file formats. The file we will be working with today is in `.xls` format, otherwise known as an Excel file. But when our data looks like it does above, how is `pandas` going to interpret this? Let's find out...

In [2]:
# just view what the data would look like when read by pandas

pd.read_excel('/dsa/data/all_datasets/messy_survey.xls') 

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,,,,,,
1,,,2013 Field Season,,,
2,,,,,,
3,,,,,,
4,,,Species: DM,,,
5,,,Date Collected,Plot,Sex,Weight
6,,,2013-07-16 00:00:00,2,F,
7,,,2013-07-16 00:00:00,7,M,33g
8,,,2013-07-16 00:00:00,3,M,
9,,,2013-07-16 00:00:00,1,M,


Calling this "messy" is an understatement. 

Remember, we are essentially dealing with three different tables in the exact same file. 
You will notice all of the `NaN` values. 
This stands for "Not a Number" and is the default value for those cells that don't contain data. 
We will touch more on how to handle these within a dataset in a bit, but for now, we want to get rid of all of those columns and rows that are entirely `NaN`s. 

Besides removing the irrelevant columns and rows, we all need to translate the species table heading into column data.
Our desired data structure would look approximately like this:

![final](../images/dc_cleaned_almost.png)

# The Manual Process

Before we dive into the code construction to clean these files, we will look at the process to manually clean up the files with Microsoft Excel.

![starting](../images/dc_p1.png)

## Remove Irrelevant Columns

#### First we need to select some columns and delete them:

![delete rows](../images/dc_delete_colmns.png)

## Then we may want to add a species column

#### We need to get the **Species: XX** value moved into a column:

![species column](../images/dc_species_column.png)

#### Copy/Type the value into a cell

![add dm once](../images/dc_add_species_dm_one.png)

#### Fill it down with a corner drag

![fill dm](../images/dc_fill_species_DM.png)



## Repeat with DO

![fill dm](../images/dc_species_fill_DO.png)

## Repeat with DS

![fill dm](../images/dc_fill_species_DS.png)


## Remove Irrelevant Rows

#### Remove the extra top rows

![dc_delete_rows_menu](../images/dc_delete_rows_menu.png)

#### Remove the rows between tables

![dc_delete_rows_menu](../images/dc_delete_rows_2.png)

## Finally, almost clean!

![dc_cleaned_almost](../images/dc_cleaned_almost.png)

## Seems easy enough

## <span style="background:yellow">... but wait</span>

### You just inherited 100 of these files

### or **1000**!


This is a key aspect of data carpentry!
These files that are not well structured for ingestion into computational environments must be cleaned up.
If you have a single file, it can easily be managed with familiar "tools" like Excel or a text editor.
However, if you need to automate the process for 100's, 1000's, or millions of files... you need the right tool!

## Please continue to the next data carpentry notebook to see some python data carpentry of this file

   * [Data Carpentry with `Python`, part2](./data_carpentry_python2.ipynb)

# Save your notebook, then `File > Close and Halt`