# Data Carpentry with `Python`, part 2

Recall our messy data from part one and all the manual steps we needed to clean things up.

![title](../images/messy2.png)

### The Data

This is only a partial dataset on several [species of desert rodents](http://esapubs.org/archive/ecol/E090/118/Portal_rodent_metadata.htm). 
The files that we have used before have all been `csv`s, but `Python` can handle a lot more types of file formats. 
The file we will be working with today is in `.xls` format, otherwise known as an Excel file. 

To clean up this data, we are going to use Pandas and the `xlrd` (Excel Reader) library.
  * External API Link: https://pypi.python.org/pypi/xlrd
  
  

In [2]:
# import the required packages

# xlrd is a package for developers to extract data from
# Excel spreadsheets. https://pypi.python.org/pypi/xlrd

import pandas as pd 
import xlrd 

#### Read the File with Pandas

One thing to note here, we have yet to see the `header` parameter used. 
Normally, when a data frame object is created, the first row in the file is interpreted to be the header row, 
and often this is okay. 
But in its current state, our data is in no place to have a header row yet. 
Plus, the first row in the dataset for this Excel file is an empty line.
When there aren't any data to be the header row, 
`pandas` defaults the names to a series of **`Unnamed:`**s or numbers.

In [4]:
file = pd.read_excel('/dsa/data/all_datasets/messy_survey.xls', header = None)

# now it is saved to the object `file`

file

Unnamed: 0,0,1,2,3,4,5
0,,,,,,
1,,,,,,
2,,,2013 Field Season,,,
3,,,,,,
4,,,,,,
5,,,Species: DM,,,
6,,,Date Collected,Plot,Sex,Weight
7,,,2013-07-16 00:00:00,2,F,
8,,,2013-07-16 00:00:00,7,M,33g
9,,,2013-07-16 00:00:00,3,M,


#### Remember, we are essentially dealing with three different tables in the exact same file. 

You will notice all of the `NaN` values. 
This stands for "Not a Number" and is the default value for those cells that don't contain data.

Recall our steps for file clean up:
  1. Remove _no data_ columns
  1. Add species column, and fill down
  1. Remove non-header rows
  1. Remove rows between sub-tables to create the unified table.




# Step 1:  Remove 'no data' columns

These are the first two columns of our data.
We can do this by selecting out only the columns with data.
Column and row indexes start at 0; and with a little `pandas` knowledge we can select out a range of columns.

This is done with the `.iloc[ rows , columns ]` syntax.

**Example**
```
file.iloc[ 0: , 2: ]
```
In this case, we are asking for all rows because it is "`0:`", and columns numbered 2 to the end with "`2:`"


**Note:** the `.head()` function below just shows the top few rows in the Pandas data frame.


In [5]:
file.iloc[ 0:, 2: ].head(10)

Unnamed: 0,2,3,4,5
0,,,,
1,,,,
2,2013 Field Season,,,
3,,,,
4,,,,
5,Species: DM,,,
6,Date Collected,Plot,Sex,Weight
7,2013-07-16 00:00:00,2,F,
8,2013-07-16 00:00:00,7,M,33g
9,2013-07-16 00:00:00,3,M,


**Example**
```
file.iloc[ 5: , 0: ]
```
In this case, we are asking for all columns because it is "`0:`", and rows numbered 5 to the end with "`5:`"


In [6]:
file.iloc[ 5: , 0: ].head()

Unnamed: 0,0,1,2,3,4,5
5,,,Species: DM,,,
6,,,Date Collected,Plot,Sex,Weight
7,,,2013-07-16 00:00:00,2,F,
8,,,2013-07-16 00:00:00,7,M,33g
9,,,2013-07-16 00:00:00,3,M,


#### Resetting Indexes after selection
The `.reset_index(drop = True)` function on the Pandas data frame will reset the indexes on the rows.
Notice the output above, the first row listed is indexed as **5**.

And below, we can _fix_ it to be back to 0-based indexing for the rows.

In [7]:
file.iloc[ 5: , 0: ].reset_index(drop = True).head()

Unnamed: 0,0,1,2,3,4,5
0,,,Species: DM,,,
1,,,Date Collected,Plot,Sex,Weight
2,,,2013-07-16 00:00:00,2,F,
3,,,2013-07-16 00:00:00,7,M,33g
4,,,2013-07-16 00:00:00,3,M,


Wait weren't we trying to complete step 1?
**Notice that most of Step 3 was just accomplished!**

## Combining step 1 & partial-step 3 of our process!

To keep our source `file` data as-is, we will cut the columns and create a new variable with the result.
We are going to call this modified data frame **`messy`**.


In [8]:
# Create new variable
messy = file.iloc[5:,2:].reset_index(drop = True)
messy.head(20)  # option number of rows to show in head

Unnamed: 0,2,3,4,5
0,Species: DM,,,
1,Date Collected,Plot,Sex,Weight
2,2013-07-16 00:00:00,2,F,
3,2013-07-16 00:00:00,7,M,33g
4,2013-07-16 00:00:00,3,M,
5,2013-07-16 00:00:00,1,M,
6,2013-07-18 00:00:00,3,M,40g
7,2013-07-18 00:00:00,7,M,48g
8,2013-07-18 00:00:00,4,F,29g
9,2013-07-18 00:00:00,4,F,46g


# Step 2: Add species column, and fill down

So this next section may be a little complicated, particularly if you are new to programming, 
but we will take this step-by-step.

First we will define the species cell values, which we saw are:
  * `Species: DM`
  * `Species: DO`
  * `Species: DS`

We will put these into a list as the names of the sub tables.

In [9]:
table_names = ["Species: DM", "Species: DO", "Species: DS"]
table_names

['Species: DM', 'Species: DO', 'Species: DS']

As you learn programming, you learn tricks and patterns to solve certain types of problems.
The need to fill a value down to a point then reset to a new value is a common pattern of tasks.

Notice above that our column index 2 (heading does not reset like rows) holds the sub-table names we put into our list. Lets investigate that column.

In [10]:
messy[2]

0             Species: DM
1          Date Collected
2     2013-07-16 00:00:00
3     2013-07-16 00:00:00
4     2013-07-16 00:00:00
5     2013-07-16 00:00:00
6     2013-07-18 00:00:00
7     2013-07-18 00:00:00
8     2013-07-18 00:00:00
9     2013-07-18 00:00:00
10    2013-07-18 00:00:00
11    2013-07-18 00:00:00
12    2013-07-18 00:00:00
13    2013-07-18 00:00:00
14    2013-07-18 00:00:00
15    2013-07-18 00:00:00
16                    NaN
17            Species: DO
18         Date Collected
19    2013-08-19 00:00:00
20    2013-10-17 00:00:00
21    2013-10-17 00:00:00
22    2013-10-17 00:00:00
23    2013-10-17 00:00:00
24    2013-10-18 00:00:00
25    2013-11-12 00:00:00
26    2013-11-12 00:00:00
27    2013-11-14 00:00:00
28    2013-12-10 00:00:00
29    2013-12-10 00:00:00
30    2013-12-11 00:00:00
31                    NaN
32            Species: DS
33         Date Collected
34    2013-11-12 00:00:00
35    2013-11-12 00:00:00
36    2013-11-12 00:00:00
37    2013-11-12 00:00:00
38    2013-1

We can use the `.isin()` function of the Pandas data frame to test if each value is in a list of values.
Recall our list of values:
```
['Species: DM', 'Species: DO', 'Species: DS']
```

This gives us a list of True or False values going down the column based on the existence of the value in the list.

In [11]:
messy[2].isin(table_names)

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17     True
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32     True
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
Name: 2, dtype: bool

#### So... 

What can we do with this list of values?
Well, as it happens, there are a couple of rules of computers that work to our advantage.
  * `True` = 1
  * `False` = 0
  
This allows us to process this `True/False` column with another function, _cumulative summation_, `.cumsum()`.

So, adding True and False is `1 + 0` and if we keep accumulating down the column, every time we hit `True`, the value increases by 1.

A couple of examples.

In [12]:
pd.DataFrame([True, False, True]).cumsum()

Unnamed: 0,0
0,1
1,1
2,2


In [13]:
pd.DataFrame([True, False, True, False, True, True, True, False, False]).cumsum()

Unnamed: 0,0
0,1
1,1
2,2
3,2
4,3
5,4
6,5
7,5
8,5


#### Therefore... 

We can use this technique to apply labels, where group 1 is all the `Species: DM`, 2 is `Species: DO`, and 3 is `Species: DS`. This can be done because of the shape of our data. If you recall `Species: DM` is a "table" with the excel file then `Species: DO` comes next and finally `Species: DS`

In [14]:
# Capture this processing (label creation) as a new variable, groups
groups = messy[2].isin(table_names).cumsum()
print(groups)

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    2
18    2
19    2
20    2
21    2
22    2
23    2
24    2
25    2
26    2
27    2
28    2
29    2
30    2
31    2
32    3
33    3
34    3
35    3
36    3
37    3
38    3
39    3
40    3
41    3
42    3
43    3
44    3
Name: 2, dtype: int64


We now have list of values we can use to label the rows.
We can use the Pandas data frame `.groupby()` function to partition a data frame into sets of rows associated with a column value.

In this case, the rows that allign to the number 1 will be one group, 2 another, and 3 a third group of rows.

The result is going to be a new type of GroupBy data frame:
```
pandas.core.groupby.DataFrameGroupBy
```


In [15]:
messy.groupby(groups)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f80ea5f7358>

This code will expand back out the data frame Group By object to show you the effects.

In [16]:
# for each element in set
# k     g
# 1 -> { row 0 , row 12}
# 2 -> { row 13 , row 21}
# 3 -> {}
for k,g in messy.groupby(groups):
    print("k = {}, g = {}".format(k,g))

k = 1, g =                       2     3    4       5
0           Species: DM   NaN  NaN     NaN
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN
k = 2, g =                       2     3    4       5
17          Species: DO   NaN  NaN     NaN
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17 00:00:00     3   

#### Notice that ... 
We have now set up the data into 3 sub-tables, K = 1, 2, and 3.

So, let's use some more appropriate variable names:
  * `species`  as the index from the cumsum() function.
  * `rows` as the collection of rows that have same label.


In [17]:
for species,rows in messy.groupby(groups):
    print("species = {}, rows = {}".format(species,rows))

species = 1, rows =                       2     3    4       5
0           Species: DM   NaN  NaN     NaN
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN
species = 2, rows =                       2     3    4       5
17          Species: DO   NaN  NaN     NaN
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17

#### Looking at the rows... 
The first row and column of each sub-table has the Species name.
We can access the cell with the Pandas data frame `.iloc[row,col]`

In [18]:
for species,rows in messy.groupby(groups):
    print(rows.iloc[0,0])

Species: DM
Species: DO
Species: DS


Similarly, we can look at all the rows from 1 to the end with the `.iloc[rows]`, single argument function.

#### NOTE: This completes step 3 by removing the Species name row.  

In [19]:
for species,rows in messy.groupby(groups):
    print(rows.iloc[1:])

                      2     3    4       5
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN
                      2     3    4       5
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17 00:00:00     3    F      33
21  2013-10-17 00:00:00     3    F      50
22  2013-10-17 00:00:00    17    F      48
23  2013-10

Based on these two manipulations, we can create a `Key->Value` data structure, called a **dictionary**

In [20]:
tables = {} # an empty dictionary variable named tables

for species,rows in messy.groupby(groups):  # For each species and its rows:
    tables[rows.iloc[0,0]] = rows.iloc[1:]  # store into the tables variable
   
print(tables)

{'Species: DM':                       2     3    4       5
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN, 'Species: DO':                       2     3    4       5
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17 00:00:00     3    F      33
21  2013-10-17 00:00:00     3    F      50
22  2013-10-17 00:00:0

#### In Python, there is always an alternative way to accomplish a task:

In [21]:
# This creates the same dictionary as above, using a "dictionary comprehension" (aka fancy Python)
tables = {rows.iloc[0,0]: rows.iloc[1:] for species,rows in messy.groupby(groups)}

############
# t={v.iloc[0,0]: v.iloc[1:] for k,v in m.groupby(g)}
############


print(tables)

{'Species: DM':                       2     3    4       5
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN, 'Species: DO':                       2     3    4       5
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17 00:00:00     3    F      33
21  2013-10-17 00:00:00     3    F      50
22  2013-10-17 00:00:0

#### What do these tables look like now?

Let's take a pause in our cleaning process, and print out what these separated tables look like now. We will create a for loop that iterates over each table, prints out its name and then the values.

In [22]:
list(tables)

['Species: DM', 'Species: DO', 'Species: DS']

In [23]:
for species, tab in tables.items():  # pull out each Key->Value pair in the tables variable
    print("table:", species)
    print(tab)
    print()

table: Species: DM
                      2     3    4       5
1        Date Collected  Plot  Sex  Weight
2   2013-07-16 00:00:00     2    F     NaN
3   2013-07-16 00:00:00     7    M     33g
4   2013-07-16 00:00:00     3    M     NaN
5   2013-07-16 00:00:00     1    M     NaN
6   2013-07-18 00:00:00     3    M     40g
7   2013-07-18 00:00:00     7    M     48g
8   2013-07-18 00:00:00     4    F     29g
9   2013-07-18 00:00:00     4    F     46g
10  2013-07-18 00:00:00     7    M     36g
11  2013-07-18 00:00:00     7    F     35g
12  2013-07-18 00:00:00     8    F     22g
13  2013-07-18 00:00:00     7    F     42g
14  2013-07-18 00:00:00     4    F     41g
15  2013-07-18 00:00:00     6    F     37g
16                  NaN   NaN  NaN     NaN

table: Species: DO
                      2     3    4       5
18       Date Collected  Plot  Sex  Weight
19  2013-08-19 00:00:00     8    F      52
20  2013-10-17 00:00:00     3    F      33
21  2013-10-17 00:00:00     3    F      50
22  2013-10-17 

Remember that each table is now a dictionary entry, so in the `for` loop we specify **`for species, tab in tables.items():`**. 

`species` stands for key, 
`tab` for value. 

This is how you iterate over dictionaries. 

We then print the `species` value first, then the `tab` variable (the data within the table). 
The last `print` statement just creates a space between one table and the next.

### Creating the dataframe

Now is the time to put the pieces back together. We really want to create a single data frame and create a new variable specifying the species. Take a look at how we use a for loop below to iterate over `table.items()` again.

In [24]:
dfs = [] # an empty list where we will store our separate dataframes

for species, tab in tables.items(): # iterate over the table dictionaries
    
#    single_frame = pd.DataFrame(tab) # create a dataframe from each table 
    
    single_frame = pd.DataFrame(tables[species]) # create a dataframe from each table 
    
    single_frame['species'] = species # create a new column called species and fill it with the name of the data frame
    
    single_frame = single_frame.reset_index(drop=True).iloc[1:] # reset index and remove first row (header row)
    
    single_frame.columns = ['date','plot','sex','weight','species'] # rename the columns
    
    dfs.append(single_frame) # add to the list of separate data frames

df = pd.concat(dfs).reset_index(drop = True) # join the dataframes together into one data frame
df # return the complete data frame    

Unnamed: 0,date,plot,sex,weight,species
0,2013-07-16 00:00:00,2.0,F,,Species: DM
1,2013-07-16 00:00:00,7.0,M,33g,Species: DM
2,2013-07-16 00:00:00,3.0,M,,Species: DM
3,2013-07-16 00:00:00,1.0,M,,Species: DM
4,2013-07-18 00:00:00,3.0,M,40g,Species: DM
5,2013-07-18 00:00:00,7.0,M,48g,Species: DM
6,2013-07-18 00:00:00,4.0,F,29g,Species: DM
7,2013-07-18 00:00:00,4.0,F,46g,Species: DM
8,2013-07-18 00:00:00,7.0,M,36g,Species: DM
9,2013-07-18 00:00:00,7.0,F,35g,Species: DM


There are a lot of steps in this for loop. For that reason, each line of code is commented to describe what each piece is doing. 

We are now in pretty good shape, but there are still some things we need to do before we can consider this tidy. One thing that should be noticeable right of the bat is there are rows of data that are almost completely `NaN` values, save for the `Species` column. Why is that? 

Let's take a look at the picture again...

![title](../images/messy2.png)

Notice that between each table there is a blank row. That is the very reason for these lines. Essentially, these are residual from our initial reading in of the file and they have remained throughout the process. Well, now it is time to get rid of them.

For this, we will use the Pandas `.dropna()` function to remove rows that have NaN values, then reset the row indexing as we did previously.

**Example** 
```
df.dropna().reset_index(drop = True)
```
However, if we just drop the `NaN` rows, we will lose some of the Species DM rows that have weight blank.
Instead, we need to limit the `NaN` check to the `date` column.

In [25]:
# Re-assign the result of removing NaN date rows and re-index the rows of the data frame
df = df.dropna(subset = ['date']).reset_index(drop = True)
df

Unnamed: 0,date,plot,sex,weight,species
0,2013-07-16 00:00:00,2,F,,Species: DM
1,2013-07-16 00:00:00,7,M,33g,Species: DM
2,2013-07-16 00:00:00,3,M,,Species: DM
3,2013-07-16 00:00:00,1,M,,Species: DM
4,2013-07-18 00:00:00,3,M,40g,Species: DM
5,2013-07-18 00:00:00,7,M,48g,Species: DM
6,2013-07-18 00:00:00,4,F,29g,Species: DM
7,2013-07-18 00:00:00,4,F,46g,Species: DM
8,2013-07-18 00:00:00,7,M,36g,Species: DM
9,2013-07-18 00:00:00,7,F,35g,Species: DM


# Column Clean Up

We have achieved most of carpentry goal, now we just need to reform the species column.
Having "Species:" repeated for every single value is unnecessary. 
We are going to go ahead and remove this segment, leaving only the species abbreviations.

Again, there are several ways to do this, but the method below takes on a familiar format that we have seen before. 

Note: The code below relies on 
  1. String `.split()` to convert a string into a list of strings
  1. List `.pop()` to remove and return the last item in the list.

In [26]:
abbrv = []  # Start an empty list

# for every row in the species column
for i in df.species:
    # Append to the list
    abbrv.append(
        i.split(" ").pop() # the result of splitting the Species: DX into two elements, taking the first element away.
    )
    
# The above for loop produced a list of Species, make that list the new species column
df['species'] = abbrv

df.head(20)

Unnamed: 0,date,plot,sex,weight,species
0,2013-07-16 00:00:00,2,F,,DM
1,2013-07-16 00:00:00,7,M,33g,DM
2,2013-07-16 00:00:00,3,M,,DM
3,2013-07-16 00:00:00,1,M,,DM
4,2013-07-18 00:00:00,3,M,40g,DM
5,2013-07-18 00:00:00,7,M,48g,DM
6,2013-07-18 00:00:00,4,F,29g,DM
7,2013-07-18 00:00:00,4,F,46g,DM
8,2013-07-18 00:00:00,7,M,36g,DM
9,2013-07-18 00:00:00,7,F,35g,DM


Let's take a look at the `df['weight']` column. 
This is a variable that we would like to find some stats about, but in its current state we can't do that. 

Take a look at the column again. 
Some of the values have a "g" added to the number indicating grams. 
It is this "g" that we want to remove so that we can start running some stats on it. 
Below is one way to do that.

In [27]:
# for this we need to import the numpy package (numerical python)
import numpy as np

# Start an empty list
nums = []
# for each value in the column
for i in df.weight:
    # if the value i, pulled from the column is empty as a number
    if pd.isnull(i):
        val = np.nan # set the temporary variable to Not a Number
    else:
        # Other wise join all the digits into a single string in the temporary val variable
        val = ''.join(c for c in str(i) if c.isdigit())
    # Add the variable into the list, possible as NaN or a number.
    nums.append(val)
    
# Assing this column of clean variables for weight to the weight column
df['weight'] = nums

df.head(20)

Unnamed: 0,date,plot,sex,weight,species
0,2013-07-16 00:00:00,2,F,,DM
1,2013-07-16 00:00:00,7,M,33.0,DM
2,2013-07-16 00:00:00,3,M,,DM
3,2013-07-16 00:00:00,1,M,,DM
4,2013-07-18 00:00:00,3,M,40.0,DM
5,2013-07-18 00:00:00,7,M,48.0,DM
6,2013-07-18 00:00:00,4,F,29.0,DM
7,2013-07-18 00:00:00,4,F,46.0,DM
8,2013-07-18 00:00:00,7,M,36.0,DM
9,2013-07-18 00:00:00,7,F,35.0,DM


We see that the "g" characters are removed.

The method is similar to how we modified the `species` column except we added a conditional. The reason we need the conditional is because `NaN` values will halt our for loop and we don't want to get rid of them, so we first need to check if the value is `NaN`. If the values are `NaN`, we will return them as `NaN`, otherwise, we use this line of code to take only those values that are numbers and return them.

#### But there is one problem... 
Our column is still a `String` type, and we cannot compute a numerical mean on a string. 

Therefore, we will convert it to a float using the Numerical Python library (NumPy).

In [28]:
df['weight'] = df['weight'].astype(float)

In [29]:
df.weight

0       NaN
1      33.0
2       NaN
3       NaN
4      40.0
5      48.0
6      29.0
7      46.0
8      36.0
9      35.0
10     22.0
11     42.0
12     41.0
13     37.0
14     52.0
15     33.0
16     50.0
17     48.0
18     31.0
19     41.0
20     44.0
21     48.0
22     39.0
23     40.0
24     45.0
25     41.0
26    117.0
27    121.0
28    115.0
29    120.0
30    118.0
31    126.0
32    132.0
33    113.0
34    122.0
35    107.0
36    115.0
Name: weight, dtype: float64

Now what to do about those pesky `NaN`s? Well, we could remove rows with those values, or we can fill them in. Below is how you fill them in using the mean weight, this way we are using the average of the dataset.

But let's take a look at the column again. We will copy the data frame to a frame called `cleaned`.

In [30]:
cleaned = df
cleaned['weight']

0       NaN
1      33.0
2       NaN
3       NaN
4      40.0
5      48.0
6      29.0
7      46.0
8      36.0
9      35.0
10     22.0
11     42.0
12     41.0
13     37.0
14     52.0
15     33.0
16     50.0
17     48.0
18     31.0
19     41.0
20     44.0
21     48.0
22     39.0
23     40.0
24     45.0
25     41.0
26    117.0
27    121.0
28    115.0
29    120.0
30    118.0
31    126.0
32    132.0
33    113.0
34    122.0
35    107.0
36    115.0
Name: weight, dtype: float64

In [31]:
cleaned['weight'] = cleaned['weight'].fillna(cleaned['weight'].mean())

And that is how you would fill in the `NaN` values with the mean weight. 
This allows us to keep the information of the other columns, which may be useful, 
but not affect the overall mean of the weight column. 
Take a look at our final frame below.

**To think about:** Is there a better way to fill in these `NaN`s?



In [32]:
cleaned

Unnamed: 0,date,plot,sex,weight,species
0,2013-07-16 00:00:00,2,F,65.5,DM
1,2013-07-16 00:00:00,7,M,33.0,DM
2,2013-07-16 00:00:00,3,M,65.5,DM
3,2013-07-16 00:00:00,1,M,65.5,DM
4,2013-07-18 00:00:00,3,M,40.0,DM
5,2013-07-18 00:00:00,7,M,48.0,DM
6,2013-07-18 00:00:00,4,F,29.0,DM
7,2013-07-18 00:00:00,4,F,46.0,DM
8,2013-07-18 00:00:00,7,M,36.0,DM
9,2013-07-18 00:00:00,7,F,35.0,DM


## Data Carpentry is sometimes tedious to construct as a process for data cleaning

However, remember that the alternative was to clean 100's or 1000's of those files by hand (with Excel or text editor).

By developing a data transformation and cleaning script, we can ingest any number of these files for further analytical inspection.

# Save your notebook, then `File > Close and Halt`