### Cleaning, tidying and re-arranging your data

It is very rare for a dataset to be in the exact format you'd like, with just the data you want and no errors.  Much of the work in data analysis is producing a tidy dataset ready to work with.  80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).  In this week's classes you will learn about:  

Adding/removing columns  
Combining columns  
Filtering data - by rows, columns  
Transforming data - operations over series  
Tidy data
Changing form of data - melt, pivot  


In Class 3 you will learn about tidy data and more advanced ways of manipulating data into useful forms.
In Class 4 you'll practise these processes on a dataset of your choice.  

In [1]:
# Analysis modules
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

### Introduction - basic changes

Read in one of last week's dataframes  - count.csv in the datassets folder

The index is not doing anything useful here.  We can read in with the field names as the index by adding:  

    index_col = 0

We'll add two new columns - one of Soil types, one of Drainage. Here they are as lists.

In [None]:
["Sand","Loam","Loam","Clay","Clay","Loam","Sand","Sand","Clay","Clay"]

In [None]:
["Good", "OK", "Poor", "Poor", "Poor", "Good", "Good", "OK", "OK","Poor"]

Make the new columns from the list like this:  
    
    df["COLUMN_NAME"] = [LIST]
    
Check the columns have been added correclty

Here's one way to drop a column:  

    df = df.drop(columns=[COLUMN_NAMES])
    
Drop the column of Goats.

to drop a row:
    
    df = df.drop(ROW_NUMBER)
    
Drop the data on Heol-y-bryn field

### Rearranging data  
Panads does not care what order your data isin, but sometimes you want particular columns or rows at the right or top.  Here's how to do this.
Make a list of your columns:  

    cols = df.columns.tolist()

re-arrange by hand or by python list handling, for example: 

    cols_new = cols[-1:] + cols[:-1]

Apply the new list order to the dataframe like this:  

    df = df[cols_new]

To do the same with rows re-order by the index.  Start by making a list from the index:  

    fields = df.index.values.tolist()
    
Take the last two fields and make them the first two using python list handling:  
        
        fields_new = fields[-2:] + fields[:-2]
        
Use the new list to order the rows:  

    df2 = df.reindex(fields_new)

Re-arrange the dataframe to have both rows and columns alphabetically sorted.
Use the python list sorting method


### Sorting data

Sorting is easy - just specify column(s) and direction.  

    df.sort_values(by=['col1', col2], ascending=False)

sort by Sheep

sort by Sheep and Oats

sort in descending order by Barley

### Transforming data   
It's very straightforward to make a new column from an existing one.  
You can treat numerical columns like numbers:

    df["More_sheep"] = df["Sheep"]*50   
    df["Per_sheep"] = df["Barley"]/df["Sheep"]  
    df["Stupid"] = df["Field"]*df["Sheep"]  
    
numpy allows you to do fancier opperations

    df["Logged_oats"] = np.log(df["Oats")

and others.....

Make a column "Cereals" of the counts for Barley and Oats

And "Half_Cereals" by diving this by two

Check to see the type of each column now.   

    df.info()

The new column, product of an opperation is a float.

##### Working with text
You can also opperate on columns of text in string format.  For example:

    df.binomial = str(df.genus) + "_" + str(df.species)

    df.protien = df.gene.str.upper()

Make a new column of Drainage and soil

Make another of Drainage and Oats

This fails as we have "Oats" coded as an integer. 

We could recode 'Oats' in the dataframe as text, 

    df = df.astype({"Oats": string})
    
But that might mess up later work.  Better to simply tell pandas to treat it as an string in the concatenation:

    df["Oats"].astype(str)

Tidy up by dropping the new columns  Half_Cerales, Field_type and Oat_Drainage

### Lambda functions

What if you want to so something a bit fancier?  Lamba functions give you a lot of flexibility.  

 We want to put in a colum showing the profitability of each field based on the yeild of each crop/flock.  
We can define a lamba function and apply it across the dataframe  

    df = df.assign(Profit=lambda x: (x['Sheep'] *10 +  x['Barley'] *5 +  x['Oats']*3))

We can use lamba to make changes in specific rows in a column and not others.  
Maybe fields Lan-y-mor and Ffos_fawr flooded and produced no yeild.   
We can replace their profits with 0.  
  

    df['Profit'] = df.apply(lambda row: 0 if row.name in ("Lan-y-mor", "Ffos_fawr") else row['Profit'], axis = 1)  
    
   
axis = 1 specifies the change is to be made along a column.

Replace current Drainage values with "Good" if Barley yield is greater than 500

A useful nugget here is in logging columns.  logging a 0 gives infinity, so it is useful to ignore 0 values when logging.  You can do this this way:  

    df['log_profit']=df['Profit'].apply(lambda x: 0 if x ==0 else np.log10(x))

You can also define your own function to apply.
For example, 

In [69]:
def which_use(Sheep, Cereals):
    if Sheep > 30:
        return 'Livestock'
    if Cereals > 1000:
        return 'Arable'
    else:
        return 'Mixed'

This takes in two values ("Sheep" and "Cereals"), goes through an if loop, and outputs a new value - "Mixed" or "Arable" or "Livestock"  We apply it like this:  

    df['Best_use'] = df.apply(lambda df: which_use(df['Sheep'],df['Cereals']),axis=1)

Make a new column called "By_Sheep" of Profit divided by Sheep for the Mixed use fields  
Base it on the command used to remove profits from the flooded fields.  

### Subsetting

How do we make a new dataframe of just the Clay Field data?  
We can run over the Soil colum to create a list of "True" and "False" for each row depending on whether the Soil matches to "Clay.  We then use this filter to subset the dataframe  

    clay_df = df[df["Soil"]=="Clay"]  
    
"==" is python for an exact match

We can use any combination or arithmetrical or boolean [true or false] statements  
Make a dataframe of the Fields with more than 10 sheep in

and another for non-clay soils. 
"!=" is python for is not equal to

We can combine filters as either_or or and  

    Clay_and_Ovid = df[(df["Sheep"] > 10) & (df["Soil"] == "Clay")]
    Clay_OR_Ovid = df[(df["Sheep"] > 10) | (df["Soil"] == "Clay")]

We can select by multiple text matches by presenting a list of string to match  

    Light_soil = df[df["Soil"].isin(["Loam", "Sand"])]

We can select by partial text matches.  We need to specify that the field contents are to be treated as a string and can then use a whole range of string opperations lies:

    .contains()
    .startswith()
    .endswith()
    
Pick out hte rows with "y' in the Field name

Make a dataframe containing cases where the Barley yield is greater than the oat yield

### Transposing

We can easily transpose the whole data set to give a dataframe arranged with columns of individual fields and rows of values.  

    Fields = df.T 

Transpose it back again and make Field a column instead of an index (.reset_index())

## Melting, casting, splitting

### Tidy data  
Melting, casting and splitting are your main tools for re-arranging the data into a tidy format.

Infection_TPM.csv is a typical way you might be presented with some data. This is transcripts per million reads for 6 samples for each of 7 genes.  Read it in fromt eh Datasets folder.



We can easily plot the gene expression levels for a tissue using barplot  

    sns.barplot(x="Gene_ID", y="Control_Flower", data=df)

We could instead view the data by gene for each of the 6 samples.  
Transpose the dataframe (.T), 
Make a new column names from the first row  
Drop the first row  
Plot NLR_1 expression by tissue


### Splitting 

But what if our focus is on infected and non-infected?  We have to do some juggling to get the dataframe in the right shape to plot useful graphs.  
We need to make columns for Tissue and for Disease state from the index which combines both pieces of information.  We can split the text string on "-" like this:

    df2.index.str.split('_')
    
We make this into two lists by:

    *df2.index.str.split('_').tolist())
    
We can then specify the two new columns:  

    df2['Disease'], df2['Tissue'] = zip(*df2.index.str.split('_').tolist())
    
We need to drop and re-set the index:  

    df2 = df2.reset_index(drop=True)
   

Now plot a barplot of NLR_1 expression by Tissues using hue = "Disease" to show the differences

### Melting

This dataframe has 42 values:  
    
Gene, Disease states, and Tissues are variables  

    Tissues - three possible levels  
    Disease_state - 2 possible levels  
    Genes - 7 possible levels  
    
Samples of plant organs in different disease states are observations  
TPM counts are values 

To make this dataset tidy we need to arrange it with the Genes,  Disease and Tissue as column headers and the TMP values in rows.  Melt will allow as to do this, going form wide format to long format.  

In [None]:
To get a list of gene names to save having to type them.  

    names = df.Gene_ID.tolist() 

use pd.melt() on df2.  
The identity variable (id_vars) should be Tissues and Disease  
The value variables (value_vars) should be the gene names  

The name of the column of variable will be 'Gene_ID'  
The name of the column of values will be 'TPM'  

Now it's easy to plot the all the data by tissue using barplot

    sns.barplot(x="Tissue", y="TPM", hue="Disease", data=df3)

Or break it down by gene  

    sns.catplot(x="Gene_ID", y="TPM", hue="Disease", data=df3, kind="bar")
    
or by tissue. 

    sns.catplot(x="Disease", y="TPM", hue="Tissue", data=df3, kind="bar")

or as individual plots by Organ type

    sns.catplot(x="Gene_ID", y="TPM", hue="Disease", data=df3, col='Tissue', kind="bar")

### Casting

Casting is the oppoiste of melting.  Put the dataframe back into wide form using pd.pivot  

index should be ['Disease', 'Tissue']
columns should be 'Gene_ID'
values should be 'TPM'

Notice we now have a multi-level index.  We will find out more about these in week 5. For the minute we will rest them to columns using  

   df.reset_index()