## Syncing your Fork

GitHub documentation: [Syncing a fork](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork)

---

# Pandas Demonstration

In this demonstration we will look at (the first part) of a common pattern in bioinformatics:
filenames from experimental data (such as FASTA, or in this case LCMS) contain the 
experimental design information.

We shall look into workflows to process the larger data components in module 3 of this course.

In [4]:
import numpy as np
import pandas as pd

# from numpy.random import default_rng
# rng = default_rng()

import matplotlib.pyplot as plt
%matplotlib inline

Here I copy a tidy set of data from the pandas documentation.

In [5]:
import datetime

df = pd.DataFrame({
        "A": ["one", "one", "two", "three"] * 6,
        "B": ["A", "B", "C"] * 8,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
        "D": np.random.randn(24),
        "E": np.random.randn(24),
        "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)] + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
})

df

Unnamed: 0,A,B,C,D,E,F
0,one,A,foo,1.018762,0.759163,2013-01-01
1,one,B,foo,-0.012206,-0.399411,2013-02-01
2,two,C,foo,-2.147779,1.353184,2013-03-01
3,three,A,bar,1.648603,-1.543284,2013-04-01
4,one,B,bar,-0.853453,-1.580179,2013-05-01
5,one,C,bar,1.719571,-2.611538,2013-06-01
6,two,A,foo,-0.394588,0.528993,2013-07-01
7,three,B,foo,0.547158,0.422124,2013-08-01
8,one,C,foo,-1.100636,1.273432,2013-09-01
9,one,A,bar,2.145902,-1.062229,2013-10-01


It is not uncommon to be provided an excel sheet that ends up looking something like the following
result of the call to `pd.pivot()`.

In [6]:
non_tidy_df = pd.pivot_table(df, index=["A", "B"], columns=["C"])
non_tidy_df

Unnamed: 0_level_0,Unnamed: 1_level_0,D,D,E,E
Unnamed: 0_level_1,C,bar,foo,bar,foo
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
one,A,0.615964,1.36769,-0.884646,0.216438
one,B,0.509176,0.365633,-0.100594,0.344357
one,C,1.176053,-0.213125,-0.903662,0.958654
three,A,1.195924,,-1.329102,
three,B,,1.027812,,0.011491
three,C,-0.365436,,1.527464,
two,A,,-0.216253,,0.473601
two,B,-0.50814,,-1.043292,
two,C,,-1.747747,,1.01619


The solution to this is to stack the data:

In [7]:
non_tidy_df.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,D,E
A,B,C,Unnamed: 3_level_1,Unnamed: 4_level_1
one,A,bar,0.615964,-0.884646
one,A,foo,1.36769,0.216438
one,B,bar,0.509176,-0.100594
one,B,foo,0.365633,0.344357
one,C,bar,1.176053,-0.903662
one,C,foo,-0.213125,0.958654
three,A,bar,1.195924,-1.329102
three,B,foo,1.027812,0.011491
three,C,bar,-0.365436,1.527464
two,A,foo,-0.216253,0.473601


## Wrangling Data

Read our data file!

In [4]:
%%bash
ls data

filenames.txt


In [5]:
%%bash
# less data/filenames.txt

In pure Python we can read a file line by line, like this:

In [12]:
with open('data/filenames.txt', 'r') as file:
    print(file.readline())

02042021/Blank-r001.d



Convenience functions are provided in pandas for many common data types. Which type you
interact with most often will depend on the source of your data. 

I would hazard to guess that you will see text or .csv (comma separated value) files most often,
and that is what we will examine here.

I have copied the filenames from a current project and provided them, we shall try to prepare
the sample annotation data from these names and the provided specification.

### Provided Specification

```
[Grape Varitety]_[Smoke Event]_[Replicate]_[Treatment]
```

This means our output should be a data frame with (at least) those four columns.

In [65]:
data = pd.read_csv("data/filenames.txt", header=None, names=["path"])

# Expand around dates.
data = data.path.str.split("/", expand=True)
data.columns = ['date', 'filename']

# Expand around specification.
data['filename'] = data['filename'].str.rstrip(".d")
data['filename'].str.split("_", expand=True)
filename_split = data['filename'].str.split("_", expand=True)

data = data.merge(filename_split, left_index=True, right_index=True)
data.columns = ['date', 'filename', 'grape', 'smoke', 'rep', 'treat', 'unknown']

# data.dropna(subset=['smoke', 'rep', 'treat'], how='all')


to_drop = data[['smoke', 'rep', 'treat']].isna().all(axis=1)
data = data.loc[~to_drop]
data['grape'].unique()
# data['filename'].str.slice(0, 2) == "ME"

array(['ME', 'NE', 'CS', 'MEC3C'], dtype=object)

Unnamed: 0,0,1
0,02042021,Blank-r001.d
1,02042021,Blank-r002.d
2,02042021,cONTROL 2.d
3,02042021,Control-r001.d
4,02042021,Control-r002.d
...,...,...
573,03232021,ME_S3C_R3_60.d
574,03232021,ME_S3C_R3_90.d
575,03232021,ME_S3C_R3_CO.d
576,03232021,Water blank-r001.d
