# General Introduction by Stefan

## Data acquisition
We downloaded curated sequences for CD-box and HACA-box RNAs from variuos species from the SNOPY database (http://snoopy.med.miyazaki-u.ac.jp/).
You can find the sequences in the "SNOPY_CDBOX_curated.fasta" and the "SNOPY_HACABOX_curated.fasta" files, a common data format for DNA/RNA/Protein sequence data.
If working in JupyterLab you can easily inspect the .fasta files with the text editor by double-clicking the file in the "File Browser" pane on the left.
What do you notice? How could this be relevant later on?

## Data cleansing
Some species may have multiple copies of these RNAs and/or closely related species might have highly similar sequences. Having multiple highly similar or identical copies of a sequence can impede our models from learning a good generalization.
#### CD-HIT (http://weizhongli-lab.org/cd-hit/)
We will use cd-hit-est to cluster highly similar sequences and create sets of representative sequences for both classes. <br>
Verify cd-hit-est is installed and find out more about it by running "cd-hit-est -h". <br> You can run shell commands directly from jupyter code cells by prefixing the command with a "!".

In [None]:
!cd-hit-est -h

#### CD-HIT parameters
Below are the parameters for our clustering runs. Run the cell please.

In [None]:
seq_identity = 0.9 # (-c)
word_size = 8 # recommended for 0.9 identity (-n)
threads = 0 # use all available CPUs (-T)
desc_len = 0 # keep description up until first white space (-d)

### Cluster CD-box sequences
Start by clustering the CD-box sequences. <br>
You can use variables in shell commands by enclosing them in curly braces i.e like this {variable}. <br>
You could define variables for the input file (-i) and output file (-o) parameters as well. <br>
Now run cd-hit-est using all of the defined parameters

In [None]:
cd_fasta_in = "SNOPY_CDBOX_curated.fasta" # (-i)
cd_clustered = "SNOPY_CDBOX_clustered.fasta" # (-o)

!cd-hit-est -i {} -o {} -c {} -n {} -T {} -d {}

### Cluster HACA-box sequences
Repeat the steps for the HACA-box sequences

In [None]:
haca_fasta_in = "SNOPY_HACABOX_curated.fasta"
haca_clustered = "SNOPY_HACABOX_clustered.fasta"

!cd-hit-est -i {} -o {} -c {} -n {} -T {} -d {}

# Data inspection

## Read in the sequences
We are now ready to read in the two sets of representative sequences. <br>
The output of cd-hit-est are .fasta files again. We can use the "parse" function from the SeqIO module from Biopython (https://biopython.org/wiki/SeqIO) to read sequences and identifiers from the fasta file.
(Since we are only dealing with two files following exactly the same format you could also easily roll your own fasta reader.) <br>
The pandas library is a powerful friend when handling data (https://pandas-docs.github.io/pandas-docs-travis/index.html).<br>
Create two pandas DataFrames (for each class) with the identifier as index and one column named "Seq" for the sequence. <br>
(You could f.e. put all sequences into a dictionary, which can then be read into a pandas DataFrame)<br> 


In [None]:
# read in seqs into two dicts
from Bio import SeqIO # to read fasta

dict_cd = {record.id: str().upper() for record in SeqIO.parse()}
dict_haca = {}

In [None]:
# create two DataFrames from the dicts
import pandas as pd

df_cd = pd.DataFrame.from_dict(dict_cd, orient="index", columns=["Seq"])
df_haca = 

In [None]:
df_cd

Add a column "Label" to both DataFrames containing the respective class label "CD-box" or "HACA-box"

In [None]:
df_cd["Label"] = "CD-box"
df_cd

In [None]:
df_haca["Label"] = "HACA-box"
df_haca

Combine both DataFrames to create our complete data set. (Save the resulting DataFrame to a csv file)

In [None]:
df_all = 
df_all

Let's add some features to generate a first impression of our data. <br>
Add a column "Length" containing the length of the sequence

In [None]:
df_all["Length"] = df_all.Seq.map(len)

Two commonly used features for DNA sequences are the "GC content" and the "ATGC ratio".<br>
The GC content is the percentage of "G"s or "C"s in the whole sequence. <br>
The ATGC ratio is the ratio of "A"s and "T"s to "G"s and "C"s. <br>
Create two columns "GC_content" and "ATGC_ratio" containing the respective feature.

In [None]:
df_all["GC_content"] = df_all.Seq.map()
df_all["ATGC_ratio"] = df_all.Seq.map()

Generate a first overview of the data using the DataFrame's describe() method

In [None]:
df_all.describe()

We are also interested in the differences between our classes.<br>
<b>Generate a class-wise description using groupby() and describe()<b>

In [None]:
df_all.

To get a visual impression of the distribution of the features in our data we can use the pairplot() function form the seaborn visualization library (https://seaborn.pydata.org/) <br>
Generate a pair plot for the DataFrame. What is easily visible using the plot?

In [None]:
import seaborn as sns # Seaborn visualization library (for pairs plot)
sns.pairplot(df_all)

Again we are also interested in the differences between the two classes.<br>
<b>Generate a pair plot colored by class (Label) (https://seaborn.pydata.org/generated/seaborn.pairplot.html)<b>

In [None]:
sns.pairplot(df_all, hue = 'Label')

Let's save the DataFrame to a csv in case we need it later

In [None]:
df_all.to_csv("df_ALL.csv")

We already noticed above (using describe) that we have a few more CD-box sequences than HACA-box sequences. <br>
To balance our dataset we can use the "resample" function from scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html).<br>
<b>Downsample the larger class to the size of the smaller class and create a new balanced DataFrame.</b>

In [None]:
from sklearn.utils import resample
rnd_seed=42

# We are going to remove the randomly selected sequences
df_cd_ds = resample() # sample without replacement
                      # class size difference
                      # fix seed for reproducible results

df_balanced =  # drop the selected sequences
df_balanced.Label.value_counts()

The sampling procedure should leave us with a representative sample, but let us check that we didn't end up with a skewed sample anyways.<br>
<b>Use groupby(), describe() and element-wise substraction to analyse the differences between the balanced and the original data.</b>

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

df_balanced.groupby("Label").describe() - df_all.groupby("Label").describe()

In [None]:
pd.reset_option("display")

Let's save the balanced DataFrame to a csv in case we need it later

In [None]:
df_balanced.to_csv("df_balanced.csv")

## A first simple classifier
We already noticed using the pair plot that the sequence length distributions of the two classes seem to be quite different. Can we train a simple classifier with only the features we already constructed? The performance of this classifier can then be used to establish a baseline for our upcoming more complex models (i.e CNNs).<br>
Wait! First things first. We need to split the data into a set used for training and a set used for testing.<br>
<b>Use sklearn's train_test_split to create sets of training and test data and corresponding sets of labels. Use an 80/20 split.</b>

In [None]:
from sklearn.model_selection import train_test_split
rnd_seed=42

xTrain, xTest, yTrain, yTest =

In [None]:
print(yTrain.value_counts())
print(yTest.value_counts())

Now we are going to train a "Naive Bayes" classifier using our features (https://scikit-learn.org/stable/modules/naive_bayes.html) <br>
<b>Fit a GaussianNB classifier to the training dataset and generate predictions for the test dataset. Does it make sense to include all of the features?<b>

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB() # create classifier
gnb.fit() # train using only length and GC
yPred = gnb.predict() # generate predictions

yPred

To assess the quality of our prediction scikit-learn provides us with many different metrics. (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)<br>
<b>Use sklearn.metrics to print out the accuracy_score and the matthews_corrcoef for the generated predictions.</b> (Are these good choices for our problem?)

In [None]:
from sklearn import metrics

print("Accuracy:", metrics.accuracy_score(yTest, yPred))
print("MCC:", metrics.matthews_corrcoef(yTest, yPred))

Scikit-learn also provides a classification report which includes commonly used metrics (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)<br>
<b>Print out the classification report for the predictions<b>

In [None]:
from sklearn.metrics import classification_report

print(classification_report(yTest, yPred))

Another common way of looking at the confusion of a classifier is the confusion matrix.<br>
<b>Use scikit-learn to create a confusion matrix for the predictions and print it</b>

In [None]:
from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(yTest, yPred)
conf_mat

Not that pretty, is it? We can use the seaborn.heatmap and matplotlib to create a matrix that is a bit more appealing to the eye.<br>
<b>Adjust the code below to generate a pretty confusion matrix<b>

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt  

classes = ["CD-box", "HACA-box"]
df_confmat = pd.DataFrame(conf_mat, columns=df_balanced.Label.unique(), index=df_balanced.Label.unique())

ax= plt.subplot()

sns.heatmap(df_confmat, annot=True, ax=ax, cmap="Blues"); #annot=True to annotate cells

ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels') 
ax.set_title('Confusion Matrix')

plt.show()

So now that we have established a (maybe crude) baseline for our classification problem let's create a simple CNN and find out if this will improve our scores in the next notebook "02_first_CNN"