# SGPS BioStatistics
## Excercise 2: Getting publically availible data

#### Gene Expression data from Gene Expression Omnibus
+ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123336

#### Specifically from this paper
1.	LaRocca D, Barns S, Hicks SD, Brindle A et al. Comparison of serum and saliva miRNAs for identification and characterization of mTBI in adult mixed martial arts fighters. *PLoS One* 2019;14(1):e0207785.


In [None]:
# Import the packages we need
import pandas as pd
import numpy as np
import GEOparse

In [None]:
# Get the data in
human_expression = pd.read_csv('GSE123336_MMA_CountMatrix.csv.gz', index_col=0, compression='gzip')

In [None]:
human_expression = human_expression.applymap(lambda x: '{:.2f}'.format(x))
human_expression

In [None]:
# Get the data straight (off the omnibus) - this is only for the genome wide expression
gse = GEOparse.get_GEO(geo="GSE123336", destdir=".")

In [None]:
gse.gsms

In [None]:
gse.gsms['GSM3500956'].metadata

In [None]:
# And collect the metadata
meta = {}
for key in gse.gsms:
    print(key)
    samp = gse.gsms[key].metadata['description'][0]
    characteristics = {}
    for item in gse.gsms[key].metadata['characteristics_ch1']:
        temp = item.split(': ')
        characteristics[temp[0]] = temp[1]

    if samp not in meta:
        meta[samp] = {}
        meta[samp] = characteristics
    else:
        meta[samp] = characteristics

metadata = pd.DataFrame(meta).T
metadata.to_csv('GSE131695_metadata.csv')

In [None]:
metadata

### How about getting rat data for mTBI?
#### Gene Expression data from Gene Expression Omnibus
+ https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE159011

#### Specifically from this paper
1.	Das Gupta S, Ciszek R, Heiskanen M, Lapinlampi N et al. Plasma miR-9-3p and miR-136-3p as Potential Novel Diagnostic Biomarkers for Experimental and Human Mild Traumatic Brain Injury. *Int J Mol Sci* 2021 Feb 4;22(4).


In [None]:
rat_expression = pd.read_csv("GSE159011_Raw_counts_matrix.txt", index_col=0, sep='\t')

In [None]:
rat_expression

In [None]:
rat_expression.sum()

In [None]:
# Normalization
# Step 1: Calculate the sum of each column
column_sums = rat_expression.sum()

# Step 2: Determine the normalization factor to scale each column to sum to 1 million
normalization_factor = 1_000_000 / column_sums

# Step 3: Multiply each column by its respective normalization factor
normalized_df = rat_expression.multiply(normalization_factor, axis=1)

# Now each column in `normalized_df` should sum to 1 million
normalized_df