
# Data cleaning and feature engineering

---

<br>

## Datasets

##### Training data:

- 29k rows x 10 cols, 
- 'Beta' is outcome: methylated (1) or not (0)
    - 30% are methylated 
- Chromosomes 1-10 only

##### Test: 
- 20,611 rows, no outcome labels
- Chromosomes 11-22 only


## Features of the data 

- `id`: unique identifiers 
  - not useful for training, useful for data handling though  
  
  
- `chromosome, position` (exact CpG site): 
  - not really sure how to use this given that test is from different chromosomes  
  - don't want classifier to memorize positions (values = n)  
  - could derive other features like size of island, placement of cg inside of island  
    
    
- `island`: chr + position range of island  
  - again, very specific to each site which could lead to over training  
  - some missing data (4k)  
  
- `refgene`: UCSC_RefGene_Group
  - 1089 unique values...
  - contains list of tags about functional elements:
      - TSS* {200, .}
      - 1st exon
      - Body
      - 5'UTR
  - Each can have multiple tags, even multiple of same tags....
  - Maybe split into counts for each tag: columns [TSS200, TSS1500, exon1, body, utr5, ...]

Body:14797  TSS200:10935   5'UTR:9117 1stExon:6928 TSS1500:7634   3'UTR:866    NA's:4243 
                           

- `feature`: lots of missing data, seems like 2 different factors
  - Promoter/Gene/NonGene_Associated or Unclassified  
      - not Cell_type_specific
  - Promoter/Gene/NonGene_Associated_Cell_type_specific or Unclassified_Cell_type_specific
  - NA (~12k)
  
  
- `relation_to_island`:
  - 5 levels: Island:18269, S_Shore:2107, N_Shelf: 529, N_Shore: 2378, S_Shelf: 434, NA's: 5348
  
  
- `Fwd_seq` and `seq`:
  - not sure what is the relationship between these?
  - `Fwd_seq` has the [CG] site marked and are all 124 bp long
  - `seq` is 2kbp of sequence

  
----

## Nominal data
  
  <br>

In [11]:
# refgene has lists of tags
# print(df['refgene'].unique()[:10])

<StringArray>
[                                'Body;Body',
                                    'TSS200',
                               "5'UTR;5'UTR",
                                      'Body',
                           '1stExon;1stExon',
                        "5'UTR;TSS200;5'UTR",
                                        <NA>,
                       'Body;Body;Body;Body',
 "5'UTR;1stExon;1stExon;5'UTR;1stExon;5'UTR",
                           'TSS1500;TSS1500']
Length: 10, dtype: string


In [12]:
# feature has 2 categories: classes (celltype_specific or not)
# and (promoter / gene / non-gene / unclassified)
# print(train['feature'].unique())

[nan 'Promoter_Associated' 'Unclassified' 'Gene_Associated'
 'Unclassified_Cell_type_specific'
 'Promoter_Associated_Cell_type_specific' 'NonGene_Associated'
 'Gene_Associated_Cell_type_specific'
 'NonGene_Associated_Cell_type_specific']


In [13]:
# relation to island is one of 5 levels or unknown
print(list(train['rel_to_island'].unique()), '\n')


['Island', nan, 'S_Shore', 'N_Shelf', 'N_Shore', 'S_Shelf'] 



---

## Encoding sequences

There are shorter regions (60 bp up and downstream) and longer sequences (2kbp)

The shorter sequence always has 60 bp, then "[CG]", then 60 bp downstream.
```
print(df['short_l'][1])
print(set(len(x) for x in df['short_l']), set(len(x) for x in df['short_r']))
> GGACCACACTGCCATGGCAACAGCGTGCCTCTGCGTCCTCCATCCGGGCCTCTCTAACTA
> {60} {60}

```

The longer sequence contains the shorter in the center, always starting in the same position (939; from 0-index).
```
for x in range(10):
    pos = df['seq'][x].find(re.sub(r'[\[|\]]', '', df['fwd_seq'][x]))
    print(pos)
```

----

In [34]:
test.info()

Unnamed: 0,position,island,refgene,rel_to_island,feature,fwd_seq,seq
0,93862594.0,chr11:93861560-93862773,1stExon,Island,,AGCCCACGGAGCCCAAGTTCAAGGGGCTGCGACTGGAGCTGGCTGT...,CACACTCACTACCGTTTCCGCGCCACCCTCTCACGCGGAGCTCCTG...
1,17756435.0,chr11:17756056-17758286,TSS1500;TSS1500,Island,,GCCCGGGCGGCAGCAGCGTCGCGGCGGCGGCGGCAGCGGCCGCTCC...,CCACCCCCAGGACCAACTGTGAAGATGGAGGAATCACAGCAAGCAG...
2,66024941.0,chr11:66024910-66026253,TSS200;TSS1500;5'UTR;TSS1500;1stExon,Island,Promoter_Associated,TGTACTCCAGGAAGCCCAACCTTCTCCCTGCGCCTCAGTTTCTCCC...,CTCAGCTACTGGAGAGGTTGAGGCAGGAGAATTGCTTGAACCCTGG...
3,69634240.0,chr11:69632033-69634710,TSS200,Island,,GCTCTGAAAGGTCCCCTACCCGCCCCTCCCCCGCCCCCTCCCCCAG...,CGAACCCCAGCGCGCGCTCTTCTCCCGGACGTGTAGGTTGAGGAGG...
4,85955689.0,chr11:85955808-85956517,TSS200;TSS200,N_Shore,Promoter_Associated,TTGCATTTCTATCTTCAAGGAAGAATTAGGTTATGAATAGTTCCGT...,ACTGATGTCCTGGTTATAAAGTTATCATGCAGAAATAATTCAAGTA...
...,...,...,...,...,...,...,...
20606,50220531.0,chr22:50220118-50220822,,Island,Promoter_Associated,GGGGTCCCGCTGCTACGGATTCTCAGAACTCCCGCGTCCGCTCACA...,ATGTTGAAACTAGATTGCAATTACCTACAGAAAGGAATAGGAGAGA...
20607,50239170.0,chr22:50241735-50243918,,N_Shelf,,TTTACGGAGACGTCTTCATGTAGGCATATGGCGCATTAGGAACCGC...,ATACCTAGAGGATCTGCAAAGGAACAAAAATGATCATGCTCCCCTG...
20608,47054067.0,chr22:47054031-47054274,Body,Island,,CACCCCTGCCCCAGCATCCCAGAGTCGGGGTCGCGTGGACATGAGC...,TGGGGAGCGCCTGGGCAGGGTCTTGTCAGTGGCCCAGGCCTCTGAG...
20609,21386885.0,chr22:21386493-21387000,TSS200,Island,Unclassified_Cell_type_specific,GCTGCCGCAGCCGCTGCCTCCGCTCTGAGCACTGAGCCCGCCCAGT...,GAGCACAGCAGGGCCAGCCACCTCCTTGGCCACGGCACCTGTGAGC...


In [41]:
# make_dummies(test)
# make_kmer_freq(test)
# make_relative_positions(test)
# make_one_hot_seq(test)

Unnamed: 0,A0,C0,G0,T0,A1,C1,G1,T1,A2,C2,...,G117,T117,A118,C118,G118,T118,A119,C119,G119,T119
0,1,0,0,0,0,0,1,0,0,1,...,0,1,0,1,0,0,0,0,1,0
1,0,0,1,0,0,1,0,0,0,1,...,0,0,0,1,0,0,0,0,0,1
2,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,1,0,0,1,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
4,0,0,0,1,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20606,0,0,1,0,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,0
20607,0,0,0,1,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
20608,0,1,0,0,1,0,0,0,0,1,...,0,1,0,0,1,0,0,1,0,0
20609,0,0,1,0,0,1,0,0,0,0,...,1,0,1,0,0,0,0,0,0,1


In [42]:
X_test = make_all_features(test)

X_test.to_csv("data/test_X.csv")

---

## This part 

In [4]:
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


def make_dummies(df):
    """Make various dummies for (relation to island), (refgene group), (regulatory features)"""
    dfx = df.copy()
    
    # get dummies for "Relation_to_UCSC_CpG_Island": 5 levels
    dfx = pd.get_dummies(dfx, columns =['rel_to_island'], prefix_sep = '', prefix = '')
    
    # pull terms from 'UCSC_RefGene_Group' lists into columns of counts
    for term in ["TSS200", "TSS1500", "Body", "5'UTR", "3'UTR", "1stExon"]:
        dfx[term] = dfx["refgene"].str.count(term)
        dfx[term] = dfx[term].fillna(0).astype('int32')
    
    # create 2 sets of dummies from 'feature ~ Regulatory_Feature_Group
    dfx["cell_type_specific"] = df['feature'].str.count("_Cell_type_specific").fillna(0).astype('int32')
    for term in ["Gene_Associated", "NonGene_Associated", "Promoter_Associated", "Unclassified"]:
        dfx[term] = dfx['feature'].str.count(term).fillna(0).astype('int32')
    
    dfx = dfx.drop(columns = ['position', 'island', 'refgene', 'feature', 'fwd_seq', 'seq'])
    return(dfx)

def make_relative_positions(df):
    """ The 'island' column has position info, eg: "chr1:2004858-2005346"
        We want to get the position of the CpG site relative to the start and stop of the island
        Many columns don't have island data so add a dummy to indicate whether it exists
    """
    dfx = df.copy()
    # dummy variable for whether has 'island' or NA
    dfx['has_island'] = np.where(dfx['island'].isna(), 0, 1)   
        
    # postion of CpG relative to nearby island start position (lots of missing values though)
    dfx['isl_start'] = dfx['island'].str.extract(':(\d+)').astype('float64')
    dfx['dist_start'] = dfx['isl_start'] - dfx['position']
    dfx['dist_start'] = dfx['dist_start'].fillna(0)
    
    # same for distance to end of island
    dfx['isl_end'] = dfx['island'].str.extract('-(\d+)').astype('float64')
    dfx['dist_end'] = dfx['isl_end'] - df['position']
    dfx['dist_end'] = dfx['dist_end'].fillna(0)
    
    # return distance columns
    return(dfx[['has_island', 'dist_start', 'dist_end']])


## with help from: https://www.kaggle.com/thomasnelson/working-with-dna-sequence-data-for-ml
def make_kmer_freq(df):
    """returns vectorized kmer frequency features of 2kbp region ('seq') as dataframe"""
    
    def get_kmers(dna, k=6):
        """creates list of kmers from dna seq"""
        dna = dna.upper()
        kmers = [dna[x:x+k] for x in range(len(dna)+1-k)]
        kmers = ' '.join(kmers)
        return(kmers)
    
    # create new column of 
    mers = df.apply(lambda x: get_kmers(x['seq'], 6), axis = 1)
    tfidf = TfidfVectorizer() 
    X = tfidf.fit_transform(mers)
    kmers = tfidf.get_feature_names()
    kmer_df = pd.DataFrame(X.toarray(), columns=kmers)
    return(kmer_df)


def make_one_hot_seq(df):
    
    def one_hot_encode_dna(dna):
        """ One-hot encode a single DNA sequence: 
        Requires creating two encoders: LabelEncoder to get from string to numeric, then OneHotEncoder
        Converts DNA to numeric then to one-hot matrix with shape: len(dna)*4
        """
        # create label encoder for DNA symbols
        label_encoder = LabelEncoder() 
        label_encoder.fit(np.array(list('ACGTN')))
        
        # create one-hot encoder
        onehot_encoder = OneHotEncoder(sparse=False, dtype=int)
        
        # dna to numeric array
        dna = re.sub('[^ACGT]', 'N', dna.upper())
        dna = np.array(list(dna))
        dna_int = label_encoder.transform(dna) 
        dna_int = dna_int.reshape(len(dna_int), 1)
        
        # convert to one-hot
        dna_onehot = onehot_encoder.fit_transform(dna_int)
        return(dna_onehot)
    
    """
    Splits the region around CpG site in up + downstream
    Applies one-hot encoding to each sequence and returns 480 column df
    """
    dfx = df.copy()
    # split the upstream and downstream seq around '[CpG]'; rejoin the two halves
    dfx['fwd_seq_x'] = dfx['fwd_seq'].str.split('\[|\]', expand = True).apply(lambda x: x[0] + x[2], axis=1)
    
    # apply one_hot_encoding to sequence and flatten matrix into vector
    X1 = dfx.apply(lambda x: one_hot_encode_dna(x['fwd_seq_x']).flatten(), axis=1)
    
    # stack vectors into data frame with 480 columns (x1 to x480), since [120 bp * 4 bases] are encoded.
    X1 = pd.DataFrame(np.column_stack(list(zip(*X1))), 
                      columns = list(x+str(y) for y in range(120) for x in 'ACGT'))
    
    return(X1)



In [17]:
## Main ######

train = pd.read_csv('data/train.csv')

# give the variables shorter names
train = train.rename(columns={"Id": "id", 
                              "CHR": "chromosome", 
                              "MAPINFO": "position", 
                              "UCSC_CpG_Islands_Name": "island",  
                              "UCSC_RefGene_Group":"refgene",
                              "Relation_to_UCSC_CpG_Island": "rel_to_island",
                              "Regulatory_Feature_Group": "feature", 
                              "Forward_Sequence":"fwd_seq",
                              "Beta": "outcome"})


# change categorical variables dtypes
for col in ["rel_to_island", "outcome"]:
    train[col] = train[col].astype("category")

# change string variables dtypes
for col in ["fwd_seq", "seq", "refgene"]:
    train[col] = train[col].astype("str")

# position as float    
train['position'] = train['position'].astype('float64')
del(col)

# remove and save outcomes
Y = train['outcome']
Y.to_csv('data/train_Y.csv')

# drop unnecessary columns
train = train.drop(['id', 'chromosome', 'outcome'], 1)
print(train.info())

# make all the features and save data
print("Making training features ....")
position = make_relative_positions(train.copy())
print("Position df")
print(position.info())

dummies = make_dummies(train.copy())
print("Dummies df")
print(dummies.info())

one_hot = make_one_hot_seq(train.copy())
print("One hot 120bp sequence df")
print(one_hot.info())

kmers = make_kmer_freq(train.copy()) 
print("kmers df before removing kmers with n's")
print(kmers.shape)
kmers = kmers.loc[:,~kmers.columns.str.contains('n', case=False)] 
print("kmers df after removing kmers with n's")
print(kmers.shape)
kmers.info()

train_X = pd.concat([position, dummies, one_hot, kmers], 1)
print(train_X.info())

print("Saving training features ....")
train_X.to_csv("data/train_X.csv")

# del(train, train_X)
print("Done with training data")

In [21]:
####
print("\n\n\nStarting on test data...\n\n")
## Same thing for test data
test = pd.read_csv('data/test.csv')

test = test.rename(columns={"Id": "id", 
                            "CHR": "chromosome", 
                            "MAPINFO": "position",
                            "UCSC_CpG_Islands_Name": "island",  
                            "UCSC_RefGene_Group":"refgene",
                            "Relation_to_UCSC_CpG_Island": "rel_to_island",
                            "Regulatory_Feature_Group": "feature", 
                            "Forward_Sequence":"fwd_seq"})

# change categorical variables dtypes
test['position'] = test['position'].astype('float64')
test["rel_to_island"] = test["rel_to_island"].astype("category")

for col in ["fwd_seq", "seq", "refgene"]:
    test[col] = test[col].astype("str")

del(col)

# drop unneeded data
test = test.drop(['id', 'chromosome'], 1)

# make features
print("Making testing features ....")
# make all the features and save data
print("Making training features ....")
test_position = make_relative_positions(test.copy())
print("test_position df")
print(test_position.info())

test_dummies = make_dummies(test.copy())
print("test_dummies df")
print(test_dummies.info())

test_one_hot = make_one_hot_seq(test.copy())
print("test_one_hot (120bp sequence) df")
print(test_one_hot.info())

test_kmers = make_kmer_freq(test.copy()) 
print("test_kmers df before removing kmers with n's")
print(kmers.shape)
test_kmers = test_kmers.loc[:,~test_kmers.columns.str.contains('n', case=False)] 
print("test_kmers df after removing kmers with n's")
print(kmers.shape)
kmers.info()

test_X = pd.concat([test_position, test_dummies, test_one_hot, test_kmers], 1)
print(test_X.info())

# write to file
print("Saving testing features ...")
test_X.to_csv("data/test_X.csv")
# del(test, test_X)
print("Done")

Making testing features ....
Making training features ....
test_position df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20611 entries, 0 to 20610
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   has_island  20611 non-null  int64  
 1   dist_start  20611 non-null  float64
 2   dist_end    20611 non-null  float64
dtypes: float64(2), int64(1)
memory usage: 483.2 KB
None
test_dummies df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20611 entries, 0 to 20610
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Island               20611 non-null  uint8
 1   N_Shelf              20611 non-null  uint8
 2   N_Shore              20611 non-null  uint8
 3   S_Shelf              20611 non-null  uint8
 4   S_Shore              20611 non-null  uint8
 5   TSS200               20611 non-null  int32
 6   TSS1500              20611 non-null  int