In [2]:
#imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy.stats
import sklearn.preprocessing
import sklearn.model_selection
import sklearn.neighbors
import duckdb
import unittest

In [14]:
df = pd.read_csv("/Users/loganroberts/Learn2Therm/ValidProt/data/Sample.csv")

# Software Component Five: s5.0_relation.py

**Params:** 

**Inputs:** Pandas Dataframe containing Pfam return data. Includes quantitative features (ID, some metric of percent similarlity) and string of amino acid sequence.

**Outputs:** Quantitative functional similarlity metric.

**Metrics:**

**Packages:** pandas, numpy, scipy, seaborn, fuzzywuzzy, unittest

***
***

**Subcomponent 1**: Test for pandas dataframe input (**ALREADY TESTED WITH CODE**)

**Use case**: User takes data from component 4 (where data is processed into pandas dataframe) and wants to pass it into relationship component.

```

def check_input_type(dataframe):
    tests that input data is a pandas dataframe with assert statement. 
    assert "pandas.core.frame.DataFrame" in str(type(dataframe)) 
    Output should pass unless assert statement fails.

```

In [12]:
#code

def check_input_type(dataframe):
    assert "pandas.core.frame.DataFrame" in str(type(dataframe)), 'Not a pandas dataframe!'

**Test**: N/A

In [13]:
#test code

import unittest

#unit tests - function 1 
class TestInputType(unittest.TestCase):
    
    def test_input_type(self): 
        """
        Tests that input data is a pandas dataframe.
        
        """
        try:
            check_input_type([4,3])
            self.assertTrue(False)
        except AssertionError:
            self.assertTrue(True)
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestInputType)
_ = unittest.TextTestRunner().run(suite)

.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK


***
***

**Subcomponent 2**: Checks that input data is cleaned property (does it have all of the features we need, and are the features we don't need removed).

**Use case**: Input data does not include local E value, which we need as an input to our model.

(**NOT TESTED WITH CODE**)
```
def check_input_strings(dataframe):
    
      if 'badstring' in dataframe[]:
          dataframe = dataframe.drop(dataframe['string']
      else:
          pass
          
      if 'goodstring' not in dataframe[]:
          raise KeyError
      else:
          pass
```

In [22]:
#CODE 

def clean_input_columns(dataframe):
    """
    We want to clean certain columns out of the Pfam dataframe. For now, let's ensure that the dataframe 
    is missing 'Unnamed: 0', 'meso_seq', 'thermo_seq', and 'prot_pair_index'.
    
    Input: Pandas dataframe (from Pfam)
    Output: Updated dataframe.
    """
    
    for title in ['Unnamed: 0','meso_seq', 'thermo_seq', 'prot_pair_index']:
        if title in dataframe:
            dataframe = dataframe.drop(columns = title)
        else:
            pass
    
    return dataframe

In [40]:
#CODE 

def check_input_columns(dataframe):
    for title in ['meso_ogt', 'thermo_ogt', 'scaled_local_symmetric_percent_id',
                  'local_E_value', 'scaled_local_query_percent_id', 'local_gap_compressed_percent_id']:
        
        if title not in dataframe:
            raise KeyError
        else:
            pass
    
    return dataframe

**Test**: 
1) 
    import unittest
    class TestMissingStrings(unittest.TestCase):
    ```
        def test_missing_strings(self):
            try:
                check_input_strings(dataframe)
                self.assertTrue(False)
            except ValueError:
                self.assertTrue(True)
```

In [33]:
#CODE

def check_input_NANs(dataframe):
    """
    Checks for NaN values in input dataframe. Removes rows with NaN values present.

    Input: Pandas dataframe
    Output: Pandas dataframe

    """
    has_nan = dataframe.isna().any().any()
    nan_rows = dataframe[dataframe.isna().any(axis=1)]

    if has_nan:
        print('Dataframe has {} rows with NaN values!'.format(len(nan_rows)))
    else:
        print("DataFrame does not have any NaN values.")

    #Drop rows with NaN's
    dataframe = dataframe.dropna()
    print('Dataframe now has {} rows.'.format(len(dataframe)))

    return dataframe

Test: 1)

In [39]:
#(**NOT TESTED WITH CODE**)

def verify_protein_pair(dataframe):
    """
    Checks that input data has two protein sequences. Will need to generalize this function other data sets 
    to simply make sure two sequences are entered. Code below is for our protein database
    """
    assert dataframe['meso_ogt'] in dataframe, 'Dataframe missing mesophillic sequence!'
    assert dataframe['thermo_ogt'] in dataframe, 'Dataframe missing thermophillic sequence!'
    
        
    if len(sequence1) != len(sequence2):
        raise ValueError
    else:
        pass

**Test**: 
1) 
    import unittest
    class TestProteinPairs(unittest.TestCase):
    ```
        def test_protein_pair(self):
            try:
                verify_protein_pairs(dataframe)
                self.assertTrue(False)
            except ValueError:
                self.assertTrue(True)
```

***
***

**Subcomponent 3**: Train the model with sample data. (**NOT TESTED WITH CODE**)

**Use case**:

```
def train_model(dataframe):
    import scipy, numpy
    Split data into dev and test (0.8/0.2 for now)
    Train model (KNN Linear Regression for now)
    Output: Print('Training successful!')
```

**Test**: 

1) assert len(dataframe)*0.8 == len(dev_data)

***
***

**Subcomponent 4**: Test the model with sample data. (**NOT TESTED WITH CODE**)

**Use case**:

```
def test_model(dataframe):
    Runs data through model (linear regression (KNN?)
    Output: Returns model_score, confusion matrix, MSE
```

**Test**: 

***
***

**Subcomponent 5**: Run confidence test on model output. (**NOT TESTED WITH CODE**)

**Use case:**

```
def check_model_confidence(model_score, ci_data):
    Runs a statistical test on model output and compares it to sample
    Output: Returns a confidence score along with the model score
```

**Test**: 
1) Run confidence test on some data for which we know the confidence score assert that the score is correct using numpy.isclose( )

***
***

**Subcomponent 6**: Calculate a 'functionality' metric that is the ultimate output of component five. This will factor in information from multiple software, not just Pfam. This will be built during spring quarter. (**NOT TESTED WITH CODE**)

**Use case:** We need to test that our protein pairs have a near maximal functionality score! This can be used as a basis for eventual user input scores.

```
def calculate_functionality(model_score, dataframe):
    runs user input data through some mathematical manipulation of their model score and input data
    Output: returns a functionality score, print statement categorizing functionality score
```

**Test**: 

***
***

# Plan Outline

1. Get data from component 4. This should already be in a pandas dataframe (data prep is included in C4)
2. Clean the data to prepare it for model training and testing
3. Train and test the model, return scores, MSE, and any other necessary indicator of model performance
4. Run confidence test on model output to determine quality of output
5. Input new user data and return a functionality score for the input protein pair