# Daily Log to SQL

This is a file to help clean up data from the daily logs and insert them into the Limblab MySQL database. You will need to know the passwords and either be connected to the VPN or running this remotely on Shrek or Donkey to use this.


## Required Dependencies:

- sqlalchemy
- pymysql
- numpy
- pandas


### Linux specific
You'll need to run <code> sudo apt install libmysqlclient mysql-client-core </code>

### macOS specific
You'll need to run <code> brew install mysql </code>

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import getpass

dbName = "staging_db"
userName = "limblab"
# password = input("enter password") # this isn't secure, but it's better than allowing it to go up to git
password = getpass.getpass('mySQL limblab ')

mySQL limblab ········


### Make sure to update the filename and monkey name below:

Either run for the google sheet **or** the excel version

In [2]:
monkeyName = "Crackle"
ccmID = "18E2"

#### For a Google Sheet

In [3]:
# Using a google sheet
sheetName = "DailyLog"
# file_id is the portion after the "d" in the URL
file_id = "1BI5a4PnZRNB4o2v8I6Sjg9MaLDjywiS5ZYsRcX80v4g"
googleURL = f"https://docs.google.com/spreadsheets/d/{file_id}/export?gid=506541297&format=csv&sheet={sheetName}"

print(googleURL)

log = pd.read_csv(googleURL)

https://docs.google.com/spreadsheets/d/1BI5a4PnZRNB4o2v8I6Sjg9MaLDjywiS5ZYsRcX80v4g/export?gid=506541297&format=csv&sheet=DailyLog


#### For an excel file

You can use forward slashes even if you're using windows. You will need to either do that or replace all of the backslashes with "\\" since it will see a single "\" as an escape key.

In [None]:
# Using an excel file
sheetName = "DailyLog"
fileName = "C:/Users/17204/Downloads/Rocket.xlsx" 
log = pd.read_excel(fileName,sheet_name=sheetName)


### Let's inspect the logs

Most likely we'll just remove any dates that don't have any useful filled information, though you should double check that nothing weird is going on.

In [4]:
log.info()
log.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 886 entries, 0 to 885
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Unnamed: 0                    886 non-null    object 
 1   Date                          886 non-null    object 
 2   Weight                        221 non-null    float64
 3   Start time                    213 non-null    object 
 4   End time                      208 non-null    object 
 5   H2O (lab)                     243 non-null    float64
 6   H20 (bottle)                  381 non-null    float64
 7   H2O (total)                   481 non-null    float64
 8   Avg H2O intake                475 non-null    float64
 9   Required Daily                221 non-null    float64
 10  Required Average              221 non-null    float64
 11  Pulse size (reg, jackpot, %)  63 non-null     object 
 12  Reward                        142 non-null    float64
 13  Abort

Unnamed: 0.1,Unnamed: 0,Date,Weight,Start time,End time,H2O (lab),H20 (bottle),H2O (total),Avg H2O intake,Required Daily,...,Fail,Incompl,Time doing task,Lab no.,Person running,Behavioral Notes,Health Notes,Cleaned,Other Notes,Unnamed: 23
0,Mon,7/30/18,,,,,500.0,500.0,,,...,,,,,,,,,,
1,Tue,7/31/18,,,,,500.0,500.0,,,...,,,,,,,,,,
2,Wed,8/1/18,,,,,500.0,500.0,,,...,,,,,,,,,,
3,Thu,8/2/18,,,,,500.0,500.0,,,...,,,,,,,,,,
4,Fri,8/3/18,,,,,500.0,500.0,,,...,,,,,,,,,,


### Remove unneeded fields

The fields for the daily logs are:

| Field | | Datatype |
| :-: | :-: | :-: |
| **rec_date** | | date |
| **monkey_id** | | varchar(10) |
| **weight** | | int |
| **start_time** | | time |
| **end_time** | | time |
| **h2o_lab** | | int |
| **h2o_home** | | int |
| **treats** | | varchar(40) |
| **lab_num** | | varchar(10) |
| **num_reward** | | int |
| **num_abort** | | int |
| **num_fail** | | int |
| **num_incomplete** | | int |
| **behavior_notes** | | varchar(1000) |
| **behavior_quality** | | enum: 'bad','ok','good' |
| **health_notes** | | varchar(1000) |
| **cleaned** | | bool/tinyint(1) |
| **other_notes** | | varchar(1000) |
| **day_key** | | int |
| **experiment** | | varchar(1000) |
| **experimentor**| | varchar(50) |


drop any fields that don't align with these and then change the names appropriately

In [5]:
# list of columns. You will need to change these to match the current dataframe columns
dropCols = ['Unnamed: 0', 'H2O (total)', 'Required Daily', 'Avg H2O intake', 'Required Average', 'Time doing task']

log.drop(columns = dropCols, inplace=True)

# rename remaining columns to match the database names
# should be a dictionary of {old_name:new_name}
renameCols = {'Date':'rec_date',
             'Weight':'weight',
             'Start time':'start_time',
             'End time':'end_time',
             'H2O (lab)': 'h2o_lab',
             'H20 (bottle)': 'h2o_home',
             'Supplementary Treats':'treats',
             'Lab no.':'lab_num',
             'Reward':'num_reward',
             'Abort':'num_abort',
             'Fail':'num_fail',
             'Incompl':'num_incomplete',
             'Behavioral Notes':'behavior_notes',
             'Health Notes':'health_notes',
             'Cleaned':'cleaned',
             'Other Notes':'other_notes',
             'Person running':'experimentor',
              'Unnamed: 23':'experiment'}
log.rename(columns = renameCols, inplace=True)



### Remove invalid days

We don't want entries from days where we didn't record. To that end, we will remove anything where we don't have weight, a start time, and h2o in the lab. I mean this in boolean AND sense, meaning if we have any of those three we will keep the row just to be safe.

In [6]:
dropRows = np.where(log[['start_time', 'h2o_lab']].isnull().sum(axis=1)>=2)[0]

log.drop(index = dropRows, inplace=True)

log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 7 to 871
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   rec_date                      244 non-null    object 
 1   weight                        212 non-null    float64
 2   start_time                    213 non-null    object 
 3   end_time                      208 non-null    object 
 4   h2o_lab                       243 non-null    float64
 5   h2o_home                      144 non-null    float64
 6   Pulse size (reg, jackpot, %)  63 non-null     object 
 7   num_reward                    142 non-null    float64
 8   num_abort                     102 non-null    float64
 9   num_fail                      90 non-null     float64
 10  num_incomplete                89 non-null     float64
 11  lab_num                       122 non-null    float64
 12  experimentor                  180 non-null    object 
 13  behav

### Change datatypes according to what is needed

As per the definitions described above

In [7]:
log['rec_date'] = pd.to_datetime(log['rec_date'])
log['rec_date'] # it's good to do some sanity checking to make sure these worked alright

7     2018-08-06
8     2018-08-07
9     2018-08-08
11    2018-08-10
14    2018-08-13
         ...    
864   2020-12-10
868   2020-12-14
869   2020-12-15
870   2020-12-16
871   2020-12-17
Name: rec_date, Length: 244, dtype: datetime64[ns]

In [8]:
# add the monkeyID
log['monkey_id'] = ccmID

In [9]:
log['cleaned'].value_counts()

x      53
Yes    11
X      11
No      7
yes     6
Name: cleaned, dtype: int64

In [10]:
# cleaning up the 'cleaned' property
YESs = ['Yes','yes','X','x']
log['cleaned'] = log['cleaned'].isin(YESs).astype(bool)
log['cleaned'].value_counts(dropna=False)

False    163
True      81
Name: cleaned, dtype: int64

In [11]:
log['h2o_lab'] = log['h2o_lab'].astype(pd.Int64Dtype())
log['h2o_home'] = log['h2o_home'].astype(pd.Int64Dtype())
log['num_reward'] = log['num_reward'].astype(pd.Int64Dtype())
log['num_abort'] = log['num_abort'].astype(pd.Int64Dtype())
log['num_fail'] = log['num_fail'].astype(pd.Int64Dtype())
log['num_incomplete'] = log['num_incomplete'].astype(pd.Int64Dtype())

In [12]:
log.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 7 to 871
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   rec_date                      244 non-null    datetime64[ns]
 1   weight                        212 non-null    float64       
 2   start_time                    213 non-null    object        
 3   end_time                      208 non-null    object        
 4   h2o_lab                       243 non-null    Int64         
 5   h2o_home                      144 non-null    Int64         
 6   Pulse size (reg, jackpot, %)  63 non-null     object        
 7   num_reward                    142 non-null    Int64         
 8   num_abort                     102 non-null    Int64         
 9   num_fail                      90 non-null     Int64         
 10  num_incomplete                89 non-null     Int64         
 11  lab_num                       12

In [13]:
log['health_notes'].value_counts()

Good.                                                                                                         4
Looks good                                                                                                    4
Same as before. Neck line slightly open. Applied TAB to area                                                  3
Head has a little goop again. Will clean again                                                                2
Head had a little goop, cleaned with betadine                                                                 2
Looking good                                                                                                  2
Sedated for sensory mapping. Seems like he’s been trying to pick at his back again. Put jacket on tighter     1
All looks good, he may be scratching at the other side of his head for some reason                            1
Back looks pretty good                                                                                  

### Export to MySQL database

you will need to create an ssh tunnel using 

<code>ssh -N -L 3306:localhost:3306 {Username}@{hostname}</code>

In [14]:
# this is set up using an SSH tunnel
engine = create_engine(f"mysql+pymysql://{userName}:{password}@127.0.0.1:3306/{dbName}")

# log.to_sql('days', engine, index=False, if_exists="append")