# Main Script for Final Project

### T. Martz-Oberlander, 2015-11-10
### Change in pitch of a pipe organ from CO2

   This script looks for mathematical relationships between CO2 concentration changes and pitch changes from a pipe organ. This script uploads, cleans data and organizes new dataframes, creates figures, and performs statistical tests on the relationships between variable CO2 and frequency of sound from a note played on a pipe organ.
   
   Here I pursue data analysis route 1 (as mentionted in my notebook.md file), which involves comparing one pitch dataframe with one dataframe of environmental characteristics taken at one sensor location. Both dataframes are compared by the time of data recorded.

In [11]:
# I import useful libraries (with functions) so I can visualize my data
# I use Pandas because this dataset has word/string column titles and I like the readability features of commands and finish visual products that Pandas offers

import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np

%matplotlib inline

#I want to be able to easily scroll through this notebook so I limit the length of the appearance of my dataframes 
from pandas import set_option
set_option('display.max_rows', 10)

### Uploaded data into Python 

    First I upload my data sets. I am working with two: one for pitch measurements and another for environmental characteristics (CO2, temperature (deg C), and relative humidity (RH) (%) measurements). My data comes from environmental sensing logger devices in the "Choir Division" section of the organ consul. 

In [32]:
#I import the environmental characteristics data file

#comment by nick changed the path you upload that data from making in compatible with clone copies of your project
env_choir_div=pd.read_table('../Data/CO2May.csv', sep=',')

#comment by nick here i am resigning colunm names to remove blank space and units. 
#assigning columns names 
env_choir_div.columns=[['test','Date_time','temp','RH','CO2_1_ppm','CO2_2_ppm']]

#I display my dataframe
env_choir_div

Unnamed: 0,test,Date_time,temp,RH,CO2_1_ppm,CO2_2_ppm
0,1,04/17/10 11:00:00 AM,20.650,35.046,452.4,689.9
1,2,04/17/10 11:02:00 AM,20.579,35.105,450.5,677.0
2,3,04/17/10 11:04:00 AM,20.507,35.229,450.5,663.6
3,4,04/17/10 11:06:00 AM,20.460,35.291,448.7,652.0
4,5,04/17/10 11:08:00 AM,20.412,35.352,442.0,641.0
...,...,...,...,...,...,...
10853,10854,2005-02-10 12:46,21.581,44.604,501.2,483.5
10854,10855,2005-02-10 12:48,21.581,44.604,504.3,482.9
10855,10856,2005-02-10 12:50,21.581,44.604,503.7,482.3
10856,10857,2005-02-10 12:52,21.604,44.575,503.1,481.7


In [35]:
#comment by nick changing your data time variable to actual values of time. 
env_choir_div['Date_time']= pd.to_datetime(env_choir_div['Date_time'])

#print the new table and the type of data. 
print(env_choir_div)

env_choir_div.dtypes

        test           Date_time    temp      RH  CO2_1_ppm  CO2_2_ppm
0          1 2010-04-17 11:00:00  20.650  35.046      452.4      689.9
1          2 2010-04-17 11:02:00  20.579  35.105      450.5      677.0
2          3 2010-04-17 11:04:00  20.507  35.229      450.5      663.6
3          4 2010-04-17 11:06:00  20.460  35.291      448.7      652.0
4          5 2010-04-17 11:08:00  20.412  35.352      442.0      641.0
...      ...                 ...     ...     ...        ...        ...
10853  10854 2005-02-10 12:46:00  21.581  44.604      501.2      483.5
10854  10855 2005-02-10 12:48:00  21.581  44.604      504.3      482.9
10855  10856 2005-02-10 12:50:00  21.581  44.604      503.7      482.3
10856  10857 2005-02-10 12:52:00  21.604  44.575      503.1      481.7
10857  10858 2005-02-10 12:54:00  21.604  44.575      498.8      480.5

[10858 rows x 6 columns]


test                  int64
Date_time    datetime64[ns]
temp                float64
RH                  float64
CO2_1_ppm           float64
CO2_2_ppm           float64
dtype: object

comment by nick now the computer understand that date time is representing the date and time.  the same should be done for all dataframes that have a date time variable. if you see the type of data is called datetime64 not an object.  

Now, I will upload the pitch data so I can compare change in pitch of certain notes and change in environmental characteristics.

In [14]:
#I import the environmental characteristics data file

#comment by nick changed the path you upload that data from making in compatible with clone copies of your project
#comment by nick changed how to read the file from a table to csv file. 
pitch =pd.read_csv('../Data/pitches.csv', sep=',')

#I display my dataframe
pitch

Unnamed: 0,time,div,note,freq1,freq2,freq3,freq4,freq5,freq6,freq7,freq8,freq9
0,2010-04-13 8:37,pedal,c3,131.17,131.20,131.18,131.11,131.17,131.14,131.21,,
1,2010-04-13 8:37,pedal,c4,262.08,262.12,262.09,262.05,262.07,262.10,262.08,,
2,2010-04-13 8:40,swell,c3,131.42,131.47,131.45,131.47,131.50,131.47,131.45,,
3,2010-04-13 8:40,swell,c4,262.9,262.87,262.84,262.85,262.90,262.87,262.88,,
4,2010-04-13 8:42,great,c4,262.04,262.05,262.01,262.03,261.97,261.98,261.99,,
...,...,...,...,...,...,...,...,...,...,...,...,...
52,2010-04-17 10:35,pedal,c4,261.95,261.95,262.02,262.00,261.97,262.01,261.95,261.97,
53,2010-04-17 10:37,great,c4,261.69,261.69,261.68,261.71,261.74,261.66,261.68,261.69,261.67
54,,2010-04-17 9:54,choir,c5,523.73,523.61,523.66,523.77,523.63,523.65,523.69,
55,,2010-04-17 10:35,pedal,c4,261.95,261.95,262.02,262.00,261.97,262.01,261.95,261.97


## Munging data for plotting and stats comparrison--Pitch data

### Using regular expressions to find matching dated data points for comparisson

To make a meaninful comparisson between pitch and CO2, I need to format my two data files. First, for the pitch.csv file, I select the data that corresponds to the environmental datafile, which are frequency data collected on 2010-04-17 in the "choir division". 

I can make a regular expression to select these rows of pitch/sound frequency data.

comment by nick 

i am alittle confused on exactly which variables you would like to select. specific colunms names would help me pin point your direction. 

In [None]:
#First, let's work with the pitch. I want to select the "choir" values in the "div[ision" column.

#Then, I can select the data from 2010-04-17 only, which is the date that can be matched with the temp, RH, and CO2 measurements in the oher data file.


In [18]:
pitch['note']

0        c3
1        c4
2        c3
3        c4
4        c4
      ...  
52       c4
53       c4
54    choir
55    pedal
56    great
Name: note, dtype: object

In [None]:
import re

#I import the file with '\n' new line separators
lines = open('../Data/pitches.csv', 'r').read().strip().split('\n' )
# comment by nick i chagne the path way for grabing your data. so it works for all computers that have a clone of your project. 
# commment by nick checked if lines was wokring 
lines


In [None]:
#search for '2010' in the 'time' column of the pitch dataframe
'2010' in pitch['time'][0] #select one item, the first item [0], from the given array of 2D columns

In [None]:
pitch.loc[('2010-04-17' in pitch['time']), 'time']

#I tried using a boolian statement for the pattern in the 'time' column, but
#Having a boolian statement causes problems in that I am searching for part of the DateTime values (the date part) and want all time values associated with 2010-04-17.

#I also tried RegEx's, but William said that is for a string/list 
#in a dataframe you should use a search function like the one I tried above

#re.search('2010-04-17', pitch) #looking for these date valuesaov

#show [new data lines]


In [None]:
#I can then make a new dataframe with 2010-04-17 data only

17data = 

I then need to select notes from the "choir" cells in the "div" column of pitch (because my CO2 readings come from the choir division area in the chapel and so are spacially comparable).  

### Making a useful/comparable pitch value with mean of all pitch frequencies

To make a comparrison between pitch and CO2, I need to find one pitch value for each time sample. I will do this by averaging pitch data points in each row of my "pitches.csv" file. 

In [22]:
#I use element-wise mathematics between dataframe cells

pitch['pitch_average'] = pitch.mean(columns='freq1' 'freq2' 'freq3' 'freq4' 'freq5' 'freq6' 'freq7' 'freq8' 'freq9')


#pitch[[['freq1', 'freq2', 'freq3', 'freq4', 'freq5', pitch_average']]]

freq2            315.185789
freq3            315.180702
freq4            315.190702
freq5            318.466071
freq6            318.110185
freq7            321.090816
freq8            300.771765
freq9            250.025455
pitch_average           NaN
dtype: float64

In [20]:
pitch('freq1', 'freq2',  pitch_average')

SyntaxError: EOL while scanning string literal (<ipython-input-20-2af93edf2621>, line 1)

In [24]:
#I want to find out why the mean pitch values I calculated are NaNs, so I check the type of data in 'pitch average'
#np.dtype('pitch_average')

#comment by nick variable_that_hold_dataframe.dtypes will show you what type of data you have. 
pitch.dtypes
#how can I check the data type?

time              object
div               object
note              object
freq1             object
freq2            float64
                  ...   
freq6            float64
freq7            float64
freq8            float64
freq9            float64
pitch_average    float64
dtype: object

## Munging data for plotting and stats comparrison--Environmental data

Like I did for pitch.csv, I need to select out the rows of my choir_division.csv file for data logged on 2010-04-17. I will use the similar RegEx to do this

In [25]:
#call in choir_division.csv with line separation
lines = open('../Data/Choir_Division_May.csv', 'r').read().strip().split('\n' )

comment by nick 

why are you importaning this data as line seperated file? 

In [36]:
env_choir_div.dtypes

test                  int64
Date_time    datetime64[ns]
temp                float64
RH                  float64
CO2_1_ppm           float64
CO2_2_ppm           float64
dtype: object

In [37]:
#search for lines that contain the given pattern "2010-04-17"

re.search('2010-04-17', env_choir_div)

#is my data not in the proper format? I tried using "env_choir_div" instead of "lines" but the same error message is returned

TypeError: expected string or buffer

comment by nick.
In all i think your problems lie with organiaztion of your data and having that right type of data be precent when calling for it. I managed to change some of your data to the right form but it will need more playing with the get it all done. 