<h1>Module 3: Python IO</h1>

<h2>Part A: File Input/Output</h2>
<p>Data available at
http://scrippsco2.ucsd.edu/data/atmospheric_co2/sampling_stations</p>

<h2>#1 I/O with Python Built-In Functions</h2>
<p>There are several different ways read files in python. <em>Caveat emptor</em>- The less input/code you provide, the more careful you need to be in evaluating your results.</p>
<p>Here are some helpful resources on using the "Built-In" open/read/write/close functions: <br> 
https://docs.python.org/3/library/functions.html,<br> https://docs.python.org/3/tutorial/inputoutput.html,<br> http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python.</p>

<h3>Reading In Files</h3>

In [1]:
# Different ways to read in files

#Open the file
file = open("mlo_station_data_set/weekly_in_situ_co2_mlo.csv","r")

#Read each line of the file, alternate methods are list(f) and f.readlines()
for line in file:
     print(line, end='')

# Aternatives:
#print(list(file))
#print(file.readlines())


#Close the file, closing the file after reading and/or writing is a programming best practice.        
file.close()





In [2]:
# Converting dates to times for plotting is non-trivial. Unix epoch: January 1, 1970, or since
# the beginning of your data.  There is no absolute time without reasonably sized integers.

from datetime import datetime

with open("mlo_station_data_set/weekly_in_situ_co2_mlo.csv","r") as file:
    raw_data = file.readlines()

data = []

for line in raw_data:
    if line[0] != '"':
        line = line.rstrip('\n').split(',')
        t = [int(x) for x in line[0].split('-')]
        line[0] = int((datetime(t[0],t[1],t[2])-datetime(1958,1,1)).total_seconds())
        line[1] = float(line[1])
        data.append(line)

file.close()
print(data)
#you can convert python lists to ndarrays with np.array()



<h3>Writing Files</h3>

In [3]:
file = open('weekly_co2_builtin-ex.txt', 'w')

for line in data:
    file.write(str(line).lstrip('[').rstrip(']')+"\n") 

file.close()

<h2>#2 I/O with Numerical Python (NumPy) Functions</h2>
<p>Here is a direct link to NumPy I/O functions https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.io.html</p>
<h3>Reading in Files as NumPy Arrays</h3>

In [4]:
# 2. Numerical Python (NumPy) Functions  (np.loadtxt() is for when there is no missing data.)
import numpy as np
import io
from datetime import datetime

s = io.BytesIO(open('mlo_station_data_set/weekly_in_situ_co2_mlo.csv','rb').read().replace(b'-',b','))
conv = lambda t: (datetime(t[0],t[1],t[2])-datetime(1958,1,1)).total_seconds()
file = np.genfromtxt(s, delimiter=',', comments='"', dtype=[('Year','i4'),('Month','i2'),('Day','i2'),('CO2','f8')])

# Convert Year, Month and Day to seconds from January 1, 1958

data = np.zeros([np.size(file),2])

for line in range(np.size(file)):
    line_1 = (datetime(file['Year'][line],file['Month'][line],file['Day'][line])-datetime(1958,1,1)).total_seconds()
    line_2 = file['CO2'][line]
    data[line] = [line_1,line_2]

print(np.shape(data))    
    






(3026, 2)


<h3>Writing Files with NumPy</h3>
<p>You can save files as either a binary file in numpy format, (extension .npy), or as a text file. Your choice will depend on the task at hand.  Here will work work with text files.</p>

In [5]:
import numpy as np
fname = 'weekly_co2_ppm_numpy-ex'
np.savetxt(fname, data, fmt='%.5e', delimiter=',', newline='\n') 
# Other options: header='', footer='', comments='# ')

In [3]:
test = "Donald Trump has great hair!"

print(test[7:13]+test[23:])

Trump hair!


<h2>#3 I/O with Pandas (Pan[el] Da[ta])</h2>
<p><a href="https://pandas.pydata.org/index.html">Pandas</a> is a powerful python package for data analysis that greatly simplifies I/O handling. Wes McKinney started developing the package in 2008, with Chang She joining in 2012, followed by additional core contributors. For more on I/O visit: https://pandas.pydata.org/pandas-docs/stable/io.html </p>

In [1]:
# 3. Using pandas to create Data Structures 

import pandas as pd
filename = 'mlo_station_data_set/weekly_in_situ_co2_mlo.csv'
data = pd.read_csv(filename, comment='"',)
data.columns = ['dates', 'co2_ppm']
data['dates'] = pd.to_datetime(data['dates'])
data.index = data['dates'] # A critical step for using resampling
print(data)
# Input is EASY, but computation is easier! Group data by year and compute the mean. Other functions: count(), sum(), median(), cumsum() etc. 
df2 = data.resample('A').mean()
df2

                dates  co2_ppm
dates                         
1958-04-05 1958-04-05   317.31
1958-04-12 1958-04-12   317.69
1958-04-19 1958-04-19   317.58
1958-04-26 1958-04-26   316.48
1958-05-03 1958-05-03   316.95
1958-05-17 1958-05-17   317.56
1958-05-24 1958-05-24   317.99
1958-07-05 1958-07-05   315.85
1958-07-12 1958-07-12   315.85
1958-07-19 1958-07-19   315.46
1958-07-26 1958-07-26   315.59
1958-08-02 1958-08-02   315.64
1958-08-09 1958-08-09   315.10
1958-08-16 1958-08-16   315.09
1958-08-30 1958-08-30   314.14
1958-09-06 1958-09-06   313.54
1958-11-08 1958-11-08   313.05
1958-11-15 1958-11-15   313.26
1958-11-22 1958-11-22   313.57
1958-11-29 1958-11-29   314.01
1958-12-06 1958-12-06   314.56
1958-12-13 1958-12-13   314.41
1958-12-20 1958-12-20   314.77
1958-12-27 1958-12-27   315.21
1959-01-03 1959-01-03   315.24
1959-01-10 1959-01-10   315.50
1959-01-17 1959-01-17   315.69
1959-01-24 1959-01-24   315.86
1959-01-31 1959-01-31   315.42
1959-02-14 1959-02-14   316.94
...     

Unnamed: 0_level_0,co2_ppm
dates,Unnamed: 1_level_1
1958-12-31,315.444167
1959-12-31,315.945417
1960-12-31,316.898868
1961-12-31,317.634038
1962-12-31,318.597708
1963-12-31,318.953673
1964-12-31,318.617097
1965-12-31,320.033462
1966-12-31,321.363061
1967-12-31,322.1688


<h3>Pandas Grouping Options</h3>
<p>There are many options for grouping. You can learn more about them in <a href="http://pandas.pydata.org/pandas-docs/stable/timeseries.html">Pandas's timeseries docs </a>. They are also listed them below for your convience.</p>

|  Value  |  Description  |
|---------|:-------------:|
|B | business day frequency | 
|C | custom business day frequency (experimental) |
|D | calendar day frequency |
|W | weekly frequency |
|M | month end frequency |
|BM | business month end frequency |
|CBM | custom business month end frequency |
|MS | month start frequency |
|BMS | business month start frequency |
|CBMS| custom business month start frequency |
|Q | quarter end frequency |
|BQ | business quarter endfrequency |
|QS | quarter start frequency |
|BQS | business quarter start frequency |
|A | year end frequency |
|BA | business year end frequency |
|AS | year start frequency |
|BAS | business year start frequency |
|BH | business hour frequency |
|H | hourly frequency |
|T | minutely frequency |
|S | secondly frequency |
|L | milliseonds |
|U | microseconds |
|N | nanoseconds |


<h3>Pandas Resampling Method Options</h3>
<p>There are many methods you can apply on pandas dataframes. A non-comprehensive listed them below for your convience.</p>

|  Method  |   Description   |
|----------|:---------------:|
| bfill | Backward fill |
| count | Count of values |
| ffill | Forward fill |
| first | First valid data value |
| last | Last valid data value |
| max | Maximum data value |
| mean | Mean of values in time range |
| median | Median of values in time range |
| min | Minimum data value |
| nunique | Number of unique values |
| ohlc | Opening value, highest value, lowest value, closing value |
| pad | Same as forward fill |
| std | Standard deviation of values |
| sum | Sum of values |
| var | Variance of values |

<h3>Writing Files with Pandas</h3>
Here is the documentation for writing a csv file, locate the documentation for a json file on your own. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
    

In [16]:
#Be careful as you write out functions- do not overwrite your original data.
df2.to_csv('mlo_station_data_set/average_annual_co2.csv')

#pandas.DataFrame.to_csv(path_or_buf=None, sep=', ', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')



In [19]:
import pandas as pd
fname = 'geneds_json.txt'
data = pd.read_json(fname)
s = data.apply(lambda x: pd.Series(x['GenedRequirement']),axis=1).stack().reset_index(level=1, drop=True)
print(s)
s.name = 'GenedRequirement'
new = data.drop('GenedRequirement', axis=1).join(s)
new.groupby(['GenedRequirement']).mean()

0      Natural Science & Technology
1      Social & Behavioral Sciences
2      Natural Science & Technology
3            Quantitative Reasoning
4            Quantitative Reasoning
5             Humanities & the Arts
6      Social & Behavioral Sciences
7            Quantitative Reasoning
8      Natural Science & Technology
8            Quantitative Reasoning
9      Natural Science & Technology
9            Quantitative Reasoning
10     Natural Science & Technology
11     Natural Science & Technology
12     Social & Behavioral Sciences
13           Quantitative Reasoning
14           Quantitative Reasoning
15     Natural Science & Technology
16     Natural Science & Technology
17           Quantitative Reasoning
18     Natural Science & Technology
19     Social & Behavioral Sciences
20     Natural Science & Technology
20           Quantitative Reasoning
21     Natural Science & Technology
21           Quantitative Reasoning
22           Quantitative Reasoning
23           Quantitative Re

Unnamed: 0_level_0,Total,avg_gpa,compre,pct_As
GenedRequirement,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Advanced Composition,245.058824,3.422309,0.64007,0.424564
Humanities & the Arts,224.790323,3.35714,0.613929,0.388573
Natural Science & Technology,1355.483871,3.169332,0.588273,0.384213
Non-Western Culture,216.828571,3.377348,0.632288,0.420239
Quantitative Reasoning,1254.686567,3.087739,0.554821,0.337707
Social & Behavioral Sciences,599.099099,3.333429,0.630614,0.427871
Western/Comparative Culture,292.858586,3.365834,0.625333,0.409208
