# Python3 - Workshop #7
Opening and Working with data files

## Is NASA a waste of money?
Watch: https://www.youtube.com/watch?v=lARpY0nIQx0

![USA Buget](https://preview.ibb.co/gpyPkq/nasa-budget.png)

## Importing Data from Google Drive (sheets)

In [0]:
!pip install --upgrade -q gspread

In [15]:
from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open('Accelerometer-Data').sheet1

# get_all_values gives a list of rows.
excelData = worksheet.get_all_values()
#print(myData)

# Convert to a DataFrame and render.
#import pandas as pd
#pd.DataFrame.from_records(myData)

dataDescription = excelData[3]

print(dataDescription)


# save the data in a list
myData = []

for row in excelData[4:len(excelData)]:
  myData.append(row)
  
print(myData)



['Time (s)', 'X (m/s2)', 'Y (m/s2)', 'Z (m/s2)', 'R (m/s2)', 'Theta (deg)', 'Phi (deg)']
[['0', '-0.484118', '0.513916', '0.720085', '0.276648', '-15.829235', '-22.688705'], ['0.0050354', '-0.46213', '0.522461', '0.805504', '0.359601', '-16.010052', '-23.41658'], ['0.0100708', '-0.508318', '0.520508', '0.965843', '0.523661', '-15.829675', '-22.885605'], ['0.0151062', '-0.407183', '0.553207', '1.307976', '0.855371', '-16.587551', '-26.009308'], ['0.0201416', '-0.141146', '0.539978', '1.54911', '1.076567', '-18.21817', '-29.730301'], ['0.025177', '-0.099108', '0.389602', '1.616341', '1.14023', '-18.594118', '-14.960175'], ['0.0302124', '-0.339831', '0.179947', '1.635246', '1.176869', '-17.066833', '0.316132'], ['0.0352478', '-0.366244', '0.250503', '1.513832', '1.05669', '-16.950735', '-5.016785'], ['0.0402832', '-0.226275', '0.475006', '1.227089', '0.759759', '-17.68107', '-22.374985'], ['0.0453186', '-0.175524', '0.730591', '0.807167', '0.343342', '-17.293148', '-44.370621'], ['0.05035

## Opening CSV and TXT in Ubuntu/Noob's Raspberry Pi



In [0]:
import csv

with open('class-data.csv') as csvfile:
  myData = csv.reader(csvfile)

In [0]:
import csv

with open('class-data.txt') as csvfile:
  myData = csv.reader(csvfile)

## Opening HDF5 Files
H5py lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

*   **Info**: https://www.h5py.org/
*   **API**: http://docs.h5py.org/en/stable/

I highly recommend you read the book: http://a.co/d/bD9dsvP

In [0]:
"""
  Method 1
"""

# copy paste the file name
fileName = 'merra-data.hdf5'

with h5py.File(fileName, 'r') as myFile:
	print(myFile.name)


"""
  Method 2
"""  
  
f = h5py.File(fileName, 'r')

print(f['2010/12/31/BCSMASS'])

f.close()

## Opening NetCDF4 Files
NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

*   **Info**: https://www.unidata.ucar.edu/software/netcdf/docs/
*   **API**: http://unidata.github.io/netcdf4-python/#section1




In [0]:
"""
  I don't know how to get netCDF4 to work on colab.research.google.com
  You need to install it on a linux machine. Please read the installation notes
    on the class raspberri pi
  
  This is an example on how to open netCDF4 files in python
"""

import netCDF4

# copy paste the file name 
fileName = 'g4.timeAvgMap.M2TMNXAER_5_12_4_BCSMASS.20170101-20170131.180W_90S_180E_90N.nc'

# open the file as READ ONLY
# read the API - http://unidata.github.io/netcdf4-python/#section1
satelliteData = netCDF4.Dataset(fileName, 'r') 

print(satelliteData.data_model)
print(satelliteData.variables)

OsatelliteData.close()

**Output**

```
NETCDF4_CLASSIC
OrderedDict([('M2TMNXAER_5_12_4_BCSMASS', <class 'netCDF4._netCDF4.Variable'>
float32 M2TMNXAER_5_12_4_BCSMASS(lat, lon)
    _FillValue: 1e+15
    fmissing_value: 1e+15
    fullnamepath: /BCSMASS
    missing_value: 1e+15
    origname: BCSMASS
    vmax: 1e+15
    vmin: -1e+15
    standard_name: m2tmnxaer_5_12_4_bcsmass
    quantity_type: Black Carbon
    product_short_name: M2TMNXAER
    product_version: 5.12.4
    long_name: Black Carbon Surface Mass Concentration
    units: kg m-3
    cell_methods: time: mean
    latitude_resolution: 0.5
    longitude_resolution: 0.625
    coordinates: lat lon
unlimited dimensions: 
current shape = (361, 576)
filling on), ('lat', <class 'netCDF4._netCDF4.Variable'>
float64 lat(lat)
    units: degrees_north
    vmax: 1e+15
    vmin: -1e+15
    origname: lat
    fullnamepath: /lat
    standard_name: latitude
    bounds: lat_bnds
unlimited dimensions: 
current shape = (361,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('lat_bnds', <class 'netCDF4._netCDF4.Variable'>
float64 lat_bnds(lat, latv)
    units: degrees_north
unlimited dimensions: 
current shape = (361, 2)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('lon', <class 'netCDF4._netCDF4.Variable'>
float64 lon(lon)
    units: degrees_east
    vmax: 1e+15
    vmin: -1e+15
    origname: lon
    fullnamepath: /lon
    standard_name: longitude
    bounds: lon_bnds
unlimited dimensions: 
current shape = (576,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('lon_bnds', <class 'netCDF4._netCDF4.Variable'>
float64 lon_bnds(lon, lonv)
    units: degrees_east
unlimited dimensions: 
current shape = (576, 2)
filling on, default _FillValue of 9.969209968386869e+36 used
)])


------------------
(program exited with code: 0)
Press return to continue

```



## Nasa Database Files
Nasa files are organized in the following way:
![Nasa Files](https://disc.gsfc.nasa.gov/api/tools/Hydrology%2520Data%2520Rods/file)





## Opening Nasa Files

Alternative to python, use: https://giovanni.gsfc.nasa.gov/giovanni/

In python, the data looks like a multidimensional list, and it can take some code to process it.
I recommend you hust copy paste my code, and change the variables.

In [0]:
"""
  This script handles daily MERRA-2 data
  Author: Edgard Parra
  Date: Summer 2017
  
  This script calculates PM2.5 around the world for the past 37 years.
"""

#import necessary modules
import re
import sys 
import h5py
import numpy as np
import netCDF4 as nc4

# open a file, and create if it doesn't exist to store data
f = h5py.File('merra-data.hdf5', 'w')

#This finds the user's current path so that all nc4 files can be found
try:
	fileList=open('myfiles.txt','r')
except:
	print('Did not find a text file containing file names (perhaps name does not match)')
	sys.exit()

for FILE_NAME in fileList:
	FILE_NAME = FILE_NAME.strip()
	
	res = re.findall('aer_Nx.(\d+)',FILE_NAME)
	
	# Find the year of the file
	year = str(res[0])
	year = str(year[:4])
	
	# Find the month of the year
	month = str(res[0])
	month = str(month[4:6])
	
	# Find the day of the file
	day = str(res[0])
	day = str(day[-2:])
	
	# read in the data 
	merraData = nc4.Dataset(FILE_NAME, 'r')
	
	# get the data variable names
	variables = set(merraData.variables)
	
	# print(variables)
	
	# declare the desired variables 
	desiredVariables = set({'BCSMASS','DUSMASS25','OCSMASS','SO4SMASS','SSSMASS25'})
	
	# create list of variables found
	desiredVariables = set([x.lower() for x in desiredVariables])
	
	var1 = variables.intersection(desiredVariables)
	
	desiredVariables = set([x.upper() for x in desiredVariables])
	
	var2 = variables.intersection(desiredVariables)
	
	fileVars = list(var1.union(var2))
	if len(fileVars)==0:
		print('This file contains none of the selected SDS. Skipping...')			
		continue
	print('Saving the following SDS from current file: \n')
	[print('(' + str(fileVars.index(x)) + ')',x) for x in fileVars]
	
	#extract lat and lon info. These are just vectors in the dataset so they're repeated to accommodate the data array  
	lats = merraData.variables['lat'][:]
	lons = merraData.variables['lon'][:]
	totalLon = np.tile(lons,len(lats))
	totalLat = lats.repeat(len(lons))
	
	#create a matrix the same size as the lat/lon datasets to save everything 
	output=np.zeros((totalLat.shape[0],len(fileVars)+2))
	output[:,0]=totalLat
	output[:,1]=totalLon
	
	#can't combine string and floats in an array, so a list of titles is made 
	tempOutput=[]
	tempOutput.append('Latitude')
	tempOutput.append('Longitude')
	index=2
	for SDS_NAME in fileVars:
		try:
			#read merra data as a vector 
			data = merraData.variables[SDS_NAME][:]
			#print(len(data.shape))
		except:
			print('There is an issue with your MERRA file (might be the wrong MERRA file type). Skipping...')
			continue
		if len(data.shape) == 4:
			level = data.shape[1]-1
			data = data[0,level,:,:]
		elif len(data.shape) == 3:
			level = data.shape[0]-1
			data = data[level,:,:]
		
		# convert from grid to list
		data=data.ravel()
		
		# convert units from kg to ug
		data = np.multiply(data,1000000000)
		
		filePath = year+'/'+month+'/'+day+'/'+SDS_NAME
		
		# store the merraData into arrays in hdf5 format
		f[filePath] = data
		f.flush()
		print('Saved ' + str(filePath) + ' succesfully! \n')
	
  # we're in hdf5 now
  
	print('Calculating PM2.5 \n')	
	BCSMASS = f[year+'/'+month+'/'+day+'/BCSMASS']
	DUSMASS25= f[year+'/'+month+'/'+day+'/DUSMASS25']
	OCSMASS = f[year+'/'+month+'/'+day+'/OCSMASS']
	SO4SMASS = f[year+'/'+month+'/'+day+'/SO4SMASS']
	SSSMASS25 = f[year+'/'+month+'/'+day+'/SSSMASS25']

	# solve for PM2.5 = [DUST] + [SS] + [BC] + 1.4[OC] + 1.375[SO4]
	OCSMASS = np.multiply(1.4,OCSMASS)
	SO4SMASS = np.multiply(1.375, SO4SMASS)

	PM25 = np.add(DUSMASS25,SSSMASS25)
	PM25 = np.add(PM25,BCSMASS)
	
	PM25 = np.add(PM25,OCSMASS)
	PM25 = np.add(PM25,SO4SMASS)

	# save our results to the hdf5 file
	pm25filePath = year+'/'+month+'/'+day+'/pm25'
	f[pm25filePath] = PM25
	f.flush()
	print('Saved ' + str(pm25filePath) + ' succesfully! \n')

print('\nSaving GPS locations')
# store gps data on arrays
f['gpsData/latitude'] = totalLat
f['gpsData/longitude'] = totalLon

# gps for plotting
f['gpsData/lats'] = lats
f['gpsData/lons'] = lons

print('\nAll valid variables have been saved successfully. \n')

# close the files
f.flush()
f.close()