# Important data analysis methods, tips for workign with data arrays

- lists 
- 1-dimension or 2-dimensional arrays
- Pandas package support for dealing with tables that include dates or text (non numerical data)


In [1]:
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
from scipy.stats import ttest_ind


## The next code cell generates a DataFrame

In [2]:
# create an example data table
# 
sample1=np.array([10.0,10.7,11.5,9.6,10.2,np.nan,10.9,10.3,np.nan,10.6])
sample2=np.array([9.5,9.0,10.2,-9999.9,9.8,9.0,10.0,9.5,-9999.9,10.5])

n=len(sample1) # sample 2 has same length here in this example

# create a list with dates as string
dlist= []
i=0
timestep=dt.timedelta(days=1)
date=dt.datetime(2021,4,24)
while i<n:
    date=date+timestep
    dlist.append(date.strftime("%Y-%m-%d"))
    i=i+1

# put the list with date strings and the numerical data into a DataFrame and save as  CSV file

df=pd.DataFrame({"time":dlist,"tmin NYC":sample1,"tmin ALB":sample2})
df.to_csv("testdata.csv")
df

Unnamed: 0,time,tmin NYC,tmin ALB
0,2021-04-25,10.0,9.5
1,2021-04-26,10.7,9.0
2,2021-04-27,11.5,10.2
3,2021-04-28,9.6,-9999.9
4,2021-04-29,10.2,9.8
5,2021-04-30,,9.0
6,2021-05-01,10.9,10.0
7,2021-05-02,10.3,9.5
8,2021-05-03,,-9999.9
9,2021-05-04,10.6,10.5


## Here we load the data from the local file
(you will find a file  _testdata.csv_ in your folder) 


In [3]:
# Note: reading the file with numpy loadtxt does not work that well!
# It require more effort if you wnat to read the whole data table
# but we can get at least the columns with the temperature data

#np.loadtxt("testdata.csv",delimiter=',',skiprows=1,usecols=(3,))

In [4]:
# load the data with pandas is the recommended way
data=pd.read_csv("testdata.csv")
data.columns

Index(['Unnamed: 0', 'time', 'tmin NYC', 'tmin ALB'], dtype='object')

In [5]:
# onvert date strings into datetime objects
# see https://datatofish.com/strings-to-datetime-pandas/
thelp=data['time']

thelp2=pd.to_datetime(thelp,format="%Y-%m-%d")
data['time']=thelp2
data.columns
tmin1=data['tmin NYC'].values
np.nanmean(tmin1)
tmin2=data['tmin ALB'].values

# replace dummy values with np.nan values

tmin2=np.where(tmin2>-9990.0,tmin2,np.nan)
tmin2

array([ 9.5,  9. , 10.2,  nan,  9.8,  9. , 10. ,  9.5,  nan, 10.5])

## Apply the t_test  to the data (nan values omitted)

In [None]:
tvalue, pvalue=ttest_ind(tmin1,tmin2,equal_var=False,nan_policy='omit')
print(tvalue,pvalue)

4

### How can we get a block of rows and columns from the DataFrame back into 2-d numpy array?

There are several options. I will show one way that makes it very much consistent with the numpy row and column
index selection methods using index operations. 

Remember: DataFrames add an extra layer of functionality around or numerical data arrays. 
One feature that DataFrame objects have is the shape attribute.
Another is the so-called iloc attribute.


In [None]:
print("type of variable data: ")
print(type(data))

print("shape of the data array stored in the DataFrame:")
print(data.shape)

In [None]:
print("Select the third column (python index 2) from the DataFrame: ")
result=data.iloc[:,3].values
print(type(result),result.shape)
for value in result:
    print(value)

In [None]:
print("Select the rows from index position 2-7 and columns at index position 2-3")
x=data.iloc[2:7,2:4].values

print(type(x),x.shape)


## We have selected the data we wanted from the DataFrame and converted the resulting data into a numpy array
The variable x now is a 2-dimensional numpy array and we can work with the numpy array as we have done earlier in the notebook examples (see for example the world topography data we worked with).

### Application of statistical calculations column by column:

Many numpy functions allow you to specify the keyword _axis_ in the function call.
I explain this keyword use using the 2-dimensional data array _x_.

The task is to find the minimum value in each column.

We use the function _np.nanmin()_.

In [None]:
# keyword axis=0 tells the function to use the first dimension (rows) as data samples
# In other words this function is applied to each data column individually.
# as a result we get returned an 1-dimensional array matching in size the number of columns in x

xmin0=np.nanmin(x,axis=0) # for each column we perform the task of m=finding the minimum value 

print("column by column application of the minimum function:")
print(xmin0)

###  Try axis=1: It will return an array of size  what?


In [None]:
xmin1=np.nanmin(x,axis=1)
print("row by row application of the minimum function:")
print(xmin1)