# **Computational Methods**
## **Introduction to Data Sets**

Written by Niv Keren, nivkeren1@mail.tau.ac.il 

Based on [Python Numpy Tutorial](https://cs231n.github.io/python-numpy-tutorial/)  by Justin Johnson,
from Stanford CS231n class.

*Computational Methods* class: 0341-2300

2020/Semester I; Tuesdays 14:00-16:00

FACULTY OF EXACT SCIENCES | GEOPHYSICS & PLANETARY SCIENCES  
Tel Aviv University

---

## **Pandas**
Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.[(from wikipedia)](https://en.wikipedia.org/wiki/Pandas_(software)

* DataFrame object for data manipulation with integrated indexing.
* Tools for reading and writing data between in-memory data structures and different file formats.
* Data alignment and integrated handling of missing data.
* Reshaping and pivoting of data sets.
* Label-based slicing, fancy indexing, and subsetting of large data sets.
* Data structure column insertion and deletion.
* Group by engine allowing split-apply-combine operations on data sets.
* Data set merging and joining.
* Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
* Time series-functionality: Date range generation[4] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
* Provides data filtration.  
[**Pandas Documantation**](https://pandas.pydata.org/pandas-docs/stable/)

In [1]:
import numpy as np
import pandas as pd

### **Seires**
Creating a **Series** by passing a `list` of values, letting pandas create a default integer index:

In [6]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

### DataFrame
Creating a **DataFrame** by passing a `NumPy array`, with a datetime index and labeled columns:

In [7]:
dates = pd.date_range('20130101', periods=6)

dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [8]:
pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

Unnamed: 0,A,B,C,D
2013-01-01,0.710661,0.986793,-2.829491,0.948989
2013-01-02,-0.004203,0.008974,-0.647333,0.020984
2013-01-03,-0.142661,0.681989,0.586639,0.539854
2013-01-04,1.131416,1.420107,-0.993988,1.382051
2013-01-05,-1.303661,0.607672,-0.185377,1.611197
2013-01-06,-0.162297,1.692326,0.238663,1.189658


In [9]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [12]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [13]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

## **Data Sets or Datasets**
A collection of data.  
Few examples:
* **Tabular data -** a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question.
* **Images data -** collection of images. Can be aloso represented in a table
* **Time-stamped data -**  dataset which has a concept of time ordering defining the sequence that each data point was either captured (event time) or collected (processed time).
* **Spatial Data -** Some objects have spatial attributes, such as positions or areas, as well as other types of attributes.
---
### **Tabular Data**
Two very common formats to save tabular data are:
1. **CSV files -** *comma separated values*. Delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
2. **JSON files -** *JavaScript Object Notation*. Is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types

open csv and json files to a dataframe

In [None]:
continue pandas toturial