# Pandas

---
# Pandas Objects and Basic Creation

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

| Name | Dimensions | Description  |
| ------:| -----------:|----------|
| ```pd.Series``` | 1 | 1D labeled homogeneously-typed array |
| ```pd.DataFrame```  | 2| General 2D labeled, size-mutable tabular structure |
| ```pd.Panel``` | 3|  General 3D labeled, also size-mutable array |

In [1]:
import numpy as np
import pandas as pd

---
# Pandas DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. We can think of it like a R dataframe/tibble, or a SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

---
## DataFrame Creation

- We can create a DataFrame from:
    - Dict of 1D ndarrays, lists, dicts, or Series
    - 2D numpy array
    - A list of dictionaries
    - A Series
    - Another Dataframe
``` python
df = pd.DataFrame(data, index = index, columns = columns)
```
- ```index``` / ``` columns ``` is a list of the row/column labels. If we pass an index and/or columns, we are guarenteeing the index and/or column of the df. 
- If we do not pass anything in, the input will be constructed by "common sense" rules.

In [2]:
# Create a dictionary of series
d = {'one': pd.Series([1,2,3], index  = ['a', 'b', 'c']), 
     'two': pd.Series(list(range(4)), index = ['a','b', 'c', 'd'])}
# Columns are dictionary keys, indices and values obtained from series
df = pd.DataFrame(d)
# Notice how it fills the column one with NaN for d
print(df)

   one  two
a  1.0    0
b  2.0    1
c  3.0    2
d  NaN    3


In [3]:
# Create a dictionary of lists
d = {'one' : [1., 2., 3., 4], 'two' : [4., 3., 2., 1.], 'three' : [3., 2., 4., 1.]}
df = pd.DataFrame(d)
print(df)

   one  two  three
0  1.0  4.0    3.0
1  2.0  3.0    2.0
2  3.0  2.0    4.0
3  4.0  1.0    1.0


In [4]:
# Adding indices
df = pd.DataFrame(d, index = ['a', 'b', 'c', 'd'])
print(df)

   one  two  three
a  1.0  4.0    3.0
b  2.0  3.0    2.0
c  3.0  2.0    4.0
d  4.0  1.0    1.0


In [5]:
# Only certain columns
df = pd.DataFrame(d, columns = ['one', 'three'])

end_string = '\n' + '-'*50 + '\n'
print(df.shape, end = end_string)

print(df)

(4, 2)
--------------------------------------------------
   one  three
0  1.0    3.0
1  2.0    2.0
2  3.0    4.0
3  4.0    1.0


---
## Modifying DataFrame : Introduction to concept

In [6]:
# multiply 
df['one*three'] =  df['one'] * df['three']
print(df)

   one  three  one*three
0  1.0    3.0        3.0
1  2.0    2.0        4.0
2  3.0    4.0       12.0
3  4.0    1.0        4.0


In [7]:
# inserting column in specified location, with values
df.insert(1, 'bar', df['one'][2:])
print(df)

   one  bar  three  one*three
0  1.0  NaN    3.0        3.0
1  2.0  NaN    2.0        4.0
2  3.0  3.0    4.0       12.0
3  4.0  4.0    1.0        4.0


In [8]:
# Deleting Columns  
three = df.pop('three')
print(df)

   one  bar  one*three
0  1.0  NaN        3.0
1  2.0  NaN        4.0
2  3.0  3.0       12.0
3  4.0  4.0        4.0


---
## Modifying DataFrame : Exercise

Given the following DataFrame `data`, we make the following modifications:

- Switch the labels of columns 'B' and 'C'
- Create a new column 'A*D', which equals the product of 'A' and 'D'
- Delete columns 'A' and 'D'
- Create a boolean column 'flag', which indicates whether column 'B' > column 'A*D'
- Delete rows where 'flag' is False
- Delete the 'flag' column

The resulting DataFrame is,
```
                   C         B       A*D
2000-01-02  0.877420  0.503913 -0.276999
2000-01-03 -0.321768 -0.057829 -1.193274
2000-01-04 -0.721817 -0.814920 -1.181251
2000-01-05  1.799571 -0.166019 -1.039480
```


In [9]:
pd.options.display.max_rows = 10
dates = pd.date_range('1/1/2000', periods=8)
np.random.seed(206)
data = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
data

Unnamed: 0,A,B,C,D
2000-01-01,-0.562224,-2.704382,0.100133,-0.556712
2000-01-02,0.282239,0.87742,0.503913,-0.981433
2000-01-03,-1.634617,-0.321768,-0.057829,0.730002
2000-01-04,-1.132797,-0.721817,-0.81492,1.042774
2000-01-05,0.612227,1.799571,-0.166019,-1.697868
2000-01-06,-0.595997,1.913933,-0.877715,-1.030056
2000-01-07,-1.582156,-0.277063,0.25572,-2.215322
2000-01-08,-1.006581,0.560446,-1.501818,0.628199


In [10]:
# Switch the labels of columns 'B' and 'C'
data.columns = ['A', 'C', 'B', 'D']

# Creating a new column 'A*D', which equals the product of 'A' and 'D'
data['A*D'] = data['A'] * data['D']

# Delete columns 'A' and 'D'
data = data.drop(['A', 'D'], axis=1)

# Creating a boolean column 'flag', which indicates whether column 'B'> column 'A*D'
data['flag'] = data['B'] > data['A*D']

# Delete rows where 'flag' is False
data = data.drop(data[data['flag']==False].index)

#Delete flag column
data = data.drop('flag', axis=1)
print(data)

                   C         B       A*D
2000-01-02  0.877420  0.503913 -0.276999
2000-01-03 -0.321768 -0.057829 -1.193274
2000-01-04 -0.721817 -0.814920 -1.181251
2000-01-05  1.799571 -0.166019 -1.039480


---
## Indexing and Selection : Concept

https://pandas.pydata.org/docs/user_guide/indexing.html

| Operation  | Syntax       | Result | 
|----|----------------------| ---------------------------|
| Select Column | df[col]   |    Series                      |
| Select Row by Label | df.loc[label] | Series  |
| Select Row by Integer Location | df.iloc[idx] |      Series                    |
| Slice rows | df[5:10]        |                        DataFrame  | 
| Select rows by boolean | df[mask]   | DataFrame        |

In [11]:
dates = pd.date_range('1/1/2000', periods=8)
np.random.seed(206)
data = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
data

Unnamed: 0,A,B,C,D
2000-01-01,-0.562224,-2.704382,0.100133,-0.556712
2000-01-02,0.282239,0.87742,0.503913,-0.981433
2000-01-03,-1.634617,-0.321768,-0.057829,0.730002
2000-01-04,-1.132797,-0.721817,-0.81492,1.042774
2000-01-05,0.612227,1.799571,-0.166019,-1.697868
2000-01-06,-0.595997,1.913933,-0.877715,-1.030056
2000-01-07,-1.582156,-0.277063,0.25572,-2.215322
2000-01-08,-1.006581,0.560446,-1.501818,0.628199


In [12]:
# Indexing/Slicing columns

print(data[['A', 'B']], end = end_string)
print(data.loc[:,['A', 'B']], end = end_string)
print(data.loc[:,'A':'B'], end = end_string)
print(data.iloc[:, [0, 1]], end = end_string)
print(data.iloc[:, range(2)], end = end_string)

                   A         B
2000-01-01 -0.562224 -2.704382
2000-01-02  0.282239  0.877420
2000-01-03 -1.634617 -0.321768
2000-01-04 -1.132797 -0.721817
2000-01-05  0.612227  1.799571
2000-01-06 -0.595997  1.913933
2000-01-07 -1.582156 -0.277063
2000-01-08 -1.006581  0.560446
--------------------------------------------------
                   A         B
2000-01-01 -0.562224 -2.704382
2000-01-02  0.282239  0.877420
2000-01-03 -1.634617 -0.321768
2000-01-04 -1.132797 -0.721817
2000-01-05  0.612227  1.799571
2000-01-06 -0.595997  1.913933
2000-01-07 -1.582156 -0.277063
2000-01-08 -1.006581  0.560446
--------------------------------------------------
                   A         B
2000-01-01 -0.562224 -2.704382
2000-01-02  0.282239  0.877420
2000-01-03 -1.634617 -0.321768
2000-01-04 -1.132797 -0.721817
2000-01-05  0.612227  1.799571
2000-01-06 -0.595997  1.913933
2000-01-07 -1.582156 -0.277063
2000-01-08 -1.006581  0.560446
--------------------------------------------------
          

In [13]:
# Indexing/Slicing rows

print(data['2000-01-01': '2000-01-04'], end = end_string)
print(data.loc['2000-01-01': '2000-01-04'], end = end_string)
print(data.iloc[0:4], end = end_string)
print(data.iloc[range(4)], end = end_string)
print(data[data.index < '2000-01-05'], end = end_string)

                   A         B         C         D
2000-01-01 -0.562224 -2.704382  0.100133 -0.556712
2000-01-02  0.282239  0.877420  0.503913 -0.981433
2000-01-03 -1.634617 -0.321768 -0.057829  0.730002
2000-01-04 -1.132797 -0.721817 -0.814920  1.042774
--------------------------------------------------
                   A         B         C         D
2000-01-01 -0.562224 -2.704382  0.100133 -0.556712
2000-01-02  0.282239  0.877420  0.503913 -0.981433
2000-01-03 -1.634617 -0.321768 -0.057829  0.730002
2000-01-04 -1.132797 -0.721817 -0.814920  1.042774
--------------------------------------------------
                   A         B         C         D
2000-01-01 -0.562224 -2.704382  0.100133 -0.556712
2000-01-02  0.282239  0.877420  0.503913 -0.981433
2000-01-03 -1.634617 -0.321768 -0.057829  0.730002
2000-01-04 -1.132797 -0.721817 -0.814920  1.042774
--------------------------------------------------
                   A         B         C         D
2000-01-01 -0.562224 -2.704382 

---
## &diams; Indexing and Selection : Exercise

Given the following DataFrame `data`,
we provide 10 different ways to slice out the following sub-DataFrame.

```
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
```


In [14]:
dates = pd.date_range('1/1/2000', periods=8)
np.random.seed(206)
data = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
data

Unnamed: 0,A,B,C,D
2000-01-01,-0.562224,-2.704382,0.100133,-0.556712
2000-01-02,0.282239,0.87742,0.503913,-0.981433
2000-01-03,-1.634617,-0.321768,-0.057829,0.730002
2000-01-04,-1.132797,-0.721817,-0.81492,1.042774
2000-01-05,0.612227,1.799571,-0.166019,-1.697868
2000-01-06,-0.595997,1.913933,-0.877715,-1.030056
2000-01-07,-1.582156,-0.277063,0.25572,-2.215322
2000-01-08,-1.006581,0.560446,-1.501818,0.628199


In [15]:
#ANSWER
#method to extract specific rows and columns from a pandas DataFrame
#data is a pandas DataFrame, .loc is a method that accesses a group of rows & columns by labels
#code will be extracting all rows with index labels between '2000-01-07' & '2000-01-08'
print(data.loc['2000-01-07':'2000-01-08', ['C', 'D']])
print(data.loc[data.index[6:8], ['C', 'D']])
print(data.iloc[[6,7], [2,3]])
print(data.loc[data.index[[6,7]], ['C', 'D']])
print(data[['C', 'D']].loc[data.index[6:8]])
print(data.loc[:, ['C', 'D']].loc[data.index[6:8]])
print(data.loc[data.index[6:8], :][['C', 'D']])
print(data[['C', 'D']].iloc[6:8])
print(data.iloc[6:8, [2,3]])
print(data[['C', 'D']].iloc[-2:])

                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
                   C         D
2000-01-07  0.255720 -2.215322
2000-01-08 -1.501818  0.628199
