# Introduction to pandas
Adapted from "10 minutes to pandas":
https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min

See also the cheatsheet:
https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

In [None]:
import numpy as np
import pandas as pd

Pandas provides you with two handy data structures:
- series
- data frame

which can store 1-dimensional and 2-dimensional labelled arrays.
NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column.
Which means data frames can store different types of objects in each column,
e.g., integers, reals, booleans, strings, dates.

## Object creation
Creating a Series by passing a list of values, letting pandas create 
a default integer index:

In [None]:
s = pd.Series([1,3,5,np.nan, 6, 8])
print(s)
print("xxx")

Creating a DataFrame by passing a NumPy array,
with a datetime index and labeled columns:

In [None]:
dates = pd.date_range('20191001',periods=16)
dates

In [None]:
df = pd.DataFrame(np.random.rand(16,4),index=dates,columns=list('ABCD'))
df
# ?np.random.randint
# np.random.randint(0,high=10,size=(3,4))

If you don't like the column names, you can use an array of strings

In [None]:
df = pd.DataFrame(np.random.rand(16,4),index=dates,columns=list(['A','B','Cyder','D']))
df

or rename only some of the columns:

In [None]:
df=df.rename(columns={'Cyder':'buba'})
df

Creating a DataFrame by passing a dictionary of objects 
that can be converted to series-like.

In [None]:
df2 = pd.DataFrame({'A': 1.,
    ...:                    'B': pd.Timestamp('20130102'),
   ...:                     'C': pd.Series(1, index=list(range(4)), dtype='float32'),
   ...:                     'D': np.array([3] * 4, dtype='int32'),
   ...:                     'E': pd.Categorical(["test", "train", "test", "train"]),
   ...:                     'F': 'foo'})
df2

### Manipulate

The data frame can be sorted in several ways, e.g.:
- by row or column names
- by a selected row or column values

In [None]:
# df2.sort_index(axis=1,ascending=True)
# df2.sort_values(by='E',ascending=False)
# df.sort_values(by='2019-10-12',axis=1,ascending=False)

Now read the documentation (or cheatsheet) and explain what happens in each of the following lines:

In [None]:
# df.T
# pd.melt(df)
# df2.pivot(columns='E')
# df.drop(columns=['A'])

### Access data
There are many ways to access data frame entries.

Let's try different ways of selecting the first column:
- by column name (which became a method associated with the data frame object)
- by column name
- using .loc method (select all rows and column named 'A')
- using .iloc method (select all rows and the first column)

Uncomment each line below and check results:

In [None]:
# df.A
# df['A']
# df.loc[:,'A']
# df.iloc[:,0]

You can access range of entries:
- by specifying start:end using names
- by specifying start:end using integer row indices
- by inserting arrays of indices (don't need to be consecutive)
- by specifying entries with an array of boolean values (True = include data, False = exclude data)

Uncomment each line below separately to see the results

In [None]:
# df[0:3]
# df['20191001':'20191003']
# df.iloc[0:3, :]
# df.iloc[[1,4,13],[0,2]]
# df.iloc[:,[True,True,False,False]]

You can look up the column names using regular expression
(provided you were smart naming them).

Let's find all the columns starting with an uppercase letter:

In [None]:
df.filter(regex='[A-Z]')

Multiple entries can be overwritten simultaneously.

Explain what will change after running the following lines:

In [None]:
df.loc[dates[0], 'D']=1.2
df.at[dates[0], 'B':'buba']=0.3
df['A']=3
df

### Select data

You can perform logical operations on multiple data frame entries at the same time:

In [None]:
df.B>0

Since we can access data with arrays of logical values, then...

Explain what happens here:

In [None]:
df[df.B>0.3]

or here (uncomment each line):

In [None]:
# df[df.B+df.D>=df.A*df.buba]
# df[df>.3]
# df[df>.3].sort_values(by='B',na_position='first')

If the data frame contains objects encoded as different types,
you can select each type separately.

For instance, let's take only categorical variables:

In [None]:
df2.select_dtypes(include='category')

### Viewing data
If the data is too big you might want to have only a glimpse on a couple of instances:

In [None]:
# df.head()
# df.tail(3)

or get familiar with the column and row names:

In [None]:
# df.index
# df.columns

You might want to have a look at some summary statistics:

In [None]:
df.describe()

Or at least at some of the ones that are of interest:

In [None]:
# df.sum()
# df.count()
df.mean()

If you haven't loaded Matplotlib (and you should have!)
you still have several options to plot the data:

In [None]:
# df.plot.box()
# df.plot.hist()
df.plot.scatter(x='B',y='D')