### 10 minutes to pandas
https://pandas.pydata.org/docs/user_guide/10min.html#minutes-to-pandas 

In [3]:
import numpy as np
import pandas as pd

### Basic data structures in pandas

Pandas provides two types of classes for handling data:

1. Series: a one-dimensional labeled array holding data of any type
    such as integers, strings, Python objects etc.

2. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.


### Object creation
Creating a Series by passing a list of values, letting pandas create a default RangeIndex.

In [6]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array with a datetime index using date_range() and labeled columns:

In [8]:
dates = pd.date_range("20130101", periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [66]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.382087,-1.87105,1.015998,0.370699
2013-01-02,0.198678,-1.344216,-0.391337,1.178942
2013-01-03,0.672602,1.352077,-0.435937,-0.169809
2013-01-04,-0.070704,0.776498,-0.267532,0.456509
2013-01-05,-0.177941,-1.135375,-0.430509,-0.073006
2013-01-06,-0.029363,1.339715,0.03114,1.215037


Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [11]:
df2 = pd.DataFrame(

    {

        "A": 1.0,

        "B": pd.Timestamp("20130102"),

        "C": pd.Series(1, index=list(range(4)), dtype="float32"),

        "D": np.array([3] * 4, dtype="int32"),

        "E": pd.Categorical(["test", "train", "test", "train"]),

        "F": "foo",

    }

)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes:

In [13]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

### Viewing data
Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

In [15]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.799797,-0.957632,0.034081,1.0545
2013-01-02,0.298638,-0.558066,-0.85244,-0.545475
2013-01-03,0.63541,0.601893,-0.596859,0.238593
2013-01-04,0.70716,1.727727,1.312219,1.283402
2013-01-05,0.71439,1.402132,0.720214,0.894386


In [16]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.70716,1.727727,1.312219,1.283402
2013-01-05,0.71439,1.402132,0.720214,0.894386
2013-01-06,-1.488932,-0.598389,-0.099698,0.902816


Display the DataFrame.index or DataFrame.columns:

In [18]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Return a NumPy representation of the underlying data with DataFrame.to_numpy() without the index or column labels:

In [30]:
df.to_numpy()

array([[-0.79979653, -0.95763154,  0.03408109,  1.0544997 ],
       [ 0.29863755, -0.55806558, -0.85244009, -0.54547455],
       [ 0.63540967,  0.60189327, -0.59685865,  0.23859302],
       [ 0.70715951,  1.72772737,  1.31221874,  1.28340216],
       [ 0.7143897 ,  1.40213204,  0.72021381,  0.89438637],
       [-1.48893191, -0.59838889, -0.09969758,  0.90281584]])

describe() shows a quick statistic summary of your data:

In [33]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.011145,0.269611,0.086253,0.638037
std,0.933724,1.136989,0.81117,0.676168
min,-1.488932,-0.957632,-0.85244,-0.545475
25%,-0.525188,-0.588308,-0.472568,0.402541
50%,0.467024,0.021914,-0.032808,0.898601
75%,0.689222,1.202072,0.548681,1.016579
max,0.71439,1.727727,1.312219,1.283402


Transposing your data:

In [37]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.799797,0.298638,0.63541,0.70716,0.71439,-1.488932
B,-0.957632,-0.558066,0.601893,1.727727,1.402132,-0.598389
C,0.034081,-0.85244,-0.596859,1.312219,0.720214,-0.099698
D,1.0545,-0.545475,0.238593,1.283402,0.894386,0.902816


DataFrame.sort_index() sorts by an axis:

In [40]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,1.0545,0.034081,-0.957632,-0.799797
2013-01-02,-0.545475,-0.85244,-0.558066,0.298638
2013-01-03,0.238593,-0.596859,0.601893,0.63541
2013-01-04,1.283402,1.312219,1.727727,0.70716
2013-01-05,0.894386,0.720214,1.402132,0.71439
2013-01-06,0.902816,-0.099698,-0.598389,-1.488932


DataFrame.sort_values() sorts by values:

In [44]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-01,-0.799797,-0.957632,0.034081,1.0545
2013-01-06,-1.488932,-0.598389,-0.099698,0.902816
2013-01-02,0.298638,-0.558066,-0.85244,-0.545475
2013-01-03,0.63541,0.601893,-0.596859,0.238593
2013-01-05,0.71439,1.402132,0.720214,0.894386
2013-01-04,0.70716,1.727727,1.312219,1.283402


### Selection
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

### Getitem([])
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:

In [48]:
df["A"]

2013-01-01   -0.799797
2013-01-02    0.298638
2013-01-03    0.635410
2013-01-04    0.707160
2013-01-05    0.714390
2013-01-06   -1.488932
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [51]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.799797,-0.957632,0.034081,1.0545
2013-01-02,0.298638,-0.558066,-0.85244,-0.545475
2013-01-03,0.63541,0.601893,-0.596859,0.238593


In [53]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,0.298638,-0.558066,-0.85244,-0.545475
2013-01-03,0.63541,0.601893,-0.596859,0.238593
2013-01-04,0.70716,1.727727,1.312219,1.283402


### Selection by label
Selecting a row matching a label:

In [58]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2013-01-01,-0.799797,-0.957632
2013-01-02,0.298638,-0.558066
2013-01-03,0.63541,0.601893
2013-01-04,0.70716,1.727727
2013-01-05,0.71439,1.402132
2013-01-06,-1.488932,-0.598389


For label slicing, both endpoints are included:

In [61]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,0.298638,-0.558066
2013-01-03,0.63541,0.601893
2013-01-04,0.70716,1.727727


Selecting a single row and column label returns a scalar:

In [64]:
df.loc[dates[0], "A"]

-0.7997965329245801

For getting fast access to a scalar (equivalent to the prior method):

In [69]:
df.at[dates[0], "A"]

0.3820872432863873

### Selection by position
Select via the position of the passed integers:

In [72]:
df.iloc[3]

A   -0.070704
B    0.776498
C   -0.267532
D    0.456509
Name: 2013-01-04 00:00:00, dtype: float64

Integer slices acts similar to NumPy/Python:

In [75]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-0.070704,0.776498
2013-01-05,-0.177941,-1.135375


Lists of integer position locations:

In [78]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,0.198678,-0.391337
2013-01-03,0.672602,-0.435937
2013-01-05,-0.177941,-0.430509


For slicing rows explicitly:

In [81]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,0.198678,-1.344216,-0.391337,1.178942
2013-01-03,0.672602,1.352077,-0.435937,-0.169809


For slicing columns explicitly:

In [84]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,-1.87105,1.015998
2013-01-02,-1.344216,-0.391337
2013-01-03,1.352077,-0.435937
2013-01-04,0.776498,-0.267532
2013-01-05,-1.135375,-0.430509
2013-01-06,1.339715,0.03114


For getting a value explicitly:

In [87]:
df.iloc[1, 1]

-1.3442164039218918

For getting fast access to a scalar (equivalent to the prior method):

In [90]:
df.iat[1, 1]

-1.3442164039218918

### Boolean indexing
Select rows where df.A is greater than 0.

In [93]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.382087,-1.87105,1.015998,0.370699
2013-01-02,0.198678,-1.344216,-0.391337,1.178942
2013-01-03,0.672602,1.352077,-0.435937,-0.169809


Selecting values from a DataFrame where a boolean condition is met:

In [103]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,0.382087,,1.015998,0.370699
2013-01-02,0.198678,,,1.178942
2013-01-03,0.672602,1.352077,,
2013-01-04,,0.776498,,0.456509
2013-01-05,,,,
2013-01-06,,1.339715,0.03114,1.215037


Using isin() method for filtering:

In [111]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]


In [113]:
df2[df2["E"].isin(["two", "four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.672602,1.352077,-0.435937,-0.169809,two
2013-01-05,-0.177941,-1.135375,-0.430509,-0.073006,four
