# <span style="color:blue;"> 10 minutes to pandas </span>
https://pandas.pydata.org/docs/user_guide/10min.html#minutes-to-pandas 

In [1]:
import numpy as np
import pandas as pd

## <span style="color:blue;"> Basic data structures in pandas 

Pandas provides two types of classes for handling data:

1. Series: a one-dimensional labeled array holding data of any type
    such as integers, strings, Python objects etc.

2. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. 


## <span style="color:blue;"> Object creation
Creating a Series by passing a list of values, letting pandas create a default RangeIndex.

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array with a datetime index using date_range() and labeled columns:

In [3]:
dates = pd.date_range("20130101", periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.189681,0.579396,-1.080761,0.502686
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925
2013-01-04,-0.102225,-0.806213,-1.029741,0.093731
2013-01-05,-0.19737,0.209055,-0.010483,-1.873776
2013-01-06,1.044062,1.030819,-0.593103,-0.151125


Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [5]:
df2 = pd.DataFrame(

    {

        "A": 1.0,

        "B": pd.Timestamp("20130102"),

        "C": pd.Series(1, index=list(range(4)), dtype="float32"),

        "D": np.array([3] * 4, dtype="int32"),

        "E": pd.Categorical(["test", "train", "test", "train"]),

        "F": "foo",

    }

)

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes:

In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

### <span style="color:blue;"> Viewing data
Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

In [7]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-0.189681,0.579396,-1.080761,0.502686
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925
2013-01-04,-0.102225,-0.806213,-1.029741,0.093731
2013-01-05,-0.19737,0.209055,-0.010483,-1.873776


In [8]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,-0.102225,-0.806213,-1.029741,0.093731
2013-01-05,-0.19737,0.209055,-0.010483,-1.873776
2013-01-06,1.044062,1.030819,-0.593103,-0.151125


Display the DataFrame.index or DataFrame.columns:

In [9]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

Return a NumPy representation of the underlying data with DataFrame.to_numpy() without the index or column labels:

In [10]:
df.to_numpy()

array([[-0.18968101,  0.57939627, -1.08076078,  0.50268621],
       [ 0.09941396, -0.50356962, -1.08339804, -2.93542801],
       [ 0.84852878,  1.31348314, -0.49743906, -0.85892511],
       [-0.10222531, -0.80621251, -1.0297415 ,  0.09373116],
       [-0.19737023,  0.2090549 , -0.01048318, -1.87377569],
       [ 1.04406243,  1.0308189 , -0.5931029 , -0.15112472]])

describe() shows a quick statistic summary of your data:

In [11]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.250455,0.303829,-0.715821,-0.870473
std,0.552986,0.83868,0.430602,1.311755
min,-0.19737,-0.806213,-1.083398,-2.935428
25%,-0.167817,-0.325413,-1.068006,-1.620063
50%,-0.001406,0.394226,-0.811422,-0.505025
75%,0.66125,0.917963,-0.521355,0.032517
max,1.044062,1.313483,-0.010483,0.502686


Transposing your data:

In [12]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-0.189681,0.099414,0.848529,-0.102225,-0.19737,1.044062
B,0.579396,-0.50357,1.313483,-0.806213,0.209055,1.030819
C,-1.080761,-1.083398,-0.497439,-1.029741,-0.010483,-0.593103
D,0.502686,-2.935428,-0.858925,0.093731,-1.873776,-0.151125


DataFrame.sort_index() sorts by an axis:

In [13]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.502686,-1.080761,0.579396,-0.189681
2013-01-02,-2.935428,-1.083398,-0.50357,0.099414
2013-01-03,-0.858925,-0.497439,1.313483,0.848529
2013-01-04,0.093731,-1.029741,-0.806213,-0.102225
2013-01-05,-1.873776,-0.010483,0.209055,-0.19737
2013-01-06,-0.151125,-0.593103,1.030819,1.044062


DataFrame.sort_values() sorts by values:

In [14]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-04,-0.102225,-0.806213,-1.029741,0.093731
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-05,-0.19737,0.209055,-0.010483,-1.873776
2013-01-01,-0.189681,0.579396,-1.080761,0.502686
2013-01-06,1.044062,1.030819,-0.593103,-0.151125
2013-01-03,0.848529,1.313483,-0.497439,-0.858925


## <span style="color:blue;"> Selection
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

### <span style="color:blue;"> Getitem([ ])
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:

In [15]:
df["A"]

2013-01-01   -0.189681
2013-01-02    0.099414
2013-01-03    0.848529
2013-01-04   -0.102225
2013-01-05   -0.197370
2013-01-06    1.044062
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [34]:
df.A

2013-01-01   -0.189681
2013-01-02    0.099414
2013-01-03    0.848529
2013-01-04   -0.102225
2013-01-05   -0.197370
2013-01-06    1.044062
Freq: D, Name: A, dtype: float64

In [17]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-0.189681,0.579396,-1.080761,0.502686
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925


In [18]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925
2013-01-04,-0.102225,-0.806213,-1.029741,0.093731


### Selection by label
Selecting a row matching a label:

In [19]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2013-01-01,-0.189681,0.579396
2013-01-02,0.099414,-0.50357
2013-01-03,0.848529,1.313483
2013-01-04,-0.102225,-0.806213
2013-01-05,-0.19737,0.209055
2013-01-06,1.044062,1.030819


For label slicing, both endpoints are included:

In [20]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,0.099414,-0.50357
2013-01-03,0.848529,1.313483
2013-01-04,-0.102225,-0.806213


Selecting a single row and column label returns a scalar:

In [21]:
df.loc[dates[0], "A"]

-0.1896810130986254

For getting fast access to a scalar (equivalent to the prior method):

In [22]:
df.at[dates[0], "A"]

-0.1896810130986254

### Selection by position
Select via the position of the passed integers:

In [23]:
df.iloc[3]

A   -0.102225
B   -0.806213
C   -1.029741
D    0.093731
Name: 2013-01-04 00:00:00, dtype: float64

Integer slices acts similar to NumPy/Python:

In [24]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-0.102225,-0.806213
2013-01-05,-0.19737,0.209055


Lists of integer position locations:

In [25]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,0.099414,-1.083398
2013-01-03,0.848529,-0.497439
2013-01-05,-0.19737,-0.010483


For slicing rows explicitly:

In [26]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925


For slicing columns explicitly:

In [27]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,0.579396,-1.080761
2013-01-02,-0.50357,-1.083398
2013-01-03,1.313483,-0.497439
2013-01-04,-0.806213,-1.029741
2013-01-05,0.209055,-0.010483
2013-01-06,1.030819,-0.593103


For getting a value explicitly:

In [28]:
df.iloc[1, 1]

-0.5035696221679727

For getting fast access to a scalar (equivalent to the prior method):

In [29]:
df.iat[1, 1]

-0.5035696221679727

## <span style="color:blue;"> Boolean indexing
Select rows where df.A is greater than 0.

In [30]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2013-01-02,0.099414,-0.50357,-1.083398,-2.935428
2013-01-03,0.848529,1.313483,-0.497439,-0.858925
2013-01-06,1.044062,1.030819,-0.593103,-0.151125


Selecting values from a DataFrame where a boolean condition is met:

In [31]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.579396,,0.502686
2013-01-02,0.099414,,,
2013-01-03,0.848529,1.313483,,
2013-01-04,,,,0.093731
2013-01-05,,0.209055,,
2013-01-06,1.044062,1.030819,,


Using isin() method for filtering:

In [32]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]


In [33]:
df2[df2["E"].isin(["two", "four"])]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.848529,1.313483,-0.497439,-0.858925,two
2013-01-05,-0.19737,0.209055,-0.010483,-1.873776,four


Summary

1 .loc[]: Label-based indexing for both rows and columns, supports slicing, and includes both start and end labels.

2 [ ]: Primarily used for selecting columns by name or performing boolean indexing, and can also be used with lists of column names for multi-column selection.