<a href="https://colab.research.google.com/github/maswadkar/python/blob/master/pandas_001_10_minutes_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Object creation

In [None]:
import numpy as np
import pandas as pd
pd.__version__

In [None]:
#Creating a Series by passing a list of values, letting pandas create a default integer index:
s = pd.Series([1,2,np.nan,4])
s

In [None]:
#Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
dates = pd.date_range(start='20000101',periods=6,)
dates


In [None]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=['A','B','C','D',])
df

In [None]:
#Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
df2 = pd.DataFrame({'A':1,
                   'B':pd.Timestamp('20220202'),
                   'C':pd.Series(1, index=list(range(4)), dtype="float32"),
                   'D':np.array([3]*4,dtype='float32'),
                   'E':pd.Categorical(['test','train','test','train']),
                   'F':'foo'
                   })
df2

In [None]:
#The columns of the resulting DataFrame have different dtypes:
df2.dtypes

#Viewing data

In [None]:
#Here is how to view the top and bottom rows of the frame:
df.head()

In [None]:
df.tail(3)

In [None]:
#Display the index, columns:
df.index

In [None]:
df.columns

 [DataFrame.to_numpy()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy) gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

In [None]:
#For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data:
df.to_numpy()

In [None]:
#For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
df2.to_numpy()

In [None]:
#describe() shows a quick statistic summary of your data:
df.describe()

In [None]:
#Transposing your data:
df.T

In [None]:
#Sorting by an axis:
df.sort_index(axis=1,ascending=False)

In [None]:
#Sorting by values:
df.sort_values(by='B',ascending=False)

#Selection

## Getting

In [None]:
#Selecting a single column, which yields a Series, equivalent to df.A:
df['A']

In [None]:
#Selecting via [], which slices the rows:
df[0:3]

## Selection by label

In [None]:
#For getting a cross section using a label:
df.loc[dates[0]]

In [None]:
#Selecting on a multi-axis by label:
df.loc[:,['A','C']]

In [None]:
#Showing label slicing, both endpoints are included:
df.loc['2000-01-02':'2000-01-05',['A','C']]

In [None]:
#Reduction in the dimensions of the returned object:
df.loc['2000-01-02',['A','C']]

In [None]:
#For getting a scalar value:
df.loc[dates[0],'D']

In [None]:
#For getting fast access to a scalar (equivalent to the prior method):
df.at[dates[0],'B']

##Selection by position

In [None]:
#Select via the position of the passed integers:
df.iloc[3]

In [None]:
#By integer slices, acting similar to NumPy/Python:
df.iloc[0:2,1:3]

In [None]:
#By lists of integer position locations, similar to the NumPy/Python style:
df.iloc[[1,5],[0,2]]

In [None]:
#For slicing rows explicitly:
df.iloc[3:4,:]

In [None]:
#For slicing columns explicitly:
df.iloc[:,2:3]

In [None]:
#For getting a value explicitly:
df.iloc[2,2]

##Boolean indexing

In [None]:
#Using a single column’s values to select data:
df[df['A'] > 0.7]

In [None]:
#Selecting values from a DataFrame where a boolean condition is met:
df[df > 0.5]

In [None]:
#Using the isin() method for filtering:
df2 = df.copy()
df2['E'] = ["one", "one", "two", "three", "four", "three"]
df2

In [None]:
df2[df2["E"].isin(["two", "four"])]

##Setting

In [None]:
#Setting a new column automatically aligns the data by the indexes:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20000101", periods=6))
df["F"] = s1

In [None]:
df

In [None]:
#Setting values by label:
df.loc[dates[0],'A'] = 0.333333
df

In [None]:
#Setting values by position:
df.iat[0,3] = .444444
df

In [None]:
#Setting by assigning with a NumPy array:
df.loc[:,'D'] = np.array([5] * len(df))
df

In [None]:
#A where operation with setting:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

#Missing data

pandas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. See the Missing Data section.



In [None]:
#Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1

df1

In [None]:
#To drop any rows that have missing data:
df1.dropna(how="any")

In [None]:
#Filling missing data:
df1.fillna(5)

In [None]:
#to get the boolean mask where values are nan:
df1.isna()

#Operations

##Stats

Operations in general exclude missing data

In [None]:
df.mean()

In [None]:
#Same operation on the other axis:
df.mean(axis=1)

In [None]:
#Operating with objects that have different dimensionality and need alignment. 
#In addition, pandas automatically broadcasts along the specified dimension:

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s

In [None]:
df

In [None]:
df.sub(s,axis='index')

## Apply

In [None]:
#Applying functions to the data:
df.apply(np.cumsum,)

In [None]:
df.apply(lambda x: x.max() - x.min(),axis=1)

## Histogramming

In [None]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

In [None]:
s.value_counts()

##String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

In [None]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

In [None]:
s.str.lower()

#Merge

##Concat

In [None]:
df = pd.DataFrame(np.random.randn(10, 4))
df

In [None]:
#break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

In [None]:
pd.concat(pieces)

**Note:**

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.

##Join

In [None]:
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
left

In [None]:
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
right

In [None]:
pd.merge(left, right, on="key")

In [None]:
#Another example that can be given is:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
left

In [None]:
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
right

In [None]:
pd.merge(left, right, on="key")

#Grouping

By **“group by”** we are referring to a process involving one or more of the following steps:

 - **Splitting** the data into groups based on some criteria
 - **Applying** a function to each group independently
 - **Combining** the results into a data structure

In [None]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

df

In [None]:
#Grouping and then applying the sum() function to the resulting groups:
df.groupby('A').sum()

In [None]:
#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:
df.groupby(["A",'B']).sum()

#Reshaping

In [None]:
tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)
tuples

In [None]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df

In [None]:
df2 = df[:4]
df2

In [None]:
#The stack() method “compresses” a level in the DataFrame’s columns:
stacked = df2.stack()
stacked

In [None]:
#With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
stacked.unstack()

In [None]:
stacked.unstack(1)

In [None]:
stacked.unstack(0)

##Pivot tables

In [None]:
df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)
df

In [None]:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

In [None]:
pd.pivot(df,index=['A','B'],columns='C',values='D')