## Python Pandas Introduction
Pandas is an open source python library which was created to do exploratory data analysis
in very efficient and meaningfull way.It is one of the top used python library used by Data analyst, Data Engineers and most importantly Data scientist.</br>

Official documentation: https://pandas.pydata.org/docs/</br>

Pandas is built based on two major Data structure  
&#9658; <font color=green>DataFrame</font>, two dimensional Table like structure consisting of rows and columns.  
&#9658; <font color=green>Series</font>, one dimensional analogous to column in Table.  

Below are some most commonly performed data analysis activity using Pandas -
- Read and create DataFrame from different kind of source system like csv, json, excel, database Tables etc.
- Manipulating data by applying filter condition, sorting, adding/removing columns.
- Performing grouping and aggreagating the Data.
- Handling missing/bad records.
- Write to different type of target format like json, csv, excel, RDBMS Table etc.

### 1. Creating Pandas DataFrame
Pandas Dataframe can be created by reading different kind of external storage like Database Tables, various file formats CSV, json, text etc. Also it can be created by passing the python's in built Data Structures like Lists and Dict aka Dictionary.
Below example shows creating a pandas Dataframe form a Dict object.

In [35]:
# import pandas librabry as pd, which is widely used alias for pandas
import pandas as pd
# Dict object of key value pair, called artists
artists = {"a id":[1001, 1002, 1003, 1004, 1005],
"NAMe" : ["Katy Perry","Jacques Brel","Tove Lo","KK", "Vince Jones"],
"DOB": ["1984-10-25","1929-04-08","1987-10-29","1968-08-23","1954-03-24",],
"Genre":["Pop", "Chanson","Alternative/Indie", "Rock", 'Jazz'],
"country": ["United States","Belgium", "Sweden", "India", "Australia"]
}

In [36]:
# Created Dataframe object artist_df by passing the dict object to DataFrame function of pandas library
artist_df = pd.DataFrame(artists)

In [37]:
artist_df.head(2) # Defaults to top 5 rows

Unnamed: 0,a id,NAMe,DOB,Genre,country
0,1001,Katy Perry,1984-10-25,Pop,United States
1,1002,Jacques Brel,1929-04-08,Chanson,Belgium


In [38]:
artist_df.tail(2) # Defaults to bottom 5 rows

Unnamed: 0,a id,NAMe,DOB,Genre,country
3,1004,KK,1968-08-23,Rock,India
4,1005,Vince Jones,1954-03-24,Jazz,Australia


In [39]:
# Get the column name using columns attribute
artist_df.columns

Index(['a id', 'NAMe', 'DOB', 'Genre', 'country'], dtype='object')

In [40]:
# Getting the number of rows and columns of the DataFrame by calling the 'shape' attribute
artist_df.shape   # 5 rows 5 columns

(5, 5)

In [41]:
# info() function shows all basic details of the DataFrame
# Like column name, nullability, datatype. 
# Object type generally refers to string type
artist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   a id     5 non-null      int64 
 1   NAMe     5 non-null      object
 2   DOB      5 non-null      object
 3   Genre    5 non-null      object
 4   country  5 non-null      object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes


### 2. Renaming the Columns
Lets rename the column, as some of the Column name does not follow the naming standards or   best practices. This can be done by using rename() function or some other technique.

In [42]:
# Creating a Dict object with old and new column name mapping
new_cols = {"NAMe": "name",
           "Genre": "genre"
           }
# Passing the Dict object containing the column name mapping to columns keyword arg
artist_df.rename(columns = new_cols, inplace=True)
artist_df.columns

Index(['a id', 'name', 'DOB', 'genre', 'country'], dtype='object')

In [43]:
# Changing to lower case by using list comprehension
artist_df.columns = [col.lower() for col in artist_df.columns]
artist_df.columns

Index(['a id', 'name', 'dob', 'genre', 'country'], dtype='object')

In [44]:
# Replacing space with underscore
artist_df.columns = artist_df.columns.str.replace(' ', '_')
artist_df.columns

Index(['a_id', 'name', 'dob', 'genre', 'country'], dtype='object')

### 3. Operations on Rows and Columns
Some of the common operation performed on DataFrame's Rows and Columns are selection, addition and deletion.

- __Selecting Columns__

Columns can be selected by passing the specific column name or list of column name in case of multiple selection

In [45]:
# Selecting one Column
artist_df['name']

0      Katy Perry
1    Jacques Brel
2         Tove Lo
3              KK
4     Vince Jones
Name: name, dtype: object

In [46]:
# Getting the type of 'name' Field/Column, which is Series
type(artist_df['name'])

pandas.core.series.Series

In [47]:
# Selecting multiple Columns by passing a list of Column name
artist_df[['name', 'country']]

Unnamed: 0,name,country
0,Katy Perry,United States
1,Jacques Brel,Belgium
2,Tove Lo,Sweden
3,KK,India
4,Vince Jones,Australia


In [48]:
# When getting multiple columns, pandas returns that as DataFrame
type(artist_df[['name', 'country']])

pandas.core.frame.DataFrame

- __Selecting rows__

Pandas provides iloc() and loc() function to fetch specific rows from a DataFrame.  
&#9755; iloc() - used this when fetching the rows by index number, 0 based index  
&#9755; loc() - this is used to get rows by labels

In [49]:
# Single row in 0th index
artist_df.iloc[0]

a_id                1001
name          Katy Perry
dob           1984-10-25
genre                Pop
country    United States
Name: 0, dtype: object

In [50]:
# Multiple row in 0 and 1st index
artist_df.iloc[[0, 1]]

Unnamed: 0,a_id,name,dob,genre,country
0,1001,Katy Perry,1984-10-25,Pop,United States
1,1002,Jacques Brel,1929-04-08,Chanson,Belgium


In [51]:
# Specific Columns from specific Rows
# Pass list of indexes for Rows
artist_df.iloc[[2,3], 0] # First argument=row index, second argument=col index

2    1003
3    1004
Name: a_id, dtype: int64

In [52]:
# Pass list of indexes for Columns as well to get multiple Columns
artist_df.iloc[[2,3], [0, 1]] # First argument=row index, second argument=col index

Unnamed: 0,a_id,name
2,1003,Tove Lo
3,1004,KK


### 4. Index column in DataFrame
Lets understand index column of Pandas DataFrame before using loc() function.
Every DataFrame created in pandas contains an index field/column. This is to provide an unique identifier to each rows in the DataFrame. It can be corelated to primary key or more specifically surrogate key(when it is implicit) with a Database Table.
This can be explicitly set to another existing column(preferably the one having distinct values) available in the DataFrame.

In [53]:
# Use index attribute to get the index details which is implitly set by DataFrame API
artist_df.index

RangeIndex(start=0, stop=5, step=1)

In [54]:
type(artist_df.index)

pandas.core.indexes.range.RangeIndex

In [55]:
# Setting index field to different column in the DataFrame
artist_df.set_index('a_id') # It does not set the index of artist_df permanently
                                # It returns the view after applying this change

Unnamed: 0_level_0,name,dob,genre,country
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,Katy Perry,1984-10-25,Pop,United States
1002,Jacques Brel,1929-04-08,Chanson,Belgium
1003,Tove Lo,1987-10-29,Alternative/Indie,Sweden
1004,KK,1968-08-23,Rock,India
1005,Vince Jones,1954-03-24,Jazz,Australia


In [56]:
artist_df.index # It's still of type RangeIndex

RangeIndex(start=0, stop=5, step=1)

In [57]:
# It does set the index of artist_df. pass True to inplace keyword arg.
artist_df.set_index('a_id', inplace=True)

In [58]:
artist_df.index # Below output shows the original DataFrame is now updated in place.

Int64Index([1001, 1002, 1003, 1004, 1005], dtype='int64', name='a_id')

In [59]:
# Reverts the index change.
artist_df.reset_index(inplace=False)  # To make it permanent pass True to inplace keyword arg.

Unnamed: 0,a_id,name,dob,genre,country
0,1001,Katy Perry,1984-10-25,Pop,United States
1,1002,Jacques Brel,1929-04-08,Chanson,Belgium
2,1003,Tove Lo,1987-10-29,Alternative/Indie,Sweden
3,1004,KK,1968-08-23,Rock,India
4,1005,Vince Jones,1954-03-24,Jazz,Australia


In [60]:
# Using loc() function
# Index column's value is label of rows. Using a_id as labels as a_id is set as index.
artist_df.loc[1002]

name       Jacques Brel
dob          1929-04-08
genre           Chanson
country         Belgium
Name: 1002, dtype: object

In [61]:
artist_df.loc[[1003,1002], ['name', 'dob']] # First argument=row label, second argument=col label

Unnamed: 0_level_0,name,dob
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1003,Tove Lo,1987-10-29
1002,Jacques Brel,1929-04-08


- Adding column  
Will use adding a list of values as new column and insert() function to add new column

In [62]:
# Created occupation list
occupation = ["singer", "singer", "singer","singer","musician"]
# Added occupation column by assinging the occupation list.
# Length of the list and number of index value should match; else pandas will raise ValueError.
artist_df['occupation'] = occupation

In [63]:
# Created popular_album list
popular_album = ["Teenage Dream","N° 5","Lady Wood","Humsafar","Trustworthy Little Sweethearts"]
# Adding popular_album column at index position 3rd
artist_df.insert(3, 'popular_album', popular_album)
artist_df.head(3)

Unnamed: 0_level_0,name,dob,genre,popular_album,country,occupation
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1001,Katy Perry,1984-10-25,Pop,Teenage Dream,United States,singer
1002,Jacques Brel,1929-04-08,Chanson,N° 5,Belgium,singer
1003,Tove Lo,1987-10-29,Alternative/Indie,Lady Wood,Sweden,singer


- Adding Rows

In [64]:
# Created record Dict Object to be inserted as Row
record = {
    "name":"Taylor Swift",
    "DOB":"1989-12-13",
    "genre":"Pop",
    "popular_album":"Speak Now",
    "country":"United States",
    "occupation":"singer-songwriter"
}
# Assign record to a desired index label
artist_df.loc[1006] = record
artist_df.tail(2)

Unnamed: 0_level_0,name,dob,genre,popular_album,country,occupation
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1005,Vince Jones,1954-03-24,Jazz,Trustworthy Little Sweethearts,Australia,musician
1006,Taylor Swift,,Pop,Speak Now,United States,singer-songwriter


- Deleting Columns  
Columns can be deleted by using del keyword as with Dict and also using drop() function

In [65]:
# Deleting occupation column
del artist_df['occupation']
artist_df.head(2)

Unnamed: 0_level_0,name,dob,genre,popular_album,country
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,Katy Perry,1984-10-25,Pop,Teenage Dream,United States
1002,Jacques Brel,1929-04-08,Chanson,N° 5,Belgium


In [66]:
# Drops popular_album columns permanently.
# Columns keyword arg also accepts list of column names for multiple column deletion
artist_df.drop(columns='popular_album', inplace=True)
artist_df.head(2)

Unnamed: 0_level_0,name,dob,genre,country
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,Katy Perry,1984-10-25,Pop,United States
1002,Jacques Brel,1929-04-08,Chanson,Belgium


- Deleting Rows  
Rows also can be deleted by using drop() function

In [67]:
# Deleting row with specific label
artist_df.drop(index = 1002, inplace=True)
artist_df

Unnamed: 0_level_0,name,dob,genre,country
a_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,Katy Perry,1984-10-25,Pop,United States
1003,Tove Lo,1987-10-29,Alternative/Indie,Sweden
1004,KK,1968-08-23,Rock,India
1005,Vince Jones,1954-03-24,Jazz,Australia
1006,Taylor Swift,,Pop,United States
