# Introduction to Pandas in Python

Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on the top of the NumPy library. Pandas is fast and it has high-performance & productivity for users.

## Advantages
Fast and efficient for manipulating and analyzing data.

Data from different file objects can be loaded.

Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

Data set merging and joining.

Flexible reshaping and pivoting of data sets

Provides time-series functionality.

Powerful group by functionality for performing split-apply-combine operations on data sets.

## Downloading and Installing Pandas

Pandas can be installed in multiple ways on Windows and on Linux. Various different ways are listed below:

Windows
Python Pandas can be installed on Windows in two ways:

Using pip
Using Anaconda

Install Pandas using pip

PIP is a package management system used to install and manage software packages/libraries written in Python. These files are stored in a large “on-line repository” termed as Python Package Index (PyPI).

Pandas can be installed using PIP by the use of the following command:

In [1]:
!pip install pandas



## Getting Started
After the pandas has been installed into the system, you need to import the library. This module is generally imported as –

In [2]:
import pandas as pd

Pandas generally provide two data structures for manipulating data, They are:

Series

DataFrame

### Series
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.


A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

### Creating a Series
In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc.

Example: 

In [3]:
import pandas as pd  
import numpy as np 
  
  
# Creating empty series  
ser = pd.Series()  
    
print(ser)  
  
# simple array  
data = np.array(['l', 'a', 'n', 'd', 'm','a', 'r', 'k'])  
    
ser = pd.Series(data)  
print(ser) 


labels = ['a','b','c','d','e','f','g','h']
pd.Series(data=data,index=labels)

Series([], dtype: float64)
0    l
1    a
2    n
3    d
4    m
5    a
6    r
7    k
dtype: object


  ser = pd.Series()


a    l
b    a
c    n
d    d
e    m
f    a
g    r
h    k
dtype: object

In [4]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## DataFrame
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

![image.png](attachment:image.png)

Creating a DataFrame

In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc.

Example:

In [5]:
import pandas as pd  
    
# Calling DataFrame constructor  
df = pd.DataFrame()  
print(df) 
  
# list of strings  
lst = ['Landmark', 'Group', 'Office', 'is',   
            'in', 'Bangalore']  
    
# Calling DataFrame constructor on list  
df = pd.DataFrame(lst)  
df 

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,0
0,Landmark
1,Group
2,Office
3,is
4,in
5,Bangalore


Creating DataFrame from dict of ndarray/lists: 


To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.


In [6]:
# Python code demonstrate creating 
# DataFrame from dict narray / lists 
# By default addresses.
 
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)
df

    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18


Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


## Indexing and Selecting Data with Pandas

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Since a dataframe is two-dimensional, we can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

Column Selection: In Order to select a column in Pandas DataFrame, we can access the columns by calling them by their columns name.

In [7]:
# Import pandas package
import pandas as pd
 
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
data
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
 
# select two columns
print(df[['Name', 'Qualification']])

     Name Qualification
0     Jai           Msc
1  Princi            MA
2  Gaurav           MCA
3    Anuj           Phd


Column Addition:
In Order to add a column in Pandas DataFrame, we can declare a new list as a column and add to a existing Dataframe.

In [8]:
# Declare a list that is to be converted into a column 
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] 
score = [40,56,23,98]
  
# Using 'Address' as the column name 
# and equating it to the list 
df['Address'] = address 
df['Score'] = score
# Observe the result 
df 

Unnamed: 0,Name,Age,Qualification,Address,Score
0,Jai,27,Msc,Delhi,40
1,Princi,24,MA,Bangalore,56
2,Gaurav,22,MCA,Chennai,23
3,Anuj,32,Phd,Patna,98


Column Deletion:

In Order to delete a column in Pandas DataFrame, we can use the drop() method. Columns is deleted by dropping columns with column names.

In [9]:
# dropping passed columns 
df.drop(["Address", "Score"], axis = 1, inplace = True) 
  
# display 
print(df)

     Name  Age Qualification
0     Jai   27           Msc
1  Princi   24            MA
2  Gaurav   22           MCA
3    Anuj   32           Phd


Row Selection: 

Pandas provide a unique method to retrieve rows from a Data frame. 

DataFrame.loc[] method is used to retrieve rows from Pandas DataFrame. 

Rows can also be selected by passing integer location to an iloc[] function.

In [10]:
#index by name
df.set_index("Name",inplace=True)
df
# retrieving row by loc method
first = df.loc["Jai"]
second = df.loc["Anuj"]
 
 
print(first, "\n\n\n", second)

Age               27
Qualification    Msc
Name: Jai, dtype: object 


 Age               32
Qualification    Phd
Name: Anuj, dtype: object


In [11]:
# Using iloc[] function
df.iloc[2,:]

Age               22
Qualification    MCA
Name: Gaurav, dtype: object

Row Addition:

In Order to add a Row in Pandas DataFrame, we can concat the old dataframe with new one.

In [12]:
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd'],
        'Address':['Delhi','Gurgaon','Haryana','Bihar'] }

# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
new_row = pd.DataFrame({'Name':'Navin', 'Age':33, 'Address':'Noida', 
                        'Qualification':'Btech'},index =[0]) 
# simply concatenate both dataframes 
df = pd.concat([new_row, df],sort=True).reset_index(drop = True) 
df

Unnamed: 0,Address,Age,Name,Qualification
0,Noida,33,Navin,Btech
1,Delhi,27,Jai,Msc
2,Gurgaon,24,Princi,MA
3,Haryana,22,Gaurav,MCA
4,Bihar,32,Anuj,Phd


Row Deletion:

In Order to delete a row in Pandas DataFrame, we can use the drop() method. Rows is deleted by dropping Rows by index label.

In [13]:
# dropping passed values 
df.drop([0], inplace = True) 
  
# display 
df

Unnamed: 0,Address,Age,Name,Qualification
1,Delhi,27,Jai,Msc
2,Gurgaon,24,Princi,MA
3,Haryana,22,Gaurav,MCA
4,Bihar,32,Anuj,Phd


While analyzing the real datasets which are often very huge in size, we might need to get the column names in order to perform some certain operations.

Let’s discuss how to get column names in Pandas dataframe.

Method #1: Simply iterating over columns

In [14]:
# iterating the columns 
for col in df.columns: 
    print(col) 

Address
Age
Name
Qualification


Method #2: Using columns with dataframe object

In [15]:
# list(data) or 
list(df.columns) 


['Address', 'Age', 'Name', 'Qualification']

Method #3: Using sorted() method

Sorted() method will return the list of columns sorted in alphabetical order.

In [16]:
# using sorted() method 
sorted(df) 

['Address', 'Age', 'Name', 'Qualification']

How to get rows/index names in Pandas dataframe

In [17]:
# iterate the indices and print each one 
for row in df.index: 
    print(row, end= " ") 


1 2 3 4 

In [18]:
print(df)
# OR list(data_top) or 
print(list(df.index)) 

# OR
print(list(df.index.values))

   Address  Age    Name Qualification
1    Delhi   27     Jai           Msc
2  Gurgaon   24  Princi            MA
3  Haryana   22  Gaurav           MCA
4    Bihar   32    Anuj           Phd
[1, 2, 3, 4]
[1, 2, 3, 4]


Pandas Dataframe type has two attributes called ‘columns’ and ‘index’ which can be used to change the column names as well as the row indexes.

In [19]:
# first import the libraries 
import pandas as pd 
   
# Create a dataFrame using dictionary 
df=pd.DataFrame({"Name":['Tom','Nick','John','Peter'], 
                 "Age":[15,26,17,28]}) 
  
# Creates a dataFrame with 
# 2 columns and 4 rows 
df 

Unnamed: 0,Name,Age
0,Tom,15
1,Nick,26
2,John,17
3,Peter,28


Method #1: Changing the column name and row index using df.columns and df.index attribute.
In order to change the column names, we provide a Python list containing the names for column df.columns= ['First_col', 'Second_col', 'Third_col', .....].
In order to change the row indexes, we also provide a python list to it df.index=['row1', 'row2', 'row3', ......].

In [20]:
# Change the column names 
df.columns =['Col_1', 'Col_2'] 
  
# Change the row indexes 
df.index = ['Row_1', 'Row_2', 'Row_3', 'Row_4'] 
  
# printing the data frame 
df

Unnamed: 0,Col_1,Col_2
Row_1,Tom,15
Row_2,Nick,26
Row_3,John,17
Row_4,Peter,28


Method #2: Using rename() function with dictionary to change a single column
filter_none
brightness_4
let's change the first column name 
from "A" to "a" using rename() function 


In [21]:
df = df.rename(columns = {"Col_1":"Mod_col"}) 
  
df 

Unnamed: 0,Mod_col,Col_2
Row_1,Tom,15
Row_2,Nick,26
Row_3,John,17
Row_4,Peter,28


In [22]:
# We can change multiple column names by  
# passing a dictionary of old names and  
# new names, to the rename() function. 
df = df.rename({"Mod_col":"Col_1","Col_2":"B"}, axis='columns') 
  
df 

Unnamed: 0,Col_1,B
Row_1,Tom,15
Row_2,Nick,26
Row_3,John,17
Row_4,Peter,28


Method #3: Using Lambda Function to rename the columns.
A lambda function is a small anonymous function which can take any number of arguments, but can only have one expression. Using the lambda function we can modify all of the column names at once. Let’s add ‘x’ at the end of each column name using lambda function

In [23]:
df = df.rename(columns=lambda x: x+'x') 
  
# this will modify all the column names 
df 

Unnamed: 0,Col_1x,Bx
Row_1,Tom,15
Row_2,Nick,26
Row_3,John,17
Row_4,Peter,28


Method #4 : Using values attribute to rename the columns.
We can use values attribute directly on the column whose name we want to change.

In [None]:
df.columns.values[0] = 'Name'
df.columns.values[1] = 'Student_Age'
# this will modify the name of the first column 
df 

In [None]:
# Let’s change the row index using the Lambda function.
# To change the row indexes 
df = pd.DataFrame({"A":['Tom','Nick','John','Peter'], 
                   "B":[25,16,27,18]}) 
  
# this will increase the row index value by 10 for each row 
df = df.rename(index = lambda x: x + 10) 
  
df 

In [None]:
df = df.rename(index = lambda x: x + 5, 
               columns = lambda x: x +'x') 
   
# increase all the row index label by value 5 
# append a value 'x' at the end of each column name.  
df 

Get unique values from a column in Pandas DataFrame

In [None]:
# create a dictionary with five fields each 
data = { 
    'A':['A1', 'A2', 'A3', 'A4', 'A5'],  
    'B':['B1', 'B2', 'B3', 'B4', 'B4'],  
    'C':['C1', 'C2', 'C3', 'C3', 'C3'],  
    'D':['D1', 'D2', 'D2', 'D2', 'D2'],  
    'E':['E1', 'E1', 'E1', 'E1', 'E1'] } 
  
# Convert the dictionary into DataFrame  
dframe = pd.DataFrame(data) 
  
# Get the unique values of 'B' column 
dframe.B.unique() 

In [None]:
# Get the unique values of 'E' column 
dframe.E.unique() 

In [None]:
dframe.E.nunique()

## Reading a file in Pandas

In [None]:
# importing pandas module  
import pandas as pd  
    
# making data frame  
df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")  
  
df.head(15) 

In [None]:
df.nsmallest(5, ['Salary']) 

In [None]:
# five largest values in column age 
df.nlargest(5, ['Age'])

In [None]:
# Ten largest values in column Weight 
df.nlargest(10, ['Weight']) 

Apply uppercase to a column in Pandas dataframe

In [None]:
df['Name'] = df['Name'].str.upper() 
  
df.head()

In [None]:
# removing null values to avoid errors   
df.dropna(inplace = True)   
  
# Applying upper() method on 'College' column 
df['College'].apply(lambda x: x.upper()).head()

Selecting some rows and some columns:

In order to select two rows and three columns, we select a two rows which we want to select and three columns and put it in a separate list like this:

In [None]:
# making data frame from csv file. This is another way of reading a CSV file  
data = pd.read_csv("nba.csv", index_col ="Name") 

# retrieving two rows and three columns by loc method 
first = data.loc[["Avery Bradley", "R.J. Hunter"], 
                   ["Team", "Number", "Position"]] 

print(first)

Selecting all of the rows and some columns:

In order to select all of the rows and some columns, we use single colon [:] to select all of rows and list of some columns which we want to select like this:

In [None]:
# retrieving all rows and some columns by loc method 
first = data.loc[:, ["Team", "Number", "Position"]] 
first  

In [None]:
# retrieving rows by iloc method  
row2 = data.iloc[3]  
row2  

In [None]:
# retrieving two rows and two columns by iloc method  
row2 = data.iloc [[3, 4], [1, 2]] 
 
print(row2) 

In [None]:
# retrieving all rows and some columns by iloc method  
row2 = data.iloc [:, [1, 2]] 
row2

## View basic statistical details

Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of string, it returns a different output which is shown in the examples below.

In [None]:
# percentile list 
perc =[.20, .40, .60, .80] 
  
# list of dtypes to include 
include =['object', 'float', 'int'] 
  
# calling describe method 
desc = df.describe(percentiles = perc, include = include) 
desc

In [None]:
df['College'].describe()

Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data.  

In [None]:
import numpy as np 
   
df= pd.DataFrame({'number': np.random.randint(1, 100, 20)}) 
df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60,  
                                          80, 100]) 
print(df) 
  

# We can check the frequency of each bin 
print(df['bins'].unique())

In [None]:
# read by default 1st sheet of an excel file 
dataframe1 = pd.read_excel('SampleWork.xlsx') 
  
print(dataframe1) 

Code #2 : Reading Specific Sheets using 'sheet_name' of read_excel() method.

In [None]:

# read 2nd sheet of an excel file 
dataframe2 = pd.read_excel('SampleWork.xlsx', sheet_name = 0) 
  
print(dataframe2) 

In [None]:

# read 2nd sheet of an excel file 
dataframe2 = pd.read_excel('SampleWork.xlsx', sheet_name = 'Data') 
  
print(dataframe2) 

In [None]:
require_cols = [0, 3] 
  
# only read specific columns from an excel file 
required_df = pd.read_excel('SampleWork.xlsx', usecols = require_cols) 
  
print(required_df) 

Skip starting rows when Reading an Excel File using 'skiprows' parameter of read_excel() method.

In [None]:
# read 2nd sheet of an excel file after 
# skipping starting two rows  
df = pd.read_excel('SampleWork.xlsx', sheet_name = 1, skiprows = 2) 
  
print(df)

Set the header to any row and start reading from that row, using 'header' parameter of the read_excel() method.

In [None]:
# setting the 3rd row as header. 
df = pd.read_excel('SampleWork.xlsx', sheet_name = 1, header = 0) 
  
print(df) 

Reading Multiple Excel Sheets using 'sheet_name' parameter of the read_excel()method.

In [None]:
# read both 1st and 2nd sheet. 
df = pd.read_excel('SampleWork.xlsx',  sheet_name =[0, 1]) 
  
print(df)

## Saving a Pandas Dataframe as a CSV

In [None]:
df.to_csv('file1.csv')

## Iterating over rows and columns in Pandas DataFrame

In Pandas Dataframe we can iterate an element in two ways:

Iterating over rows

Iterating over columns

In [None]:
   
# dictionary of lists 
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"], 
        'degree': ["MBA", "BCA", "M.Tech", "MBA"], 
        'score':[90, 40, 80, 98]} 
  
# creating a dataframe from a dictionary  
df = pd.DataFrame(dict) 
  
# iterating over rows using iterrows() function  
for i, j in df.iterrows(): 
    print(i, j) 
    

Iteration over rows using itertuples()

In [None]:
# using a itertuples()  
for i in df.itertuples(): 
    print(i) 


Now we iterate through columns in order to iterate through columns we first create a list of dataframe columns and then iterate through list.

In [None]:
# creating a list of dataframe columns 
columns = list(df) 
  
for i in columns: 
  
    # printing the third element of the column 
    print (df[i][2]) 

## Sorting using Pandas 

In [None]:
# making data frame from csv file 
data = pd.read_csv("nba.csv") 
  
# sorting data frame by name 
data.sort_values("Name", axis = 0, ascending = True, 
                 inplace = True, na_position ='last') 
  
# display 
data.head()

Example #2: Changing position of Null values

In the give data, there are many null values in different columns which are put in the last by default. In this example, the Data Frame is sorted with respect to Salary column and Null values are kept at the top.

In [None]:

# making data frame from csv file 
data = pd.read_csv("nba.csv") 
  
# sorting data frame by name 
data.sort_values("Salary", axis = 0, ascending = True, 
                 inplace = True, na_position ='first') 
  
data.head(30) 
# display 

In [None]:
# Example #1: Sorting by Name and Team
# In the following example, A data frame is made from the csv file and the data frame is sorted in ascending order of 
# Team and in every Team the Name is also sorted in Ascending order.
#making data frame from csv file 
data=pd.read_csv("nba.csv") 
  
#sorting data frame by Team and then By names 
data.sort_values(["Team", "Name"], axis=0, 
                 ascending=True, inplace=True) 
  
#display 
data 

Example #2: Passing list to Ascending Parameter

As shown in the above example, a Data frame can be sorted with respect to multiple columns by passing a list to the ‘by’ Parameter. We can also pass a list to the ‘ascending’ Parameter to tell pandas which column to sort how.
The index of Boolean in ‘ascending’ parameter should be same as the index of column name in ‘by’ Parameter.



In [None]:

#making data frame from csv file 
data=pd.read_csv("nba.csv") 
  
#sorting data frame by Team and then By names 
data.sort_values(["Team", "Name"], axis=0, 
                 ascending=[True,False], inplace=True) 
  
#display 
data 

## Working With Text Data

In [None]:
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
   
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
   
# converting and overwriting values in column 
df["Name"]= df["Name"].str.lower()
 
df

Splitting and Replacing a Data

In order to split a data, we use str.split() this function returns a list of strings after breaking the given string by the specified separator but it can only be applied to an individual string. Pandas str.split() method can be applied to a whole series. .str has to be prefixed every time before calling this method to differentiate it from the Python’s default function otherwise, it will throw an error. In order to replace a data, we use str.replace() this function works like Python .replace() method only, but it works on Series too. Before calling .replace() on a Pandas series, .str has to be prefixed in order to differentiate it from the Python’s default replace method.

In [None]:
# Define a dictionary containing employee data 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 
        'Age':[27, 24, 22, 32], 
        'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Knnuaj'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']} 
 
# Convert the dictionary into DataFrame  
df = pd.DataFrame(data) 
    
# dropping null value columns to avoid errors 
df.dropna(inplace = True) 
    
# new data frame with split value columns 
df["Address"]= df["Address"].str.split("a", n = 1, expand = True) 
   
# df display 
print(df)

This Pandas exercise will help the learners to get a better understanding of data analysis problems. This practice page consists of a huge set of Pandas programs like Pandas Dataframe/series, handling Rows/Columns, grouping and all sort of frequently encounterd problems.

https://www.geeksforgeeks.org/pandas-practice-excercises-questions-and-solutions/

# Great job!