What is pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. 
In particular, it offers data structures and operations for manipulating numerical tables and time series. 
It is free software released under the three-clause BSD license.

The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

*Pandas support 2 types of Data Structures :- **

    a) Dataframe
    b) Series 

**Dataframe -**
pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components--the index, columns, and data (also known as values)


So as to understand the Dataframe better , lets load student_marks_demo data into a Dataframe and explore the methods and attributes avilable with Dataframe .

In [1]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: C:\Users\Namratha R\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Import the Pandas package with pd as alias 

In [2]:
import pandas as pd

#### Use the read_csv function to read in the student_marks_demo dataset

 read_csv is an important pandas function to read csv files and do operations on it.

In [3]:
# read a CSV file into Python and return a DataFrame as output
demo_DF = pd.read_csv('student_marks_demo.csv')


#### Let's check the dataype returned by read_csv() and confirm that indeed the output is of type Dataframe

In [4]:
# Check the Data type of the output returned by pd.read_csv()

print(type(demo_DF))

<class 'pandas.DataFrame'>


#### Now , that we have read the CSV file into a Pandas Dataframe - 
       a) We check the data for first 5 observations using head() 
       b) we can also check the data for last 5 observations using tail()

In [5]:
# Show the content of first 5 observations in a Dataframe

demo_DF.head()

Unnamed: 0,Student_ID,Name,Subject,Marks
0,101,Aarav,Math,78
1,102,Diya,Science,85
2,103,Rohan,Math,62
3,104,Sneha,English,90
4,105,Karan,Science,55


In [6]:
# Show the content of last 5 observations in a Dataframe

demo_DF.tail()

Unnamed: 0,Student_ID,Name,Subject,Marks
10,111,Arjun,Science,91
11,112,Kavya,Math,76
12,113,Manish,English,40
13,114,Isha,Science,59
14,115,Suresh,Math,82


You will observe the following in the above Dataframe content displayed -
    
    a) Observation : Each Row of Data is called an Observation 
    b) INDEX : Provides a label to the Rows of a Dataframe 
    c) Columns : Each column also called Attribute / Feature contains a homogenous set of data 
    d) Column Name : Each column is being uniquely identified through a Column Name


Pandas uses NaN (not a number) to represent missing values. 

### Do you Know -

You can control how many observations to be displayed as part of head() or tail() 

In [7]:
# Show the content of first 3 observations in a Dataframe

demo_DF.head(3)

Unnamed: 0,Student_ID,Name,Subject,Marks
0,101,Aarav,Math,78
1,102,Diya,Science,85
2,103,Rohan,Math,62


### Accessing the 3 individual components of a DataFrame

    the index, 
    columns, 
    and data
    

Each of these components is itself a Python object with its own unique attributes and methods.

In [8]:
# Fetch all the Index values from the Dataframe.

demo_index = demo_DF.index

print("The values in the index column is : " , demo_index)

print('\n')

print("The Data Type of the Index is     : " , type(demo_index))

The values in the index column is :  RangeIndex(start=0, stop=15, step=1)


The Data Type of the Index is     :  <class 'pandas.RangeIndex'>


In [9]:


demo_Col_Names = demo_DF.columns

print("The Column Names are  : " , demo_Col_Names)

print('\n')

print("The Data Type of the Column Names is     : " , type(demo_Col_Names))

The Column Names are  :  Index(['Student_ID', 'Name', 'Subject', 'Marks'], dtype='str')


The Data Type of the Column Names is     :  <class 'pandas.Index'>


In [10]:
# Fetch all the Values from the Dataframe.

demo_Values = demo_DF.values

print("The Values in the Dataframe are  : " , demo_Values)

print('\n')

print("The Data Type of the Values is     : " , type(demo_Values))

The Values in the Dataframe are  :  [[101 'Aarav' 'Math' 78]
 [102 'Diya' 'Science' 85]
 [103 'Rohan' 'Math' 62]
 [104 'Sneha' 'English' 90]
 [105 'Karan' 'Science' 55]
 [106 'Pooja' 'Math' 48]
 [107 'Rahul' 'English' 72]
 [108 'Ananya' 'Science' 88]
 [109 'Vikram' 'Math' 35]
 [110 'Neha' 'English' 67]
 [111 'Arjun' 'Science' 91]
 [112 'Kavya' 'Math' 76]
 [113 'Manish' 'English' 40]
 [114 'Isha' 'Science' 59]
 [115 'Suresh' 'Math' 82]]


The Data Type of the Values is     :  <class 'numpy.ndarray'>


**How to check the Structure of a DataFrame **

Structure of a Dataframe can be enquired / checked using info() method of a Dataframe object

In [11]:
# Display the structure of a Dataframe

demo_DF.info()

<class 'pandas.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Student_ID  15 non-null     int64
 1   Name        15 non-null     str  
 2   Subject     15 non-null     str  
 3   Marks       15 non-null     int64
dtypes: int64(2), str(2)
memory usage: 612.0 bytes


In [12]:
# Just display the Data types of the columns of a Dataframe

demo_DF.dtypes

Student_ID    int64
Name            str
Subject         str
Marks         int64
dtype: object

**We can extract the Count of each DataTypes in a Datframe using get_dtype_counts()**

In [13]:
# Display the cummulative Data Type Counts in a Dataframe
demo_DF.dtypes.value_counts()

int64    2
str      2
Name: count, dtype: int64

**Series**

Time to explore the Series object in Pandas .

A Series is a single column of data from a DataFrame. 
It is a single dimension of data, composed of just an index and the data.

In [14]:
#Extract the values in the column 'Student_ID' from the demo DataFrame


demo_DF['Student_ID']

0     101
1     102
2     103
3     104
4     105
5     106
6     107
7     108
8     109
9     110
10    111
11    112
12    113
13    114
14    115
Name: Student_ID, dtype: int64

In [15]:
#Extract the values in the column 'Student_ID' from the demo DataFrame

demo_DF.Student_ID

0     101
1     102
2     103
3     104
4     105
5     106
6     107
7     108
8     109
9     110
10    111
11    112
12    113
13    114
14    115
Name: Student_ID, dtype: int64

**Now , lets check the Data Type of the values extracted from a single column of a Dataframe **

In [16]:
 #Display the Data type of a Column extracted from a DataFrame

print(type(demo_DF['Student_ID']))

<class 'pandas.Series'>


**Explore the Methods available with SERIES **

let's select two Series with different data types from the demo Dataframe. 
The  Student_ID column contains strings, formally an object data type, and the column Marks contains numerical data, formally float64:

In [17]:
Student_ID = demo_DF['Student_ID']
Marks = demo_DF['Marks']

print(type(Student_ID))
print(type(Marks))

<class 'pandas.Series'>
<class 'pandas.Series'>


In [18]:
# Create 2 Series from the Student_ID Dataframe.
print(Student_ID.head())

print('\n')  # Print an Empty Line 

print(Marks.head(3))

0    101
1    102
2    103
3    104
4    105
Name: Student_ID, dtype: int64


0    78
1    85
2    62
Name: Marks, dtype: int64


**value_counts() method for SERIES with String / Object data type**

This method returns the count of Unique Values in a Series .

In [19]:
# Display the Count of Unique values in a SERIES 

print(Student_ID.value_counts())

Student_ID
101    1
102    1
103    1
104    1
105    1
106    1
107    1
108    1
109    1
110    1
111    1
112    1
113    1
114    1
115    1
Name: count, dtype: int64


**using describe() method to extract Statistical summary of a Series of Numerical datatype**


In [48]:
# Display the Statistical Summary of a Series of Numerical Data Type
print(Marks.describe())

count    15.000000
mean     68.533333
std      18.192882
min      35.000000
25%      57.000000
50%      72.000000
75%      83.500000
max      91.000000
Name: Marks, dtype: float64


In [46]:
# Display the Statistical Summary of a Series of Numerical Data Type

print(Student_ID.describe())

count     15.000000
mean     108.000000
std        4.472136
min      101.000000
25%      104.500000
50%      108.000000
75%      111.500000
max      115.000000
Name: Student_ID, dtype: float64


**describe() method shows different outputs in case of Categorical datatype **

In [55]:
# Display the Summary of a Series of Categorical Data Type

print(Student_ID.describe())

count     15.000000
mean     108.000000
std        4.472136
min      101.000000
25%      104.500000
50%      108.000000
75%      111.500000
max      115.000000
Name: Student_ID, dtype: float64


**Checking whether any value in SERIES is a Missing Value or NaN**

We can identify if a specific value in a Series is missing by using isnull() method

In [22]:
# Check if any value in a Series is Missing 

Student_ID.isnull()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: Student_ID, dtype: bool

Once , we have identified whether there are Missing Values in a Series or Not , we can address the Missing Values in 2 ways -

    a) Replace the Missing Values with some Constant or by output of an Expression 
    b) Drop those elements with Missing values from the Series 

In [23]:
# Replace the Missing Values with 0 

Student_ID_flled = Student_ID.fillna(0)

print("Count of Non Missing Values before applying fillna() :" ,Student_ID.count())
print("Count of Non Missing Values after applying fillna()  :" ,Student_ID_flled.count())

Count of Non Missing Values before applying fillna() : 15
Count of Non Missing Values after applying fillna()  : 15


In [24]:
# Replace the Missing Values with 0 

Student_ID_NA_Dropped = Student_ID.dropna()

print("Count of elements before applying dropna() :" ,Student_ID.size)
print("Count of elements after applying dropna()  :" ,Student_ID_NA_Dropped.size)

Count of elements before applying dropna() : 15
Count of elements after applying dropna()  : 15


*Applying 'Operations' on a SERIES **

Operations applied on a SERIES is Vectorized - meaning an operation applied on a Series , gets applied on each element of the Series.

In [25]:
# Applying multiplaction operator on Series containing String data type 
# Resulting in each Value being concatenated to itself

Student_ID * 2

0     202
1     204
2     206
3     208
4     210
5     212
6     214
7     216
8     218
9     220
10    222
11    224
12    226
13    228
14    230
Name: Student_ID, dtype: int64

In [26]:
# Applying Division operator with the Series containing Numeric Data 
# Resulting in each value being divided by 100

Marks / 100

0     0.78
1     0.85
2     0.62
3     0.90
4     0.55
5     0.48
6     0.72
7     0.88
8     0.35
9     0.67
10    0.91
11    0.76
12    0.40
13    0.59
14    0.82
Name: Marks, dtype: float64

### **Chaining Series methods together**

In Python, every variable is an object, and all objects have attributes and methods that refer to or return more objects. The sequential invocation of methods using the dot notation is referred to as method chaining. Pandas is a library that lends itself well to method chaining, as many Series and DataFrame methods return more Series and DataFrames, upon which more methods can be called. 

In [27]:
# Example of Method Chaining 

Student_ID.value_counts().head(3)

Student_ID
101    1
102    1
103    1
Name: count, dtype: int64

A common way to count the number of missing values is to chain the sum method after isnull

In [28]:
# Find the total number of Missing values in a SERIES 

Marks.isnull().sum()

np.int64(0)

## Index of a DataFrame 

The index of a DataFrame provides a label for each of the rows. 
If no index is explicitly provided upon DataFrame creation, then by default, a RangeIndex is created with labels as integers from 0 to n-1, where n is the number of rows.

There are 2 ways , we can have an explicit Index Label -
          
      a) Defining one of the Column as Index during the read_csv() call 
      b) Setting the Index to a column after read_csv() 

In [29]:
# Specify the Index Column during the read_csv() step 

demo_DF2 = pd.read_csv('student_marks_demo.csv')

print(demo_DF2.head())

   Student_ID   Name  Subject  Marks
0         101  Aarav     Math     78
1         102   Diya  Science     85
2         103  Rohan     Math     62
3         104  Sneha  English     90
4         105  Karan  Science     55


In [56]:
# Change the Index Column after the read_csv() step
demo_DF3 = demo_DF.set_index('Student_ID')

print(demo_DF3.head())

             Name  Subject  Marks  Tot_Marks
Student_ID                                  
101         Aarav     Math     78        234
102          Diya  Science     85        255
103         Rohan     Math     62        186
104         Sneha  English     90        270
105         Karan  Science     55        165


Conversely, it is possible to turn the index into a column with the reset_index method. This will make demo_title a column again and revert the index back to a RangeIndex. reset_index always puts the column as the very first one in the DataFrame, so the columns may not be in their original order:

In [31]:
demo_DF4 = demo_DF3.reset_index()

demo_DF4.head()

Unnamed: 0,Student_ID,Name,Subject,Marks
0,101,Aarav,Math,78
1,102,Diya,Science,85
2,103,Rohan,Math,62
3,104,Sneha,English,90
4,105,Karan,Science,55


### **Adding or Dropping Columns from a DataFrame**

**Adding new column to a DataFrame**

In [32]:
# Count & Display the number of Columns in a Dataframe
demo_DF.columns.size

4

In [33]:
# Add a New column to a DataFrame

demo_DF['Tot_Marks'] = demo_DF.Marks + demo_DF.Marks + demo_DF.Marks

In [34]:
# Count & Display the number of Columns in a Dataframe after adding a new column

demo_DF.columns.size

5

**DataFrame Operations **

In the earlier section of this script , We had extracted single column from a DataFrame resulting in a SERIES . But , now , lets extract multiple columns from the DataFrame .




In [35]:
demo_values = demo_DF[['Student_ID', 'Marks']]

demo_values.head()


Unnamed: 0,Student_ID,Marks
0,101,78
1,102,85
2,103,62
3,104,90
4,105,55


Use the select_dtypes method to select only the integer columns:

In [36]:
demo_DF.dtypes.value_counts()

int64    3
str      2
Name: count, dtype: int64

In [37]:
# Extract only those columns which are of Integer Data Type

demo_DF.select_dtypes(include=['int']).head()

Unnamed: 0,Student_ID,Marks,Tot_Marks
0,101,78,234
1,102,85,255
2,103,62,186
3,104,90,270
4,105,55,165


In [38]:
# Extract only those columns which are of Numeric Data Type

demo_DF.select_dtypes(include=['number']).head()

Unnamed: 0,Student_ID,Marks,Tot_Marks
0,101,78,234
1,102,85,255
2,103,62,186
3,104,90,270
4,105,55,165


Extracting Columns where the column name contains specific string

Lets extract those columns only which has 'Subject' in the column name

In [39]:
# Extract only those columns which has 'Subject' in the Column name
demo_DF.filter(like='Subject').head()

Unnamed: 0,Subject
0,Math
1,Science
2,Math
3,English
4,Science


Extracting Columns where the column name contains specific string

Lets extract those columns only which has 'Marks' in the column name

In [40]:
# Extract only those columns which has 'Marks' in the Column name

demo_DF.filter(like='Marks').head()

Unnamed: 0,Marks,Tot_Marks
0,78,234
1,85,255
2,62,186
3,90,270
4,55,165


Extracting Columns where the column name contains specific string

Lets extract those columns only which has 'name' in the column name

In [41]:
# Extract only those columns which has 'name' in the Column name

demo_DF.filter(like='name').head()

0
1
2
3
4


Methods & Attributes used for DataFrame Objects 

Attributes

    a) shape : returns the number of rows & columns in a DatFrame
    b) size  : Returns the total numbers of elements in a Dataframe

In [42]:
# get the shape & size of a DataFrame

print(demo_DF.shape)

print(demo_DF.size)

(15, 5)
75


Methods available with DataFrame object -

    a) count()
    b) describe()

In [43]:
# count method is used to find the number of non-missing values for each column.

print(demo_DF.count())

Student_ID    15
Name          15
Subject       15
Marks         15
Tot_Marks     15
dtype: int64


In [44]:
# Find the Statistical Summary of each column in the DataFrame

print(demo_DF.describe())

       Student_ID      Marks   Tot_Marks
count   15.000000  15.000000   15.000000
mean   108.000000  68.533333  205.600000
std      4.472136  18.192882   54.578646
min    101.000000  35.000000  105.000000
25%    104.500000  57.000000  171.000000
50%    108.000000  72.000000  216.000000
75%    111.500000  83.500000  250.500000
max    115.000000  91.000000  273.000000
