# Pandas

- Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. 
- It provides various data structures and operations for manipulating numerical data and time series. 
- This library is built on top of the NumPy library. 
- Pandas is fast and it has high performance & productivity for users.

**- Two data structures for manipulating data, They are:**

**1 Series**

**2 DataFrame**

In [1]:
# Installing Pandas Library

!pip install pandas



In [2]:
# Importing Pandas library

import pandas as pd

In [3]:
# Two hide warnings from notebook

import warnings

warnings.filterwarnings('ignore')

---
# 1) Series
- Series is a one-dimensional labeled array and capable of holding data of any type (integer, string, float, python objects, etc.)

**Ways to create pandas Series:**

- Method 1: Create list then convert it into series
- Method 2: Create Dictionary then convert it into series
- Method 3: pd.Series()

**Method 1: Create list then convert it into series**

In [4]:
# creating list

lst = [10,20,30,40,50,60,70]
lst

[10, 20, 30, 40, 50, 60, 70]

In [5]:
type(lst)

list

In [6]:
# Converting list into pandas series data structure

a = pd.Series(lst)
a

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [7]:
# checking

type(a)

pandas.core.series.Series

**Method 2: Create Dictionary then convert it into series**

In [8]:
# creating dictionary

b = {1:10, 2:20, 3:30, 4:40, 5:50}
b

{1: 10, 2: 20, 3: 30, 4: 40, 5: 50}

In [9]:
# checking

type(b)

dict

In [10]:
# Converting dictionary ds into pandas series data structure

c = pd.Series(b)
c

1    10
2    20
3    30
4    40
5    50
dtype: int64

In [11]:
# Checking

type(c)

pandas.core.series.Series

**Method 3: pd.Series()**

One-dimensional ndarray with axis labels (including time series).

**Syntax:** pd.Series(data=None, index=None, dtype: 'Dtype | None' = None, name=None, copy: 'bool' = False, fastpath: 'bool' = False)

In [12]:
# creating pandas Series data structure

d = pd.Series(data = ['python', 'pune', 100], index = ['a', 'b', 'c'], name =  'Series')
d

a    python
b      pune
c       100
Name: Series, dtype: object

In [13]:
# Accessing elements from a series using index labels

d['b']

'pune'

In [14]:
# Extract pune

d[1]

'pune'

---
# 2) DataFrame
- A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns

**Syntax:** pd.DataFrame(data=None, index: 'Axes | None' = None, columns: 'Axes | None' = None, dtype: 'Dtype | None' = None, copy: 'bool | None' = None)

**Ways to create pandas DataFrame.**
1. By creating a nested lists and colname lists
2. By creating a dictionary

**1. By creating a nested lists and colname lists**

In [15]:
e = [['Python', 98, 'Pune'], ['ML', 92, 'Mumbai']]

coln = ['Subject', 'Marks', 'Location']

In [16]:
# Creating DataFrame

f = pd.DataFrame(data = e, columns = coln)
f

Unnamed: 0,Subject,Marks,Location
0,Python,98,Pune
1,ML,92,Mumbai


**2. By creating a dictionary**

In [17]:
g = pd.DataFrame(
    {'Subject': ['Python', 'ML', 'DL', 'AI'],
    'Marks' : [98, 99, 97, 100],
    'Location' : ['Pune', 'Mumbai', 'Hydrabad', 'Bengalore']})
g

Unnamed: 0,Subject,Marks,Location
0,Python,98,Pune
1,ML,99,Mumbai
2,DL,97,Hydrabad
3,AI,100,Bengalore


In [18]:
# Checking

type(g)

pandas.core.frame.DataFrame

In [19]:
# Check number of rows and columns in dataframe: (variable.shape)

g.shape

(4, 3)

In [20]:
# Check data types of each column: (variable.dtypes)

g.dtypes

Subject     object
Marks        int64
Location    object
dtype: object

In [21]:
# Dimesion of the dataframe

g.ndim

2

In [22]:
# Extracting columns from dataframe: Approach 1
# Syntax: variable.column name

g.Marks

0     98
1     99
2     97
3    100
Name: Marks, dtype: int64

In [23]:
# Always use this Approach 2
# Extracting columns from dataframe: Approach 2
# Syntax: variable['column name']

g['Subject']

0    Python
1        ML
2        DL
3        AI
Name: Subject, dtype: object

In [24]:
# Descriptive Statistics of entire dataset

g.describe()

Unnamed: 0,Marks
count,4.0
mean,98.5
std,1.290994
min,97.0
25%,97.75
50%,98.5
75%,99.25
max,100.0


In [25]:
# Descriptive Statistics for categorical column

g.describe(include = object)

Unnamed: 0,Subject,Location
count,4,4
unique,4,4
top,Python,Pune
freq,1,1


In [26]:
# Descriptive Statistics for all column

g.describe(include='all')

Unnamed: 0,Subject,Marks,Location
count,4,4.0,4
unique,4,,4
top,Python,,Pune
freq,1,,1
mean,,98.5,
std,,1.290994,
min,,97.0,
25%,,97.75,
50%,,98.5,
75%,,99.25,


In [27]:
# Getting information about the entire dataset

g.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Subject   4 non-null      object
 1   Marks     4 non-null      int64 
 2   Location  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


In [28]:
# Checking null values are present or not

g.isnull()

Unnamed: 0,Subject,Marks,Location
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False


In [29]:
# Checking null values are present or not

g.isna()

Unnamed: 0,Subject,Marks,Location
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False


In [30]:
# Checking count of null values

g.isnull().sum()

Subject     0
Marks       0
Location    0
dtype: int64

---
# Importing External Dataset 

**1) To Import .CSV -->** variable = pd.read_csv(r'dataset') -- r stands for raw-string.

In [31]:
# Loading the csv file

df = pd.read_csv(r"C:\Users\sures\3D Objects\Data Science\1) Offline Python for Data Science (Aishwarya Mate)\2) GitHub Aishwarya Mam\3) Datasets-main\Salaries.csv")
df

Unnamed: 0,rank,discipline,phd,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
...,...,...,...,...,...,...
73,Prof,B,18,10,Female,105450
74,AssocProf,B,19,6,Female,104542
75,Prof,B,17,17,Female,124312
76,Prof,A,28,14,Female,109954


In [32]:
# First 5 records

df.head()

Unnamed: 0,rank,discipline,phd,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800


In [33]:
# First 10 records

df.head(11)

Unnamed: 0,rank,discipline,phd,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800


In [34]:
# Last 5 records

df.tail()

Unnamed: 0,rank,discipline,phd,service,gender,salary
73,Prof,B,18,10,Female,105450
74,AssocProf,B,19,6,Female,104542
75,Prof,B,17,17,Female,124312
76,Prof,A,28,14,Female,109954
77,Prof,A,23,15,Female,109646


In [35]:
# Last 10 records

df.tail(10)

Unnamed: 0,rank,discipline,phd,service,gender,salary
68,AsstProf,A,4,2,Female,77500
69,Prof,A,28,7,Female,116450
70,AsstProf,A,8,3,Female,78500
71,AssocProf,B,12,9,Female,71065
72,Prof,B,24,15,Female,161101
73,Prof,B,18,10,Female,105450
74,AssocProf,B,19,6,Female,104542
75,Prof,B,17,17,Female,124312
76,Prof,A,28,14,Female,109954
77,Prof,A,23,15,Female,109646


In [36]:
# Reading all the records

pd.set_option('display.max_rows', None)

# If you want to read ex.5000 rows then put num instead of None
# pd.set_option('display.max_columns', None)--> If more columns

In [37]:
df

Unnamed: 0,rank,discipline,phd,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800


In [38]:
# Describe the data

df.describe()

Unnamed: 0,phd,service,salary
count,78.0,78.0,78.0
mean,19.705128,15.051282,108023.782051
std,12.498425,12.139768,28293.661022
min,1.0,0.0,57800.0
25%,10.25,5.25,88612.5
50%,18.5,14.5,104671.0
75%,27.75,20.75,126774.75
max,56.0,51.0,186960.0


In [39]:
# Describe the data (change orientation)

df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
rank,78.0,3.0,Prof,46.0,,,,,,,
discipline,78.0,2.0,B,42.0,,,,,,,
phd,78.0,,,,19.705128,12.498425,1.0,10.25,18.5,27.75,56.0
service,78.0,,,,15.051282,12.139768,0.0,5.25,14.5,20.75,51.0
gender,78.0,2.0,Male,39.0,,,,,,,
salary,78.0,,,,108023.782051,28293.661022,57800.0,88612.5,104671.0,126774.75,186960.0


In [40]:
# Extracting only column names

df.columns

Index(['rank', 'discipline', 'phd', 'service', 'gender', 'salary'], dtype='object')

In [41]:
# Extract only any two columns

df[['rank', 'salary']]

Unnamed: 0,rank,salary
0,Prof,186960
1,Prof,93000
2,Prof,110515
3,Prof,131205
4,Prof,104800
5,Prof,122400
6,AssocProf,81285
7,Prof,126300
8,Prof,94350
9,Prof,57800


In [42]:
# Extract records from discipline column starting from 18 to 25

# This type of operations done with "loc and iloc" functions

**1) loc function** (To extract particular data)

Note: It includes Last index number.

**Syntax:** variable.loc[start:end, 'column name']

In [43]:
df.loc[18:25, 'discipline']

18    A
19    A
20    B
21    A
22    A
23    A
24    A
25    B
Name: discipline, dtype: object

In [44]:
df.loc[18:25, 'discipline':'service']

Unnamed: 0,discipline,phd,service
18,A,19,7
19,A,29,27
20,B,4,4
21,A,33,30
22,A,4,2
23,A,2,0
24,A,30,23
25,B,35,31


**2) iloc function**

Note: Last index number is excluded.

**Syntax:** df.iloc[row:index, column:index]

In [45]:
df.iloc[30:40, 1:3]

Unnamed: 0,discipline,phd
30,B,9
31,B,22
32,A,27
33,B,18
34,B,12
35,B,28
36,B,45
37,A,20
38,B,4
39,B,18


---
**value_counts**

Shows unique values with their count(how many times that value is appearing)

**Syntax:** variable['column name'].value_counts()

In [46]:
# Value counts

df['rank'].value_counts()

Prof         46
AsstProf     19
AssocProf    13
Name: rank, dtype: int64

In [47]:
df['gender'].value_counts()

Male      39
Female    39
Name: gender, dtype: int64

---
**unique values**

To check unique values in particular column

**Syntax:** variable['column name'].unique()

In [48]:
df['gender'].unique()

array(['Male', 'Female'], dtype=object)

---
**unique value count**

To count unique values in particular column

**Syntax:** variable['column name'].nunique()

In [49]:
df['rank'].nunique()

3

---
**Renaming column names**

In [50]:
df = df.rename(columns = {'rank':'Rank'})
df

Unnamed: 0,Rank,discipline,phd,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800


In [51]:
# Use inplace to store changes permanently

df.rename(columns={'discipline':'Discipline','phd':'PHD'}, inplace=True)
df

Unnamed: 0,Rank,Discipline,PHD,service,gender,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400
6,AssocProf,A,20,17,Male,81285
7,Prof,A,18,18,Male,126300
8,Prof,A,29,19,Male,94350
9,Prof,A,51,51,Male,57800
