Pandas is open-source Python library which is used for data manipulation and analysis. It consist of data structures and functions to perform efficient operations on data. It is well-suited for working with tabular data such as spreadsheets or SQL tables. It is used in data science because it works well with other important libraries. It is built on top of the NumPy library as it makes easier to manipulate and analyze. Pandas is used in other libraries such as:
* Matplotlib for plotting graphs
* SciPy for statistical analysis
* Scikit-learn for machine learning algorithms.
* It uses many functionalities provided by NumPy library.
  
Here is a various tasks that we can do using Pandas:

* Data Cleaning, Merging and Joining: Clean and combine data from multiple sources, handling inconsistencies and duplicates.
* Handling Missing Data: Manage missing values (NaN) in both floating and non-floating point data.
* Column Insertion and Deletion: Easily add, remove or modify columns in a DataFrame.
* Group By Operations: Use "split-apply-combine" to group and analyze data.
* Data Visualization: Create visualizations with Matplotlib and Seaborn, integrated with Pandas

### 1. Pandas Series
A Pandas Series is one-dimensional labeled array capable of holding data of any type (integer, string, float, Python objects etc.). The axis labels are collectively called indexes.

Pandas Series is created by loading the datasets from existing storage which can be a SQL database, a CSV file or an Excel file. It can be created from lists, dictionaries, scalar values, etc.

Example: Creating a series using the Pandas Library.

In [1]:
import pandas as pd 
import numpy as np

ser = pd.Series() 
print("Pandas Series: ", ser) 

data = np.array(['g', 'e', 'e', 'k', 's']) 
  
ser = pd.Series(data) 
print("Pandas Series:\n", ser)

Pandas Series:  Series([], dtype: object)
Pandas Series:
 0    g
1    e
2    e
3    k
4    s
dtype: object


In [10]:
df = pd.DataFrame() 
print(df)

lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks'] 
  
df = pd.DataFrame(lst) 
df

Empty DataFrame
Columns: []
Index: []


Unnamed: 0,0
0,Geeks
1,For
2,Geeks
3,is
4,portal
5,for
6,Geeks


Pandas Series is a one-dimensional labeled array that can hold data of any type (integer, float, string, Python objects, etc.). It is similar to a column in an Excel spreadsheet or a database table. In this article we will study Pandas Series which is a useful one-dimensional data structure in Python.

Key Features of Pandas Series:

* Supports integer-based and label-based indexing.
* Stores heterogeneous data types.
* Offers a variety of built-in methods for data manipulation and analysis.

### Creating a Pandas Series
A Pandas Series can be created from different data structures such as lists, NumPy arrays, dictionaries or scalar value.

In [12]:
data = [1, 2, 3, 4]
 
ser = pd.Series(data)
ser

0    1
1    2
2    3
3    4
dtype: int64

### Accessing element of Series
In Pandas we can access element of series using two ways:

* Position-based Indexing - In this we use numerical positions similar to lists in Python.
* Label-based Indexing - This method also custom index labels assigned to elements.

#### 1. Position-based Indexing
In order to access the series element refers to the index number. Use the index operator []to access an element in a series. The index must be an integer. In order to access multiple elements from a series we use Slice operation.

In [15]:
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data)

print(ser[:6])

0    g
1    e
2    e
3    k
4    s
5    f
dtype: object


#### 2. Label-based Indexing
In order to access an element from series, we have to set values by index label. A Series is like a fixed-size dictionary in that we can get and set values by index label. let's see a example to understand this

In [17]:
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20,21,22])

print(ser[18])

g


### Indexing and Selecting Data in Series
Indexing in pandas means simply selecting particular data from a Series. Indexing could mean selecting all the data some of the data from particular columns. Indexing can also be known as Subset Selection. We can use .iloc[] for position-based selection and .loc[] for label-based selection.

#### 1. Indexing a Series using .loc[]
This function selects data by refering the explicit index . The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of data. You can download dataset from here.

In [31]:
nba_data=pd.read_csv("nba.csv")
ser=pd.Series(df['Name'])
data=ser.head(10)
data

0       Jai
1    Princi
2    Gaurav
3      Anuj
Name: Name, dtype: object

In [21]:
data.loc[3:6]

3      R.J. Hunter
4    Jonas Jerebko
5     Amir Johnson
6    Jordan Mickey
Name: Name, dtype: object

#### 2. Indexing a Series using .iloc[]
.iloc[] function allows us to retrieve data by position. In order to do that we’ll need to specify the positions of the data that we want. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

In [22]:
data.iloc[3:6]

3      R.J. Hunter
4    Jonas Jerebko
5     Amir Johnson
Name: Name, dtype: object

### Binary Operations on Pandas Series
Pandas allows performing binary operations on Series, such as addition, subtraction, multiplication and division. These operations can be performed using functions like .add() , .sub(), .mul() and .div().

Example: Performing Binary Operations

In [23]:
ser1 = pd.Series([1, 2, 3], index=['A', 'B', 'C'])
ser2 = pd.Series([4, 5, 6], index=['A', 'B', 'C'])

df_sum = ser1.add(ser2)
print(df_sum)

A    5
B    7
C    9
dtype: int64


#### Common Binary Operations  

sub()	-> Method is used to subtract series or list like objects with same length from the caller series    
mul()	-> Method is used to multiply series or list like objects with same length with the caller series  
div()	-> Method is used to divide series or list like objects with same length by the caller series  
sum()	-> Returns the sum of the values for the requested axis  
prod()	-> Returns the product of the values for the requested axis  
mean()	-> Returns the mean of the values for the requested axis  
pow()	-> Method is used to put each element of passed series as exponential power of caller series and returned the results  
abs()	-> Method is used to get the absolute numeric value of each element in Series/DataFrame  
cov()	-> Method is used to find covariance of two series   

### Conversion Operation on Series
Conversion operations allow transforming data types within a Series. This can be useful for ensuring consistency in data types. In order to perform conversion operation we have various function which help in conversion like .astype(), .tolist() etc

In [24]:
ser = pd.Series([1, 2, 3, 4])
ser = ser.astype(float)
print(ser)

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64


A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:

* Data: Actual values in the table.
* Rows: Labels that identify each row.
* Columns: Labels that define each data category.  
In this article, we’ll see the key components of a DataFrame and see how to work with it to make data analysis easier and more efficient.

### Creating a Pandas DataFrame
Pandas allows us to create a DataFrame from many data sources. We can create DataFrames directly from Python objects like lists and dictionaries or by reading data from external files like CSV, Excel or SQL databases.

#### 1. Creating DataFrame using a List
If we have a simple list of data, we can easily create a DataFrame by passing that list to the pd.DataFrame() function.

In [26]:
lst = ['Geeks', 'For', 'Geeks', 'is', 
            'portal', 'for', 'Geeks']

df = pd.DataFrame(lst)
df

Unnamed: 0,0
0,Geeks
1,For
2,Geeks
3,is
4,portal
5,for
6,Geeks


#### 2. Creating DataFrame from dict of ndarray/lists
We can create a DataFrame from a dictionary where the keys are column names and the values are lists or arrays.

* All arrays/lists must have the same length.
* If an index is provided, it must match the length of the arrays.
* If no index is provided, Pandas will use a default range index (0, 1, 2, …).

In [27]:
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}
 
df = pd.DataFrame(data)
 
df

Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,krish,19
3,jack,18


### Working With Rows and Columns in Pandas DataFrame
We can perform basic operations on rows/columns like selecting, deleting, adding and renaming.

#### 1. Column Selection
In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [29]:
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
 
df = pd.DataFrame(data)
df[['Name', 'Qualification']]

Unnamed: 0,Name,Qualification
0,Jai,Msc
1,Princi,MA
2,Gaurav,MCA
3,Anuj,Phd


#### 2. Row Selection
Pandas provide unique methods for selecting rows from a Data frame. DataFrame.loc[] method is used for label-based selection

Here we’ll be using nba.csv dataset in below examples for better understanding.






In [46]:
data=pd.read_csv("nba.csv",index_col="Name")

first=data.loc["Avery Bradley"]
second=data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)

Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


 Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object


### Indexing and Selecting Data in Pandas DataFrame
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. It allows us to access subsets of data such as:

* Selecting all rows and some columns.
* Selecting some rows and all columns.
* Selecting a specific subset of rows and columns.
* Indexing can also be known as Subset Selection.

#### 1. Indexing a Dataframe using indexing operator [] 
The indexing operator [] is the basic way to select data in Pandas. We can use this operator to access columns from a DataFrame. This method allows us to retrieve one or more columns. The .loc and .iloc indexers also use the indexing operator to make selections.

In order to select a single column, we simply put the name of the column in-between the brackets.

In [48]:
age=data['Age']
age.head()

Name
Avery Bradley    25.0
Jae Crowder      25.0
John Holland     27.0
R.J. Hunter      22.0
Jonas Jerebko    29.0
Name: Age, dtype: float64

#### 2. Indexing a DataFrame using .loc[ ]
The .loc method is used to select data by label. This means it uses the row and column labels to access specific data points. .loc[] is versatile because it can select both rows and columns simultaneously based on labels.

In order to select a single row using .loc[], we put a single row label in a .loc function.

In [49]:
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
 
print(first, "\n\n\n", second)

Team        Boston Celtics
Number                 0.0
Position                PG
Age                   25.0
Height                 6-2
Weight               180.0
College              Texas
Salary           7730337.0
Name: Avery Bradley, dtype: object 


 Team        Boston Celtics
Number                28.0
Position                SG
Age                   22.0
Height                 6-5
Weight               185.0
College      Georgia State
Salary           1148640.0
Name: R.J. Hunter, dtype: object


#### 3. Indexing a DataFrame using .iloc[ ] 
The .iloc() method allows us to select data based on integer position. Unlike .loc[] (which uses labels) .iloc[] requires us to specify row and column positions as integers (0-based indexing).

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

In [53]:

 
row2 = data.iloc[3:20] 
 
row2

Unnamed: 0_level_0,Team,Number,Position,Age,Height,Weight,College,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0
Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
Jared Sullinger,Boston Celtics,7.0,C,24.0,6-9,260.0,Ohio State,2569260.0
Isaiah Thomas,Boston Celtics,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
Evan Turner,Boston Celtics,11.0,SG,27.0,6-7,220.0,Ohio State,3425510.0


### Working with Missing Data
Missing Data can occur when no information is available for one or more items or for an entire row/column. In Pandas missing data is represented as NaN (Not a Number). Missing data can be problematic in real-world datasets where data is incomplete. Pandas provides several methods to handle such missing data effectively:

#### 1. Checking for Missing Values using isnull() and notnull()
To check for missing values (NaN) we can use two useful functions:

1. isnull(): It returns True for NaN (missing) values and False otherwise.
2. notnull(): It returns the opposite, True for non-missing values and False for NaN values.

In [57]:
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}

df = pd.DataFrame(dict)
print(df)
print(df.isnull())

   First Score  Second Score  Third Score
0        100.0          30.0          NaN
1         90.0          45.0         40.0
2          NaN          56.0         80.0
3         95.0           NaN         98.0
   First Score  Second Score  Third Score
0        False         False         True
1        False         False        False
2         True         False        False
3        False          True        False


#### 2. Filling Missing Values using fillna(), replace() and interpolate()
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

In [62]:
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
 
df.fillna(np.mean(df).round(2))

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,70.44
1,90.0,45.0,40.0
2,70.44,56.0,80.0
3,95.0,70.44,98.0


#### 3. Dropping Missing Values using dropna()
If we want to remove rows or columns with missing data we can use the dropna() method. This method is flexible which allows us to drop rows or columns depending on the configuration.

In [67]:
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
 
df = pd.DataFrame(dict)
   
df.dropna()

Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
3,95.0,56.0,98,65.0


### Iterating over rows and columns
Iteration refers to the process of accessing each item one at a time. In Pandas, it means iterating through rows or columns in a DataFrame to access or manipulate the data. We can iterate over rows and columns to extract values or perform operations on each item.

#### 1. Iterating Over Rows
There are several ways to iterate over the rows of a Pandas DataFrame and three common methods are:

1. iteritems()
2. iterrows()
3. itertuples()

In [69]:
dict = {'name':["aparna", "pankaj", "sudhir", "Geeku"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
 
df = pd.DataFrame(dict)

for i, j in df.iterrows():
    print(i, j)
    print()

0 name      aparna
degree       MBA
score         90
Name: 0, dtype: object

1 name      pankaj
degree       BCA
score         40
Name: 1, dtype: object

2 name      sudhir
degree    M.Tech
score         80
Name: 2, dtype: object

3 name      Geeku
degree      MBA
score        98
Name: 3, dtype: object



In [76]:
for i in df:
    print (df[i][2])

sudhir
M.Tech
80


| Function         | Description |
|------------------|-------------|
| `index`          | Method returns index (row labels) of the DataFrame |
| `insert`         | Method inserts a column into a DataFrame |
| `add`            | Method returns addition of DataFrame and other, element-wise (binary operator add) |
| `sub`            | Method returns subtraction of DataFrame and other element-wise (binary operator sub) |
| `mul`            | Method returns multiplication of DataFrame and other, element-wise (binary operator mul) |
| `div`            | Method returns floating division of DataFrame and other element-wise (binary operator truediv) |
| `unique`         | Method extracts the unique values in the DataFrame |
| `nunique`        | Method returns count of the unique values in the DataFrame |
| `value_counts`   | Method counts the number of times each unique value occurs within the Series |
| `columns`        | Method returns the column labels of the DataFrame |
| `axes`           | Method returns a list representing the axes of the DataFrame |
| `isnull`         | Method creates a Boolean Series for extracting rows with null values |
| `notnull`        | Method creates a Boolean Series for extracting rows with non-null values |
| `isin`           | Method extracts rows from a DataFrame where a column value exists in a predefined collection |
| `dtypes`         | Method returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns |
| `astype`         | Method converts the data types in a Series |
| `values`         | Method returns a Numpy representation of the DataFrame (only the values, labels removed) |
| `sort_values`    | Method sorts a DataFrame in Ascending or Descending order of passed column |
| `sort_index`     | Method sorts the values in a DataFrame based on their index positions or labels |
| `loc[]`          | Method retrieves rows based on index label |
| `iloc[]`         | Method retrieves rows based on index position |
| `ix[]`           | Method retrieves rows based on either index label or index position (deprecated in latest pandas) |
| `rename`         | Method is called on a DataFrame to change the names of the index labels or column names |
| `drop`           | Method is used to delete rows or columns from a DataFrame |
| `pop`            | Method is used to delete rows or columns from a DataFrame |
| `sample`         | Method pulls out a random sample of rows or columns from a DataFrame |
| `nsmallest`      | Method pulls out the rows with the smallest values in a column |
| `nlargest`       | Method pulls out the rows with the largest values in a column |
| `shape`          | Method returns a tuple representing the dimensionality of the DataFrame |
| `ndim`           | Method returns an int representing the number of axes/array dimensions (1 if Series, 2 if DataFrame) |
| `dropna`         | Method allows the user to drop Rows/Columns with Null values in different ways |
| `fillna`         | Method replaces NaN values with some value provided by the user |
| `rank`           | Values in a Series can be ranked in order with this method |
| `query`          | Method is an alternate string-based syntax for extracting a subset from a DataFrame |
| `copy`           | Method creates an independent copy of a pandas object |
| `duplicated`     | Method creates a Boolean Series and uses it to extract rows that have duplicate values |
| `drop_duplicates`| Method removes duplicate rows from a DataFrame |
| `set_index`      | Method sets the DataFrame index (row labels) using one or more existing columns |
| `reset_index`    | Method resets the index of a DataFrame to default integer values |
| `where`          | Method is used to check a DataFrame for one or more conditions and return the result accordingly. By default, rows not satisfying the condition are filled with NaN |
