#  $$ **PANDAS** $$
---  
<img src = 'https://assets.rbl.ms/629688/980x.jpg' style = "height: 400px;width:600px;"/>

## Introduction  
Pandas is an open-source library that is made mainly for working with relational or labeled data both easily and intuitively. It provides various data structures and operations for manipulating numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it has high performance & productivity for users.  

## What is Pandas?  

- Pandas is a Python library used for working with data sets.

- It has functions for analyzing, cleaning, exploring, and manipulating data.

- The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.  

## Why Use Pandas?  

- Pandas allows us to analyze big data and make conclusions based on statistical theories.

- Pandas can clean messy data sets, and make them readable and relevant.

- Relevant data is very important in data science.  

## Advantages  
- Fast and efficient for manipulating and analyzing data. 
- Data from different file objects can be loaded. 
- Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data 
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects 
- Data set merging and joining. 
- Flexible reshaping and pivoting of data sets 
- Provides time-series functionality. 
- Powerful group by functionality for performing split-apply-combine operations on data sets. 
---



## Install Pandas    
use the command: pip install pandas  

__Pandas generally provide two data structures for manipulating data, They are:__

- Series 
- DataFrame  

__Series:__  

Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes.  

$ \color{blue}{\text{In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, an Excel file.}}$  
Pandas Series can be created from the lists, dictionary, and from a scalar value etc.


In [1]:
import pandas as pd
import numpy as np

In [7]:
s = np.array([1,2,3,'sai', np.nan])
print(type(s))
print(s)
s = pd.Series(s)
print(type(s))
print(s)

<class 'numpy.ndarray'>
['1' '2' '3' 'sai' 'nan']
<class 'pandas.core.series.Series'>
0      1
1      2
2      3
3    sai
4    nan
dtype: object


### Data Frame

In [14]:
data = np.random.randn(10,3)
index = np.arange(0,10)
pd.DataFrame(data,index,columns = ['a','b','c'])

Unnamed: 0,a,b,c
0,-0.23024,0.321891,0.427451
1,0.833057,-0.565531,-0.662137
2,0.514412,1.433143,1.053655
3,-0.923316,1.397798,0.613851
4,1.061436,-1.960011,1.509203
5,-1.097202,-2.156628,1.079989
6,0.324354,-0.665589,1.050158
7,0.817193,-0.225649,-0.107562
8,1.047717,0.39202,2.15144
9,-0.247353,-0.24005,-1.51504


In [10]:
dates = pd.date_range(start = 1/1/2021, periods = 10)
dates

DatetimeIndex(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
               '1970-01-05', '1970-01-06', '1970-01-07', '1970-01-08',
               '1970-01-09', '1970-01-10'],
              dtype='datetime64[ns]', freq='D')

In [15]:
pd.DataFrame(np.random.randint(5,size = (10,4)), index = dates, columns = ['A','B','C','D'])

Unnamed: 0,A,B,C,D
1970-01-01,2,0,3,3
1970-01-02,2,3,3,1
1970-01-03,0,0,2,4
1970-01-04,1,3,2,2
1970-01-05,4,4,0,3
1970-01-06,2,4,4,0
1970-01-07,1,3,2,1
1970-01-08,4,1,4,0
1970-01-09,3,3,2,4
1970-01-10,0,3,2,3


In [26]:
df  = pd.DataFrame({'A' : [1,2,np.nan,4],
                    'B' : pd.Timestamp('20200301'),
                    'C' : pd.Series(1, index = list(range(4)), dtype = 'float32'),
                    'D' : np.array([6]*4),
                    'E' : pd.Categorical(['True','False','True','False']),
                    'F' : 'skill assure'
})
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-03-01,1.0,6,True,skill assure
1,2.0,2020-03-01,1.0,6,False,skill assure
2,,2020-03-01,1.0,6,True,skill assure
3,4.0,2020-03-01,1.0,6,False,skill assure


In [27]:
df.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       3 non-null      float64       
 1   B       4 non-null      datetime64[ns]
 2   C       4 non-null      float32       
 3   D       4 non-null      int32         
 4   E       4 non-null      category      
 5   F       4 non-null      object        
dtypes: category(1), datetime64[ns](1), float32(1), float64(1), int32(1), object(1)
memory usage: 288.0+ bytes


In [29]:
df.head(1)

Unnamed: 0,A,B,C,D,E,F
0,1.0,2020-03-01,1.0,6,True,skill assure


In [30]:
df.tail(2)

Unnamed: 0,A,B,C,D,E,F
2,,2020-03-01,1.0,6,True,skill assure
3,4.0,2020-03-01,1.0,6,False,skill assure


In [31]:
df.isnull()

Unnamed: 0,A,B,C,D,E,F
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,True,False,False,False,False,False
3,False,False,False,False,False,False


In [32]:
df.isnull().sum()

A    1
B    0
C    0
D    0
E    0
F    0
dtype: int64

In [36]:
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [37]:
df.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [38]:
df.values

array([[1.0, Timestamp('2020-03-01 00:00:00'), 1.0, 6, 'True',
        'skill assure'],
       [2.0, Timestamp('2020-03-01 00:00:00'), 1.0, 6, 'False',
        'skill assure'],
       [1.0, Timestamp('2020-03-01 00:00:00'), 1.0, 6, 'True',
        'skill assure'],
       [4.0, Timestamp('2020-03-01 00:00:00'), 1.0, 6, 'False',
        'skill assure']], dtype=object)

In [39]:
df.ndim

2

In [40]:
df.axes

[Int64Index([0, 1, 2, 3], dtype='int64'),
 Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')]

In [41]:
df.shape

(4, 6)

In [43]:
df.describe()

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,2.0,1.0,6.0
std,1.414214,0.0,0.0
min,1.0,1.0,6.0
25%,1.0,1.0,6.0
50%,1.5,1.0,6.0
75%,2.5,1.0,6.0
max,4.0,1.0,6.0


In [52]:
df['A']

0    1.0
1    2.0
2    1.0
3    4.0
Name: A, dtype: float64

In [45]:
df['G']  = [23,34,52,23]
df

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2020-03-01,1.0,6,True,skill assure,23
1,2.0,2020-03-01,1.0,6,False,skill assure,34
2,1.0,2020-03-01,1.0,6,True,skill assure,52
3,4.0,2020-03-01,1.0,6,False,skill assure,23


In [54]:
df.loc[0,['A','B']]

A                    1.0
B    2020-03-01 00:00:00
Name: 0, dtype: object

In [46]:
df.iloc[0]

A                    1.0
B    2020-03-01 00:00:00
C                    1.0
D                      6
E                   True
F           skill assure
G                     23
Name: 0, dtype: object

In [50]:
df.sort_index(axis = 0, ascending = False)

Unnamed: 0,A,B,C,D,E,F,G
3,4.0,2020-03-01,1.0,6,False,skill assure,23
2,1.0,2020-03-01,1.0,6,True,skill assure,52
1,2.0,2020-03-01,1.0,6,False,skill assure,34
0,1.0,2020-03-01,1.0,6,True,skill assure,23


In [55]:
df[df['G']>20]['A']

0    1.0
1    2.0
2    1.0
3    4.0
Name: A, dtype: float64

In [51]:
df.sort_values(by = 'G')

Unnamed: 0,A,B,C,D,E,F,G
0,1.0,2020-03-01,1.0,6,True,skill assure,23
3,4.0,2020-03-01,1.0,6,False,skill assure,23
1,2.0,2020-03-01,1.0,6,False,skill assure,34
2,1.0,2020-03-01,1.0,6,True,skill assure,52


# Real World Implementations  
---  
## Recommendation Systems  

We all have used Spotify or Netflix and been appalled at the brilliant recommendations provided by these sites. These systems are a miracle of Deep Learning. Such models for providing recommendations is one of the most important applications of Pandas. Mostly, these models are made in python and Pandas being the main libraries of python, used when handling data in such models. We know that Pandas are best for managing huge amounts of data. And the recommendation system is possible only by learning and handling huge masses of data. Functions like groupBy and mapping help tremendously in making these systems possible.  
  
## Data Science  

Pandas and Data science are almost synonymous. Most of the examples are a product of Data Science itself. It is a very broad umbrella which encompasses anything that deals with analyzing data, and thus almost all applications of Pandas fall under the scope of Data science. Pandas mainly used for processing the data. Therefore Data Science on Python without Pandas is very difficult.