# **Basic Therory For Best Understanding**

In [None]:
'''Pandas is an open-source Python Library providing high-performance data manipulation 
and analysis tool using its powerful data structures. The name Pandas is derived from the 
word Panel Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high 
performance, flexible tool for analysis of data. 
Prior to Pandas, Python was majorly used for data munging and preparation. It had very 
less contribution towards data analysis. Pandas solved this problem. Using Pandas, we can 
accomplish five typical steps in the processing and analysis of data, regardless of the origin 
of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial 
domains including finance, economics, Statistics, analytics, etc.'''

In [None]:
'''Key Features of Pandas
 Fast and efficient DataFrame object with default and customized indexing.
 Tools for loading data into in-memory data objects from different file formats.
 Data alignment and integrated handling of missing data.
 Reshaping and pivoting of date sets.
 Label-based slicing, indexing and subsetting of large data sets.
 Columns from a data structure can be deleted or inserted.
 Group by data for aggregation and transformations.
 High performance merging and joining of data.
 Time Series functionality.'''

# **Environment Setup**

In [None]:
'''Standard Python distribution doesn't come bundled with Pandas module. A lightweight 
alternative is to install NumPy using popular Python package installer, pip.'''

In [None]:
#pip install pandas

In [None]:
"For ubuntu users"

In [None]:
#sudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook python-pandas python-sympy python-nose

# **Introduction to Data Structures**

In [None]:
'''Pandas deals with the following three data structures:
 Series 
 DataFrame
 Panel
These data structures are built on top of Numpy array, which means they are fast.'''

# **Dimension & Description**

In [None]:
'''The best way to think of these data structures is that the higher dimensional data structure 
is a container of its lower dimensional data structure. For example, DataFrame is a 
container of Series, Panel is a container of DataFrame'''

In [None]:
'''For example, with tabular data (DataFrame) it is more semantically helpful to think of 
the index (the rows) and the columns rather than axis 0 and axis 1.'''


# **Mutability**

In [None]:
'''All Pandas data structures are value mutable (can be changed) and except Series all are 
size mutable. Series is size immutable. 
Note: DataFrame is widely used and one of the most important data structures. Panel is 
very less used.'''


# **Series**

In [None]:
# Series is a one-dimensional array like structure with homogeneous data. For example, the 
# following series is a collection of integers 10, 23, 56,.


'''Key Points
 Homogeneous data 
 Size Immutable
 Values of Data Mutable'''

# **DataFrame**

In [None]:
# DataFrame is a two-dimensional array with heterogeneous data

# **Panel**

In [None]:
'''Panel is a three-dimensional data structure with heterogeneous data. It is hard to 
represent the panel in graphical representation. But a panel can be illustrated as a 
container of DataFrame.'''

'''Key Points
 Heterogeneous data 
 Size Mutable
 Data Mutable'''


# **Start Coding**

# **Series**


In [2]:
#pandas.DataFrame( data, index, dtype, copy)
                                            #data takes various forms like ndarray, list, constants.
                                            #Index values must be unique and hashable, same length as data.
                                            #Default np.arrange(n) if no index is passed.
                                            #dtype is for data type. If None, data type will be inferred
                                            #Copy data. Default False

# **Create an Empty Series**

In [4]:
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)

Series([], dtype: float64)


  This is separate from the ipykernel package so we can avoid doing imports until


# **Create a Series from ndarray**


In [6]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
s                 #We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3

0    a
1    b
2    c
3    d
dtype: object

In [8]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s= pd.Series(data,index=[100,101,102,103])
s

100    a
101    b
102    c
103    d
dtype: object

# **Create a Series from dict**


In [None]:
'''A dict can be passed as input and if no index is specified, then the dictionary keys are 
taken in a sorted order to construct index. If index is passed, the values in data 
corresponding to the labels in the index will be pulled out'''

In [9]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s= pd.Series(data)
s

a    0.0
b    1.0
c    2.0
dtype: float64

In [11]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data, index=['b', 'c', 'd', 'a'])
s          #Index order is persisted and the missing element is filled with NaN (Not a Number).

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

# **Create a Series from Scalar**

In [13]:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
s

0    5
1    5
2    5
3    5
dtype: int64

# **Accessing Data from Series withPosition**

In [15]:
import pandas as pd
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
#retrieve the first element
print(s[0])

1


In [16]:
import pandas as pd
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
#retrieve the first three element
print(s[:3])


a    1
b    2
c    3
dtype: int64


# **Retrieve Data Using Label (Index)**

In [17]:
import pandas as pd
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
#retrieve a single element
print(s['a'])

1


In [18]:
import pandas as pd                             #Retrieve multiple elements using a list of index label values
s=pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])
#retrieve multiple elements
print(s[['a','c','d']])


a    1
c    3
d    4
dtype: int64


# **DataFrame**

In [None]:
#pandas.DataFrame( data, index, columns, dtype, copy)     #data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
                                                         #For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed
                                                         #For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed.
                                                         #Data type of each column.
                                                         #This command (or whatever it is) is used for copying of data, if the default is False.



# **Create DataFrame**

# **Create an Empty DataFrame**

In [19]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
df

# **Create a DataFrame from Lists**


In [20]:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [21]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]] 
df = pd.DataFrame(data,columns=['Name','Age']) 
df

Unnamed: 0,Name,Age
0,Alex,10
1,Bob,12
2,Clarke,13


In [22]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]] 
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)  #the dtype parameter changes the type of Age column to floating point.
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Name,Age
0,Alex,10.0
1,Bob,12.0
2,Clarke,13.0


# **Create a DataFrame from Dict of ndarrays/Lists**

In [None]:
'''All the ndarrays must be of same length. If index is passed, then the length of the index 
should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length'''

In [23]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],
 'Age':[28,34,29,42]}
df= pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Tom,28
1,Jack,34
2,Steve,29
3,Ricky,42


# **Create a DataFrame from List of Dicts**

In [24]:
import pandas as pd
data = [{'a': 1, 'b': 2},
         {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [25]:
import pandas as pd
data = [{'a': 1, 'b': 2}, 
 {'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
df

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [28]:
import pandas as pd
data = [{'a': 1, 'b': 2}, 
 {'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys 
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name 
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print(df1)
print(df2)
  
'''Observe, df2 DataFrame is created with a column index other than the dictionary 
key; thus, appended the NaN’s in place. Whereas, df1 is created with column indices same 
as dictionary keys, so NaN’s appended.'''

        a   b
first   1   2
second  5  10
        a  b1
first   1 NaN
second  5 NaN


# **Create a DataFrame from Dict of Series**

In [29]:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


# **Column Selection**


In [30]:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df['one'])


a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64


# **Column Addition**

In [32]:
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by  passing new series
print("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)
print ("Adding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
df

Adding a new column by passing as Series:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Adding a new column using the existing columns in DataFrame:


Unnamed: 0,one,two,three,four
a,1.0,1,10.0,11.0
b,2.0,2,20.0,22.0
c,3.0,3,30.0,33.0
d,,4,,


# **Column Deletion**


In [33]:
# Using the previous DataFrame, we will delete a column
# using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
 'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print(df)
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print(df)
# using pop function
print("Deleting another column using POP function:")
df.pop('two')
print(df)

Our dataframe is:
   one  two  three
a  1.0    1   10.0
b  2.0    2   20.0
c  3.0    3   30.0
d  NaN    4    NaN
Deleting the first column using DEL function:
   two  three
a    1   10.0
b    2   20.0
c    3   30.0
d    4    NaN
Deleting another column using POP function:
   three
a   10.0
b   20.0
c   30.0
d    NaN


# **Row Selection, Addition,and Deletion**

In [34]:
#Rows can be selected by passing row label to a loc function.

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df.loc['b'])


one    2.0
two    2.0
Name: b, dtype: float64


# **Selection by integer location**

In [35]:
#Rows can be selected by passing integer location to an iloc function.

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df.iloc[2])


one    3.0
two    3.0
Name: c, dtype: float64


# **Slice Rows**

In [36]:
#Multiple rows can be selected using ‘ : ’ operator. 

import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print (df[2:4])

   one  two
c  3.0    3
d  NaN    4


# **Addition of Rows**


In [37]:
# Add new rows to a DataFrame using the append function. This function will append the rows at the end

import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['a','b'])
df = df.append(df2)
print (df)

   a  b
0  1  2
1  3  4
0  5  6
1  7  8


# **Deletion of Rows**

In [38]:
import pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=['a','b'])
df = df.append(df2)
# Drop rows with label 0
df = df.drop(0)
print (df)

   a  b
1  3  4
1  7  8


# ** Basic Functionality**

In [45]:
import pandas as pd
import numpy as np
#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
s

0    0.636170
1    0.337554
2    0.619479
3   -1.513822
dtype: float64

# **axes**


In [46]:
#Returns the list of the labels of the series.

import pandas as pd
import numpy as np
#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("The axes are:")
print (s.axes)

The axes are:
[RangeIndex(start=0, stop=4, step=1)]


# **empty**

In [47]:
import pandas as pd
import numpy as np
#Create a series with 100 random numbers
s = pd.Series(np.random.randn(4))
print ("Is the Object empty?")
print (s.empty)

Is the Object empty?
False


# **ndim**


In [49]:
#Returns the number of dimensions of the object. By definition, a Series is a 1D data structure, so it returns 1.

import pandas as pd
import numpy as np
#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print(s)
print ("The dimensions of the object:")
print(s.ndim)

0   -0.496281
1   -0.689762
2   -0.157701
3    1.621578
dtype: float64
The dimensions of the object:
1


# **size**

In [50]:
import pandas as pd
import numpy as np
#Create a series with 4 random numbers
s = pd.Series(np.random.randn(2))
print (s)
print ("The size of the object:")
print (s.size)

0    0.860418
1    0.355710
dtype: float64
The size of the object:
2


# **Try Yourself these functions**

In [None]:
# values()
# Head & Tail()
# shape()


# **T (Transpose)**

In [51]:
import pandas as pd
import numpy as np
# Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
# Create a DataFrame
df = pd.DataFrame(d)
print ("The transpose of the data series is:")
print(df.T)

The transpose of the data series is:
           0      1      2     3      4      5     6
Name     Tom  James  Ricky   Vin  Steve  Smith  Jack
Age       25     26     25    23     30     29    23
Rating  4.23   3.24   3.98  2.56    3.2    4.6   3.8


#  **Descriptive Statistics**

In [None]:
# sum(), 
# mean(),
# median(),
# std()
# mode() 
# min() 
# max()
# abs() 
# prod()     Product of Values
# cumsum()   Cumulative Sum
# cumprod()  Cumulative Product

# **sum()**

In [52]:
#Returns the sum of the values for the requested axis. By default, axis is index (axis=0).

import pandas as pd
import numpy as np
#Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print (df.sum())

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object


# **mean()**

In [53]:
#Returns the average value

import pandas as pd
import numpy as np
#Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print (df.mean())


Age       31.833333
Rating     3.743333
dtype: float64


  # This is added back by InteractiveShellApp.init_path()


# **std()**


In [54]:
#Returns the Bressel standard deviation of the numerical columns.

import pandas as pd
import numpy as np
#Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print (df.std())

Age       9.232682
Rating    0.661628
dtype: float64


  del sys.path[0]


# **Try these as your Practice**

In [None]:
# mode() 
# min() 
# max()
# abs() 
# prod()     Product of Values
# cumsum()   Cumulative Sum
# cumprod()  Cumulative Product

# **Summarizing Data**

In [55]:
#The describe() function computes a summary of statistics pertaining to the DataFrame columns.


import pandas as pd
import numpy as np
#Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print (df.describe())


             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000


In [56]:
#  object - Summarizes String columns
#  number - Summarizes Numeric columns
#  all - Summarizes all columns together (Should not pass it as a list value)


import pandas as pd
import numpy as np
#Create a Dictionary of series
d={'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print (df.describe(include=['object']))


       Name
count    12
unique   12
top     Tom
freq      1


# **Function Application**

In [None]:
'''To apply your own or another library’s functions to Pandas objects, you should be aware 
of the three important methods. The methods have been discussed below. The appropriate 
method to use depends on whether your function expects to operate on an 
entire DataFrame, row- or column-wise, or elementwise.
 Table wise Function Application: pipe()
 Row or Column Wise Function Application: apply()
 Element wise Function Application: applymap()'''


# **Table-wiseFunction Application**

In [None]:
# I'm ending this right here, this will be updated time to time advanced content will be available soon, keep following, upvoting my content. thank you