# PANDAS

Pandas are also known as panel data.

Pandas are built in on top of NumPy library.

Pandas have fast analysis and data cleaning and prepearation has built in visulization features.

• Data structures with labeled axes supporting automatic or explicit data alignment
—this prevents common errors resulting from misaligned data and working with
differently indexed data coming from different sources

• Integrated time series functionality

• The same data structures handle both time series data and non–time series data

• Arithmetic operations and reductions that preserve metadata

• Flexible handling of missing data

• Merge and other relational operations found in popular databases (SQL-based,
for example)

Provides R features of making data frames

Following will be discussed

1. Series
2. Data Frames
3. Missing Data
4. Group By
5. Merging, Joining and Concatenation
6. Operations
7. Data Input and output



# Pandas Series

There is a signinficant difference between NumPy arrays and Series.
A series can have an active label while an array cannot
It is built on top of array object. But in can be indexed by a label enlike arrays

In [1]:
import pandas as pd
import numpy as np

In [2]:
labels = ['a','b','c']
my_data = [10,20,30]

arr = np.array(my_data)

d = {'a':10, 'b':20,'c':30}

In [3]:
pd.Series(data = my_data)

0    10
1    20
2    30
dtype: int64

In [4]:
#Below we are labelling our data.


pd.Series(data = my_data, index = labels)

#Below we can see that we have labelled our data

a    10
b    20
c    30
dtype: int64

In [5]:
#Another way to create a series is just pass a Numpy array

pd.Series(arr, labels)

a    10
b    20
c    30
dtype: int32

In [6]:
# We can even pass a dictionary to Series()

#What python does here is automatically take keys of the dictionarya and use it as a label for the corresponding value

pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [7]:
#Also significant difference between Numpy arrays and Pandas is that it can hold variety of object types
#It can hold any object in python as it's data point.

pd.Series(labels)


0    a
1    b
2    c
dtype: object

In [8]:
#We can even pass in strings or builtin functions

pd.Series([sum, print, len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

# ---------------------------------------------------------------------------------------------------------------

In [9]:
ser1 = pd.Series([1,2,3,4],['USA','Germany','USSR','Japan'])

In [10]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [11]:
ser2 = pd.Series([1,2,5,4],['USA','Germany','Italy','Japan'])
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [12]:
#to grab any element in the series, it is similiar to python dictionaries

ser1['USA']

1

# --------------------------------------------------------------------------------------------------------


# Pandas Data Frame

In [16]:
from numpy.random import randn

In [18]:
np.random.seed(101)

In [19]:
# We are creating a dataframe called df. Not df is just a variable name.
# the pandas function DataFrame has the following syntax

# DataFrame(data, index, columns)

df = pd.DataFrame(randn(5,4),['a','b','c','d','e'], ['w','x','y','z'])

In [20]:
df

Unnamed: 0,w,x,y,z
a,2.70685,0.628133,0.907969,0.503826
b,0.651118,-0.319318,-0.848077,0.605965
c,-2.018168,0.740122,0.528813,-0.589001
d,0.188695,-0.758872,-0.933237,0.955057
e,0.190794,1.978757,2.605967,0.683509


In [25]:
#Now we will select column from a particular column

df['w']

a    2.706850
b    0.651118
c   -2.018168
d    0.188695
e    0.190794
Name: w, dtype: float64

In [26]:
# The following will show us the type of data frame i.e. Series in this case

type(df['w'])

pandas.core.series.Series

In [28]:
#Another way to get a column is

# avoid using this as if the name of column is same the pandas function might get overwritten.

df.w

a    2.706850
b    0.651118
c   -2.018168
d    0.188695
e    0.190794
Name: w, dtype: float64

When we ask for a single column we get a Series.

When we want ot have multiple columns, a data frame with those columns are returned

Illustrated Below:

In [30]:
# To get multiple columns from the dataframe

df[['w','z']]



Unnamed: 0,w,z
a,2.70685,0.503826
b,0.651118,0.605965
c,-2.018168,-0.589001
d,0.188695,0.955057
e,0.190794,0.683509


In [33]:
# Creating a new column

#Creating a new column in pandas, we can actually define it in a way that it exists
#Therefore while creating it, we can create it as if it is already defined

df['new'] = df['w'] + df['z']

df

Unnamed: 0,w,x,y,z,new
a,2.70685,0.628133,0.907969,0.503826,3.210676
b,0.651118,-0.319318,-0.848077,0.605965,1.257083
c,-2.018168,0.740122,0.528813,-0.589001,-2.607169
d,0.188695,-0.758872,-0.933237,0.955057,1.143752
e,0.190794,1.978757,2.605967,0.683509,0.874303


In [34]:
# To remove a particular column use the drop() function

#df.drop('column name', axis = 1)


df.drop('new',axis = 1)

Unnamed: 0,w,x,y,z
a,2.70685,0.628133,0.907969,0.503826
b,0.651118,-0.319318,-0.848077,0.605965
c,-2.018168,0.740122,0.528813,-0.589001
d,0.188695,-0.758872,-0.933237,0.955057
e,0.190794,1.978757,2.605967,0.683509


In [35]:
df

Unnamed: 0,w,x,y,z,new
a,2.70685,0.628133,0.907969,0.503826,3.210676
b,0.651118,-0.319318,-0.848077,0.605965,1.257083
c,-2.018168,0.740122,0.528813,-0.589001,-2.607169
d,0.188695,-0.758872,-0.933237,0.955057,1.143752
e,0.190794,1.978757,2.605967,0.683509,0.874303


In [36]:
# Above we can see that even though we have dropped df, the column new is still in place
# i.e it's not yet deleted

# This is because the inplace argument is not set to TRUE

df.drop('new', axis = 1, inplace = True)

In [38]:
df

# Now we can see below that it has been dropped

Unnamed: 0,w,x,y,z
a,2.70685,0.628133,0.907969,0.503826
b,0.651118,-0.319318,-0.848077,0.605965
c,-2.018168,0.740122,0.528813,-0.589001
d,0.188695,-0.758872,-0.933237,0.955057
e,0.190794,1.978757,2.605967,0.683509


Pandas needs 'inplace  = True' to stay because sometimes we might erronously delete the data.
When we specify inplace it meens it is permanent.

In [39]:
# We can use df.drop() to drop rows.

df.drop('e', axis = 0)

Unnamed: 0,w,x,y,z
a,2.70685,0.628133,0.907969,0.503826
b,0.651118,-0.319318,-0.848077,0.605965
c,-2.018168,0.740122,0.528813,-0.589001
d,0.188695,-0.758872,-0.933237,0.955057


Why are the columns axis = 1 and rows axis = 0
Actually dataframes are just fancy index on the top of the numpy array.


In [41]:
df.shape

#it is actually a tuple with 5 rows and 4 columns


(5, 4)

In [42]:
# Selecting Rows
df

Unnamed: 0,w,x,y,z
a,2.70685,0.628133,0.907969,0.503826
b,0.651118,-0.319318,-0.848077,0.605965
c,-2.018168,0.740122,0.528813,-0.589001
d,0.188695,-0.758872,-0.933237,0.955057
e,0.190794,1.978757,2.605967,0.683509


In [46]:
# 2 ways to select rows in a dataframe. There are 2 methods for the same

# 1:
#loc : it means location

df.loc['a']

#This returns a series. Not only rows are series, columns are series too.

w    2.706850
x    0.628133
y    0.907969
z    0.503826
Name: a, dtype: float64

In [48]:
#Second method
#This method uses index position

#USing the iloc() function for this

df.iloc[2]


w   -2.018168
x    0.740122
y    0.528813
z   -0.589001
Name: c, dtype: float64

 # ----------------------------------------------------------------------------------------------------------
 
 Selecting subsets of rows and columns

In [49]:
#We can specify it by using row b and column y

#Syntax : loc[row,column]

df.loc['b','y']

-0.8480769834036315

In [51]:
# We can also do it like

df.loc[ ['a','b'],['w','y'] ]

Unnamed: 0,w,y
a,2.70685,0.907969
b,0.651118,-0.848077
