<a href="https://colab.research.google.com/github/lakshya90/DataScience101/blob/master/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

In [0]:
import numpy as np
import pandas as pd

## The Pandas Series Object

A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:

In [0]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:

In [0]:
data.values

array([ 0.25,  0.5 ,  0.75,  1.  ])

The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.

In [0]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [0]:
data[1]

0.5

In [0]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.

## The Pandas DataFrame Object


**Create an Empty DataFrame**

In [0]:
#import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


In [0]:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print(df)

   0
0  1
1  2
2  3
3  4
4  5


In [0]:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


#Head and Tail

In [0]:
import pandas as pd
import numpy as np

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print("The original series is:")
print(s)

print("The first two rows of the data series:")
print(s.head(2))

#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print("The original series is:")
print(s)

print("The last two rows of the data series:")
print(s.tail(2))

The original series is:
0    0.116896
1   -0.308206
2    0.775001
3    2.722417
dtype: float64
The first two rows of the data series:
0    0.116896
1   -0.308206
dtype: float64
The original series is:
0    0.280888
1    0.610335
2   -0.462607
3   -1.980918
dtype: float64
The last two rows of the data series:
2   -0.462607
3   -1.980918
dtype: float64


#Shape

In [0]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print("Our object is:")
print(df)
print("The shape of the object is:")
print(df.shape)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The shape of the object is:
(7, 3)


#values()

In [0]:
import pandas as pd
import numpy as np
 
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
 
#Create a DataFrame
df = pd.DataFrame(d)
print("Our object is:")
print(df)
print("The actual data in our data frame is:")
print(df.values)

Our object is:
   Age   Name  Rating
0   25    Tom    4.23
1   26  James    3.24
2   25  Ricky    3.98
3   23    Vin    2.56
4   30  Steve    3.20
5   29  Smith    4.60
6   23   Jack    3.80
The actual data in our data frame is:
[[25 'Tom' 4.23]
 [26 'James' 3.24]
 [25 'Ricky' 3.98]
 [23 'Vin' 2.56]
 [30 'Steve' 3.2]
 [29 'Smith' 4.6]
 [23 'Jack' 3.8]]


#Sorting

In [0]:
import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print(sorted_df)

       col2      col1
9 -0.248807  0.468988
8  1.047795 -0.239473
7 -0.985816  0.373464
6  0.255020 -1.442820
5  0.576216  0.421228
4  1.228995  1.174762
3 -1.463602 -1.853996
2 -1.666039  0.153949
1  0.822848  1.850178
0 -0.574736 -1.317724


#Missing Values

In [0]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

print(df['one'].notnull())

print(df['one'].sum())



a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool
-4.622776173264095


#Handling Missing Values

In [0]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print("NaN replaced with '0':")
print(df.fillna(0))

print(df.dropna())

        one       two     three
a -0.665472  0.380839  0.461021
b       NaN       NaN       NaN
c  0.489712  0.543741 -0.913995
NaN replaced with '0':
        one       two     three
a -0.665472  0.380839  0.461021
b  0.000000  0.000000  0.000000
c  0.489712  0.543741 -0.913995
        one       two     three
a -0.665472  0.380839  0.461021
c  0.489712  0.543741 -0.913995
