# **Pandas**

Let's get started with pandas

### Installing pandas

In [1]:
!pip install pandas



##Importing dependencies

In [2]:
#@title EXERCISE: Importing pandas
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [4]:
import pandas as pd


# **Pandas Series**


---
Pandas series works like a one-dimensional array and it can hold any type of data such as int, float, string,etc. You can create a series using an array, list or even dictionary.
* Command - pd.Series()


In [6]:
labels = ['a','b','c']
my_data = [10,20,30] 
arr = np.array(my_data) #our array
dic = {'a':10 , 'b':20, 'c':30} #our dictionary

In [7]:
pd.Series(arr,labels) #by using our array

a    10
b    20
c    30
dtype: int64

In [8]:
ser = pd.Series(dic) #by using our dictionary
print(ser)

a    10
b    20
c    30
dtype: int64


To retrieve any value from a Pandas Series we just need to pass the index of the value that we wish to fetch

In [9]:
print(ser['a']) #we used the index a which has the value of 10

10


### EXERCISE: Generate the same `Series` using only the initial `lists` and extract the data at label `c`.

In [11]:
# write code in the line below
s = pd.Series(my_data, labels)
s['c']

30

We can also add two Pandas series in the following way

In [16]:
ser1 = pd.Series([1,2,3,4],['a','b','c','d'])
ser2 = pd.Series([1,2,5,4],['a','b','e','d'])
ser3 = ser1 + ser2
print(ser3)

a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64


We never gave any NaN values in the series but in the resultant series we have  NaN values. We can see we have Nan values for index 'c' and 'e' and that is because in the first series the values of 'c' is 3 and value of 'e' is 5, now as these two index dont match with any other index in the series hence by default their values are given as NaN.

# **Data Frames**

---
It's a two dimensional structure that stores the data in tabular format which can be changed and edited as per user's requirements. Just like Series we can make a Data frame as well using array, lists and dictionaries.
* Command - pd.DataFrame(dimesnion,rows,columns)

In [32]:
from numpy.random import randn
df = pd.DataFrame(randn(4,4),['A','B','C','D'],['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.723056,-0.445546,-0.652065,-0.030567
B,0.844453,0.208513,-0.176878,0.556395
C,-1.193186,1.306642,0.844274,-1.591508
D,-0.229251,-0.272999,-3.005975,-0.972506


As we can see we have our very own Dataframe. We used A,B,C,D as the rows and W,X,Y,Z as our columns. The random method just generates random number to fill up our DataFrame. We also mentioned dimensions for our DataFrame i.e. 4x4 as we mentioned 4 rows and 4 columns. We can also grab rows and columns and the information in these rows and column if we need to analyse them individually. The columns that we will grab will work as Series

In [33]:
print(df['W']) #grabbing a single column W

A    0.723056
B    0.844453
C   -1.193186
D   -0.229251
Name: W, dtype: float64


In [34]:
print(df[['W','X']]) #grabbing more than one list by passing in list of columns

          W         X
A  0.723056 -0.445546
B  0.844453  0.208513
C -1.193186  1.306642
D -0.229251 -0.272999


### EXERCISE: Create a DataFrame with random integers and extract the First and Last column from it.

In [35]:
# write code in the line below
import numpy as np
data_frame = pd.DataFrame(np.random.randint(0, 50, (3, 3)))
data_frame[[0, 2]]

Unnamed: 0,0,2
0,18,41
1,6,15
2,15,4


We can make changes to already existing Data Frames by adding or dropping columns. Many times the data that you have might have some columns that are not at all useful for your model so you might have to drop them from the Data Frame in order to make your model more accurate. Lets go ahead and take a look at how we can add and delete columns in our DataFrame.

Command - 
* df['column_name'] - To add a new column we can use this command.
* df.drop('column_name',axis = 1,inplace = True) - To drop a column from a Data Frame we can use this method. We also define the axis from which we wish to drop, so axis = 1 when we wish to drop a column and axis = 0 when we wish to drop a row. If we dont define the inplace = True, the column wont be dropped permanently from the DataFrame.

In [36]:
df['new'] = df['X'] + df['Y']
df.head()

Unnamed: 0,W,X,Y,Z,new
A,0.723056,-0.445546,-0.652065,-0.030567,-1.097611
B,0.844453,0.208513,-0.176878,0.556395,0.031636
C,-1.193186,1.306642,0.844274,-1.591508,2.150916
D,-0.229251,-0.272999,-3.005975,-0.972506,-3.278974


In [37]:
df.drop('new',axis = 1,inplace = True)

In [38]:
df.head()

Unnamed: 0,W,X,Y,Z
A,0.723056,-0.445546,-0.652065,-0.030567
B,0.844453,0.208513,-0.176878,0.556395
C,-1.193186,1.306642,0.844274,-1.591508
D,-0.229251,-0.272999,-3.005975,-0.972506


As you can see, we added a new column to our data frame called 'new'. Then we dropped it using the .drop() method and when we call the head on our Data Frame, we can see that the column does not exist any more in the Data Frame.

### EXERCISE: Remove the third row and show  the DataFrame created in the previous exercise.

In [42]:
# write code in the line below
df_new = df.drop('C', axis=0, inplace=False)
df_new

Unnamed: 0,W,X,Y,Z
A,0.723056,-0.445546,-0.652065,-0.030567
B,0.844453,0.208513,-0.176878,0.556395
D,-0.229251,-0.272999,-3.005975,-0.972506


**Here I have given 'inplce' argument as False as in the following code, 3rd row is used**

# **Selecting Rows and Columns**

---
In some cases you will need to select some specific rows and columns in order to analyse your data, in such cases method like '.loc' and '.iloc' come into picture. These will help you grab a row by either using the name of the row or by using the index. To grab a column just pass the column name in your DataFrame Object.

Command -
* df.loc['row_name'] - Helps in fetching the row by using the row name.
* df.iloc[x] - Helps in fetching the row using the Index number(x).
* df[[column_name]] - Helps in fetching the column using the Column name. Be careful with the two Square Brackets.



In [40]:
df.loc['C'] #using .loc()

W   -1.193186
X    1.306642
Y    0.844274
Z   -1.591508
Name: C, dtype: float64

In [24]:
df.iloc[2] #using .iloc()

W    0.928126
X    0.101014
Y    1.191366
Z   -0.442132
Name: D, dtype: float64

In [25]:
df[['X']] #fetching the column

Unnamed: 0,X
A,2.149702
B,-0.670875
D,0.101014


In [26]:
df.loc['B','Y'] #we can pass the row name and column name in .loc to get a specific value

-0.30031442492452853

In [27]:
df.loc[['A','B'],['W','Y']] #we passed the columns that we need and the rows that we need and we got the info that we needed

Unnamed: 0,W,Y
A,0.420934,-1.019098
B,0.44092,-0.300314


### EXERCISE: Use the DataFrame created and formatted in the previous Exercises and extract the 2nd and 3rd rows from its 1st three columns.

In [41]:
# write code in the line below
df.iloc[[1, 2], [i for i in range(3)]]

Unnamed: 0,W,X,Y
B,0.844453,0.208513,-0.176878
C,-1.193186,1.306642,0.844274


# **Working with Missing data**

---

Any data that you get will always have some sort of missing values, its very rare that you get a data without any missing values.In this section we will learn how to deal with the missing values.

In [43]:
d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]} 
df1 = pd.DataFrame(d)
print(df1)

     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3



As we can see we have a new data set with NaN values in it and we will be performing all our operations on it. Many a times these NaN values are very less in number and can be discarded. For this purpose we use the '.dropna() method. This will drop all the NaN values in our data.

In [44]:
df1.dropna() #dropped all the rows with NaN values

Unnamed: 0,A,B,C
0,1.0,5.0,1


Sometimes the amount of NaN values might be huge so in that case we cant just discard them. We will have to fill them up. We use the .fillna() method to do so.

In [45]:
print(df1.fillna(value = 0)) #filling up all the NaN values with 0

     A    B  C
0  1.0  5.0  1
1  2.0  0.0  2
2  0.0  0.0  3


### EXERCISE: Create a DataFrame with NaN values in it and fill them up using the mean of the 2nd Column.

In [77]:
# write code in the line below
d1 = {'A': [np.nan, 1, 6], 'B': [1., 2., 3.], 'C': [np.nan, 3, np.nan]}
df2 = pd.DataFrame(d1)
df2.fillna(value = float(df2[['B']].mean()))



Unnamed: 0,A,B,C
0,2.0,1.0,2.0
1,1.0,2.0,3.0
2,6.0,3.0,2.0
