<H1><B>PANDAS</B></H1>
Welcome to the lesson on Pandas.Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely <b>Pandas Series</b> and <b>Pandas DataFrame</b>. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.

## **Why Use Pandas?**
The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important.

More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well.

his is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

- Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.

## **PANDAS SERIES:**
A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

```
`# This is formatted as code`
```



In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [1]:
# We first have to import pandas using import statement in python

import pandas as pd  

Let's create a pandas series.  For that we just write <B>pd.Series()</b>

```
items = pd.Series(data, index)    # For now we just names the series object in a variable named item
```



So while creating series we can pass arguments for <u>data</u> and <u>index</u>

In [4]:
item = pd.Series(data=[15, 5, 'No'], index=['chocolates', 'chips', 'milk'])
item

chocolates    15
chips          5
milk          No
dtype: object

As we see the series is displayed as indicies in the first column ans data in the second column.

<br><br>
## **Attributes of Panda Series:**

Let's see some of the attributes of the pandas series that helps us to understand our series

In [5]:
item.shape   # gives us the  sizes of each dimension of the data

(3,)

In [6]:
item.ndim   # gives us the number of dimensions of the data

1

In [5]:
item.size   # gives us the total number of items in the array 

3

In [7]:
item.index   # gives us the list indeices of the series 

Index(['chocolates', 'chips', 'milk'], dtype='object')

In [8]:
item.values    #gives us the data of the series

array([15, 5, 'No'], dtype=object)

If you are dealing with a very large Pandas Series and if you are not sure whether an index label exists, you can check by using the ```in``` command

In [8]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in item

# We check whether bread is a food item (an index) in Groceries
y = 'chips' in item

# We print the results
print('Is bananas an index label in item:', x)
print('Is chips an index label in item:', y)

Is bananas an index label in item: False
Is chips an index label in item: True


## **Accessing and Deleting Elements in Pandas Series:**
Now let's look at how we can access or modify elements in a Pandas Series. One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. 

In [9]:
# One by their index labels
item['chocolates']

15

In [10]:
item[['chocolates', 'milk']]   #a list of indices is passed 

chocolates    15
milk          No
dtype: object

In [10]:
item[0]  #first element

15

In [12]:
item[-1]  #last element

'No'

In [11]:
item[[0,1]]    # list of numerical indices

chocolates    15
chips          5
dtype: object

In order to remove any ambiguity from whether we are reffering to the labeled index or the numerical index, panda series have two attributes <b>loc</b> and <b>iloc</b>.

* The attribute loc stands for the location and is used to explicitly state that we are using our labeled index.
* The attribute iloc stands for integer location and is used to explicitly state that we are using a numerical index

In [14]:
item.loc[['milk', 'chips']]

milk     No
chips     5
dtype: object

In [15]:
item.iloc[[1,2]]

chips     5
milk     No
dtype: object

In [16]:
item #let's see the series again

chocolates    15
chips          5
milk          No
dtype: object

Pandas Series are also mutable like NumPy ndarrays, which means we can change the elements of a Pandas Series after it has been created.

In [17]:
# we can change the data by reassigning the value
item['milk'] = 'Yes'
item

chocolates     15
chips           5
milk          Yes
dtype: object

We can see that our series is now modified
<br><br>

## **Deleting values from Pandas Series:**

We can also delete items from a Pandas Series by using the ```.drop()``` method. The ```Series.drop(label)``` method removes the given label from the given Series.

In [14]:
item.drop('chocolates')

chips     5
milk     No
dtype: object

In [15]:
item

chocolates    15
chips          5
milk          No
dtype: object

As we see the chocolate item is no longer is present in the series returned by the method.

><B>Note: </B>However, this drops elements from the series out of place meaning the code above just returned the modified series and actually didn't change the actual series.

In [19]:
item   # THE CHOCOLATES ITEM MUST BE PRESENT HERE

chocolates     15
chips           5
milk          Yes
dtype: object

If you want to modify the actual series, we can use either of the two methods
* overwrite the returned series on the original one

```
item = item.drop('chocolates')
```

* use inplace agrgument and set it to True

```
item.drop('chocolates', inplace=True)
```
Let's try the second one :p

In [12]:
item.drop('chocolates', inplace= True)
# item

In [13]:
item

chips     5
milk     No
dtype: object

As you see the item is dropped from the original series.

<br>

##  **Arithmatic operations on Pandas Series:**
Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series.
We will look at the arithematic operations between pandas series and single numbers.<br>
Let's first make a new series

In [14]:
sweets = pd.Series(data=[10, 5, 7], index=['candies', 'donuts', 'ladoos'])
sweets

candies    10
donuts      5
ladoos      7
dtype: int64

We can now modify the data in sweets by performing basic arithmetic operations. Let's see some examples

In [19]:
sweets+2

candies    12
donuts      7
ladoos      9
dtype: int64

In [11]:
sweets-2

candies    8
donuts     3
ladoos     5
dtype: int64

In [12]:
sweets*2

candies    20
donuts     10
ladoos     14
dtype: int64

In [13]:
sweets/2

candies    5.0
donuts     2.5
ladoos     3.5
dtype: float64

In [20]:
# WE CAN ALSO APPLY MATHEMATICAL FUNCTIONS FROM NUPY SUCH AS SQAURE ROOT
import numpy as np

np.sqrt(sweets)

candies    3.162278
donuts     2.236068
ladoos     2.645751
dtype: float64

Pandas also allows us to only apply arithmetic operations on selected items in our sweets list. Let's see some examples

In [21]:
np.power(sweets, 4)

candies    10000
donuts       625
ladoos      2401
dtype: int64

In [22]:
np.exp(sweets)

candies    22026.465795
donuts       148.413159
ladoos      1096.633158
dtype: float64

In [16]:
sweets['sri'] = 'to explain'

In [17]:
sweets

candies            10
donuts              5
ladoos              7
sri        to explain
dtype: object

We can also apply arithemeatic operations on specific elements

In [16]:
sweets['candies']-4

6

In [17]:
sweets

candies    10
donuts      5
ladoos      7
dtype: int64

In [14]:
sweets[['donuts', 'ladoos']]*3

donuts    15
ladoos    21
dtype: int64

><b>Note:</b> We can also apply arithematic operations on a pandas series of mixed data types, provided the operations are defines on all the datatypes.

* Let's use the item series we made earlier 

In [15]:
item = pd.Series(data=[15, 5, 'No'], index=['chocolates', 'chips', 'milk'])
item

chocolates    15
chips          5
milk          No
dtype: object

In [25]:
item * 2  #SINCE MULTIPLICATION OPERATION IS DEFINED FOR BOTH STRINGS AND INTEGERS, THE CODE DOESN'T RETURN ANY ERROR

chocolates      30
chips           10
milk          NoNo
dtype: object

In [26]:
item/2 #SINCE THE DIVISION OPERATION IS DEFINED ONLY FOR NUMBERS AND NOT FOR STRING, THE CODE WILL RETURN AN ERROR

TypeError: unsupported operand type(s) for /: 'str' and 'int'

<br><br><hr>
<H3><B>2. PANDAS DATAFRAME</B></H3>

Dataframe is a two dimensional object which holds rows and columns and can hold values of different data types.

* We can create a dataframe manually or by loading data from a file.

<br>
Let's first ceate a dataframe manually:<<br>

*  First let's create a dictionary of pandas series and pass it into pandas dataframe

In [18]:
item = {'Shaurya': pd.Series([250, 15, 70, 100], index=['watch', 'toys', 'glasses', 'shirt']),
        'Ashish': pd.Series([120, 50, 90], index=['pants', 'books', 'toys' ])}

# item is a dictionary for two people containing some items and the cost of the item 

In [21]:
# WE CAN CREATE A DATA FRAME BY PASSINF THE DICTIONARY TO THE DataFrame FUNCTION

cart = pd.DataFrame(item)
cart

Unnamed: 0,Shaurya,Ashish
books,,50.0
glasses,70.0,
pants,,120.0
shirt,100.0,
toys,15.0,90.0
watch,250.0,


* Make sure to capitalize the **D** and **F** while calling the dataframe function.
* The dataframe is displayed in the tabular form
* The row labels for the dataframe are built from the union of the index labels we provided in the series and the column labels for the dataframe is taken from the keys of the dictionaries.
* The dataframe has NaN values because for Shaurya we have no item like books and pants in the dicrionary we provided and similarly we have NaN values  for column Ashish.

In [20]:
# IN ABOVE EXAMPLE WE PROVIDED THE DICTIONARIES THAT CLEARLY DEFINED THE INDEX LABLES, HOWEVER IF WE DON'T PROVIDE THE INDEX LABELS, 
# THEN THE DATAFRAME WOULD USE THE NUMERICAL INDEX VALUES
# LET'S CREATE THE SAME DICIONARY WITHOUT THE INDEXED LABELS
new_item = {'Shaurya': pd.Series([250, 15, 70, 100]),
        'Ashish': pd.Series([120, 50, 90])}


# NOW MAKE THE DATAFRAME USING THE NEW DICTIONARIES 
df = pd.DataFrame(new_item)
df

Unnamed: 0,Shaurya,Ashish
0,250,120.0
1,15,50.0
2,70,90.0
3,100,


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Shaurya  4 non-null      int64  
 1   Ashish   3 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 192.0 bytes


The dataframe uses the numerical indices.

<br>
<hr>
<h4><b>2.a Attributes</b></h4>

Like we did in pandas series, we can also extract information from pandas dataframe using some attributes.

In [22]:
cart.index

Index(['books', 'glasses', 'pants', 'shirt', 'toys', 'watch'], dtype='object')

In [23]:
cart.columns

Index(['Shaurya', 'Ashish'], dtype='object')

In [25]:
cart.values

array([[ nan,  50.],
       [ 70.,  nan],
       [ nan, 120.],
       [100.,  nan],
       [ 15.,  90.],
       [250.,  nan]])

In [26]:
cart.shape

(6, 2)

In [27]:
cart.ndim

2

In [28]:
cart.size

12

><b>NOTE:</B> While creating the cart dataframe, we passed the whole dictionary to the dataframe function. However, there might be cases when we are only interested in some specific subset of the whole data. Pandas let's us select which data we want to put into the DataFrame, with the keywords **columns** and **index**.

In [29]:
ashish_cart = pd.DataFrame(item, columns=['Ashish'])
ashish_cart

Unnamed: 0,Ashish
pants,120
books,50
toys,90


In [30]:
selected_item = pd.DataFrame(item, index=['pants', 'toys'])
selected_item

Unnamed: 0,Shaurya,Ashish
pants,,120
toys,15.0,90


In [47]:
ashish_selected_item = pd.DataFrame(item, columns=['Ashish'], index=['pants', 'toys'])
ashish_selected_item

Unnamed: 0,Ashish
pants,120
toys,90


We can also create a dataframe from a dictionary of lists or arrays. The procedure is same as before, we start by creating the dictionary and then pass it into the dataframe function. In this case however all the list or arrays in the dictionary must be of the same length.

In [26]:
# Here's the dictionary of the integers and the floats.
data = {'Integers':[1,2,3],
         'Floats':[1.1, 2.2, 3.3]}

df = pd.DataFrame(data, index=['label1', 'label2', 'label3'])    ## IF WE DON'T PASS THE INDICES, DATAFRAME WILL AUTOMATICALLY USE NUMERICAL INDICES
df

Unnamed: 0,Integers,Floats
label1,1,1.1
label2,2,2.2
label3,3,3.3


In [43]:
# Creating DataFrame using a list of python dictionaries 
ListOfDict = [{'apple':20, 'banana':15, 'orange':30},{'apple':10, 'tomato':17, 'grapes': 35}]

df = pd.DataFrame(ListOfDict)
df

Unnamed: 0,apple,banana,orange,tomato,grapes
0,20,15.0,30.0,,
1,10,,,17.0,35.0


DataFrame used the numerical row indices.
If we want to assign the row indices some values, we can use:


In [44]:
df.index = ['personA', 'personB']
df  

Unnamed: 0,apple,banana,orange,tomato,grapes
personA,20,15.0,30.0,,
personB,10,,,17.0,35.0


**Accessing the values in a DataFrame**

In [45]:
df['apple']     ##accessing a column

personA    20
personB    10
Name: apple, dtype: int64

In [46]:
df[['banana','tomato']]     ##accessing by passing a list of columns

Unnamed: 0,banana,tomato
personA,15.0,
personB,,17.0


In [47]:
df.loc[['personA']]    ##accessing a row

Unnamed: 0,apple,banana,orange,tomato,grapes
personA,20,15.0,30.0,,


In [48]:
df['grapes']['personB']    ##accessing a specific value

35.0

><b>NOTE:</B> While accessing the specific element, the column label always comes the first then the row label.

In [49]:
## IF WE WANT TO ADD A NEW COLUMN TO OUR DATAFRAME,  WE CAN ADD LIKE THIS:

df['corn'] = [5, 7]
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn
personA,20,15.0,30.0,,,5
personB,10,,,17.0,35.0,7


We can also add new columns using the arithematic operations on the other columns of our DataFrame.<br>
For eg: we can add a new column vegies by adding the values for corn and tomato


In [50]:
df['vegies'] = df['tomato'] + df['corn'] + df['apple'] + df['banana'] + df['orange'] + df['grapes']
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn,vegies
personA,20,15.0,30.0,,,5,
personB,10,,,17.0,35.0,7,


>**NOTE:** IF YOU WANT TO ADD THE VALUES FOR ONE MORE PERSON, MEANS YOU WANT TO ADD A NEW ROW. WE FIRST HAVE TO CREATE A NEW DATAFRAME WITH THOSE ROWS AND THEN APPEND IT TO THE ORIGINAL DATAFRAME



In [51]:
new_person = [{'apple':15, 'banana': 17, 'corn': 3, 'orange':5}]
new_df = pd.DataFrame(new_person, index=['personC'])
new_df

Unnamed: 0,apple,banana,corn,orange
personC,15,17,3,5


In [52]:
## WE CAN NOW ADD THE NEW ROW TO THE ORIGINAL DATAFRAME

df = df.append(new_df)
df

  df = df.append(new_df)


Unnamed: 0,apple,banana,orange,tomato,grapes,corn,vegies
personA,20,15.0,30.0,,,5,
personB,10,,,17.0,35.0,7,
personC,15,17.0,5.0,,,3,


In [53]:
df = df.set_index('apple')
df

Unnamed: 0_level_0,banana,orange,tomato,grapes,corn,vegies
apple,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20,15.0,30.0,,,5,
10,,,17.0,35.0,7,
15,17.0,5.0,,,3,


In [54]:
df = df.reset_index()

In [55]:
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn,vegies
0,20,15.0,30.0,,,5,
1,10,,,17.0,35.0,7,
2,15,17.0,5.0,,,3,


It is also possible to insert columns in the DataFrame using **insert method** to **any location we want**.
The insert method allows us to specify:
* the location,
* label, and
* the data 

of the column that we want to add.

In [56]:
df.insert(5, 'ginger', [7,11,13])
# THE FIRST ARGUMENT IS LOCATION OF THE COLUMN
# THE SECOND ARGUMENT IS THE LABEL FOR THE COLUMN
# AND THE LAST ONE IS THE DATA FOR THE COLUMN
df

Unnamed: 0,apple,banana,orange,tomato,grapes,ginger,corn,vegies
0,20,15.0,30.0,,,7,5,
1,10,,,17.0,35.0,11,7,
2,15,17.0,5.0,,,13,3,


We can also delete the columns and rows using two methods:
* pop
* drop

> **NOTE:** The pop method allows us to delete columns while the drop method allows us to delete both rows and columns by using the axis keyword.

In [57]:
df.pop('ginger')
df

Unnamed: 0,apple,banana,orange,tomato,grapes,corn,vegies
0,20,15.0,30.0,,,5,
1,10,,,17.0,35.0,7,
2,15,17.0,5.0,,,3,


In [60]:
df.drop(['grapes', 'orange'], axis=1,inplace=True)

In [61]:
df

Unnamed: 0,apple,banana,tomato,corn,vegies
0,20,15.0,,5,
1,10,,17.0,7,
2,15,17.0,,3,


In [77]:
df = df.drop(['grapes', 'orange'], axis=1)   # AXIS = 1 MEANS WE ARE REMOVING COLUMNS
df

Unnamed: 0,apple,banana,tomato,corn,vegies
0,20,15.0,,5,
1,10,,17.0,7,
2,15,17.0,,3,


In [65]:
df.index = ['personA','personB','personC']

In [66]:
df

Unnamed: 0,apple,pineapple,Tomato,corn,vegies
personA,20,15.0,,5,
personB,10,,17.0,7,
personC,15,17.0,,3,


In [80]:
df = df.drop(['personB', 'personC'], axis=0)   # AXIS = 0 MEANS WE ARE REMOVING ROWS
df

Unnamed: 0,apple,banana,tomato,corn,vegies
personA,20,15.0,,5,


To change the name of any row or any column,  we use the rename method



In [62]:
df  = df.rename(columns={'banana':'pineapple','tomato':'Tomato'})
df

Unnamed: 0,apple,pineapple,Tomato,corn,vegies
0,20,15.0,,5,
1,10,,17.0,7,
2,15,17.0,,3,


Now let's change the name of the row

In [67]:
df  = df.rename(index={'personA':'A'})
df

Unnamed: 0,apple,pineapple,Tomato,corn,vegies
A,20,15.0,,5,
personB,10,,17.0,7,
personC,15,17.0,,3,


We  can also set index to be one of the existing columns of the DataFrame

In [69]:
df = df.reset_index()
df

Unnamed: 0,index,apple,pineapple,Tomato,corn,vegies
0,A,20,15.0,,5,
1,personB,10,,17.0,7,
2,personC,15,17.0,,3,


Now we have the column tomato as our index label



<br>
<hr>
<h4><b>2.b Cleaning Data</b></h4>

Before using the data for making predictions or using it for analysis, we first need to clean the data. By cleaning data, it means to detect and correct thr bad data
