## Pandas

Pandas is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely __Pandas Series__ and __Pandas DataFrame__. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.

In the following lessons you will learn:

* How to import Pandas
* How to create Pandas Series and DataFrames using various methods
* How to access and change elements in Series and DataFrames
* How to perform arithmetic operations on Series
* How to load data into a DataFrame
* How to deal with Not a Number (NaN) values
  
The following lessons assume that you are already familiar with NumPy and have gone over the previous NumPy lessons. Therefore, to avoid being repetitive we will omit a lot of details already given in the NumPy lessons. Consequently, if you haven't seen the NumPy lessons we suggest you go over them first.

In [2]:
!conda list pandas

# packages in environment at /Users/mekalathuruchenchaiah/Desktop/PROGRAMMING/Projects/GPU-pytorch/env:
#
# Name                    Version                   Build  Channel
pandas                    2.0.3            py38h46d7db6_0  


### Why pandas is so important 

This package is built on top of Numpy, which makes it very fast and efficient

* Allows the use of labels for rows and columns
* Can calculate rolling statistics on time series data
* Easy handling of NaN values
* Is able to load data of different formats into DataFrames
* Can join and merge different datasets together
* It integrates with NumPy and Matplotlib

#### A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings, and has an option to provide axis labels.

__Difference between NumPy ndarrays and Pandas Series__

* One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want.
* Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types

Let's begin by creating a Pandas Series. You can create Pandas Series by using the command `pd.Series(data, index)`, where index is a list of `index` labels. Let's use a Pandas Series to store a grocery list. We will use the food items as index labels and the quantity we need to buy of each item as our data.



In [5]:
import pandas as pd 

groceries = pd.Series(data=[30, 4, 'YES', 'NO'], index = ['eggs', 'apples', 'milk', 'bread'])

print(groceries) # Here in this series we have named each index with a label, this is the benefit of the pandas series over numpy array

eggs       30
apples      4
milk      YES
bread      NO
dtype: object


In [6]:
# Example 2 - Print attributes - shape, ndim,and size

print("Groceries has a shape : ", groceries.shape)
print("Groceries has a dimension  : ", groceries.ndim)
print("Groceries has total of", groceries.size, 'elements')

Groceries has a shape :  (4,)
Groceries has a dimension  :  1
Groceries has total of 4 elements


In [7]:
# Example 3 - Print attributes - values, and index

print("The data in groceries is : ", groceries.values)
print("The index of groceries is : ", groceries.index )

The data in groceries is :  [30 4 'YES' 'NO']
The index of groceries is :  Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')


In [8]:
# Example 4 - Check if an index is available in the given Series

x = 'bananas' in groceries
y = 'bread' in groceries

print(f"Is banana in the groceries {x}")
print(f"Is bread is in the groceries {y}")

Is banana in the groceries False
Is bread is in the groceries True


In [9]:
print(groceries)

eggs       30
apples      4
milk      YES
bread      NO
dtype: object


In [11]:
#In pandas series, it will let us access the data in multiple ways 

print(groceries['eggs']) # with index names we can access


30


In [12]:
groceries[['milk', 'bread']] # at a time accessing the multiple values 

milk     YES
bread     NO
dtype: object

In [13]:
groceries[0] # also can access the data using the indexs numbers 

30

In [14]:
groceries[-1] # access the last element 


'NO'

In [15]:
groceries[[0, 1]]

eggs      30
apples     4
dtype: object

In [16]:
groceries.loc[['eggs', 'apples']] # loc means locations


eggs      30
apples     4
dtype: object

In [17]:
groceries.iloc[[2, 3]] # it access the integer location of the series 

milk     YES
bread     NO
dtype: object

In [18]:
print(groceries)

eggs       30
apples      4
milk      YES
bread      NO
dtype: object


In [22]:
# now try to reassign some value 
groceries['eggs'] = 2
groceries  # now here the data is modified from 30 to 2

eggs        2
apples      4
milk      YES
bread      NO
dtype: object

In [24]:
groceries.drop('apples') # now the apple is deleted from the series 

eggs       2
milk     YES
bread     NO
dtype: object

In [25]:
# now let's check again , apple still exist because it is not doing changes in the main series data 
# to change, to delete the data from the original series we have to use parameter inplace=True 
groceries

eggs        2
apples      4
milk      YES
bread      NO
dtype: object

In [26]:
groceries.drop('apples', inplace=True)




In [28]:
groceries # the apple has been delete from the original series

eggs       2
milk     YES
bread     NO
dtype: object

### Arithmetic Operations on pandas series 

We can also perform element wise arithmetic operation on series data just like we did on numpy arrays data elements

In [30]:
import pandas as pd 

fruits = pd.Series([10, 15, 20], ['apples', 'oranges', 'bananas'])

fruits

apples     10
oranges    15
bananas    20
dtype: int64

In [32]:
# Element-wise basic arithmetic operations
fruits + 2


apples     12
oranges    17
bananas    22
dtype: int64

In [33]:
fruits - 2


apples      8
oranges    13
bananas    18
dtype: int64

In [34]:
fruits * 2

apples     20
oranges    30
bananas    40
dtype: int64

In [35]:
fruits / 2

apples      5.0
oranges     7.5
bananas    10.0
dtype: float64

In [37]:
# Let's import numpy 
import numpy as np 

fruits


apples     10
oranges    15
bananas    20
dtype: int64

In [39]:
np.sqrt(fruits) # Square roots of the fruits

apples     3.162278
oranges    3.872983
bananas    4.472136
dtype: float64

In [40]:
np.exp(fruits)

apples     2.202647e+04
oranges    3.269017e+06
bananas    4.851652e+08
dtype: float64

In [41]:
np.power(fruits, 2)

apples     100
oranges    225
bananas    400
dtype: int64

In [42]:
fruits

apples     10
oranges    15
bananas    20
dtype: int64

In [43]:
fruits['bananas'] + 2

22

In [44]:
fruits[0] - 2 

8

fruits[['apples', 'oranges']] * 2

In [46]:
fruits.loc[['apples', 'oranges']]/2

apples     5.0
oranges    7.5
dtype: float64

In [49]:
# now let's check the groceries here again 

groceries  = pd.Series([30, 60, 'Yes', 'No'], ['eggs', 'apples', 'milk', 'bread'])
groceries

eggs       30
apples     60
milk      Yes
bread      No
dtype: object

In [51]:
groceries * 2 # Here the string is also gets multiplied so we have to be cereful 

eggs          60
apples       120
milk      YesYes
bread       NoNo
dtype: object

In [54]:
### Manipulate a Series

import pandas as pd

# DO NOT CHANGE THE VARIABLE NAMES

# Given a list representing a few planets
planets = ['Earth', 'Saturn', 'Venus', 'Mars', 'Jupiter']

# Given another list representing the distance of each of these planets from the Sun
# The distance from the Sun is in units of 10^6 km
distance_from_sun = [149.6, 1433.5, 108.2, 227.9, 778.6]

In [55]:
# TO DO: Create a Pandas Series "dist_planets" using the lists above, representing the distance of the planet from the Sun.
# Use the `distance_from_sun` as your data, and `planets` as your index.
dist_planets = pd.Series(distance_from_sun, planets)
dist_planets

Earth       149.6
Saturn     1433.5
Venus       108.2
Mars        227.9
Jupiter     778.6
dtype: float64

In [56]:
# TO DO: Calculate the time (minutes) it takes light from the Sun to reach each planet. 
# You can do this by dividing each planet's distance from the Sun by the speed of light.
# Use the speed of light, c = 18, since light travels 18 x 10^6 km/minute.
time_light = dist_planets / 18
time_light

Earth       8.311111
Saturn     79.638889
Venus       6.011111
Mars       12.661111
Jupiter    43.255556
dtype: float64

In [58]:
# TO DO: Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
close_planets = time_light[time_light < 40.0]
close_planets

Earth     8.311111
Venus     6.011111
Mars     12.661111
dtype: float64

In [60]:
import pandas as pd

distance_from_sun = [149.6, 1433.5, 108.2, 227.9, 778.6]

planets = ['Earth','Saturn', 'Venus', 'Mars', 'Jupiter']

dist_planets = pd.Series(data = distance_from_sun, index = planets)

time_light = dist_planets / 18

close_planets = time_light[time_light < 40]
close_planets

Earth     8.311111
Venus     6.011111
Mars     12.661111
dtype: float64