# Pandas

## Introduction to Pandas

**Pandas** is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely **Pandas Series** and **Pandas DataFrame**. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.


- How to import Pandas
- How to create Pandas Series and DataFrames using various methods
- How to access and change elements in Series and DataFrames
- How to perform arithmetic operations on Series
- How to load data into a DataFrame
- How to deal with Not a Number (NaN) values

## Why Use Pandas?

The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.



-------------

# Creating Pandas Series
- pd.Series(data, index), where index is a list of index labels
    - series.shape
    - series.size
    - series.ndim
    - series.index
    - series.values

A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

In [1]:
import pandas as pd

unlike NumPy, panda series can hold data of different types and index can also be string

In [2]:
groceries = pd.Series(data=[30, 6, "Yes", "No"], index=["egg", "apple", "milk", "bread"])
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [3]:
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

print()

print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements

The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['egg', 'apple', 'milk', 'bread'], dtype='object')


to check whether a specific value is included in index values

In [4]:
"banana" in groceries.index

False

In [5]:
"apple" in groceries.index

True

In [6]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries

# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries

# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True


-----------------

## Accessing and Deleting Elements in Pandas Series

One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes, **.loc** and **.iloc** to explicitly state what we mean. The attribute .loc stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute .iloc stands for integer location and it is used to explicitly state that we are using a numerical index.

### Accessing Pandas Series by Index Label

In [7]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [8]:
# We use a single index label
groceries["egg"]

30

In [9]:
#via list of index labels
# we can access multiple index labels
groceries[["milk", "bread"]]

milk     Yes
bread     No
dtype: object

### Accessing Pandas Series by Numerical Indices
it is same as NumPy

In [25]:
# We use a single numerical index
groceries[0]

35

In [26]:
# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 

Do we need bread:
 milk    Yes
dtype: object


In [11]:
groceries[2:4]

milk     Yes
bread     No
dtype: object

In [12]:
# we use multiple numerical indices
groceries[[2,3]]

milk     Yes
bread     No
dtype: object

## How to differentiate between label index and numerical index?
- **series.loc[]** : label index, label location
- **series.iloc[]**: numerical index, integer location

In [13]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [14]:
#lable location
# we use loc to access multiple index labels
groceries.loc[["egg", "apple"]]

egg      30
apple     6
dtype: object

In [15]:
#integer location
# we use iloc to access multiple numerical indices
groceries.iloc[[0,1]]

egg      30
apple     6
dtype: object

## Changing Pandas Series
pandas series are mutable like numPy arrays

In [16]:
groceries

egg       30
apple      6
milk     Yes
bread     No
dtype: object

In [17]:
groceries.iloc[[0]] = 32
groceries

egg       32
apple      6
milk     Yes
bread     No
dtype: object

In [18]:
groceries.loc[["egg"]] = 33
groceries

egg       33
apple      6
milk     Yes
bread     No
dtype: object

In [19]:
groceries["egg"] = 35
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

## Deleting elements from Pandas Series
- series.drop(label index, inplace=False) : return the modified series, but no impact on original series
- series.drop(label index, inplace=True) : modified the original series 

We can delete items from a Pandas Series in place by setting the keyword inplace to True in the .drop() method.

In [20]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [21]:
new = groceries.drop("bread")
new

egg       35
apple      6
milk     Yes
dtype: object

In [22]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [23]:
#now original series got changed
groceries.drop("bread", inplace=True)
groceries

egg       35
apple      6
milk     Yes
dtype: object

-------------------

## Arithmetic Operations on Pandas Series

Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series. 


In [1]:
import pandas as pd

In [5]:
fruits = pd.Series([10,6,3], ["apples", "oranges", "bananas"])
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [6]:
fruits + 2

apples     12
oranges     8
bananas     5
dtype: int64

In [9]:
fruits - 2

apples     8
oranges    4
bananas    1
dtype: int64

In [10]:
fruits / 2

apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

In [11]:
fruits * 2

apples     20
oranges    12
bananas     6
dtype: int64

In [40]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2:
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2:
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2:
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2:
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64



#### we can use mathematical functions from NumPy, to operate on pandas series
- np.sqrt(pandas series)
- np.exp(pandas series)
- np.power(pandas series, power of x)

In [25]:
import numpy as np

In [26]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [27]:
np.sqrt(fruits)

apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

In [28]:
np.exp(fruits)

apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

In [23]:
np.power(fruits, 2)

apples     100
oranges     36
bananas      9
dtype: int64

In [41]:
# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

Original grocery list of fruits:
 apples     10
oranges     6
bananas     3
dtype: int64

EXP(X) = 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

SQRT(X) =
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

POW(X,2) =
 apples     100
oranges     36
bananas      9
dtype: int64


------

### using loc and iloc on pandas

In [33]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [32]:
fruits["bananas"] + 2

5

In [34]:
fruits[["oranges", "apples"]] + 10

oranges    16
apples     20
dtype: int64

In [35]:
fruits.loc["apples"] * 2

20

In [36]:
fruits.iloc[2]*2

6

In [42]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
print()

# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
print()

# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
print()

# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
print()

# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

Amount of bananas + 2 =  5

Amount of apples - 2 =  8

We double the amount of apples and oranges:
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges:
 apples     5.0
oranges    3.0
dtype: float64


--------

### pandas series with different/ mixed data type can use with arithmetic operations

In [37]:
groceries = pd.Series(data=[30,6,"Yes","No"], index=["eggs", "apples", "milk", "bread"])
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Since we multiplied by 2, Pandas doubles the data of each item including the strings. Pandas can do this because the multiplication operation * is defined both for numbers and strings. 

In [38]:
groceries * 2

eggs          60
apples        12
milk      YesYes
bread       NoNo
dtype: object

If you were to apply an operation that was valid for numbers but not strings, say for instance, / you will get an error. So when you have mixed data types in your Pandas Series make sure the arithmetic operations are valid on all the data types of your elements.

In [39]:
groceries / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

----------

## Manipulate a Series (Exercises)

In [58]:
import pandas as pd

# Create a Pandas Series that contains the distance of some planets from the Sun.
# Use the name of the planets as the index to your Pandas Series, and the distance
# from the Sun as your data. The distance from the Sun is in units of 10^6 km

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

# Create a Pandas Series using the above data, with the name of the planets as
# the index and the distance from the Sun as your data.
dist_planets = pd.Series(data=distance_from_sun, index=planets) 
print(dist_planets)
print()
# Calculate the number of minutes it takes sunlight to reach each planet. You can
# do this by dividing the distance from the Sun for each planet by the speed of light.
# Since in the data above the distance from the Sun is in units of 10^6 km, you can
# use a value for the speed of light of c = 18, since light travels 18 x 10^6 km/minute.

time_light = dist_planets / 18
print(time_light)
print()

# Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
close_planets = time_light[time_light < 40]
print(close_planets)

Earth       149.6
Saturn     1433.5
Mars        227.9
Venus       108.2
Jupiter     778.6
dtype: float64

Earth       8.311111
Saturn     79.638889
Mars       12.661111
Venus       6.011111
Jupiter    43.255556
dtype: float64

Earth     8.311111
Mars     12.661111
Venus     6.011111
dtype: float64


------------- ******************************************************************************* ------------

----------

# Creating Pandas DataFrames

Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. 

In [102]:
import pandas as pd

In [103]:
items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice":pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"])
}

type(items)

dict

In [104]:
shopping_carts = pd.DataFrame(items)
type(shopping_carts)

pandas.core.frame.DataFrame

We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold. Also notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary. Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won't happen when we load data into a DataFrame from a data file. The last thing we want to point out is that we see some NaN values appear in the DataFrame. NaN stands for Not a Number, and is Pandas way of indicating that it doesn't have a value for that particular row and column index. For example, if we look at the column of Alice, we see that it has NaN in the watch index. You can see why this is the case by looking at the dictionary we created at the beginning. We clearly see that the dictionary has no item for Alice labeled watches. So whenever a DataFrame is created, if a particular column doesn't have values for a particular row index, Pandas will put a NaN value there. If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first. In a later lesson we will learn how to deal with NaN values and clean our data. For now, we will leave these values in our DataFrame.

In [105]:
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


#### without index labels

In [106]:
# Pandas Series without indexes
#Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

data =  {
    "Bob": pd.Series(data=[245, 25, 55]),
    "Alice":pd.Series(data=[40, 110, 500, 45])
}

items = pd.DataFrame(data)
items


Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


### like Pandas Series, Data Frame can provie the following info
- df.index
- df.values
- df.columns
- df.shape
- df.size
- df.ndim

In [107]:
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


In [108]:
shopping_carts.index

Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

In [109]:
shopping_carts.values

array([[245., 500.],
       [ nan,  40.],
       [ nan, 110.],
       [ 25.,  45.],
       [ 55.,  nan]])

In [110]:
shopping_carts.columns

Index(['Bob', 'Alice'], dtype='object')

In [111]:
shopping_carts.shape

(5, 2)

In [71]:
shopping_carts.size

10

In [72]:
shopping_carts.ndim

2

In [112]:

# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
 [[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

The row index in shopping_carts is: Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: Index(['Bob', 'Alice'], dtype='object')


In [113]:
items

Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


### Create DF with specific Columns or  Labels only

there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords **columns** and **index**

In [75]:
items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice":pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"])
}

In [76]:
items

{'Bob': bike     245
 pants     25
 watch     55
 dtype: int64,
 'Alice': book        40
 glasses    110
 bike       500
 pants       45
 dtype: int64}

#### by using columns keyword (for Columns)

In [79]:
Bob_shopping_cart = pd.DataFrame(items, columns=["Bob"])
Bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


#### by using index keyword (for Label)

In [81]:
#a DataFrame that only has selected items for both Alice and Bob
selected_shopping_carts = pd.DataFrame(items, index=["pants", "book"])
selected_shopping_carts

Unnamed: 0,Bob,Alice
pants,25.0,45
book,,40


#### specific label with specific column (Columns + Lables)

In [84]:
#DataFrame that only has selected items for Alice
alice_selected_shopping_carts = pd.DataFrame(items, columns=["Alice"], index=["glasses", "bike"])
alice_selected_shopping_carts

Unnamed: 0,Alice
glasses,110
bike,500


-------

## Creating DF from a dictionary of list of (arrays)
#### Float / Integers Data Frame

all the lists (arrays) in the dictionary must be of the same length. 

In [88]:
data = {
    "Integers": [1,2,3,4],
    "Floats": [4.6, 2.8, 5.1, 3.2]
}

In [92]:
#panda automatically create numeric label as default
df = pd.DataFrame(data)
df

Unnamed: 0,Integers,Floats
0,1,4.6
1,2,2.8
2,3,5.1
3,4,3.2


In [93]:
df = pd.DataFrame(data, index=["lable1", "label2", "label3", "label4"])
df

Unnamed: 0,Integers,Floats
lable1,1,4.6
label2,2,2.8
label3,3,5.1
label4,4,3.2


In [94]:
items = [
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35},
    {
        "watches": 10,
        "glasses": 50,
        "bikes": 15,
        "pants": 5
    }
]

In [98]:
store_items = pd.DataFrame(items)
store_items

Unnamed: 0,bikes,pants,watches,glasses
0,20,30,35,
1,15,5,10,50.0


In [99]:
new_store_items = pd.DataFrame(items, index=["Store1", "Store2"])
new_store_items

Unnamed: 0,bikes,pants,watches,glasses
Store1,20,30,35,
Store2,15,5,10,50.0


---------------

## Accessing Elements in Pandas DataFrames