# Pandas

## Introduction to Pandas

**Pandas** is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely **Pandas Series** and **Pandas DataFrame**. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.


- How to import Pandas
- How to create Pandas Series and DataFrames using various methods
- How to access and change elements in Series and DataFrames
- How to perform arithmetic operations on Series
- How to load data into a DataFrame
- How to deal with Not a Number (NaN) values

## Why Use Pandas?

The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.



-------------

# Creating Pandas Series
- pd.Series(data, index), where index is a list of index labels
    - series.shape
    - series.size
    - series.ndim
    - series.index
    - series.values

A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

In [1]:
import pandas as pd

unlike NumPy, panda series can hold data of different types and index can also be string

In [2]:
groceries = pd.Series(data=[30, 6, "Yes", "No"], index=["egg", "apple", "milk", "bread"])
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [3]:
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

print()

print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements

The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['egg', 'apple', 'milk', 'bread'], dtype='object')


to check whether a specific value is included in index values

In [4]:
"banana" in groceries.index

False

In [5]:
"apple" in groceries.index

True

In [6]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries

# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries

# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True


-----------------

## Accessing and Deleting Elements in Pandas Series

One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes, **.loc** and **.iloc** to explicitly state what we mean. The attribute .loc stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute .iloc stands for integer location and it is used to explicitly state that we are using a numerical index.

### Accessing Pandas Series by Index Label

In [7]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [8]:
# We use a single index label
groceries["egg"]

30

In [9]:
#via list of index labels
# we can access multiple index labels
groceries[["milk", "bread"]]

milk     Yes
bread     No
dtype: object

### Accessing Pandas Series by Numerical Indices
it is same as NumPy

In [10]:
# We use a single numerical index
groceries[0]

30

In [11]:
# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 

Do we need bread:
 bread    No
dtype: object


In [12]:
groceries[2:4]

milk     Yes
bread     No
dtype: object

In [13]:
# we use multiple numerical indices
groceries[[2,3]]

milk     Yes
bread     No
dtype: object

## How to differentiate between label index and numerical index?
- **series.loc[]** : label index, label location
- **series.iloc[]**: numerical index, integer location

In [14]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [15]:
#lable location
# we use loc to access multiple index labels
groceries.loc[["egg", "apple"]]

egg      30
apple     6
dtype: object

In [16]:
#integer location
# we use iloc to access multiple numerical indices
groceries.iloc[[0,1]]

egg      30
apple     6
dtype: object

## Changing Pandas Series
pandas series are mutable like numPy arrays

In [17]:
groceries

egg       30
apple      6
milk     Yes
bread     No
dtype: object

In [18]:
groceries.iloc[[0]] = 32
groceries

egg       32
apple      6
milk     Yes
bread     No
dtype: object

In [19]:
groceries.loc[["egg"]] = 33
groceries

egg       33
apple      6
milk     Yes
bread     No
dtype: object

In [20]:
groceries["egg"] = 35
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

## Deleting elements from Pandas Series
- series.drop(label index, inplace=False) : return the modified series, but no impact on original series
- series.drop(label index, inplace=True) : modified the original series 

We can delete items from a Pandas Series in place by setting the keyword inplace to True in the .drop() method.

In [21]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [22]:
new = groceries.drop("bread")
new

egg       35
apple      6
milk     Yes
dtype: object

In [23]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [24]:
#now original series got changed
groceries.drop("bread", inplace=True)
groceries

egg       35
apple      6
milk     Yes
dtype: object

-------------------

## Arithmetic Operations on Pandas Series

Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on Pandas Series. 


In [25]:
import pandas as pd

In [26]:
fruits = pd.Series([10,6,3], ["apples", "oranges", "bananas"])
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [27]:
fruits + 2

apples     12
oranges     8
bananas     5
dtype: int64

In [28]:
fruits - 2

apples     8
oranges    4
bananas    1
dtype: int64

In [29]:
fruits / 2

apples     5.0
oranges    3.0
bananas    1.5
dtype: float64

In [30]:
fruits * 2

apples     20
oranges    12
bananas     6
dtype: int64

In [31]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)

# We perform basic element-wise operations using arithmetic symbols
print()
print('fruits + 2:\n', fruits + 2) # We add 2 to each item in fruits
print()
print('fruits - 2:\n', fruits - 2) # We subtract 2 to each item in fruits
print()
print('fruits * 2:\n', fruits * 2) # We multiply each item in fruits by 2 
print()
print('fruits / 2:\n', fruits / 2) # We divide each item in fruits by 2
print()

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2:
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2:
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2:
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2:
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64



#### we can use mathematical functions from NumPy, to operate on pandas series
- np.sqrt(pandas series)
- np.exp(pandas series)
- np.power(pandas series, power of x)

In [32]:
import numpy as np

In [33]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [34]:
np.sqrt(fruits)

apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

In [35]:
np.exp(fruits)

apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

In [36]:
np.power(fruits, 2)

apples     100
oranges     36
bananas      9
dtype: int64

In [37]:
# We print fruits for reference
print('Original grocery list of fruits:\n', fruits)

# We apply different mathematical functions to all elements of fruits
print()
print('EXP(X) = \n', np.exp(fruits))
print() 
print('SQRT(X) =\n', np.sqrt(fruits))
print()
print('POW(X,2) =\n',np.power(fruits,2)) # We raise all elements of fruits to the power of 2

Original grocery list of fruits:
 apples     10
oranges     6
bananas     3
dtype: int64

EXP(X) = 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

SQRT(X) =
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

POW(X,2) =
 apples     100
oranges     36
bananas      9
dtype: int64


------

### using loc and iloc on pandas

In [38]:
fruits

apples     10
oranges     6
bananas     3
dtype: int64

In [39]:
fruits["bananas"] + 2

5

In [40]:
fruits[["oranges", "apples"]] + 10

oranges    16
apples     20
dtype: int64

In [41]:
fruits.loc["apples"] * 2

20

In [42]:
fruits.iloc[2]*2

6

In [43]:
# We print fruits for reference
print('Original grocery list of fruits:\n ', fruits)
print()

# We add 2 only to the bananas
print('Amount of bananas + 2 = ', fruits['bananas'] + 2)
print()

# We subtract 2 from apples
print('Amount of apples - 2 = ', fruits.iloc[0] - 2)
print()

# We multiply apples and oranges by 2
print('We double the amount of apples and oranges:\n', fruits[['apples', 'oranges']] * 2)
print()

# We divide apples and oranges by 2
print('We half the amount of apples and oranges:\n', fruits.loc[['apples', 'oranges']] / 2)

Original grocery list of fruits:
  apples     10
oranges     6
bananas     3
dtype: int64

Amount of bananas + 2 =  5

Amount of apples - 2 =  8

We double the amount of apples and oranges:
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges:
 apples     5.0
oranges    3.0
dtype: float64


--------

### pandas series with different/ mixed data type can use with arithmetic operations

In [44]:
groceries = pd.Series(data=[30,6,"Yes","No"], index=["eggs", "apples", "milk", "bread"])
groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

Since we multiplied by 2, Pandas doubles the data of each item including the strings. Pandas can do this because the multiplication operation * is defined both for numbers and strings. 

In [45]:
groceries * 2

eggs          60
apples        12
milk      YesYes
bread       NoNo
dtype: object

If you were to apply an operation that was valid for numbers but not strings, say for instance, / you will get an error. So when you have mixed data types in your Pandas Series make sure the arithmetic operations are valid on all the data types of your elements.

In [46]:
groceries / 2

TypeError: unsupported operand type(s) for /: 'str' and 'int'

----------

## Manipulate a Series (Exercises)

In [47]:
import pandas as pd

# Create a Pandas Series that contains the distance of some planets from the Sun.
# Use the name of the planets as the index to your Pandas Series, and the distance
# from the Sun as your data. The distance from the Sun is in units of 10^6 km

distance_from_sun = [149.6, 1433.5, 227.9, 108.2, 778.6]

planets = ['Earth','Saturn', 'Mars','Venus', 'Jupiter']

# Create a Pandas Series using the above data, with the name of the planets as
# the index and the distance from the Sun as your data.
dist_planets = pd.Series(data=distance_from_sun, index=planets) 
print(dist_planets)
print()
# Calculate the number of minutes it takes sunlight to reach each planet. You can
# do this by dividing the distance from the Sun for each planet by the speed of light.
# Since in the data above the distance from the Sun is in units of 10^6 km, you can
# use a value for the speed of light of c = 18, since light travels 18 x 10^6 km/minute.

time_light = dist_planets / 18
print(time_light)
print()

# Use Boolean indexing to select only those planets for which sunlight takes less
# than 40 minutes to reach them.
close_planets = time_light[time_light < 40]
print(close_planets)

Earth       149.6
Saturn     1433.5
Mars        227.9
Venus       108.2
Jupiter     778.6
dtype: float64

Earth       8.311111
Saturn     79.638889
Mars       12.661111
Venus       6.011111
Jupiter    43.255556
dtype: float64

Earth     8.311111
Mars     12.661111
Venus     6.011111
dtype: float64


------------- ******************************************************************************* ------------

----------

# Creating Pandas DataFrames

Pandas DataFrames are two-dimensional data structures with labeled rows and columns, that can hold many data types. If you are familiar with Excel, you can think of Pandas DataFrames as being similar to a spreadsheet. We can create Pandas DataFrames manually or by loading data from a file. 

In [48]:
import pandas as pd

In [49]:
items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice":pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"])
}

type(items)

dict

In [50]:
shopping_carts = pd.DataFrame(items)
type(shopping_carts)

pandas.core.frame.DataFrame

We see that DataFrames are displayed in tabular form, much like an Excel spreadsheet, with the labels of rows and columns in bold. Also notice that the row labels of the DataFrame are built from the union of the index labels of the two Pandas Series we used to construct the dictionary. And the column labels of the DataFrame are taken from the keys of the dictionary. Another thing to notice is that the columns are arranged alphabetically and not in the order given in the dictionary. We will see later that this won't happen when we load data into a DataFrame from a data file. The last thing we want to point out is that we see some NaN values appear in the DataFrame. NaN stands for Not a Number, and is Pandas way of indicating that it doesn't have a value for that particular row and column index. For example, if we look at the column of Alice, we see that it has NaN in the watch index. You can see why this is the case by looking at the dictionary we created at the beginning. We clearly see that the dictionary has no item for Alice labeled watches. So whenever a DataFrame is created, if a particular column doesn't have values for a particular row index, Pandas will put a NaN value there. If we were to feed this data into a machine learning algorithm we will have to remove these NaN values first. In a later lesson we will learn how to deal with NaN values and clean our data. For now, we will leave these values in our DataFrame.

In [51]:
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


#### without index labels

In [52]:
# Pandas Series without indexes
#Pandas indexes the rows of the DataFrame starting from 0, just like NumPy indexes ndarrays.

data =  {
    "Bob": pd.Series(data=[245, 25, 55]),
    "Alice":pd.Series(data=[40, 110, 500, 45])
}

items = pd.DataFrame(data)
items


Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


### like Pandas Series, Data Frame can provie the following info
- df.index
- df.values
- df.columns
- df.shape
- df.size
- df.ndim

In [53]:
shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


In [54]:
shopping_carts.index

Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

In [55]:
shopping_carts.values

array([[245., 500.],
       [ nan,  40.],
       [ nan, 110.],
       [ 25.,  45.],
       [ 55.,  nan]])

In [56]:
shopping_carts.columns

Index(['Bob', 'Alice'], dtype='object')

In [57]:
shopping_carts.shape

(5, 2)

In [58]:
shopping_carts.size

10

In [59]:
shopping_carts.ndim

2

In [60]:

# We print some information about shopping_carts
print('shopping_carts has shape:', shopping_carts.shape)
print('shopping_carts has dimension:', shopping_carts.ndim)
print('shopping_carts has a total of:', shopping_carts.size, 'elements')
print()
print('The data in shopping_carts is:\n', shopping_carts.values)
print()
print('The row index in shopping_carts is:', shopping_carts.index)
print()
print('The column index in shopping_carts is:', shopping_carts.columns)

shopping_carts has shape: (5, 2)
shopping_carts has dimension: 2
shopping_carts has a total of: 10 elements

The data in shopping_carts is:
 [[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

The row index in shopping_carts is: Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: Index(['Bob', 'Alice'], dtype='object')


In [61]:
items

Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


### Create DF with specific Columns or  Labels only

there might be cases when you are only interested in a subset of the data. Pandas allows us to select which data we want to put into our DataFrame by means of the keywords **columns** and **index**

In [62]:
items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice":pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"])
}

In [63]:
items

{'Bob': bike     245
 pants     25
 watch     55
 dtype: int64,
 'Alice': book        40
 glasses    110
 bike       500
 pants       45
 dtype: int64}

#### by using columns keyword (for Columns)

In [64]:
Bob_shopping_cart = pd.DataFrame(items, columns=["Bob"])
Bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


#### by using index keyword (for Label)

In [65]:
#a DataFrame that only has selected items for both Alice and Bob
selected_shopping_carts = pd.DataFrame(items, index=["pants", "book"])
selected_shopping_carts

Unnamed: 0,Bob,Alice
pants,25.0,45
book,,40


#### specific label with specific column (Columns + Lables)

In [66]:
#DataFrame that only has selected items for Alice
alice_selected_shopping_carts = pd.DataFrame(items, columns=["Alice"], index=["glasses", "bike"])
alice_selected_shopping_carts

Unnamed: 0,Alice
glasses,110
bike,500


-------

## Creating DF from a dictionary of list of (arrays)
#### Float / Integers Data Frame

all the lists (arrays) in the dictionary must be of the same length. 

In [67]:
data = {
    "Integers": [1,2,3,4],
    "Floats": [4.6, 2.8, 5.1, 3.2]
}

In [68]:
#panda automatically create numeric label as default
df = pd.DataFrame(data)
df

Unnamed: 0,Integers,Floats
0,1,4.6
1,2,2.8
2,3,5.1
3,4,3.2


In [69]:
df = pd.DataFrame(data, index=["lable1", "label2", "label3", "label4"])
df

Unnamed: 0,Integers,Floats
lable1,1,4.6
label2,2,2.8
label3,3,5.1
label4,4,3.2


In [70]:
items = [
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35},
    {
        "watches": 10,
        "glasses": 50,
        "bikes": 15,
        "pants": 5
    }
]

In [71]:
store_items = pd.DataFrame(items)
store_items

Unnamed: 0,bikes,pants,watches,glasses
0,20,30,35,
1,15,5,10,50.0


In [72]:
new_store_items = pd.DataFrame(items, index=["Store1", "Store2"])
new_store_items

Unnamed: 0,bikes,pants,watches,glasses
Store1,20,30,35,
Store2,15,5,10,50.0


---------------

## Accessing Elements in Pandas DataFrames

In [84]:
import pandas as pd

In [85]:
items = [
    {
        "bikes":20,
        "pants":30,
        "watches":35
    },
    {
        "watches":10,
        "glasses":50,
        "bikes":15,
        "pants":5
    }
]

In [87]:
store_items = pd.DataFrame(data=items, index=["Store1", "Store2"])

In [88]:
store_items

Unnamed: 0,bikes,pants,watches,glasses
Store1,20,30,35,
Store2,15,5,10,50.0


#### Summary

In [90]:
# We print the store_items DataFrame
print(store_items)

# We access rows, columns and elements using labels
print()
print('How many bikes are in each store:\n', store_items[['bikes']])
print()
print('How many bikes and pants are in each store:\n', store_items[['bikes', 'pants']])
print()
print('What items are in Store 1:\n', store_items.loc[["Store1"]])
print()
print('How many bikes are in Store 2:', store_items['bikes']['Store2'])

        bikes  pants  watches  glasses
Store1     20     30       35      NaN
Store2     15      5       10     50.0

How many bikes are in each store:
         bikes
Store1     20
Store2     15

How many bikes and pants are in each store:
         bikes  pants
Store1     20     30
Store2     15      5

What items are in Store 1:
         bikes  pants  watches  glasses
Store1     20     30       35      NaN

How many bikes are in Store 2: 15


#### accessing Single Column

In [77]:
store_items[["bikes"]]

Unnamed: 0,bikes
Store1,20
Store2,15


#### accessing Multiple Columns

In [78]:
store_items[["bikes", "pants"]]

Unnamed: 0,bikes,pants
Store1,20,30
Store2,15,5


#### accessing Specific Row

In [79]:
#using loc
store_items.loc[["Store1"]]

Unnamed: 0,bikes,pants,watches,glasses
Store1,20,30,35,


In [80]:
#using iloc
store_items.iloc[[0]]

Unnamed: 0,bikes,pants,watches,glasses
Store1,20,30,35,


#### accessing Specific Column, Specific Row
Note: Column lable comes first

It is important to know that when accessing individual elements in a DataFrame, as we did in the last example above, the labels should always be provided with the column label first, i.e. in the form **dataframe[column][row]**. 

In [81]:
store_items["bikes"]["Store1"]

20

## Modifying elements

### add New Columns & Values

In [92]:
store_items["shirts"] = [15,2]
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts
Store1,20,30,35,,15
Store2,15,5,10,50.0,2


#### we can also add new columns to our DataFrame by using arithmetic operations between other columns in our DataFrame. 

In [93]:
store_items["suits"] = store_items["pants"] + store_items["shirts"]
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
Store1,20,30,35,,15,45
Store2,15,5,10,50.0,2,7


### Add New Row

#### Suppose now, that you opened a new store and you need to add the number of items in stock of that new store into your DataFrame. We can do this by adding a new row to the store_items Dataframe. 

In [94]:
new_items = [
    {
        "bikes": 20,
        "pants":30,
        "watches":35,
        "glasses":4
    }
]

new_store = pd.DataFrame(data=new_items, index=["Store3"])
new_store

Unnamed: 0,bikes,pants,watches,glasses
Store3,20,30,35,4


#### append new rows

In [95]:
store_items = store_items.append(new_store)
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits
Store1,20,30,35,,15.0,45.0
Store2,15,5,10,50.0,2.0,7.0
Store3,20,30,35,4.0,,


#### add New columns by using exsiting column values

suppose that you want to stock stores 2 and 3 with new watches and you want the quantity of the new watches to be the same as the watches already in stock for those stores.

In [96]:
store_items["new_watches"] = store_items["watches"][1:]
store_items

Unnamed: 0,bikes,pants,watches,glasses,shirts,suits,new_watches
Store1,20,30,35,,15.0,45.0,
Store2,15,5,10,50.0,2.0,7.0,10.0
Store3,20,30,35,4.0,,,35.0


## Insert new columns at Specific Location
    - df.insert(position, label, values list)
    - dataframe.insert(loc,label,data)

add new column named shoes right before the suits column. Since suits has numerical index value 4 then we will use this value as loc.

In [97]:
store_items.insert(4, "shoes", [8,5,10])
store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts,suits,new_watches
Store1,20,30,35,,8,15.0,45.0,
Store2,15,5,10,50.0,5,2.0,7.0,10.0
Store3,20,30,35,4.0,10,,,35.0


## Deleting Columns and Rows
- df.pop(column label)
- df.drop(column or row label list, axis=1/0)

In [98]:
#deleting columns
store_items.pop("new_watches")
store_items

Unnamed: 0,bikes,pants,watches,glasses,shoes,shirts,suits
Store1,20,30,35,,8,15.0,45.0
Store2,15,5,10,50.0,5,2.0,7.0
Store3,20,30,35,4.0,10,,


In [99]:
# dropping multiple columns
store_items = store_items.drop(["watches", "suits"], axis=1)
store_items

Unnamed: 0,bikes,pants,glasses,shoes,shirts
Store1,20,30,,8,15.0
Store2,15,5,50.0,5,2.0
Store3,20,30,4.0,10,


In [100]:
# dropping multiple rows
store_items = store_items.drop(["Store1", "Store2"], axis=0)
store_items

Unnamed: 0,bikes,pants,glasses,shoes,shirts
Store3,20,30,4.0,10,


## Changing / Renaming Row(index) & Column Name
- df.rename(columns={"old column name": "new column name"})
- df.rename(index={"old row name": "new row name"})

In [101]:
store_items = store_items.rename(columns = {"bikes": "hats"})
store_items

Unnamed: 0,hats,pants,glasses,shoes,shirts
Store3,20,30,4.0,10,


In [102]:
store_items = store_items.rename(index={"Store3": "Last Store"})
store_items

Unnamed: 0,hats,pants,glasses,shoes,shirts
Last Store,20,30,4.0,10,


### Using column as index
- df.set_index(column name)

In [103]:
store_items = store_items.set_index("pants")
store_items

Unnamed: 0_level_0,hats,glasses,shoes,shirts
pants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,20,4.0,10,


---------------------------

# Dealing with NaN

before we can begin training our learning algorithms with large datasets, we usually need to clean the data first. This means we need to have a method for detecting and correcting errors in our data. While any given dataset can have many types of bad data, such as outliers or incorrect values, the type of bad data we encounter almost always is missing values. As we saw earlier, Pandas assigns NaN values to missing data. So we need to detect and deal with NaN values.

In [123]:
import pandas as pd

In [124]:
items = [
    {"bikes":20, "pants": 30, "watches":35, "shirts":15, "shoes":8, "suits":45},
    {"watches": 10, "glasses": 50, "bikes": 15, "pants": 5, "shirts": 2, "shoes":5, "suits": 7},
    {"bikes":20, "pants":30, "watches":35, "glasses":4, "shoes":10}
]

In [125]:
store_items = pd.DataFrame(data=items, index=["Store1", "Store2", "Store3"])
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


### couting NAN values / Missing Values

.isnull() method returns a Boolean DataFrame of the same size as store_items and indicates with True the elements that have NaN values and with False the elements that are not

In [126]:
x = store_items.isnull()
x

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,False,False,False,False,False,False,True
Store2,False,False,False,False,False,False,False
Store3,False,False,False,True,False,True,False


In Pandas, logical True values have numerical value 1 and logical False values have numerical value 0. Therefore, we can count the number of NaN values by counting the number of logical True values. In order to count the total number of logical True values we use the .sum() method twice. We have to use it twice because the first sum returns a Pandas Series with the sums of logical True values along columns

In [127]:
x = store_items.isnull().sum()
x

bikes      0
pants      0
watches    0
shirts     1
shoes      0
suits      1
glasses    1
dtype: int64

### Total number of NaN in whole dataset

In [128]:
# now we get TOTAL Missing values in the whole dataset
x = store_items.isnull().sum().sum()
x

3

### Glance at data values situation
- dataframe.count()
- you can see which columns has how many values

Instead of counting the number of NaN values we can also do the opposite, we can count the number of non-NaN values.

In [129]:
store_items.count()

bikes      3
pants      3
watches    3
shirts     2
shoes      3
suits      2
glasses    2
dtype: int64

## Once we found Missing Value, we can handle in two ways
- remove the missing values
- replace the missing values

## 1) Remove NaN values
- dataframe.dropna(axis=0/1) : return as new df
- dataframe.dropna(axis=0/1, inplace=True/False) : to replace the original df

Notice that the .dropna() method eliminates (drops) the rows or columns with NaN values out of place. This means that the original DataFrame is not modified. You can always remove the desired rows or columns in place by setting the keyword inplace = True inside the dropna() function.

In [130]:
# drop na values from rows
store_items.dropna(axis=0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store2,15,5,10,2.0,5,7.0,50.0


In [131]:
# drop na values from columns
store_items.dropna(axis=1)

Unnamed: 0,bikes,pants,watches,shoes
Store1,20,30,35,8
Store2,15,5,10,5
Store3,20,30,35,10


In [132]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


## 2) Replacing NaN Values
- dataframe.fillna(value to replace)
- dataframe.fillna(method="ffill", axis=0/1) : forwards filling

Now, instead of eliminating NaN values, we can replace them with suitable values. We could choose for example to replace all NaN values with the value 0. We can do this by using the .fillna() method as shown below.

In [134]:
# replace NaN values with 0
store_items.fillna(0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,0.0
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,0.0,10,0.0,4.0


### Forwards Filling Rows

We can also use the .fillna() method to replace NaN values with previous values in the DataFrame, this is known as forward filling. When replacing NaN values with forward filling, we can use previous values taken from columns or rows. The .fillna(method = 'ffill', axis) will use the forward filling (ffill) method to replace NaN values using the previous known value along the given axis. 

In [135]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [136]:
# forward filling with row
store_items.fillna(method="ffill", axis=0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,2.0,10,7.0,4.0


### Forwards Filling Columns

In [137]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [138]:
store_items.fillna(method="ffill", axis=1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
Store2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
Store3,20.0,30.0,35.0,35.0,10.0,10.0,4.0


### Backwards Filling

Similarly, you can choose to replace the NaN values with the values that go after them in the DataFrame, this is known as backward filling. The .fillna(method = 'backfill', axis) will use the backward filling (backfill) method to replace NaN values using the next known value along the given axis.

In [139]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [140]:
store_items.fillna(method="backfill", axis=0)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,50.0
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [141]:
store_items.fillna(method="backfill", axis=1)

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20.0,30.0,35.0,15.0,8.0,45.0,
Store2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
Store3,20.0,30.0,35.0,10.0,10.0,4.0,4.0


## Interpolate
- linear

Interpolating treat the values as **equally spaced**.

We can also choose to replace NaN values by using different interpolation methods. For example, the .interpolate(method = 'linear', axis) method will use **linear** interpolation to replace NaN values using the values along the given axis. 

In [142]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [147]:
# We replace NaN values by using linear interpolation using column values, along the rows
store_items.interpolate(method='linear', axis=0)

# in this case, when we look at the values along the rows, 
# 1) First Priority is check the previous rows value is Present or not. If not, nothing will happen
# 2) if  above and below cell row both have values, the target values will be mean of those 2 cells.

# Store1's glasses cell is located between only one cell. there is no cell above. So its value become NaN
# but for Store3's shirts and suits, those values got interpolated with previous row cells value. Because there is no below row cell. So the mean will be the same as above row cell's value.

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,2.0,10,7.0,4.0


In [146]:
store_items

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20,30,35,15.0,8,45.0,
Store2,15,5,10,2.0,5,7.0,50.0
Store3,20,30,35,,10,,4.0


In [149]:
# We replace NaN values by using linear interpolation using row values, along the columns
store_items.interpolate(method="linear", axis=1)

# for this one, Store1's glasses got the values of Store1's suits, which is the mean of before cell column's value.
# Store3's shirts will be mean of waches and shoes value (35+10)/2 = 22.5. Because it is located between column before and cloumn later.
# the same case for Store3's suits (10+4)/ 2 = 7

Unnamed: 0,bikes,pants,watches,shirts,shoes,suits,glasses
Store1,20.0,30.0,35.0,15.0,8.0,45.0,45.0
Store2,15.0,5.0,10.0,2.0,5.0,7.0,50.0
Store3,20.0,30.0,35.0,22.5,10.0,7.0,4.0



_____________________


# Manipulate a DataFrame (Exercise)

In [175]:
import pandas as pd
import numpy as np

# Since we will be working with ratings, we will set the precision of our 
# dataframes to one decimal place.
pd.set_option('precision', 1)

# Create a Pandas DataFrame that contains the ratings some users have given to a
# series of books. The ratings given are in the range from 1 to 5, with 5 being
# the best score. The names of the books, the authors, and the ratings of each user
# are given below:

books = pd.Series(data = ['Great Expectations', 'Of Mice and Men', 'Romeo and Juliet', 'The Time Machine', 'Alice in Wonderland' ])
authors = pd.Series(data = ['Charles Dickens', 'John Steinbeck', 'William Shakespeare', ' H. G. Wells', 'Lewis Carroll' ])

user_1 = pd.Series(data = [3.2, np.nan ,2.5])
user_2 = pd.Series(data = [5., 1.3, 4.0, 3.8])
user_3 = pd.Series(data = [2.0, 2.3, np.nan, 4])
user_4 = pd.Series(data = [4, 3.5, 4, 5, 4.2])

# Users that have np.nan values means that the user has not yet rated that book.
# Use the data above to create a Pandas DataFrame that has the following column
# labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3', 'User 4'. Let Pandas
# automatically assign numerical row indices to the DataFrame. 

# Create a dictionary with the data given above
dat = {
    "Author": authors,
    "Book Title": books,
    "User 1": user_1,
    "User 2": user_2,
    "User 3": user_3,
    "User 4": user_4,
}

# Use the dictionary to create a Pandas DataFrame
book_ratings = pd.DataFrame(dat)

# If you created the dictionary correctly you should have a Pandas DataFrame
# that has column labels: 'Author', 'Book Title', 'User 1', 'User 2', 'User 3',
# 'User 4' and row indices 0 through 4.
book_ratings

# Now replace all the NaN values in your DataFrame with the average rating in
# each column. Replace the NaN values in place. HINT: you can use the fillna()
# function with the keyword inplace = True, to do this. Write your code below:
print("Total Missing Values: ", book_ratings.isnull().sum().sum())

book_ratings.fillna(book_ratings.mean(), inplace=True)
book_ratings

Total Missing Values:  6


Unnamed: 0,Author,Book Title,User 1,User 2,User 3,User 4
0,Charles Dickens,Great Expectations,3.2,5.0,2.0,4.0
1,John Steinbeck,Of Mice and Men,2.9,1.3,2.3,3.5
2,William Shakespeare,Romeo and Juliet,2.5,4.0,2.8,4.0
3,H. G. Wells,The Time Machine,2.9,3.8,4.0,5.0
4,Lewis Carroll,Alice in Wonderland,2.9,3.5,2.8,4.2


In [177]:
#  pick all the books that had a rating of 5
best_rated = book_ratings[(book_ratings == 5).any(axis = 1)]['Book Title'].values
best_rated

array(['Great Expectations', 'The Time Machine'], dtype=object)

----------------

# Loading Data into a pandas DataFrame

In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for Comma Separated Values and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the pd.read_csv() function.

In [185]:
import pandas as pd

In [207]:
sti_stock = pd.read_csv('ES3.SI.csv')

In [208]:
sti_stock

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2008-01-10,3.7,3.7,3.6,3.6,2.9,29000.0
1,2008-01-11,3.6,3.7,3.6,3.6,2.9,27000.0
2,2008-01-14,3.6,3.6,3.5,3.5,2.8,79000.0
3,2008-01-15,3.5,3.5,3.4,3.4,2.7,171000.0
4,2008-01-16,3.4,3.4,3.2,3.3,2.6,173000.0
...,...,...,...,...,...,...,...
3139,2020-07-13,,,,,,
3140,2020-07-14,,,,,,
3141,2020-07-15,,,,,,
3142,2020-07-16,,,,,,


In [209]:
sti_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2008-01-10,3.7,3.7,3.6,3.6,2.9,29000.0
1,2008-01-11,3.6,3.7,3.6,3.6,2.9,27000.0
2,2008-01-14,3.6,3.6,3.5,3.5,2.8,79000.0
3,2008-01-15,3.5,3.5,3.4,3.4,2.7,171000.0
4,2008-01-16,3.4,3.4,3.2,3.3,2.6,173000.0


We can also optionally use .head(N) or .tail(N) to display the first and last N rows of data, respectively.

In [210]:
sti_stock.tail(10)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
3134,2020-07-06,,,,,,
3135,2020-07-07,,,,,,
3136,2020-07-08,,,,,,
3137,2020-07-09,,,,,,
3138,2020-07-10,,,,,,
3139,2020-07-13,,,,,,
3140,2020-07-14,,,,,,
3141,2020-07-15,,,,,,
3142,2020-07-16,,,,,,
3143,2020-07-17,,,,,,


Let's do a quick check to see whether we have any NaN values in our dataset. To do this, we will use the .isnull() method followed by the .any() method to check whether any of the columns contain NaN values.

In [211]:
sti_stock.isnull().any()

Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

In [212]:
sti_stock.isnull().sum().sum()

108

In [216]:
# drop the rows if any of the cell values has NaN
sti_stock.dropna(axis=0, how="any", inplace=True)

In [217]:
sti_stock

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2008-01-10,3.7,3.7,3.6,3.6,2.9,2.9e+04
1,2008-01-11,3.6,3.7,3.6,3.6,2.9,2.7e+04
2,2008-01-14,3.6,3.6,3.5,3.5,2.8,7.9e+04
3,2008-01-15,3.5,3.5,3.4,3.4,2.7,1.7e+05
4,2008-01-16,3.4,3.4,3.2,3.3,2.6,1.7e+05
...,...,...,...,...,...,...,...
3128,2020-06-26,2.6,2.7,2.6,2.7,2.7,2.0e+06
3129,2020-06-29,2.6,2.7,2.6,2.6,2.6,2.2e+06
3130,2020-06-30,2.7,2.7,2.6,2.6,2.6,3.2e+06
3131,2020-07-01,2.7,2.7,2.7,2.7,2.7,2.3e+06


### Describe the statistics summary

When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the .describe() method to get descriptive statistics on each column of the DataFrame.

In [218]:
sti_stock.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,3126.0,3126.0,3126.0,3126.0,3126.0,3100.0
mean,3.0,3.1,3.0,3.0,2.6,450000.0
std,0.4,0.4,0.4,0.4,0.4,970000.0
min,1.5,1.5,1.5,1.5,1.2,0.0
25%,2.9,2.9,2.9,2.9,2.4,110000.0
50%,3.1,3.1,3.1,3.1,2.6,230000.0
75%,3.3,3.3,3.3,3.3,2.9,440000.0
max,3.7,3.7,3.6,3.7,3.4,19000000.0


descriptive statistics on a single column of our DataFrame

In [219]:
sti_stock['Close'].describe()

count    3126.0
mean        3.0
std         0.4
min         1.5
25%         2.9
50%         3.1
75%         3.3
max         3.7
Name: Close, dtype: float64

### Min

In [223]:
sti_stock['High'].min()

1.51

### Max

In [221]:
sti_stock.max()

Date         2020-07-02
Open                  4
High                  4
Low                   4
Close                 4
Adj Close             3
Volume            2e+07
dtype: object

### Mean

In [222]:
sti_stock.mean()

Open              3.0
High              3.1
Low               3.0
Close             3.0
Adj Close         2.6
Volume       447817.8
dtype: float64

### Correlation

Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the .corr() method to get the correlation between different columns

In [224]:
sti_stock.corr()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
Open,1.0,1.0,1.0,1.0,0.88,-0.1
High,1.0,1.0,1.0,1.0,0.88,-0.1
Low,1.0,1.0,1.0,1.0,0.88,-0.11
Close,1.0,1.0,1.0,1.0,0.88,-0.11
Adj Close,0.9,0.88,0.9,0.9,1.0,0.046
Volume,-0.1,-0.1,-0.1,-0.1,0.046,1.0


--------

## Dummy Company Info Analysis

In [246]:
import pandas as pd

In [247]:
company_info = pd.read_csv("Dummy Company.csv")

In [248]:
company_info.head()

Unnamed: 0,Year,Name,Department,Age,Salary
0,1990,Alice,HR,25,50000
1,1990,Bob,RD,30,48000
2,1990,Charlie,Admin,45,55000
3,1991,Alice,HR,26,52000
4,1991,Bob,RD,31,50000


In [249]:
company_info.describe()

Unnamed: 0,Year,Age,Salary
count,9.0,9.0,9.0
mean,1991.0,32.2,54333.3
std,0.9,7.9,5147.8
min,1990.0,25.0,48000.0
25%,1990.0,27.0,50000.0
50%,1991.0,30.0,52000.0
75%,1992.0,32.0,60000.0
max,1992.0,46.0,62000.0


## Group By
 The .groupby() method allows us to group data in different ways.

In [250]:
#total salary of every year
company_info.groupby(['Year'])['Salary'].sum()

Year
1990    153000
1991    162000
1992    174000
Name: Salary, dtype: int64

In [251]:
#average salary of every year
company_info.groupby(["Year"])["Salary"].mean()

Year
1990    51000
1991    54000
1992    58000
Name: Salary, dtype: int64

In [254]:
# total salary per employee
# the total salary each employee received in all the years they worked for the company
company_info.groupby(["Name"])["Salary"].sum()

Name
Alice      162000
Bob        150000
Charlie    177000
Name: Salary, dtype: int64

what was the salary distribution per department per year. In this case we will group the data by Year and by Department using the .groupby() method and then we will add up the salaries for each department.

In [253]:
# Salary distribution by Department per each year
company_info.groupby(["Year", "Department"])["Salary"].sum()

Year  Department
1990  Admin          55000
      HR             50000
      RD             48000
1991  Admin          60000
      HR             52000
      RD             50000
1992  Admin         122000
      RD             52000
Name: Salary, dtype: int64