# Pandas Toturial Part 2

On this lesson we will cover the following topics:

* Lambda Functions
* Boolean Indexing
* Reading & Writing data
* DataFrame Manipulation
* Statistics on data

In [None]:
!pip install matplotlib

In [6]:
# we start with the imports as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

# and ipython definition:
%matplotlib inline

In [7]:
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Lambda Function
A lambda function is a small anonymous function.

A lambda function can take any number of arguments, but can only have one expression.

### Syntax

**lambda** arguments : expression

The expression is executed and the result is returned

In [8]:
# Add 10 to argument a, and return the result:

x = lambda a : a + 10 # x is a variable holding a function, it is an executable
print(x(5))

15


In [9]:
# Multiply argument a with argument b and return the result:

x = lambda a, b : a * b
print(x(5, 6))

30


## Boolean Indexing

In boolean indexing, we will select subsets of data based on the actual values of the data in the DataFrame and not on their row/column labels or integer locations. In boolean indexing, we use a boolean vector to filter the data. 

Boolean indexing is a type of indexing which uses actual values of the data in the DataFrame. In boolean indexing, we can filter a data in 2 main ways – 

* Accessing a DataFrame with a boolean index
* Masking data based on column value

**Accessing a DataFrame with a boolean index**

In order to access a dataframe with a boolean index, we have to create a dataframe in which the index of dataframe contains a boolean value that is “True” or “False”. For Example 

In [10]:
data = {'name':["Yair", "Ben", "Shir", "Natan"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
  
df = pd.DataFrame(data, index = [True, False, True, False])
  
df

Unnamed: 0,name,degree,score
True,Yair,MBA,90
False,Ben,BCA,40
True,Shir,M.Tech,80
False,Natan,MBA,98


In order to access a dataframe with a boolean index using **.loc[]**, we simply pass a boolean value (True or False) in a **.loc[]** function

**.iloc** doesn't work as the Index is not automatically generated

In [11]:
print(df.loc[True])
print()
print(df.loc[False])

      name  degree  score
True  Yair     MBA     90
True  Shir  M.Tech     80

        name degree  score
False    Ben    BCA     40
False  Natan    MBA     98


**Masking data based on column value**

In a dataframe we can filter a data based on a column value in order to filter data, we can apply certain conditions on the dataframe using different operators like **==, >, <, <=, >=**. When we apply these operators to the dataframe then it produces a Series of True and False.

In [13]:
data = {'name':["Yair", "Ben", "Shir", "Natan"],
        'degree': ["BCA", "BCA", "M.Tech", "BCA"],
        'score':[90, 40, 80, 98]}
 
# creating a dataframe
df = pd.DataFrame(data)

df

Unnamed: 0,name,degree,score
0,Yair,BCA,90
1,Ben,BCA,40
2,Shir,M.Tech,80
3,Natan,BCA,98


In [14]:
# In order to get the Boolean result
print(df['degree'] == 'BCA')
print()
# In order to actually filter the DataFrame
print(df[df['degree'] == 'BCA'])

0     True
1     True
2    False
3     True
Name: degree, dtype: bool

    name degree  score
0   Yair    BCA     90
1    Ben    BCA     40
3  Natan    BCA     98


In [15]:
# In order to get the Boolean result
print(df['score'] >= 90)
print()
# In order to actually filter the DataFrame
print(df[df['score'] >= 90])

0     True
1    False
2    False
3     True
Name: score, dtype: bool

    name degree  score
0   Yair    BCA     90
3  Natan    BCA     98


In order to apply multiple conditions we can use "&" for "and", and "|" for or

In [16]:
# In order to get the Boolean result
print((df['score'] >= 90) & ((df['degree'] == 'BCA')))
print()
# In order to actually filter the DataFrame
print(df[(df['score'] >= 90) & ((df['degree'] == 'BCA'))])

0     True
1    False
2    False
3     True
dtype: bool

    name degree  score
0   Yair    BCA     90
3  Natan    BCA     98


In [17]:
# In order to get the Boolean result
print((df['score'] >= 90) | ((df['degree'] == 'BCA')))
print()
# In order to actually filter the DataFrame
print(df[(df['score'] >= 90) | ((df['degree'] == 'BCA'))])

0     True
1     True
2    False
3     True
dtype: bool

    name degree  score
0   Yair    BCA     90
1    Ben    BCA     40
3  Natan    BCA     98


## Exercise 1

1. Filter "**df_a**" based on "Accessing a DataFrame with a boolean index" to get **True** index only
2. Filter "**df_b**" based on "Masking data based on column value" to get **score** **under** **85**
3. Filter "**df_b**" based on "Masking data based on column value" to get **score** **under** **85** **and** **M.Tech** **degree**
3. Filter "**df_b**" based on "Masking data based on column value" to get **score** **euqal and above** **55** **or** **BCA** **degree**

In [18]:
data_a = {'name':["Yair", "Ben", "Shir", "Natan"],
        'degree': ["MBA", "BCA", "M.Tech", "MBA"],
        'score':[90, 40, 80, 98]}
df_a = pd.DataFrame(data_a, index = [True, False, True, False])
  

data_b = {'name':["Yair", "Ben", "Shir", "Natan"],
        'degree': ["BCA", "BCA", "M.Tech", "BCA"],
        'score':[90, 40, 80, 98]}
df_b = pd.DataFrame(data_b)

In [25]:
df_a.loc[True]
df_b[df_b['score'] < 85]
df_b[(df_b['score'] < 85) & (df_b['degree'] == 'M.Tech')]
df_b[(df_b['score'] >= 55) | (df_b['degree'] == 'BCA')]

Unnamed: 0,name,degree,score
True,Yair,MBA,90
True,Shir,M.Tech,80


Unnamed: 0,name,degree,score
1,Ben,BCA,40
2,Shir,M.Tech,80


Unnamed: 0,name,degree,score
2,Shir,M.Tech,80


Unnamed: 0,name,degree,score
0,Yair,BCA,90
1,Ben,BCA,40
2,Shir,M.Tech,80
3,Natan,BCA,98


## Getting Data In/Out
When handling DataFrames, most of the times we would like to create a DataFrame from existing data saved in some local/remote storage, and after manipulating it - saving it in some local/remote storage

### CSV

**.to_csv()** Write object to a comma-separated values (csv) file.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [27]:
data = {
    'name': ['Ori', 'Yarin', 'Shir'],
    'grades': [100, 100, 100]
}

df = pd.DataFrame(data)
df.to_csv('./Pandas_Tutorial_Part_2.csv') # change this localtion

**.read_csv()** Read a comma-separated values (csv) file into DataFrame.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [32]:
df = pd.read_csv('./Pandas_Tutorial_Part_2.csv') # change this localtion
df

Unnamed: 0.1,Unnamed: 0,name,grades
0,0,Ori,100
1,1,Yarin,100
2,2,Shir,100


In [33]:
# Sometimes we would like to read a csv from a remote storage, in order to do so we just provide 
# the link the data stored at, instead of a local path

df = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv")

df

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


### Excel

**.to_excel()** To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

**.read_excel()** Read an Excel file into a pandas DataFrame.

Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

### HDF5

**.to_hdf()** Write the contained data to an HDF5 file using HDFStore.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html

**.read_hdf()** read data from an HDF5 file using HDFStore.

https://pandas.pydata.org/docs/reference/api/pandas.read_hdf.html

## Sorting 

### Sorting Index
We use '**sort_index**' in order to sort the DataFrame based on the Index values
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

In [34]:
df = pd.DataFrame([10, 2, 35, 14, 25],
                  index=[100, 29, 234, 1, 150],
                  columns=['A'])
df

Unnamed: 0,A
100,10
29,2
234,35
1,14
150,25


In [35]:
df.sort_index()

Unnamed: 0,A
1,14
29,2
100,10
150,25
234,35


In [36]:
# By default, it sorts in ascending order, to sort in descending order, use ascending=False
df.sort_index(ascending=False)

Unnamed: 0,A
234,35
150,25
100,10
29,2
1,14


## Sort Values

We use '**sort_values**' in order to sort the DataFrame based on specific **columns** values

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

In [37]:
df = pd.DataFrame({
    'name': ['aaron', 'aaron', 'moshe', 'guy', 'ben', 'yarin', 'sharon'],
    'math_grade': [53, 22, 78, 96, 84, 66, 77],
    'CS_grade': [72, 75, 40, 45, 91, 95, 63]
})
df

Unnamed: 0,name,math_grade,CS_grade
0,aaron,53,72
1,aaron,22,75
2,moshe,78,40
3,guy,96,45
4,ben,84,91
5,yarin,66,95
6,sharon,77,63


In [38]:
# Sort by 'name' column
df.sort_values(by=['name']) # ascending=True

Unnamed: 0,name,math_grade,CS_grade
0,aaron,53,72
1,aaron,22,75
4,ben,84,91
3,guy,96,45
2,moshe,78,40
6,sharon,77,63
5,yarin,66,95


In [39]:
# Sort by multiple columns
# first and then second

df.sort_values(by=['name', 'math_grade'])

Unnamed: 0,name,math_grade,CS_grade
1,aaron,22,75
0,aaron,53,72
4,ben,84,91
3,guy,96,45
2,moshe,78,40
6,sharon,77,63
5,yarin,66,95


In [40]:
# Sort Descending

df.sort_values(by='CS_grade', ascending=False)

Unnamed: 0,name,math_grade,CS_grade
5,yarin,66,95
4,ben,84,91
1,aaron,22,75
0,aaron,53,72
6,sharon,77,63
3,guy,96,45
2,moshe,78,40


## Exercise 2 
1. Download and read the csv from "https://media.geeksforgeeks.org/wp-content/uploads/nba.csv"
1. Slice the DataFrame until the **15th** row
1. Sort the DataFrame's **Index** in a **descending** order
1. Sort the DataFrame using "**Number**" column in **ascending** order
1. Sort the DataFrame using "**Name**" and "**Weight**"
1. Save DataFrame to a local location (of your choice)

In [45]:
df = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/nba.csv')# YOUR CODE HERE
df = df.iloc[:15]
df.sort_index(ascending=False)
df.sort_values(by='Number')
df.sort_values(by=['Number', 'Weight'])

df.to_csv('./Pandas_Tutorial_Part_2_2.csv')

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
14,Tyler Zeller,Boston Celtics,44.0,C,26.0,7-0,253.0,North Carolina,2616975.0
13,James Young,Boston Celtics,13.0,SG,20.0,6-6,215.0,Kentucky,1749840.0
12,Evan Turner,Boston Celtics,11.0,SG,27.0,6-7,220.0,Ohio State,3425510.0
11,Isaiah Thomas,Boston Celtics,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
10,Jared Sullinger,Boston Celtics,7.0,C,24.0,6-9,260.0,Ohio State,2569260.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7-0,238.0,Gonzaga,2165160.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6-8,235.0,LSU,1170960.0
5,Amir Johnson,Boston Celtics,90.0,PF,29.0,6-9,240.0,,12000000.0


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
11,Isaiah Thomas,Boston Celtics,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
10,Jared Sullinger,Boston Celtics,7.0,C,24.0,6-9,260.0,Ohio State,2569260.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
12,Evan Turner,Boston Celtics,11.0,SG,27.0,6-7,220.0,Ohio State,3425510.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
13,James Young,Boston Celtics,13.0,SG,20.0,6-6,215.0,Kentucky,1749840.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
11,Isaiah Thomas,Boston Celtics,4.0,PG,27.0,5-9,185.0,Washington,6912869.0
10,Jared Sullinger,Boston Celtics,7.0,C,24.0,6-9,260.0,Ohio State,2569260.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
12,Evan Turner,Boston Celtics,11.0,SG,27.0,6-7,220.0,Ohio State,3425510.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6-2,190.0,Louisville,1824360.0
13,James Young,Boston Celtics,13.0,SG,20.0,6-6,215.0,Kentucky,1749840.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6-4,220.0,Oklahoma State,3431040.0


## Setting and Editing data

Sometimes we would like to change single value of few values, row or column wise

In [48]:
data = {
    'day': ['Sunday', 'Monday', 'Tuesday'],
    'liters': [10, 15, 8]
}

df = pd.DataFrame(data)
df

Unnamed: 0,day,liters
0,Sunday,10
1,Monday,15
2,Tuesday,8


Change full column values

In [47]:
new_liters_values = pd.Series([12, 55, 8])
df['liters'] = new_liters_values

df

Unnamed: 0,day,liters
0,Sunday,12
1,Monday,55
2,Tuesday,8


In [50]:
# Indexes must be allign, otherwise we will get NaNs (np.nan)

new_liters_values = pd.Series([12, 55, 8], index=[0, 2, 10])
df['liters'] = new_liters_values

df

Unnamed: 0,day,liters
0,Sunday,12.0
1,Monday,
2,Tuesday,55.0


Setting value for a subset of values

In [51]:
df.loc[[0, 1], 'liters'] = 1 # change the values of first and second rows' liters value to 1
df

Unnamed: 0,day,liters
0,Sunday,1.0
1,Monday,1.0
2,Tuesday,55.0


Setting value of a specific cell, by position

In [52]:
df.iloc[-1, 1] = 2 # last row, second column df.loc[-1, 'liters'] = 2
df

Unnamed: 0,day,liters
0,Sunday,1.0
1,Monday,1.0
2,Tuesday,2.0


Setting values using NumPy Array

In [53]:
# Reminder: in order to get dimensions of DataFrame - use df.shape -> (rows num, columns num)

df.loc[:, 'liters'] = np.array([42] * df.shape[0])
df 

Unnamed: 0,day,liters
0,Sunday,42
1,Monday,42
2,Tuesday,42


Setting values using a function


In [54]:
df['liters'] = df['liters'].apply(lambda x: x + 10) # adds 10 to each value in 'liters' column
df

Unnamed: 0,day,liters
0,Sunday,52
1,Monday,52
2,Tuesday,52


Setting values based on a condition

In [55]:
data = {
    'day': ['Sunday', 'Monday', 'Tuesday'],
    'liters': [10, 15, 8]
}

df = pd.DataFrame(data)
df

Unnamed: 0,day,liters
0,Sunday,10
1,Monday,15
2,Tuesday,8


In [56]:
df_2 = df.copy()
df_2

Unnamed: 0,day,liters
0,Sunday,10
1,Monday,15
2,Tuesday,8


In [57]:
df_2[df_2['liters'] % 2 == 0] = 0
df_2

Unnamed: 0,day,liters
0,0,0
1,Monday,15
2,0,0


In [58]:
df_2['liters'][df_2['liters'] % 2 == 0] = 0
df_2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,day,liters
0,0,0
1,Monday,15
2,0,0


## Exercise 3

1. Download and read the csv from "https://media.geeksforgeeks.org/wp-content/uploads/nba.csv"
1. Slice the DataFrame until the **15th** row
1. Change **all** values in '**Height**' column to **180**
1. Change the value of the **first and second** players' '**Number**' to **100**
1. Change the '**Age**' of players playing in '**Position**' '**PF**' to **50**
1. Increase the "**Salary**" of all players by **10** using **df.apply(lambda function)**

In [71]:
df = pd.read_csv('https://media.geeksforgeeks.org/wp-content/uploads/nba.csv')# YOUR CODE HERE
df = df.iloc[:15]
df['Height'] = 180
df.loc[[0,1], 'Number'] = 100
df.loc[df['Position'] == 'PF', 'Age'] = 50
df['Salary'] = df['Salary'].apply(lambda x: x + 10)
df

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,100.0,PG,25.0,180,180.0,Texas,7730347.0
1,Jae Crowder,Boston Celtics,100.0,SF,25.0,180,235.0,Marquette,6796127.0
2,John Holland,Boston Celtics,30.0,SG,27.0,180,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,180,185.0,Georgia State,1148650.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,50.0,180,231.0,,5000010.0
5,Amir Johnson,Boston Celtics,90.0,PF,50.0,180,240.0,,12000010.0
6,Jordan Mickey,Boston Celtics,55.0,PF,50.0,180,235.0,LSU,1170970.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,180,238.0,Gonzaga,2165170.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,180,190.0,Louisville,1824370.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,180,220.0,Oklahoma State,3431050.0


## Mathmatical and Statistical Operations

Just like on plain Python Objects and NumPy Arrays, pandas support variety of mathmatical and statistical operations on Series and DataFrames

In [72]:
s = pd.Series([10, 20, -10])
s

0    10
1    20
2   -10
dtype: int64

In [73]:
data = {
    'a': [10, 15, 20],
    'b': [3, 9, -16],
    'c': [-1, -10, -7]
}

df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,10,3,-1
1,15,9,-10
2,20,-16,-7


In [74]:
# Absolute values

print(s.abs())
print()
print(df.abs())

0    10
1    20
2    10
dtype: int64

    a   b   c
0  10   3   1
1  15   9  10
2  20  16   7


In [75]:
# Add a scalar with operator version which return the same results.

print(df + 1)
print(df.add(1))
print()

print(df.add([1, 2, 3])) # print(df.add([1, 2, 3], axis=1))
print(df.add([1, 2, 3], axis=0))

    a   b  c
0  11   4  0
1  16  10 -9
2  21 -15 -6
    a   b  c
0  11   4  0
1  16  10 -9
2  21 -15 -6

    a   b  c
0  11   5  2
1  16  11 -7
2  21 -14 -4
    a   b  c
0  11   4  0
1  17  11 -8
2  23 -13 -4


In [76]:
# Subtract a list and Series by axis with operator version.

print(df - 1)
print(df.sub(1))
print()

print(df.sub([1, 2, 3])) # print(df.sub([1, 2, 3], axis=1))
print(df.sub([1, 2, 3], axis=0))

    a   b   c
0   9   2  -2
1  14   8 -11
2  19 -17  -8
    a   b   c
0   9   2  -2
1  14   8 -11
2  19 -17  -8

    a   b   c
0   9   1  -4
1  14   7 -13
2  19 -18 -10
    a   b   c
0   9   2  -2
1  13   7 -12
2  17 -19 -10


In [77]:
# Multiply a DataFrame of different shape with operator version.

print(df * 2)
print(df.mul(2))
print(df.mul(2, fill_value=0)) # in case of NaN values in the DataFrame
print()

print(df.mul([1, 2, 3])) # print(df.mul([1, 2, 3], axis=1))
print(df.mul([1, 2, 3], axis=0))

    a   b   c
0  20   6  -2
1  30  18 -20
2  40 -32 -14
    a   b   c
0  20   6  -2
1  30  18 -20
2  40 -32 -14
    a   b   c
0  20   6  -2
1  30  18 -20
2  40 -32 -14

    a   b   c
0  10   6  -3
1  15  18 -30
2  20 -32 -21
    a   b   c
0  10   3  -1
1  30  18 -20
2  60 -48 -21


In [78]:
# Divide by constant

print(df / 2)
print(df.div(2))
print(df.div(2, fill_value=0)) # in case of NaN values in the DataFrame
print()

print(df.div([1, 2, 3])) # print(df.div([1, 2, 3], axis=1))
print(df.div([1, 2, 3], axis=0))

      a    b    c
0   5.0  1.5 -0.5
1   7.5  4.5 -5.0
2  10.0 -8.0 -3.5
      a    b    c
0   5.0  1.5 -0.5
1   7.5  4.5 -5.0
2  10.0 -8.0 -3.5
      a    b    c
0   5.0  1.5 -0.5
1   7.5  4.5 -5.0
2  10.0 -8.0 -3.5

      a    b         c
0  10.0  1.5 -0.333333
1  15.0  4.5 -3.333333
2  20.0 -8.0 -2.333333
           a         b         c
0  10.000000  3.000000 -1.000000
1   7.500000  4.500000 -5.000000
2   6.666667 -5.333333 -2.333333


In [79]:
# Calculate exponential power.

print(df ** 2)
print(df.pow(2))
print(df.pow(2, fill_value=0)) # in case of NaN values in the DataFrame
print()

print(df.pow([1, 2, 3])) # print(df.pow([1, 2, 3], axis=1))
print(df.pow([1, 2, 3], axis=0))

     a    b    c
0  100    9    1
1  225   81  100
2  400  256   49
     a    b    c
0  100    9    1
1  225   81  100
2  400  256   49
     a    b    c
0  100    9    1
1  225   81  100
2  400  256   49

    a    b     c
0  10    9    -1
1  15   81 -1000
2  20  256  -343
      a     b    c
0    10     3   -1
1   225    81  100
2  8000 -4096 -343


In [80]:
# Calculate any root.
# df.pow(1 / N)

# Squared root
print(df.pow(1 / 2))
# Cubic root
print(df.pow(1 / 3))

          a         b   c
0  3.162278  1.732051 NaN
1  3.872983  3.000000 NaN
2  4.472136       NaN NaN
          a         b   c
0  2.154435  1.442250 NaN
1  2.466212  2.080084 NaN
2  2.714418       NaN NaN


In [81]:
# Sum

print(s.sum())
print()

print(df.sum()) # df.sum(axis=0)
print()
print(df.sum(axis=1))

20

a    45
b    -4
c   -18
dtype: int64

0    12
1    14
2    -3
dtype: int64


In [82]:
# calculate variation
# Normalized by N-1 by default. This can be changed using the ddof argument
# The divisor used in calculations is N - ddof, where N represents the number of elements.

print(df.var())
print(df.var(ddof=0))

a     25.000000
b    170.333333
c     21.000000
dtype: float64
a     16.666667
b    113.555556
c     14.000000
dtype: float64


In [83]:
# calculate standard deviation
# Normalized by N-1 by default. This can be changed using the ddof argument
# The divisor used in calculations is N - ddof, where N represents the number of elements.

print(df.std())
print(df.std(ddof=0))

a     5.000000
b    13.051181
c     4.582576
dtype: float64
a     4.082483
b    10.656245
c     3.741657
dtype: float64


In [84]:
# Return the minimum of the values over the requested axis.
print(df.min())
print(df.min(axis=1))
print()

# Return the index of the minimum of the values over the requested axis.
print(df.idxmin())
print(df.idxmin(axis=1))

a    10
b   -16
c   -10
dtype: int64
0    -1
1   -10
2   -16
dtype: int64

a    0
b    2
c    1
dtype: int64
0    c
1    c
2    b
dtype: object


In [85]:
# Return the maximum of the values over the requested axis.
print(df.max())
print(df.max(axis=1))
print()

# Return the index of the maximum of the values over the requested axis.
print(df.idxmax())
print(df.idxmax(axis=1))

a    20
b     9
c    -1
dtype: int64
0    10
1    15
2    20
dtype: int64

a    2
b    1
c    0
dtype: int64
0    a
1    a
2    a
dtype: object


In [None]:
# Operations between Series and DataFrames

In [86]:
s

0    10
1    20
2   -10
dtype: int64

In [87]:
df

Unnamed: 0,a,b,c
0,10,3,-1
1,15,9,-10
2,20,-16,-7


In [88]:
# Series-Series operation

s.mul(s)

0    100
1    400
2    100
dtype: int64

In [90]:
# Series-DataFrame operation

df.mul(s, axis=0)

Unnamed: 0,a,b,c
0,100,30,-10
1,300,180,-200
2,-200,160,70


Unnamed: 0,a,b,c
0,100,30,-10
1,300,180,-200
2,-200,160,70


In [91]:
# DataFrame-DataFrame operation

df.mul(df)

Unnamed: 0,a,b,c
0,100,9,1
1,225,81,100
2,400,256,49


## Merge Dataframes

### Concat

pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

**pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)**

Concatenating pandas objects together with concat():

In [92]:
data_1 = {
    'name': ['Yair', 'Moshe'],
    'grade': [100, 100]
}

data_2 = {
    'name': ['Ben', 'Shir'],
    'grade': [85, 94]
}

df_1 = pd.DataFrame(data_1)
df_2 = pd.DataFrame(data_2)

print(df_1)
print(df_2)

    name  grade
0   Yair    100
1  Moshe    100
   name  grade
0   Ben     85
1  Shir     94


In [93]:
# Combine 2 DataFrames

pd.concat([df_1, df_2])

Unnamed: 0,name,grade
0,Yair,100
1,Moshe,100
0,Ben,85
1,Shir,94


In [94]:
# Clear the existing index and reset it in the result by setting the ignore_index option to True.

pd.concat([df_1, df_2], ignore_index=True)

Unnamed: 0,name,grade
0,Yair,100
1,Moshe,100
2,Ben,85
3,Shir,94


### Merge
SQL style merges. See the [Database style joining](http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-join)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

**DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)**

In [95]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

print(df1)
print()
print(df2)

  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

  rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8


Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.

In [96]:
df1.merge(df2, left_on='lkey', right_on='rkey')


Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.



In [97]:
df1.merge(df2, left_on='lkey', right_on='rkey',
          suffixes=('_left', '_right'))

Unnamed: 0,lkey,value_left,rkey,value_right
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


### Append

Append rows to a dataframe.

http://pandas.pydata.org/pandas-docs/stable/merging.html#merging-concatenation

**DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)**

In [98]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,-1.221954,0.310123,0.474508,-0.259488
1,-0.973505,0.870196,-1.405098,1.34384
2,-1.509555,0.375862,0.721538,-0.977737
3,-1.523562,-0.802582,0.060863,1.35102
4,-1.462458,0.538927,1.632286,0.946544
5,-0.211631,-1.213286,1.079478,0.394429
6,-0.608691,1.326229,0.526437,-1.139497
7,1.771969,1.528414,0.932571,0.248426


In [99]:
s = df.iloc[3]
s

A   -1.523562
B   -0.802582
C    0.060863
D    1.351020
Name: 3, dtype: float64

In [100]:
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,-1.221954,0.310123,0.474508,-0.259488
1,-0.973505,0.870196,-1.405098,1.34384
2,-1.509555,0.375862,0.721538,-0.977737
3,-1.523562,-0.802582,0.060863,1.35102
4,-1.462458,0.538927,1.632286,0.946544
5,-0.211631,-1.213286,1.079478,0.394429
6,-0.608691,1.326229,0.526437,-1.139497
7,1.771969,1.528414,0.932571,0.248426
8,-1.523562,-0.802582,0.060863,1.35102


General note: **Append** function will add rows of second data frame to first dataframe **iteratively** one by one. **Concat** function will do a **single** operation to finish the job, which makes it **faster** than append()

* **Concat** gives the flexibility to join based on the axis( all rows or all columns)
* **Append** is the specific case(axis=0, join='outer') of concat

### Statstics on the data

https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html

In [101]:
df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
df

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


Describe shows a quick statistic summary of your data

In [102]:
df.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


Performing a descriptive statistic

In [103]:
df.mean()

  """Entry point for launching an IPython kernel.


Number    1.767834e+01
Age       2.693873e+01
Weight    2.215230e+02
Salary    4.842684e+06
dtype: float64

Same operation on the rows axis

In [104]:
df.mean(axis=1)

  """Entry point for launching an IPython kernel.


0      1.932636e+06
1      1.699119e+06
2      8.733333e+01
3      2.872188e+05
4      1.250067e+06
           ...     
453    6.083925e+05
454    2.250570e+05
455    7.250758e+05
456    2.368892e+05
457             NaN
Length: 458, dtype: float64