# Dataframe Calculations

## Introduction
There are cases where we might want to augment our Pandas DataFrame with calculated columns. Pandas enables us to perform these calculations and easily store them in a new column.

## Working with Constants 

We can add new calculated columns using an existing column and a constant value.

Recall our animals dataset. We will use this dataset and create a new column that converts the body weight in pounds to kilograms.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# Read animals 
animals = pd.read_csv('https://raw.githubusercontent.com/loukjsmalbil/datasets_ws/master/animals.csv')
animals.head()

In [None]:
# Multiply animals['bodywt'] by 0.45359237 and store in brainwtkg column
animals['brainwtkg'] = animals['bodywt']  * 0.45359237

animals.head()

Note that we used the head function to look at the first 5 rows for every column. We do this to confirm that the changes we made to the DataFrame worked as expected.

## Combining Two (Or More) Columns

We can perform calculations using a combination of two or more column. We write an equation that correctly refers to the columns in the DataFrame and assign the calculation to a new column.

For example, we can compute the ratio of body weight to brain weight for all animals in our data and assign this value to a new column.

In [None]:
# Compute the ratio of animals['bodywt'] /  and animals['brainwt'] and store this as animals['wtratio']

animals['wtratio'] = animals['bodywt'] / animals['brainwt']
animals.head()

## Conditional Calculations

It is possible to perform more complex calculations. For example, you may have noticed that we used division in the previous example without checking whether the denominator is zero. This can cause quite a bit of problems. Therefore, we can introduce a condition in our assignment. If the brain weight is zero then the ratio will be zero, otherwise, store the ratio in the new column. We can create conditional functions using the where function in numpy. We pass 3 arguments to the function. The first argument is the condition, the second is the value in case the condition is true, and the third is the value in case the condition is false.

In [None]:
# Make a new column with 'high' and 'low' values using np.where, where brainweight > 5

animals['Low/High'] = np.where(animals['brainwtkg'] > 5, 'High', 'Low')

animals

In [None]:
np.where(animals['brainwtkg'] > 5)

## Calculations Using Functions 
As we have learned in a previous lesson, Pandas DataFrames have 3 components: rows, columns and data. The rows and columns are also called axes. Axis zero is the row axis and axis one is the column axis. Therefore, we can apply functions to the column axis in order to summarize all columns at once.

Let's say we want to take a sum of all numeric columns in the animals DataFrame. We can do this by using the sum function and passing axis=1as an argument to the function.

In [None]:
animals = pd.read_csv('https://raw.githubusercontent.com/loukjsmalbil/datasets_ws/master/animals.csv')

animals.head()

In [None]:
# Sum weight values
animals['sum'] = animals[['brainwt', 'bodywt']].sum(axis=1)
animals

In [None]:
# Simply add the two columns and create the column sum
animals['Total'] = animals['brainwt'] + animals['bodywt']
animals.head()

## Other Useful Functionalities

Apart from the basic operations we can carry out, Pandas also offers use usefull functionalities to improve the efficiency and effectiveness of the calculations. As we saw in the previous lecture, it requires us to 

In [None]:
# Read csv cars
cars = pd.read_csv('cars.csv')
cars.head()

In [None]:
# Grouping the data: Make and Fuel Barrels/Year grouped by 'Make', obtaining the mean of Fuel Barrels/Year
cars[['Make', 'Fuel Barrels/Year']].groupby('Make').mean()

cars[['Make', 'Cylinders']].groupby('Make').mean()

## Isin

The isin-method is a very easy way to check whether some specific value is present and, subsequently, to make a subset out of that. 

In [None]:
cars_years = cars[cars['Year'].isin(['2004'])]
cars_years.head()

In [None]:
# Cars isin 2014
cars[cars['Make'].isin(['Acura'])]

## Iloc (and Loc)

We have already discussed the .loc method where we located columns and rows on the basis of the values present. However, we can also use the .iloc method. Iloc is short for index locations and it is a powerful tool if we want to index based on column and row indices rather than on values and names -- it makes our procedure more generic. 

In [None]:
cars.head()

In [None]:
# Use iloc here
cars.iloc[18000,:]

# Sort Values

The sort values, as the name suggests, lets you sort values. 

In [None]:
# Sort by year using .sort_values by method
cars.sort_values(by = 'Drivetrain', ascending = True)

## Count

Another useful method is the .count method which allows us to see how many 'pieces' there are of a certain entity in our dataframe.  

In [None]:
cars[['Model', 'Year']].groupby('Year').count()

In [None]:
# Use count to determine how many different models were built in a certain year (use groupby('year'))
cars[['Model', 'Year']].groupby('Year').count()

In [None]:
# Use value counts to determine the total amount of cars['Make'] 
cars['Make'].value_counts()

# Info() and isnull()

The info method allows us to get useful information about the data base. For instance, how many columns are rows there are and what their types are. 

In [None]:
# Info 
cars.info()

Another very useful tool is the .isnull() method which locates the missing values (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). 

In [None]:
# Read housing df
housing = pd.read_csv('https://raw.githubusercontent.com/loukjsmalbil/datasets_ws/master/housing_prices.csv')
housing

In [None]:
# Check for null values using isnull() and value_counts() for Alley
housing['Alley'].isnull().value_counts()

In [None]:
# Check for null values using isnull() and value_counts() for Alley
housing['Alley'].isna().value_counts()

## Datetime

Sometimes we want to convert the numerical types to datetime objects. 

In [None]:
# Display the column types
cars = pd.read_csv('cars.csv')
cars.head()

In [None]:
# convert year to datetime object using pd.to_datetime
cars['Year']= pd.to_datetime(cars['Year'])
cars.dtypes

## Correlation

We can also compute the correlation coefficient. 

In [None]:
# Compute the correlation on the animals set using Pearson's method. 
animals_corr = animals.corr(method='pearson')

#'pearson', 'spearman', 'kendall'
animals_corr

In [None]:
housing_corr = housing.corr()
housing_corr

In [None]:
# Obtain the variables correlated with animals_corr['brainwt'] and sort them
housing_corr['SalePrice'].sort_values(ascending = False)

# Unique method

Lastly, we can obtain all unique values in a dataframe column. 

In [None]:
# Select all unique car manufacturers
cars['Make'].unique()

## Summary 

In this lesson we learned different ways to create calculated columns. We computed a new column by combining existing data with a constant. We also computed a calculated column using two existing columns as well as using a conditional function to create a calculated column. Finally, we applied one function to all columns at once by specifying to apply the function to axis=1.