# Python and Data Analysis 3 - Calculations with Data

**Goal:** The goal of this project is to learn to calculate information from measured data.

**Description:** Data that is given to us is often considered *measured* - it is a real world observation that is recorded and put into a DataFrame. To turn this into information, we need to be able to use the data in calculations. This workshop covers how to create *calculated columns*, and how to calculate *summary statistics*.

## 3A: Calculated Columns

A calculated column is a column that is added to a DataFrame based on existing columns. In the following DataFrame, we have price information for Amazon's stock. It contains the columns `date`, `open`, `high`, `low`, `close`, and `volume`. We can create a calculated column in two ways.

In [None]:
import pandas as pd
amzn = pd.read_csv('AMZN.csv')
print(amzn.head())

### Calculations Across Two Columns

Frequently, we want to be able to carry out mathematical operations between two or more columns. The syntax is quite intuitive. For example, we might want to keep track of the daily gains/losses. This can be calculated by subtracting the close price from the opening price: `open - close`. 

In [None]:
amzn['daily_change'] = amzn['open'] - amzn['close']
print(amzn.head())

A similar syntax can be used for other operations, including multiplication (`*`), division (`/`), addition (`+`) and exponents (`**`). As another example, perhaps we want an approximation of value of trades executed on a particular day by multiplying an average price by the volume. To get an estimate of average price, we will use `(open + high + low + close) / 4`. The final formula is `((open + high + low + close) / 4) * volume)`.

In [None]:
amzn = amzn.drop(columns=['daily_change']) # Removes the 'daily_change' column we created earlier
amzn['daily_value'] = ((amzn['open'] + amzn['high'] + amzn['low'] + amzn['close']) / 4) * amzn['volume']
print(amzn.head())

**Challenges**:
 - Calculate the difference between the `high` and `low` columns
 - Return the higher value between the `open` and `close` columns (hint: look into the `max` function)
 - Return the `close` price as a percentage of the original `close` price (useful when comparing the growth of different stocks)

In [None]:
amzn['difference_high_and_low'] = amzn['high'] - amzn['low']
# print(amzn.head())
# Note that we have to do the max across the two columns via axis=1 instead of
# across all rows
higher_value = amzn[['open', 'close']].max(axis=1)
print(higher_value)

close_growth = amzn['close'].iloc[-1] / amzn['close'].iloc[0] 
print('Close percentage growth:', f"{close_growth * 100:.2f}%")

### Operations on a Column

There is another way to carry out calculations on a column, but it can easily be used for other operations too. The `apply` function takes a column or entire DataFrame and applies a function to each item. This is convenient when the operation we want to perform for each item is quite complex. Below, we have a function `change_date` that takes a date in the form `YYYY-MM-DD` as a string, and outputs it in the form `Month Day, Year`. 

In [None]:
amzn = pd.read_csv('AMZN.csv')
def change_date(original_date):
    year = original_date[0:4]    # Get the first four characters in the string
    month = original_date[5:7]  # Get the month from the string
    months = ["January", "February", "March", "April", "May", "June", "July",
              "August", "September", "October", "November", "December"]
    month_name = months[int(month) - 1] # e.g. int('01') - 1 == 0 giving January
    day = original_date[-2:]    # Get the last two characters in the string

    return month_name + " " + day + ", " + year

print(change_date(amzn['date'][0]))

Because this operation is quite complex, we created a new function for it, and now just need to `apply` `change_date` to our `date` column.

In [None]:
amzn['date'] = amzn['date'].apply(change_date)
print(amzn.head())

Ultimately, `apply` allows us to carry out more complex operations on a column, and *abstract* their functionality into helper functions.

**Challenge:** create a new `volume_estimate` column by apply a function to the `volume` column which replaces values greater than 10000000 with `'high'` and everything else with `'low'`.

In [None]:
amzn = pd.read_csv('AMZN.csv')

def estimate_volume(vol):
  if vol > 10000000:
    return 'high'
  else:
    return 'low'

amzn['vol_estimate'] = amzn['volume'].apply(estimate_volume)
# def estimate_volume(volume):
#   return 'high' if volume > 10000000 else 'low'

# amzn['volume_estimate'] = amzn['volume'].apply(estimate_volume)
print(amzn)

## 3B: Summary Statistics

Previously, we carried out operations to fill each row in a new or existing column with a calculated value. Now we turn our attention to *summary statistics*. These aggregate calculations accross multiple rows within the same column. There are many different types of summary statistics, but common ones are:
 - `size`: Counts the number of rows in the given column
 - `count`: Counts the number of rows, excluding NaNs, in the given column
 - `sum`: Calculates the sum of the values in the given column
 - `min` and `max`: Calculates the minimum or maximum value in the given column
 - `mean`, `median`, and `mode`: Calculates the average value in the given column
 - `std`: Calculates the standard deviation in the given column
 - `describe`: Many statistics at once
 
 Lets look at a few examples:

Get the mean close price in the `amzn` DataFrame.

In [None]:
amzn = pd.read_csv('AMZN.csv')
mean_close = amzn['close'].mean()
print(f'{mean_close:.2f}')
print(f"Mean Close Price: {mean_close:.2f}")

Find the highest and lowest close price.

In [None]:
highest_close = amzn['close'].max()
print("Highest Close Price: " + str(highest_close))

lowest_close = amzn['close'].min()
print("Lowest Close Price: " + str(lowest_close))

Find the median value for both the `high` and `low` column. We can calculate summary statistics on more than one column by passing a list of columns.

In [None]:
median_value = amzn[['high', 'low']].median()
print(median_value)

Describe the close prices.

In [None]:
described = amzn['close'].describe()
print(described)

**Challenge:** find the greatest difference in `high` and `low` prices on a given day. Think about the steps you need to perform, and whether the given DataFrame contains all the information we need.

In [None]:
# greatest = (amzn['high'] - amzn['low']).max()
f = amzn[['high' ,'low']].max(axis=1)
print(f)

### Summary Statistics by Group

If we have data from multiple categories in the same DataFrame, we can split it into separate DataFrames and then calculate the summary statistics. Lets look at the DataFrame we create before, with stock prices for Microsoft, Amazon, Google, and Apple.

In [None]:
stock_names = ['MSFT', 'AAPL', 'AMZN', 'GOOG']

df = pd.DataFrame()
for stock_name in stock_names:
    stock_df = pd.read_csv(f'{stock_name}.csv')
    stock_df['name'] = stock_name
    df = pd.concat([df, stock_df], ignore_index=True)
print(df)

Combining our knowledge of the `groupby` function with our knowledge of summary statistics, we can do the following:

In [None]:
stocks = df.groupby('name')
for stock in stocks.groups.keys():
    stock_df = stocks.get_group(stock)
    avg_vol = stock_df['volume'].mean()
    print(stock + " Avg Trading Volume: " + str(avg_vol))

Even more simply, we can do `grouped-object['name-of-col'].summary-statistic()`. 

In [None]:
print(stocks['volume'].mean())

The key takeaway is that Pandas allows us to easily calculate columns, operate on existing columns, and create summary statistics for columns and groups.

**Challenge:** Find the greatest average close price across all four stocks and print out the stock along with it's average close.

In [None]:
avg_close = stocks['close'].mean()
greatest_stock = avg_close.idxmax()
print(greatest_stock, avg_close[greatest_stock])