## Comparing Histograms

We will start by looking at data from the Big 5 European Soccer Leagues for the 1995/96 to 2019/20 seasons. 

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In [None]:
#Load data
goals = Table.read_table('big5.csv').select(3, 9, 6, 10)
goals

In [None]:
#relabel the columns
goals = goals.relabeled(1, 'Home').relabeled(3, 'Away')
goals

In [None]:
# Look at the distribution of goals scored by the home team... what is unit? what is the bin spacing?
goals.hist('Home')

In [None]:
#What happens when we specify the bin widths?
my_bins = np.arange(0, 10, 1)
goals.hist('Home', unit = 'goals', bins = my_bins)

In [None]:
# Is there a home team advantage?
goals.hist([1,3], unit = 'goals', bins = my_bins)

In [None]:
climate = Table.read_table('NOAA.csv').relabeled(5, '1948').relabeled(6, '2018')
climate

In [None]:
climate.hist(5, 6, unit = 'days above 90 degrees F', bins = (np.arange(0, 150, 10)))

In [None]:
climate = climate.with_column('Difference', (climate.column('2018') - climate.column('1948')))
climate

In [None]:
climate.hist('Difference')

In [None]:
len(climate.where(7, are.above_or_equal_to(1)).column(7))/len(climate.column(7))

## Defining Functions

Let’s look at this simple function.  The name of this function is “data_range” and we can see what the function will return.  What type of input will this function take, and what will be the output? In other words, what is values?

In [None]:
def data_range(values):
    return max(values) - min(values)

In [None]:
values = make_array(1, 2, 3, 4)
data_range(values)

In [None]:
#let's do a few more
def triple(x):
    return 3*x
triple(values)

In [None]:
x = 'triple'
triple(x)

Let's write a function thattakes an array and then computes the percent of the total for  each value in the array 

In [None]:
counts = make_array(1, 2, 3, 4)
total = sum(counts)
np.round(counts/total * 100, 2) 

In [None]:
# How to write a function.  Step 1, write the body...

In [None]:
# Step 2 now use our def statement and take what we need, and specify what it will return.
def percents(counts, decimal_places = 2):
    '''This function takes an array of values, converts those values 
    to percents out of the total, and returns an array of those percents'''
    total = sum(counts)
    return np.round(counts/total * 100, decimal_places) 


In [None]:
counts = make_array(4, 8, 3, 4)
percents(counts)

In [None]:
# Does this function change our original array? 
counts

In [None]:
# What happens when we pass a number to this function? 
help(percents)

## Apply

In [None]:
# define a function called "cut off at a billion"
def cut_off_at_a_billion(x):
    return min(x, 1e9)

In [None]:
# now use Apply to apply this function to a table we have already seen:
top = Table.read_table('top_movies_2017.csv')

In [None]:
top = top.with_column('cutoff', top.apply(cut_off_at_a_billion, 3))
top

## Prediction Example

In [None]:
height = Table.read_table('galton.csv').select(1, 2, 7).relabeled(2, 'child')
height

In [None]:
# visualize with a scatterplot - setting the child's height to be the horizontal variable
height.scatter(2)

In [None]:
# Compute the average height of both parents
height = height.with_column(
    'parent average', (height.column('father')+height.column('mother'))/2)
height

In [None]:
height.scatter('parent average', 'child')
plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)
plots.plot([68.5, 68.5], [50, 85], color='red', lw=2)


In [None]:
close_to_68 = height.where('parent average', are.between(67.5, 68.5))
close_to_68.column(2).mean()

In [None]:
def predict_child(pa):
    close_to_pa = height.where('parent average', are.between(pa - 0.5, pa +0.5))
    return close_to_pa.column(2).mean()

In [None]:
predict_child(63)

In [None]:
height.with_column('prediction', height.apply(predict_child, 3)).select(2, 3, 4).scatter('parent average')

## Groups

In [None]:
# load cones.csv
all_cones = Table.read_table('cones.csv')
cones = all_cones.drop('Color').exclude(5)
cones

In [None]:
# group by 'Flavor'
cones.group('Flavor', np.average)

In [None]:
# add a second argument to see what else group can do.

In [None]:
nba = Table.read_table('nba_salaries.csv').relabeled(4, 'SALARY')
nba

In [None]:
# let's focus on teams and how much they paid in salaries... can I just use group as is?
nba.group()

In [None]:
nba.group('TEAM')

It looks like grouping by team returns the number of players, so what can we do if we want to know how much each team paid their players?

In [None]:
nba.group('POSITION')

In [None]:
nba.select('TEAM', "SALARY").group('TEAM', sum)

In [None]:
nba.select('TEAM', "SALARY").group('TEAM', sum).sort(1, descending = True)

In [None]:
nba.select('POSITION', 'SALARY').group('POSITION', np.average)