# COMP-240 Homework 5

This homework assignment focuses on both data transformations and data grouping to produce aggregate results. No special libraries are required other than NumPy and Pandas.

In [None]:
import numpy as np
import pandas as pd

## Exercise 1

### Design a function that given a person's height (in m) and weight (in kg), the function returns the person's BMI score using the following equation:

$$
BMI = \frac{weight}{height^2}
$$

In [None]:
def bmi(weight, height):
    #your code goes here
    try:
        return weight / (height ** 2)
    except ZeroDivisionError:
        return 'Height cannot be 0'

bmi(57, 1.55) #example input and output 23.72

### Design another function that given the bmi returns the bmi classification for that person by adopting the following interpretation:

0 < bmi < 18.5    Underweight

18.5 <= bmi < 25  Normal

25 <= bmi < 30    Overweight

30 <= bmi         Obese 

In [None]:
def bmi_classification(bmi):
    #your code goes here
    if bmi < 0:
        return 'Bmi cannot be lower than 0.'
    elif bmi < 18.5:
        return 'Underweight'
    elif bmi < 25:
        return 'Normal'
    elif bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'
    
bmi_classification(23.73) # example input

### Using the `map` function and lambda classify the given bmi scores by using the function you have previously designed

In [None]:
bmi_scores = [18.1, -3., 23.73, 34.32, 28.7, 22.8] # example input
#your code goes here
result = map(lambda x : bmi_classification(x), bmi_scores)
print(*result)

### From the new list, remove the values that do not have a valid interpretation

In [None]:
#your code goes here
bmi_scores_valid = filter(lambda x: x > 0, bmi_scores)
print(*bmi_scores_valid)

### Download the following dataset capturing the height and weight of 500 people into a pandas DataFrame

dataset in `csv` format can be found using the following url:

`https://storage.googleapis.com/comp240-stores/bmi.csv`

In [None]:
#your code goes here
url = 'https://storage.googleapis.com/comp240-stores/bmi.csv'
df = pd.read_csv(url)
df.head()

If you look closely, you will see that the height is given in cm. 

### Add a new attribute to the DataFrame named `HeightM` that captures for each person their height in meters

In [None]:
#your code goes here
df['HeightM'] = df['Height']/100
df.head()

### Add two new attributes to the DataFrame that capture for each person their bmi and the respective classification

In [None]:
#your code goes here
df['bmi'] = bmi(df['Weight'], df['HeightM'])
df['bmi_type'] = df['bmi'].apply(bmi_classification)
df.head()

Example output of how your dataframe should look like in the end:

|   | Gender | Height | Weight | HeightM | bmi       | bmi_type   |
|---|--------|--------|--------|---------|-----------|------------|
| 0 | Male   | 174    | 96     | 1.74    | 31.708284 | Obese      |
| 1 | Male   | 189    | 87     | 1.89    | 24.355421 | Normal     |
| 2 | Female | 185    | 110    | 1.85    | 32.140248 | Obese      |
| 3 | Female | 195    | 104    | 1.95    | 27.350427 | Overweight |
| 4 | Male   | 149    | 61     | 1.49    | 27.476240 | Overweight |

## Exercise 2

In this exercise you will download and process a dataset capturing player salaries from a recent NBA season. 

As your first task, download the dataset in `csv` format from the following url:

`https://storage.googleapis.com/comp240-stores/nba_salaries.csv`

Afterwards, answer the 5 questions that follow.

In [None]:
#your code goes here
df = pd.read_csv('https://storage.googleapis.com/comp240-stores/nba_salaries.csv')
df.head()

### How much money did each team pay for its players’ salaries?

In [None]:
# your code goes here
salaries_per_team = df['SALARY'].groupby(df['TEAM']).sum()
salaries_per_team

### Which teams are the top-5 and bottom-5 spenders?

In [None]:
# your code goes here
top_n = 5

print(f'\tBOTTOM {top_n} SPENDERS:\n{salaries_per_team.sort_values(ascending=True)[:top_n]}')
print(f'\tTOP {top_n} SPENDERS:\n{salaries_per_team.sort_values(ascending=False)[:top_n]}')

### How many NBA players were there in each of the five positions?

In [None]:
# your code goes here
players_by_position = df['POSITION'].groupby(df['POSITION']).count()
players_by_position

### What was the average salary of the players at each of the five positions?

In [None]:
# your code goes here
avg_position_salary = df[['POSITION', 'SALARY']].groupby('POSITION').mean()
avg_position_salary

### Depict the mean, median, std, min and max SALARY for each of the five positions

In [None]:
# your code goes here
df[['POSITION', 'SALARY']].groupby('POSITION').agg(['mean', 'median', 'std', 'min', 'max'])

## Exercise 3

In this exercise you will download and process a dataset capturing detailed information of the passengers that embarked the Titanic.

As your first task, download the dataset in `csv` format from the following url:

`https://storage.googleapis.com/comp240-stores/titanic.csv`

Afterwards, answer the questions that follow.

In [None]:
#your code goes here
df = pd.read_csv('https://storage.googleapis.com/comp240-stores/titanic.csv')
df.head()

### How many female and how many male passengers survived?

In [None]:
#your code goes here
df[df['Survived'] == 1][['Survived', 'Sex']].groupby('Sex').count()

### Answer the same question but this time return a percentage for each sex against the total number of passengers survived.

In [None]:
#your code goes here
total_passengers_survived = df[df['Survived'] == 1]['Survived'].count()
df2 = df[df['Survived'] == 1][['Survived', 'Sex']].groupby('Sex').count()
df2['Percentage'] = df2['Survived'] / total_passengers_survived * 100
df2

### What is the percentage of female passengers that survived from the female passengers and what is this percentage for the male passengers?

In [None]:
#your code goes here
df3 = df[df['Survived'] == 1][['Survived', 'Sex']].groupby('Sex').count()
df3['Total Count'] = df[['Survived', 'Sex']].groupby('Sex').count()
df3['Percentage'] = df3['Survived'] / df3['Total Count'] * 100
df3

### Answer the above question by also depicting class-wise (`Pclass` attribute) the survivor rate.

In [None]:
#your code goes here
df_temp = df[['Survived', 'Sex', 'Pclass']]
df3 = pd.DataFrame()
df3['Total Count'] = df_temp.groupby(['Sex', 'Pclass']).count()
df3['Survived'] = df_temp[df_temp['Survived'] == 1].groupby(['Sex', 'Pclass']).count()
df3['%Survived'] = (df3['Survived'] / df3['Total Count'] * 100).round(2)
# df3['%Survived3'] = (df_temp.groupby(['Sex', 'Pclass']).mean() * 100).round(2)
df3

### Achieve the same result by embracing a pivot table

In [None]:
# your code goes here
table = pd.pivot_table(df, index=['Sex', 'Pclass'], values='Survived', aggfunc=['count', lambda x: (x == 1).sum(), 'mean'], margins=True, margins_name='Total')
table.columns = ['Total Count', 'Survived','%Survived']
table['%Survived'] = (table['%Survived'] * 100).round(2)
table

### Create a new attribute `age_type` that labels each passenger as either (i) `infant` (age 0-3); (ii) `child` (age 4-12); (iv) `teenager` (age 13-17); (iv) `youngster` (age 18-33); (v) `adult` (age 34-59); or (vi) `senior` (age 60 and above)

In [None]:
#your code goes here
def age_type(age):
    # result = np.NAN Could be also this, but then it would be excluded in the next exercise
    result = 'Missing Age' # by assigning a string to passengers with NaN as Age, I can still check what was their total count and how many survived 
    if age < 0:
        result = 'Age cannot be lower than 0'
    elif age <= 3:
        result = 'infant'
    elif age <= 12:
        result = 'child'
    elif age <= 17:
        result = 'teenager'
    elif age <= 33:
        result = 'youngster'
    elif age <= 59:
        result = 'adult'
    elif age >= 60:
        result = 'senior'
    return result

df['Age_type'] = df['Age'].apply(age_type)
df

### Redesign the pivot table previously requested to now present the survivor rates for each sex and age type

In [None]:
# your code goes here
table = pd.pivot_table(df, index=['Sex', 'Age_type'], values='Survived', aggfunc=['count', lambda x: (x == 1).sum(), 'mean'], margins=True, margins_name='Total')
table.columns = ['Total Count', 'Survived','%Survived']
table['%Survived'] = (table['%Survived'] * 100).round(2)
table