# Challenge 2

In this challenge we will walk you through how to solve a problem in the previous [Subsetting and Descriptive Stats lab](../../lab-subsetting-and-descriptive-stats/your-code/main.ipynb). You'll be exposed to the thinking process how a pro would tackle the problem. Try to understand the thinking process and apply it in the next challenge.

## Import all libraries that are necessary

In [2]:
import numpy as np
import pandas as pd

## Import and overview data

First import `employee.csv` from the "subsetting" lab folder and print head to overview the data:

In [3]:
employee = pd.read_csv("../../lab-subsetting-and-descriptive-stats/your-code/Employee.csv")

employee.head()

Unnamed: 0,Name,Department,Education,Gender,Title,Years,Salary
0,Jose,IT,Bachelor,M,analyst,1,35
1,Maria,IT,Master,F,analyst,2,30
2,David,HR,Master,M,analyst,2,30
3,Sonia,HR,Bachelor,F,analyst,4,35
4,Samuel,Sales,Master,M,associate,3,55


Printing the head is not a useless routine. You should really look at the data set and understand what they are. No data analyst can successfully analyze the data without in-dpeth understanding of what each column is about. As we progress in this course, the data sets are becoming increasingly complex which requires you to inspect the data at the beginning then on the needed basis thoughout the problem-solving process.

One question in the previous lab is:

**Find the minimum, mean, and maximum of all numeric columns for each Department.**

We will walk you through how to solve this question using the workflow discussed in the [Data Analysis Iteration video](https://www.youtube.com/watch?v=xOomNicqbkk).

## Main Problem - Setting Expectations

We want to break down the problem into several sub problems:

**Sub Problem 1 - How to extract numeric columns from a data set?**

**Sub Problem 2 - How to calculate minimum, mean. and maximum?**

**Sub Problem 3 - How to perform calculations for each Department?**

If we figure out each of the sub problems above, we have found the solution for our main problem.

## Main Problem - Collecting Information

This step is the problem-solving process of the main problem in which we will solve each of the three sub problems. The successful solution of all sub problems will lead us to the solution of the main problem.

### Sub Problem 1

#### Setting Expectations

**Define problem: How to extract numeric columns from a data set?**

#### Collecting Information

This was already covered in the lesson by using `dtypes`. So let's print out all numeric columns:

In [None]:
# enter your code here


You should have seen:
    
```
Name          object
Department    object
Education     object
Gender        object
Title         object
Years          int64
Salary         int64
dtype: object
```

#### Reacting to Data

You found `Years` and `Salary` are the numeric columns we need to extract. So we can potentially use `employee[["Years", "Salary"]]` to extract these columns:

In [None]:
employee[["Years", "Salary"]]

But instead of hardcoding the column names in the solution, a better approach is to define a Python function that dynamically returns all numeric columns. You will be able to re-use this function in your future works. Also, if the data set is huge and it contains hundreds of numeric columns, it is impossible to manually select them.

#### Revising Expectations

**Define new problem: How to *dynamically* extract numeric columns from a data set?**

#### Collecting Information

This was not covered in the lesson. So we need to [google the answer](https://www.google.com/search?q=pandas+dataframe+get+all+numeric+columns).

After finding the answer, write the function below.

In [37]:
def get_numeric_cols(df):
    # write your code below. Return all numeric columns of the dataframe
    return

#### Reacting to Data

Now test your function:

In [None]:
get_numeric_cols(employee)

You should have seen:

```
   Years  Salary
0      1      35
1      2      30
2      2      30
3      4      35
4      3      55
5      2      55
6      8      70
7      7      60
8      8      70
```



Yes, this is exactly what we want!

Now we move to the next sub problem.

### Sub Problem 2

#### Setting Expectations

**Define problem: How to calculate minimum, mean. and maximum?**

#### Collecting Information

That's easy. Review the *Descriptive Statistics With Pandas* lesson and we find there are functions already made for Pandas dataframes to calculate minimum, mean, and maximum. We'll leverage from the solution we found in sub problem 1 and try to calculate on the numeric columns:

In [None]:
numeric_cols = get_numeric_cols(employee)

print('PRINTING MIN:')

# enter your code here


print('\n---\n')
print('PRINTING MEAN:')

# enter your code here


print('\n---\n')
print('PRINTING MAX:')

# enter your code here


If everything is expected in the output we move to the next sub problem.

### Sub Problem 3

#### Setting Expectations

**Define problem: How to perform calculations for each Department?**

#### Collecting Information

What we need to do is first group data by Department, then perform calculation on each grouped data. This is covered in the *Data Aggregations and Summarization* lesson. Assign the grouped data to a new variable called `employee_by_department`. 

In [40]:
# enter your code here


#### Reacting to Data

Print out the grouped data and try to calculate mean for the grouped data. Check if you obtain the desired results.

In [None]:
print(employee_by_department)

print(employee_by_department.mean())

From the outputs above, you probably noticed Pandas automatically ignores non-numeric columns when you call the `mean()` method. This means we may not need the solution for sub problem 1. That's ok, let's move on.

Assuming everying is fine, now we are ready to combine the solutions of all sub problems in order to solve the main problem.

## Main Problem - Reacting to Data / Revising Expectations

It turns out Pandas is smart enough to perform calculations on numeric columns only even if the data set contains non-numeric fields. We can choose to revise our solution because it is no longer necessary to obtain the numeric columns (Sub Problem 1) by ourselves. In this case we simply combine solutions for Sub Problem 2 & 3. Write your codes below:

In [None]:
print('PRINTING DEPARTMENT MIN:')
# enter your codes here to print department MIN


print('\n---\n')
print('PRINTING DEPARTMENT MEAN:')
# enter your codes here to print department MEAN


print('\n---\n')
print('PRINTING DEPARTMENT MAX:')
# enter your codes here to print department MAX


Alternatively, we can choose to stick to our original solution that combines all 3 sub problems. We want to do this because we will have more control over what we want to do with the data. What if the goal is more complex than performing MIN, MEAN, and MAX? What if we need to apply a custom function which cannot automatically select numeric columns? It is good we can figure out how to do this.

Write your code below that uses one line of code to calate MIN, then MEAN, then MAX.

*Hint: use `apply` and `lambda`*

In [43]:
print('PRINTING DEPARTMENT MIN:')
# enter your codes here to print department MIN


print('\n---\n')
print('PRINTING DEPARTMENT MEAN:')
# enter your codes here to print department MEAN


print('\n---\n')
print('PRINTING DEPARTMENT MAX:')
# enter your codes here to print department MAX


Test your codes. You should have seen outputs similar to the following:

```
PRINTING DEPARTMENT MIN:

Years      1
Salary    30
dtype: int64

---

PRINTING DEPARTMENT MEAN:

Years      4.111111
Salary    48.888889
dtype: float64

---

PRINTING DEPARTMENT MAX:

Years      8
Salary    70
dtype: int64
```

If you don't see the correct output, check your codes and revise.