# DSC 80: Homework 03

### Due Date: Monday, Jan 28 12:00PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the homework problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `hw03.py` file, that will be imported into the current notebook.

Homeworks and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).


**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *HW assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the HW! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `hw03.py` (much like we do in the notebook).
- Always document your code!

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import hw03 as hw

---

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import glob
import os

# The other side of the Gradescope

In this question you will help me to clean a dataset `grades.csv` that I downloaded from the Gradescope. I removed `Name`, `ID` and `Email` and a few irrelevant columns but the rest of the data was left unchanged. Each subproblem will ask you to perform different manipulations on this table. When you write functions, do not forget about the `DRY` principle: Don't Repeat Yourself. Use helper methods instead of copy/pasting. Think about efficiency as well, not just correctness. 

**Useful reminder**

You can write a function that take the `unknown` number of arguments. Similar to Java's `main (String[] args)`, where you could pass any number of arguments on the command line. 

For example, you want to write a function that multiple given numbers but you do not know how many numbers will be given. In this case you can use a `*` before the name of the formal parameter:

```
def mult_them(*nums):
    product = 1
    for n in nums:
        product *=n
    print(product)

>>> mult_them(1,2,3)
6
>>> mult_them(1,2,3,4)
24
```


**Question 1:**

Write a function `major_drop` that drops the following columns:
* Every column about the `Lab`
* Every column about the `Quiz`
* Every column about the `Project`
* Every column that says `Midterm` and `Lateness` in it
* Every column that says `Final` and `Lateness` in it
* Column with `Total Lateness`
* Every exam (midterm and final) column that contains max points. 

and return a new dataframe without these columns.

**Question 2:**

* If you inspect the table carefully you will see that there are a lot of `NaN`s. Replace them with the appropriate values. 

* Look at the columns for the `Midterm`. There are three columns that correspond to three different versions. You need to merge these columns into one, create a new column `Midterm Grades` with the new grades and drop the old ones. 

* Repeat the same steps for columns about the `Final exam`.

You should write a function `merged_exams` that takes in a dataframe  and returns the updated one.

# Gradescope: how late, how early?

You are provided with ten files names `hwX_time.csv` that include the dates and times for each submission. When the homework was submitted and how late it was submitted.

**Question 3:**

Your next modification will involve the lateness of each homework. The late submission rule was:
* if an assignment is submitted within 24 hours after the deadline, then the penalty is 20%
* if an assignment is submitted after 24 hours after the deadline, then the penalty is 50%.


If you look closer you will see that some times are way larger than the allowed time frames. It indicates that there was a regrade request and the code was resubmitted. It does not count as late submission. 

* Write a function `read_all(dirName)` that reads all files from the given directory and returns a list of dataframes. 

* Then write a function `extract_and_create` that takes in a list of dataframes with times for 10 homeworks and creates a new dataframe with two columns: `Penalty_20` and `Penalty_50`. Each item in the column represents the total number of the corresponding late submissions. 

For example, if a student submitted 15 hours late for two assignments, 37 hours late for five assignments and 120 hours late for one assignment, then `Penalty_20` and `Penalty_50` will contain 2 and 5 correspondingly. Thus, the regrades should not effect the penalties.

*Note: In reality we do take care of the case when assignment was late AND resubmitted for a regrade request*

**Useful hints:**
1. You may find it useful to go back to lecture 1 and understand the code that reads in multiple files from a directory. `os` module is imported for you. 
2. You can replace NaN with `00:00:00` time stamp since it does not effect the answer.
3. There are two ways you can take to deal with times: either split them by `:` , convert each item to a number and go from there; or you can use `to_timedelta` method to work with times. 


# LinkedIn Survey

**Question 4:**

Three friends decide to send out a survey to 1000 of their linkedin connections, asking them for their favorite animal. Each friend also records some other data on their connection's pages (company, job, firstname, and summary/slogan). Collect all the data contained in `linkedin1.csv, linkedin2.csv, linkedin3.csv` into a single dataframe and compute some initial summary statistics:

* Create a function `compute_stats` that takes in 3 file handles like `open('linkedin*.csv)` and returns a list containing, the most common first name, job held, slogan, and favorite animal (in that order). If there are ties for the most common value, give the value with the "largest size" (as defined by the python string ordering).


*Note:* Your code will be tested on samples of the three dataframes given above. Don't overly generalize your code.

**Useful function:**
Sometime it is useful to combine two lists into one and it can be done fast by using a function `zip`. It takes in iterable containers, aggregates them into  tuples and returns an iterator of these tuples:

In [None]:
x = [1, 2, 3]
y = [4, 5, 6]
zipped = zip(x, y)
zipped

Then depending on the data you have aggregated, you can use `list` or `dict` constructors:

In [None]:
lst = list(zipped)

# Salaries Grouping

### `groupby.transform`: transforming data by groups.

* The `apply` and `aggregate` methods on groupby expect a function that takes a dataframe (corresponding to a group) and returns a number. The output dataframe is indexed by the values of the groupby column(s) and the columns consist of the values of the function passed into the method.
* The `transform` method, on the other hand, expects a function that takes a dataframe (corresponding to a group) and returns a transformed dataframe of the same size. The `transform` method then combines these dataframes into a dataframe of the same shape as the original, full dataframe being grouped. This is useful if you'd like to scale columns, where that scaling depends on the group.

For example, each element in `df` below is scaled by the range of the group to which it belongs:

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two'],
                   'C' : [1, 5, 5, 2, 5, 5],
                   'D' : [2.0, 5., 8., 1., 2., 9.]})
df

In [None]:
grouped = df.groupby('A')
grouped.transform(lambda x: x/(x.max() - x.min()))

### Transforming SD employee salaries


Recall the dataset of salaries for city of San Diego employees from [lecture 1](../../lecture/01/Lecture\ 01\ Introduction.ipynb). In our investigation of whether women make total salaries similar to the general population of city employees, we came to the following conclusions:
* Women's total salary is on average significantly lower than city employees as a whole.
* Much of this difference is due to different gender proportions in different job types. For example fire fighters make a lot more than librarians.

The natural follow-up question is whether the same difference is present *when we control for job type*. To approach this, we would like to
1. Define different job types (`Job Title` is a messy field).
2. *Standardize* salaries within each job type.
4. Perform a hypothesis tests as in lecture 1, with the standardized data.

In [None]:
salaries = pd.read_csv('san-diego-2017.csv')
salaries.head()

**Question 5:**

* Create a function `job_word_distribution` that takes in a series like `salaries['Job Title']` and returns a series of counts of how many times each word occurred in the column. Assume that words are delimited by whitespace. *Note:* do *not* use loops!

Look at the distribution of words in `salaries['Job Title']` -- which ones would make reasonable labels for classifying most Job Titles into a few Job Types? Compare to the list given below. When we cover text processing, we will refine our task here.

Assume we care about the following job types:
```
job_types = ['Police', 'Fire', 'Libr', 'Rec', 'Grounds', 'Lifeguard', 'Water', 'Equip', 'Utility', 'Clerical', 'Administrative', 'Sanitation', 'Principal', 'Public', 'Dispatcher']
```

A job title belongs to a job type above if any of the above strings (in `job_types`) is a substring of a given Job Title. What proportion of Job Titles have a corresponding job type? (Verify for yourself!) If there isn't a matching job type, then set the job type of that job to be `Other`.

Now we are ready to analyze salaries by job-type and standardize them. 
* First, create a function `describe_salaries_by_job_type` that takes in a dataframe like `salaries` and outputs a dataframe of descriptive statistics of `Total Pay` by `Job Type` (using the method `.describe`).
* Then create a function `std_salaries_by_job_type` that takes in a dataframe like `salaries` and outputs a dataframe with 
    - the same rows as the input,
    - four columns given by `['Job Type', 'Base Pay', 'Overtime Pay', 'Total Pay']`,
    - where each of the `Pay` columns are *standardized by Job type* -- that is, row is put into the standard units for the job type it belongs to. For a review of standard units, see the [DSC 10 Textbook](https://www.inferentialthinking.com/chapters/15/1/Correlation)
    - *Hint*: use the `groupby` method `transform`.

In [None]:
# A look at the normalized salaries:

# pd.plotting.scatter_matrix(hw.std_salaries_by_job_type(salaries));

**Question (OPTIONAL)**

Perform the hypothesis test from lecture 1 on these standardized salaries to answer the question of "do women earn fair pay, when controlling for job type?"

### Salary Percentiles

**Question 6**

Since the total pay of city employees has a *long tail* (i.e. a small number of people make much more than the rest), the results from calculating with means tends to be skewed. That is, *most* people make very little compared to a few well payed employees. In this case, if you care about what the "typical" employee earns, it makes sense to bin salaries into percentiles and work with those.

* Create a function `bucket_total_pay` that takes in a series like `salaries['Total Pay']` and outputs an array containing the decile of `Total Pay` each employee lies in (deciles are labeled 1-10).
* Create a function `mean_salary_per_decile` which takes in a dataframe like `salaries` and outputs a series, indexed by decile, of the mean total pay of each decile.

Another interesting (optional) excercise, is to redo the analysis of salaries from both lecture 1, as well above, using salary deciles instead of mean salaries.

This is a very typical workflow for data scientists -- constantly refine your features and statistics, and re-run your analyses with those. For this reason, developing your analyses like software is highly productive.

# Robo-calls and Marketing

Given in `phones.csv` is a synthetically generated list of people's names and their phone numbers.

In [None]:
phones = pd.read_csv('phones.csv')
phones.head()

### Preparing the dataset

**Question 7**

Suppose you have an upstart robo-dialing service advertising [your IRS scam](https://nypost.com/2018/12/14/these-were-the-most-common-types-of-robocalls-in-2018/); the dataframe `phones` contains the information of 1000 people you need your software to call. Your software needs a dataframe as input satisfying the following conditions:

1. The columns should be `id`, `first_name`, `last_name`, `phone`, in that order.
2. The `phone` should a 10 digit integer of *string type*.
3. Additionally, the `phone` column should contain:
    - the `cell_phone` number if it exists,
    - otherwise the `home_phone` if it exists, 
    - otherwise `work_phone` if it exists;
    - otherwise `NaN`.


Create a function `robo_table` that takes in a dataframe like `phones` and outputs the table described above. Your function should not change the `phones` table.

*Hint #1:* Read [lecture 3 on `NaN` and Data Types](https://github.com/ucsd-ets/dsc80-wi19/blob/master/lecture/03/Lecture%2003%20Messy%20Data.ipynb).

*Hint #2:* `fillna` will be useful in creating the phone column

*Note:* robo-dialing is both illegal and a public nuisance. However, the situation described in this problem is core to most marketing by legal companies large and small. A company's practices can be a nuisance, while still being legal, and there are companies everywhere on this spectrum. Data scientists have a hand in all of these.

### Targeting an age group

**Question 8**

Next, you would like to target your calls toward the 40-49 age group. Since you don't have age columns, you need to make a best guess at peoples ages. To do this, use the *names* dataset from the Social Security administration in the `names` directory. 

1. Create a function `read_names` which takes in a directory path (containing files `yob****.txt`), and outputs a dataframe with four columns (`year,first_name,sex,number`). The column `year` denotes the year-of-birth of the recorded name/sex/cnt row, given in the file name.
2. Assign the 'best guess' for the age of a given first name as the year in the *names* dataset which contains the most occurances of that name. Create a function `age_guess` which takes in a dataframe as in part 1, and outputs a pd.Series, indexed by name, of the most likely age of each name.
3. Lastly, select the rows of `phones` that are most likely between the ages 40-49 (inclusive). Create a function `get_age_group` which takes in a dataframe like `phones` and a series like the output of `age_guess`, and outputs a dataframe consisting of the rows of `phones` who are most likely between the ages of 40-49.

*Hint:* Look at the example in the first lecture!

# Congratulations, your done with the homework

### Now, run your doctests and upload `hw03.py` to GradeScope.

