In [1]:
import pandas as pd
import re

# Review

We can load some data to play with. 

Below there are a series of challenges that (mostly) use this data and that cover some of the big concepts that we have covered in the course so far. None of the things below are specific - you need to look at the data, decide on a goal, then work to build code to do it. There are multiple reasonable interpretations, and what you build is correct... until it isn't. Right now all these questions are asking is to make something simple, so many interpretations are reasonable. If we continued to build this into larger project, we might need to change stuff as more is asked of our code.

In [14]:
df = pd.read_csv("../data/Salary_Survey.csv")
df.head()

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000.0,...,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000.0,...,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000.0,...,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,


In [6]:
df.describe(include="all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
timestamp,62642.0,62561.0,2/25/2020 13:25:07,3.0,,,,,,,
company,62637.0,1631.0,Amazon,8126.0,,,,,,,
level,62523.0,2923.0,L4,5014.0,,,,,,,
title,62642.0,15.0,Software Engineer,41231.0,,,,,,,
totalyearlycompensation,62642.0,,,,216300.373647,138033.746377,10000.0,135000.0,188000.0,264000.0,4980000.0
location,62642.0,1050.0,"Seattle, WA",8701.0,,,,,,,
yearsofexperience,62642.0,,,,7.204135,5.840375,0.0,3.0,6.0,10.0,69.0
yearsatcompany,62642.0,,,,2.702093,3.263656,0.0,0.0,2.0,4.0,69.0
tag,61788.0,3058.0,Full Stack,11382.0,,,,,,,
basesalary,62642.0,,,,136687.281297,61369.278057,0.0,108000.0,140000.0,170000.0,1659870.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62642 entries, 0 to 62641
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   timestamp                62642 non-null  object 
 1   company                  62637 non-null  object 
 2   level                    62523 non-null  object 
 3   title                    62642 non-null  object 
 4   totalyearlycompensation  62642 non-null  int64  
 5   location                 62642 non-null  object 
 6   yearsofexperience        62642 non-null  float64
 7   yearsatcompany           62642 non-null  float64
 8   tag                      61788 non-null  object 
 9   basesalary               62642 non-null  float64
 10  stockgrantvalue          62642 non-null  float64
 11  bonus                    62642 non-null  float64
 12  gender                   43102 non-null  object 
 13  otherdetails             40137 non-null  object 
 14  cityid                

## Loops, Control Logic, Data Structures

<ol>
<li> There are NaN values in several columns. Replace those with something better, that something should be something that you decide from the other data. There are several ways to do this, both in terms of ways to apply the change to every row in the data, and in terms of swapping the values. Try a couple of ways.</li>
<li> Write a function to clean at least one column of data, use apply (or some equiv that isn't looping) to process that column. Dealer's choice on what to clean, I did making City and State to separate columns. </li>
<li> There is a section of columns that look to be true/false values. Convert them to be boolean values. Then write a function that takes in those dataframe columns as in input (with the booleans) and returns an array of 1/0 values. (Yes, this involves redundent conversions). </li>
<li> Find every time where the time is evenly on an hour (ingoring seconds) - so 01:00, 02:00, etc. </li>
<li> (This may be harder to do completely, but for the simple ones it shouldn't be that bad - it might take some time to handle all edge cases though, so don't dwell on this too much in class) Write a function to process the "other details" column into something useful. There are lots of things you could do, one that makes sense is to take the academic degree info that is in that column and use that to update the "education" column. </li>
</ol>

In [24]:
def replaceNaN(input_text):
    for line in str(input_text):
        line.replace('NaN', 'None')

In [26]:
results = df["Race"].apply(replaceNaN)
results

0        None
1        None
2        None
3        None
4        None
         ... 
62637    None
62638    None
62639    None
62640    None
62641    None
Name: Race, Length: 62642, dtype: object

In [33]:
# I made this one, you can do this or other things. 
def cityStateSplitter(input_text):
    for line in str(input_text):
        df[["City", "State"]] = df["location"].str.split(",")
        

In [34]:
# Example of using mine
split_results = df["location"].apply(cityStateSplitter)

ValueError: Columns must be same length as key

## Objects and Classes

<ol>
<li> Create classes for both Employees and Jobs. Populate each with the relevant attributes from the data. </li>
    <ul>
    <li> Override or create some basic useful stuff - str, repr, len, etc. </li>
    <li> You'll need to split the data into the class it "belongs" to - this is the same idea as database normalization. </li>
    </ul>
<li> Build code to take in the arguments using a kwargs argument if there are a lot of attributes. </li>
<li> Create the logic needed to sort Jobs by total compensation. </li>
<li> Create some error checking using try/except blocks to ensure that the data that goes into at least a few of the attributes is valid. </li>
<li> Create a function that takes in a text input, and a bunch of job/employee objects, and finds matching records. Good use of regex. (This is a bit open ended, we want to be able to seach for text matching a pattern, and get the row. Dynamic inputs make that a bit more complex than what I thought when I first wrote it.)</li>
</ol>

In [None]:
class Employee:
    def __init__(self) -> None:
        pass

class Job:
    def __init__(**kwargs) -> None:
        pass

def process_data_load(df):
    pass
    # It might be useful to have a function that takes a dataframe and generates the job and employee objects

## Inheritance

<b>Not using the imported data.</b>

<ol>
<li> Create a few classes:</li>
    <ul>
    <li> A pet class, that contains things relating to a pet such as name, age, owner, etc...</li>
    <li> A dog class, that inherits from the pet class, and adds things like breed, favorite toy, etc...</li>
    <li> A cat class, that inherits from the pet class, and adds things like favorite food, etc...</li>
    </ul>
<li> Override the key methods for each class such as str(), eq(), etc... Exactly how to do this is abmiguous, you'll need to pick. </li>
<li> Add a method to the Pet class called, "prepare for vet visit" that lists out the steps to get ready for a vet visit. For a pet object, this is pretty generic. For each subclass of pets, this can be specific - e.g. a cat might get put in a kennel, a dog might get a leash, etc... We should be able to loop over a bunch of pets and ask them all to get ready for the vet, each should do its thing depending on what it is. </li>
<li></li>
<li> Create a class called MyPets, that is a container that holds a bunch of pets and ties them to their owner. This is effectively a data structure that holds pets. </li>
    <ul>
    <li> This class should implement basic data structure stuff like str(), len(), etc... </li>
    <li> Make sure that you can iterate through pets that are stored in the class. </li>
    </ul>
</ol>

To test it and use it:
<ul>
<li> Create some pets, add them to a MyPets object. Make some dogs, some cats, and some other - that don't have a subclass and are a generic pet. </li>
<li> Enumerate through the MyPets object and print out the pets. </li>
<li> Sort the pets, based on whatever you defined for the comparison logic. </li>
<li> Send a pet over to a friend, and remove it from your MyPets object. </li>
</ul>

This description is knowingly a bit vague, you'll need to make decisions on how to make things, exactly what you need, and where things should live. The exercise is to translate a simplified understanding of what pets are, what we know about them, and how we structure owning them into the model in our code. There is no specific correct answer, as long as this does what we need - such as keeping an inventory of our fur babies - it is good enough. 

## Assignment #2 Bonus - Verifying Output

For the next assignment you'll need to produce a specific output, that will be evaluated against expected results programmatically. This is a simple example of that - below there is a definition of a function that takes in a dataframe and two column name (in this case the total compensation and the base salary). The function returns a series that lists the differences between the two columns, with the difference being the total compensation minus the base salary.

Underneath that is a function that evaluates accuracy of the first function. It pulls in the correct answers from some other file, and compares the results of the function that we write to that predefined set of answers. This is a simple example of how we can test our code to make sure that it is doing what we expect it to do. For assignment #2 I'm going to ask you to produce specific results like this, and your output needs to be in the exact format that we are asking for. You'll get a few things that you'll need to use:
<ul>
<li> A specified output format that you must produce </li>
<li> A sample set of test data that you can use to verify that your code is working correctly </li>
<li> A sample set of expected results that you can use to verify that your code is working correctly </li>
<li> The tester function - this is like the one below, it just checks if your answer matches expectations. </li>
</ul>

So you'll need to ensure that your code works with the sample tests - i.e. run yours with test data, put the outputs through the error checker, and see if it works. I'll rerun your code with different input data and correct answers, and the output will determine the scores. If yours works with the test stuff, that means you're doing things correctly, and you'll be able to handle longer sets of data for evaluation. If it works, then it is 100% (on that part). This is strict and pedantic, but it is also very realistic. The number 1 thing that our code needs to do is to create the correct answers, so that's what we'll check. 

In [10]:
# this function should take a dataframe and two columns to compare.
# it returns a series of the difference between the two columns
# i.e. total compensation - base salary
def produceDifferences(dat, total="totalyearlycompensation", base="basesalary", output_file="difference.csv"):
    total_comp = dat[total]
    base_sal = dat[base]
    tmp = total_comp - base_sal
    tmp.to_csv(output_file)
    return tmp
    

In [11]:
# Call the function
diffs = produceDifferences(df)

#### Testing

This part doesn't change except for the input file. Whatever your function above produces is checked against the file. 

In assignment 2 you'll get something similar to this, a testing function that takes your results and a set of answers, and scores them. You need to supply the "diffs" in this case, the rest is provided. 

In [12]:
def verifyDifferences(diffs, truth):
    """This function will compare the differences that we have calculated to the truth values.

    Args:
        diffs (Series): The input differences that we are testing. 
        truth (Series): The true values that we are comparing to. 
    """
    errors = []
    for i, diff in enumerate(diffs):
        if diff != truth[i]:
            print("Error at index", i)
            errors.append(i)
    return errors

In [13]:
truths = pd.read_csv("diffs.csv")["0"]
submissions = pd.read_csv("difference.csv")["0"]

print(verifyDifferences(submissions, truths))
print("Score:", (1 - len(verifyDifferences(submissions, truths))/len(truths))*100, "%")

[]
Score: 100.0 %
