![logo.png](logo.png)


## States Title Data Science Challenge 

### Purpose

We would like you to use a Jupyter (python) notebook to work with a slice of this data. You'll get a sense of the type of questions that we deal with at States Title, and we'll get a sense of your data science approach.

### Background

States Title's platform disrupts the nationwide market for title insurance by estimating title defect risk for any given property through the use of data science instead of manual searches by title examiners. We combine data streams that contain two types of information:

1. **Title defects:** The dollar value of each defect, measured via a past slow title search that we have obtained from partner title agencies

2. **Property information:** Hundreds of property-level attributes

The platform cleans, processes, and combines this data to train and test our machine learning models, which quantifies the risk of any given property. When that estimated risk is compared to a carefully-chosen threshold, properties will either pass or fail.

We have provided you with a sliced-down subset of our own datasets to simplify things. You may make the following (drastically) simplifying assumptions:

- When a property passes (is accepted), it'll be underwritten instantaneously, netting us a $\$800$ flat, one-time fee per policy regardless of its real estate value

- When a property fails, it is rejected and generates no revenue

- If we accept a policy that turns out to have defects, we suffer a loss to the tune of $100 \%$ of the value of the defects.

- Therefore, our profit is $\$800$ times the number of policies we accept, minus the sum of the values of all the defects we fail to catch (i.e., the total amount of dollars in false negatives).


### Objectives 

1. Write python code that allows you to stand up a nationwide title insurance company:

  a. It should read the files `default_notices.csv`, `train_property_data.csv`, and test property `data.csv`, described below.

  b. It should append a new column, risk, to the test property `data.csv` file, which represents your prediction of overall title risk for the property. This column should behave in such a way that properties with lower risk are predicted to be more profitable than properties with higher risk.

  c. You are at complete freedom to set the method for measuring risk, and the column itself can contain any real-valued number that satisfies part (b).

### Notes

1. Within the iPython (Jupyter) notebook make a rough but justified profit-maximizing recommendation for the value of the risk that should be used as a decision threshold during production, such that you instant-underwrite all properties with a lower risk than that threshold. It would be helpful to have a graph or two explaining your thinking.

2. Spend minimal time on formatting/fonts as they are not important.

### Data Description

Along with this description, there are CSV files containing information on 50,000 properties. Handle abnormal values as you see fit. Below is additional information on the dataset, including details for a few fields that may not be self-explanatory.

- `house_id`: A unique property identifier column in all four files.

- `train_property data.csv`: General information about each property. Includes location, size, and age of the homes. This file includes the target column, `lien_total`. This is a sum of the dollar value of all liens (defects) against each property. A value of $\$ 0$ indicates the property was free of liens that we are attempting to predict.

- `title_check_date`: The date when the lien_total target was measured (i.e., title was checked).

- `county_fips`: Unique county identifier code

- A value of `"yes"` in a categorical column means "available but of unknown type."

- `test_property_data.csv`: Contains different properties but in the same format, as `train_property_data.csv`, with the target column, `lien_total` removed. You must append a risk column to this file as described above.

- `default_notices.csv`: The history of every notice of default (NOD) issued to each test and train property. A NOD indicates a bill payment that is more than 90 days late.

- `event_type`: Distinguishes whether the notice of default was issued or rescinded.

In [None]:
!git clone --branch states_title_1 https://github.com/interviewquery/takehomes.git
%cd takehomes/states_title_1
!if [[ $(ls *.zip) ]]; then unzip *.zip; fi
!ls

In [None]:
# Write your code here