# Welcome

This notebook contains a set of programming questions that will test your ability to wrangle data in python. It will test your ability to injest, manipulate, and transform different types of data. While modeling and machine learning will not be a part of this test, a person who is familiar with data science will have worked with these methods as part of data cleaning, data analysis and validation.

This is a timed 35 minute test. Just because you do not get through all the questions does not mean you are doing poorly. You are welcome to Google for the relevant APIs and documentation.

**Try to group your ideas cleanly into functions where possible.** You should write code that is clean, simple, and to the point where possible. Do not be afraid to use fancy flairs occasionally to show us your style. 

Good Luck and Have Fun

In [6]:
# This is to get you in the zone.
import this

# And these are some pieces used for when you need a hint (we all do some times!)
from cryptography.fernet import Fernet
key = b'qroFon14Mk22FqYltB9zv-IopBa2bs0LC45CWOOaOyE='
f = Fernet(key)

def decrypt_hint(encrypted_hint):
    return f.decrypt(encrypted_hint)
    
def encrypt_hint(hint):
    return f.encrypt(hint)
    

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## Question 1 - Theme: Beginner Data Manipulation

Given a small data set, `./data/Q1-Resume-Metadata.csv`, about high level details regarding resumes, figure out how many documents belong to each `job_type`. 

Calculate the average length in words of each `document` in each `job_type`.


In [1]:
# Question 1 Answer Space

import pandas as pd

Q1_FILE = './data/Q1-Resume-Metadata.csv'

## Question 2 - Theme: Beginner NLP Preprocessing

Using your datafile from Q1 `./data/Q1-Resume-Metadata.csv` in the current directory. Please load the data into a dataframe with each row being on the sentence level. 

Create the following columns. 

`document_id` - A unique identifier for that document. We didn't provide these so you can use any system as long as its unique across documents.

`sentence_index` - An integer which determines the order of each sentence within the document. 

`sentence_text` - The actual text of the document.

`sentence_tokens` - Tokenizee the sentence and store the results in a list. 

In [11]:
# Question 2 Answer Space

import pandas as pd

Q2_FILE = 'Q2-Job-Descriptions-Data.txt'

#### Need A Hint on Question 2? Try This Out

In [10]:
decrypt_hint(b'gAAAAABehRX2LPOi70Bj9hRpT2Dtfa3KGjaYI8RQuU_ETM8nWKljfxVjwbQg3AkYZ3sSJ_tWxvhnr34g1XngcKrmkp-Oijfncb9w0P7KOah_NzlqPGOTLRys_qwODQtYrVOViVJ-t5qn2ng9iF2RNWE9hocM-DXodgtNPCcB-QItGMwZs8JfQwepEifY4VobsvDWX0ThcVIX')

## Question 3 - Theme: Classifier Validation

You've built a classifier that scores a document's category. Your results are in `Q3-Sentence-Classifier-Results.csv`. Review your results and select a subset of the documents that you are fearful are not being classified strongly enough. Write a function that returns the IDs of these lower confidence documents so we can take some action with them.

**The defintion of "classified strongly enough" is arbitrary and open to interpretation. Maybe your function also inputs a bit about the documents and why they were selected?** 

In [16]:
# Question 3 Answer Space

import pandas as pd

Q3_FILE = './data/Q3-Sentence-Classifier-Results.pickle'
q3_df = pd.read_pickle(Q3_FILE)

In [21]:
decrypt_hint(b'gAAAAABehR-AiWE3JS4eoTSycCw-9odIdt6vUBPLzXZXpa3BPyBRB-FSLG-9bKo-FVRRFhxOOC654jKRqqwPLyCJqQNB22yTKU1mG9V1t6eGj-4CfZjeMUWIxjf3ulHvzJc4skMGtK5H5IAgs4VfFc9UCyxBdw4aIqROMbsVLDJMMwbSzg29VDnyy9XdxbVb4cUrHcJG51HxS4tGbr-H1u5e-Q6WNb8nK7D_O3iBDeFJsbBq0V7YCWgCFJ7omL5Z0FULLk0jP6SrMwU0nc-ODsGoVNiciXjKqg_5_qhc8IjXf3j4CLJuXyI=')

## Bonus Question 1 - Theme: Basic Probability & Visualization

Generate 2 random arrays of numbers, 1 of them using a normal distribution and the other a uniform. Sort and plot a simple 2-D line chart. Familiar shape huh? Try to make the normal distribution exponential. Plot it again. Now reverse the exponential distribution and plot once more.

In [3]:
# Bonus Question 1 Answer Space

import numpy as np
import matplotlib.pyplot as plt


#### Need A Hint on Bonus Question 1?

In [14]:
# Stuck on generating distributions?
decrypt_hint(b'gAAAAABePXItvyLA-Ou-Udm3lxZsxgLjFNkOMt2cFFvlhEGF2soYwpFYYdJlUiHPsDCrdtHzPmW47uWIbyYcn39M34A-cx45j3DqewKqCE3Sw-D8wzZNfjuXIbkwkYT_K3vnyyXCqgt1dQKvDL3bTI2quodzJJVYrpzgRvfsK6qQP-7ac-K7y7c=')

In [16]:
# Stuck on getting a nice plot?
decrypt_hint(b'gAAAAABePXJJ4IHj4CuN9Vv0e-rr9M41ozEtmwhaopjHTrN8QQMEENrq0Sf_0CZmCT5GZp7xG0UYLZ1bgcKBl42-Wu6D8pEKlAR1QFiq597Dn1VZx7QGE_WHVwl3clXiStTllVPsx6Bv')

In [19]:
# Stuck on exponential and normal transformations?
decrypt_hint(b'gAAAAABePXKlkxtT8LDZ3Y2XuoLiya-qBgI05i6TQAjTWkExyjYmgn0ir-vDv8zgQRrIWQmKbmpotOYlvtx0u2JD6CPTZ8qDTVX67qJlrkyKbUgynN6cpdq264B9fWCFDI1U7oUgnsY9104blxrfWOiarAVtWR3OF4w-9p8w3JbRxx1jurUXbO2JaW7K3LeS02RTYgQ0kU2T')

## Bonus Question 2 - Theme: Intermediate NLP Preprocessing 

In Question 2 you generated a dataframe that had text and then tokenized each sentence. Using the `genism` snippet below and the docs, apply pre-trained embeddings to your sentences, resulting in an embedding space for each document. 

- [Genism Word2Vec Docs](https://radimrehurek.com/gensim/models/word2vec.html)

In [3]:
# Bonus Question 2 Answer Space

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec

GLOVE_PATH = './tools-for-interviewee/glove.6B.50d.txt.zip'

#### Need A Hint on Bonus Question 2?

In [25]:
decrypt_hint(b'gAAAAABehR_rPxhIjBukAygVOGRcdJCqesdudadYBTW2IiEDbSU-W64xXdzcQZexYqaHqasKSqafofjj4-iaBNAYJwAekZkkWwde0kSY7QKP6QQbDGy_YEcL0h2K1hSulQtxzbASzoUNaHhAV_r_S7yEokAYpwEmW2KRhcqubjQh54OgKaufxVkloHoiWWS7pG6lzB8SY7MAYvdJm0ilZlVAX2aoImAEHu2WIhCUVCtkaEEclOg-Rar1ZQepnq1fAdSB7mWIbIjc')