# Coding Challenges

There are three problems, below, with mandatory tool usage, in parentheses.  Try to complete two of the three problems.  No points are awarded for completion of a third problem. 
    
1. Data Manipulation (SparkSQL)
2. Data Vizualization (Vue, React)
3. Data Models (Traditional, Neural Network)
    
<br>

### Instructions

* Copy this Jupyter notebook and use it as your turn-in report
  - there may be multiple notebooks, if a different kernel is used for different tasks
  - this does not apply to the Data Vizualization task, which should be a zip file
* Notebook should, at least:
  - describe the work you've done using markdown (styled as any professional report)
  - investigate the data and state your assumptions
  - state results and future work
  - idempotent - clean up after yourself (i.e. drop your tables)
* Notebook is scored based upon:
  - good problem-solving workflow
  - quality of code
  - explanation of operations performed beneath code
  - reproduceability and documentation
  - professional, interesting, and witty text

<br>
<br>
<br>
<br>
<hr>
<br>
<br>
<br>
<br>

# 1. Data Manipulation (SparkSQL)

Use Apache Spark to import the data: `./Data/classification/DM-classification.json`


While you MUST use Apache Spark to answer this Data Manipulation question, ANY language (Scala, Pyspark, SparklyR) can be used.

### Configure your environment

Please explain this configuration, and why you chose it.

### ETL

Load the data into a Spark-SQL dataframe with the folowing columns: 'content','label','size','usage','effect','date'.  Create a Temporary Table for querying.  

Ensure you use appropriate types with your schema using: StructField, StructType.

### Process

Group the table by 'size', and sort based on 'date'.  Then, create a new column that is the difference between 'date' in consecutive records (within groups).

### Save

Save the data results as one file.

<br>
<br>
<br>
<br>
<hr>
<br>
<br>
<br>
<br>

# 2. Data Visualization (Vue, React)

Design and build a User Interface to visualize the data: `./Data/classification/DV-classification.json`


Please be creative in your design, but meet the following requirements:

* enable it to be deployed using docker: `docker run`
* use either Vue, React, or complimentary framework (ie. Nuxt, Next)
* use a styling library of your choice
* add some basic interactivity to explore the data, such as [dc.js](https://dc-js.github.io/dc.js/)
* explain how this would be deployed, including the necessary tools and hardware

Send this solution via a zipped directory, and include instructions.

![example interface](./Data/classification/image-DimChart.png)

<br>
<br>
<br>
<br>
<hr>
<br>
<br>
<br>
<br>

# 3. Data Models

__Note:__ this task is language and framework agnostic.

## Models: Traditional Methods

This section will make use of traditional analytic methods to describe and model the data.  It will NOT make use of Neural Network - based methods.

### Import data and summarize

Import the classification data: `./Data/classification/classification.txt` into two columns: 'content', 'labels'.  Review and summarize the data.  Make any notes of your observations.

### Preprocessing

Create preprocessing function that does the following:
    
* lowercase text
* replace punctuation, digits, & all special characters except '_' underscore with one space

Preprocess 'content' column, and store prepped content in a new column called 'prep_content'.

### Dataset dev(train, validate)-test split

Make dev_train, dev_validate, and test sets with dev_validate and test sets being the same size.

### Named Entity Recognition

Count how many organizations -- 'ORG' exist in the `dev_train` dataset.  Use a text processing framework with included NER classifier to perform this.

### ML pipeline with logistic regression

Build a ML Pipeline that performs the following:

* vectorizes text
* selects top 100 features with chi2
* trains a logistic regression classifier

### Pipeline inspection

Review the pipeline and print the 100 features that the chi2 selector chose along with each feature's score.

### Model 

Plot the learning curve.

### Discuss future work

Make notes of the learning curve and discuss how it effects your future work in this modeling process.

## Models: Neural Network Methods

This section will make use of Neural Network - based methods, including word embeddings.  You MUST use a neural network (NN) framework (PyTorch, DeepLearning4J, MXNet, Keras, TensorFlow, etc) for performing these tasks.

While you will not be judged on the accuracy of the model, the architecture should be appropriate for the situation.

### Configure environment

Prepare your environment for your NN framwork of choice, and with the ability to reproduce your results.

Also, explain your platform and how different hardware might provide improvements to your code.

### Re-ETL and dataset split

Reformat the original data for ingestion with your NN framework.  Be sure to explain how the process relates to your framework.

Ensure that you re-split the data, also.

### Word embeddings

Apply word embedding and explain their purpose with respect to the traditional methods used, above.

Explain the underlying concepts of word embeddings with an example.

### Model architecture

Determine an architecture for your neural network classifier.  Explain why you chose the architecture, then implement it in your chosen framework.

Be sure to initialize the model, and print the number of parameters.

### Model training routine

What decisions do you have to make for training the model.  Ensure you include the following:

* loss function
* scoring criteria
* optimization

Explain why these were chosen and implement them in code.

### Train and evaluate

Train the model and evaluate the results.

Ensure that you print some of the Training and Validation scores that you decided upon, after each iteration / epoch.

### Spark integration

Explain how you would train and apply this model on a large dataset that required a Spark cluster to process.