In [None]:
https://www.kaggle.com/thumbsnail/coming-from-google-s-machine-learning-crash-course

<a id="top"></a>
**I.  [Getting Started](#gettingstarted)**
1. [Introduction](#intro)
2. [Kaggle vs Colab (Google's Colabratory)](#kaggle)
3. [Setup - Importing Packages and Data Files](#setup)
 - [Importing Data from Other Data Sources](#otherdata)
4. [Some Quirks in the Data:  Training Data Split over Multiple Files, Multiple Rows for One Entry, Missing Data](#quirks)
5. [Pandas Pain](#pandaspain)
  - [NaNs in the Data](#nans)
  - [Condensing Multiple Rows into a Single Row](#multiplerows)
  - [Combining Two DataSets](#combining)
  - [Changing Only Certain "Cells" (that is, Update Column Values for Only Certain Rows)](#changecells)
  - [Editing Lots of String Data (Removing Punctuation, Setting to Lowercase)](#stringdata)
      - **[NEW:  Looking for Misspelled Words](#misspelledwords)**
  - [Handling Date Information](#dateinfo)

**II.  [Data Visualization](#datavisual)**
1.  [Bar Chart](#barchart)
2. [Scatter Plot](#scatterplot)
3. [Line Plot](#lineplot)
4. [Stacked-Bars Barchart](#stackedbars)
5. [Pie Chart](#piechart)
6. [More Line Plots to Look at Dates (Month, Week, Weekday)](#datedata)
7.  [Violin Plots (Using Spelling/Typing Error Data)](#violin)

**III.  [Applying the Crash Course to this Project to Create a Linear Classifier (Does NOT Use Text-Heavy Features)](#linearclassifier)**
1. [Randomizing the Data](#random)
2. [Creating Training and Validation Sets](#sets)
3. [Choosing Good Features (Location of DESIRED_FEATURES const)](#desiredfeatures)
4. [Setting Up for Training](#trainingsetup)
5. [Regularization](#regular)
6. [Training](#lineartraining)
7. [ROC and AUC](#rocandauc)
8. [Experiment Results](#linearpretextexps)

**IV. [Dealing with Text-Heavy Features:  Using the Crash Course to Bypass TensorFlow Pain](#bruteforcebaby)**
1. [Converting from Pandas to TFRecord](#tfrecordrescue)
 - [Handle the Test Dataset](#testrecord)
2. [Loading TFRecord into TensorFlow](#recordtotf)
3. [Setting up for Training (with Text-Heavy Features)](#traintextheavy)
4. [Training](#lineartexttraining)
5. [AUC](#lineartextauc)
6. [Experiment Results](#lineartextexps)

**V. [Applying the Crash Course to Create a DNNClassifier](#dnnclassifier)**
1. [Setting Up for Training](#dnntrainingsetup)
2. [Training](#dnntraining)
3. [AUC](#dnnauc)
4. [Experiment Results](#dnnpretextexps)

**VI. [Submitting Predictions](#submit)**
1. [Making Predictions on the Test Dataset](#predictions)
2. [Convert to .csv](#csv)
3. [Submit to Kaggle](#submitfile)

**I. Getting Started**
<a id="gettingstarted"></a>

**1. Introduction**
<a id="intro"></a>

If you're like me, perhaps your only machine learning experience is from Google's [ Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/next-steps).  This will be a brief guide to help you transition from the nice and tidy Programming Exercises of that course to this Kaggle competition that has some issues not encountered in those exercises.  In short, hopefully this will help save you some time and spare you from some hassle.  I'll also be documenting my own attempt to tackle this machine learning problem.

I originally built a Linear Classifier and DNN Classifier using a lot of the code from the Programming Exercises.  For the features, I only incorporated a few features (basically, any whose data did NOT consist of a bunch of sentences like the essays and the descriptions).  But these models didn't do much better than a random guesser.  So the next step was to incorporate those text-heavy features into the model.

However, when I attempted to do so, I couldn't get the code to work in the way that I wanted.  I ended up having to completely change how I load in the data (because I couldn't find another solution) so that TensorFlow would look at those strings as individual words to then compare to a vocabulary_list as opposed to it looking at those strings as full sentences (making the comparison to a vocabulary_list made up of individual words useless).  I've kept the old code for the few-feature models if you want to see how to get those models up and running.  But mixed within some of that code may be code that only really applies to the newer model that looks at text data as individual words.

**2.  Kaggle vs Colab (Google's Colabratory)**
<a id="kaggle"></a>

The Programming Exercises in the Machine Learning Crash Course were done in Google's Colabratory environment.  So my first question was "How do I get the data from this Kaggle competition into Colab?"  While you can likely do that using...

* [External data: Drive, Sheets, and Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=c2W5A2px3doP)
* and/or [Kaggle API](https://github.com/Kaggle/kaggle-api)

...it's unnecessary.  The Kaggle environment is effectively the same, making it easier just to get started right here.

So, to use Kaggle and get your programming environment set up, you'll need:

i. A [Kaggle account](https://www.kaggle.com/account/login) (which requires an email address and a phone number for verification)

ii. To start [a new kernel](https://www.kaggle.com/kernels)

iii. To then choose Notebook (if you want your environment to pretty much have the same feel as Colab)

iv. To click on the Data tab at the top of your Notebook and then click "Add Data Source" (searching for the DonorsChoose competition and then agreeing to its rules)

That's pretty much it!  You should now have access to the DonorsChoose files in your Notebook and can begin working with them in your code.

**3. Setup - Importing Packages and Data Files**
<a id="setup"></a>

This should be similar to what you saw in the Programming Exercises.  Your Notebook should load up with a handy note that the data files have this file path:  '../input/filename.csv', which we'll use to load the csv files into a Pandas DataFrame in the code below: 

In [1]:
# Packages
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.python.data import Dataset
import sklearn.metrics as metrics
import os # to access data files (found in the "../input/" directory)
import re  #regular expressions

# More Packages from https://colab.research.google.com/notebooks/mlcc/sparsity_and_l1_regularization.ipynb?hl=en#scrollTo=pb7rSrLKIjnS
import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt

# Data Files  NOTE:  My file paths here may look different since I'm using files from multiple data sources 
training_dataset = pd.read_csv('../input/donorschoose-application-screening/train.csv', sep=',')
resources_dataset = pd.read_csv('../input/donorschoose-application-screening/resources.csv', sep=',')
test_dataset = pd.read_csv('../input/donorschoose-application-screening/test.csv', sep=',')

*i.  Importing Data from Other Data Sources*
<a id="otherdata"></a>

As I continued to work on this project, I discovered that I wanted to check for misspelled words in the data as an additional feature.  A very long story short, I came to the conclusion that it would be great to have an actual dictionary of English words to compare to.  The following additional data sources are an attempt to construct such a dictionary (though they aren't sufficiently complete):

 - [English Word Frequency](https://www.kaggle.com/rtatman/english-word-frequency)
 - [Unix words](https://www.kaggle.com/nltkdata/unix-words) (not many unique words compared to above)
 - [Brown Corpus](https://www.kaggle.com/nltkdata/brown-corpus) (also didn't add much, so I opted not to actually use this one)

If you do want to add additional data sources to your Notebook/Kernel, it's the exact same process as described earlier via the "Data" tab at the top of the Notebook.  The only difference is that adding a 2nd data source *alters the file paths of where those data files are stored in your Notebook*, meaning there are now subfolders under the 'input' folder.

In [2]:
#English Word Frequency
dictionary_dataset = pd.read_csv('../input/english-word-frequency/unigram_freq.csv', sep=',')

#Unix Words
#https://stackoverflow.com/questions/3277503/in-python-how-do-i-read-a-file-line-by-line-into-a-list
with open('../input/unix-words/words/en') as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content] 

unix_set = set(content)

#Brown Corpus
USING_BROWN = False  #Though Brown has "1 million words," most of them are NOT unique.  So this wasn't that helpful
if USING_BROWN:
    #https://www.kaggle.com/alvations/testing-1000-files-datasets-from-nltk
    #^Code to get data into pandas dataframe:
    from nltk.corpus import (LazyCorpusLoader, CategorizedTaggedCorpusReader)

    import nltk
    # Removing the original path
    if '/usr/share/nltk_data' in nltk.data.path:
        nltk.data.path.remove('/usr/share/nltk_data')
    nltk.data.path.append('../input/brown-corpus/brown')
    nltk.data.path

    brown = LazyCorpusLoader('brown', CategorizedTaggedCorpusReader, r'c[a-z]\d\d',
                             cat_file='cats.txt', tagset='brown', encoding="ascii",
                            nltk_data_subdir='brown-corpus/brown')

    list_of_lists_of_tuples = brown.tagged_sents()

    list_sents = []
    for list_of_tuples in list_of_lists_of_tuples:
        sentence = " ".join([tupl[0] for tupl in list_of_tuples])
        list_sents.append(sentence)

    list_sents

    pd_brown = pd.DataFrame({'sentence': list_sents})
    
    #some text formatting:
    pd_brown['sentence'] = pd_brown['sentence'].replace('(\-)|(/)|(\.\.\.\.)|(\.\.\.)', ' ', regex=True)
    pd_brown['sentence'] = pd_brown['sentence'].replace('["#$%&\'()*+,.:!;<>?@\\^_`{|}~]', '', regex=True)
    pd_brown['sentence'] = pd_brown['sentence'].replace('(   )|(  )', ' ', regex=True)
    pd_brown['sentence'] = pd_brown['sentence'].str.lower()
    pd_brown[0:10]

**4. Some Quirks in the Data**
<a id="quirks"></a>

i.  In the Programming Exercises, all your feature training data would come from a single csv file and get loaded into a single Pandas DataFrame.  However, in this project, part of the training data is in train.csv and part is in resources.csv, the common link being the same 'id' for a given entry in both files.  Thus, we're going to need some way to combine the two data sets.

ii.  Additionally, resources.csv has the added twist that *not all of an id's information is in the same row *.  Instead, some project ids are requesting multiple resources, which thus span multiple lines.  For example, submission p069063:

In [None]:
resources_dataset[0:9]

  So, you might be thinking about how to group some (or all) of that information together as a single-line entry.
  
  iii.  If you play around with resources.csv, you may eventually discover that some of the description data is actually missing.  Some of the cells are filled with NaN, which you may want to handle.  For example, p194324:

In [None]:
resources_dataset[37602:37612]

**5. Pandas Pain**
<a id="pandaspain"></a>

In the Programming Exercises from the course, there really weren't that many different functions used from Pandas.  Pandas is cool because it can do a million different complex things.  However, Pandas is also frustrating for the same reason, especially when you just want to do something that seems trivial and yet have to comb through documentation or StackOverflow to figure out how to do that.  To spare you from (some of) that searching, let's work through the quirks listed above in reverse order.

*iii.  NaNs in the data*
<a id="nans"></a>

A handy Pandas function to know for a DataFrame is:  **.isnull().any()**, which alerts you to missing values in any of your columns:


In [None]:
resources_dataset.isnull().any()

(Note:  If you wanted to locate one of the rows and see the missing data for yourself, you could use the **.loc()** function.)

In [None]:
resources_dataset.loc[resources_dataset["description"].isnull()]

Now, let's go ahead and replace those NaNs so that they don't cause any trouble if we were to, say, try to join together all the description strings for a project id into a single string.  To do so, we'll use the function **.fillna({column_name : replacement_value})**:

In [3]:
resources_dataset = resources_dataset.fillna({'description' : 'no_detail'})
resources_dataset.isnull().any()

*ii. Condensing multiple rows into one single row*
<a id="multiplerows"></a>

In order to do this, we'll need the **.groupby()** function for grouping and the **.agg()** function in order to perform an operation on the cells that are being condensed.  But first, let's create a new feature that we'll likely want to consider:  the cost of a given request.  The resources.csv file contains a request's quantity and its price.  If we multiply those two values together, we'll end up with the cost for that request.  This is something done in the Programming Exercises, so it should likely look familiar:

In [4]:
# Make a cost column (quantity * price)
resources_dataset['cost'] = resources_dataset['quantity'] * resources_dataset['price']
resources_dataset[0:9]

Now, let's use the **.groupby()** function and the **.agg()** function to condense, for example, all those p069063 rows into a single line such that we just have one p069063 entry with a total cost that has summed up all the costs of the individual requests.  In this example, I'm only interested in the 'id' and the 'cost' columns, so the final DataFrame will only have those two columns:

In [5]:
#create a total_cost column

grouped_ids = resources_dataset.groupby(['id'], as_index=False)
resources_condensed = grouped_ids.agg({'cost' : 'sum'}).rename(columns={'cost' : 'total_cost'})
#resources_condensed.loc[resources_condensed['id'] == 'p069063']
resources_condensed[69060:69065]

Cool!  Let's step through the code.  We pass to **.groupby()** the column that contains the values that repeat in multiple rows.  We also pass in as_index as False so that 'id' (such as p069061, p069062, etc.) doesn't get used as the index.  Instead, each row will have a number index like normal (such as 69060, 69061, etc.)

The **.agg()** function is then called so that we can perform an operation on the cost values that all belong to the same id.  (In this case, we want to add them all up together.)  We pass in a dictionary of the form *{ column_name : operation_we_want_to_perform }*.  Lastly, I just renamed the column to show that it's the final, total cost for all of the requests made by a given id.

Now, if you wanted to, you could combine the other columns as well by passing multiple dictionaries into **.agg()**.  I'm going to go ahead and do that just in case I decide to do anything with the description data later on:

In [6]:
#Join together all of the columns
group_the_ids = resources_dataset.groupby(['id'], as_index=False)
all_condensed = grouped_ids.agg({'description' : lambda x: ' '.join(x), #there's no simple single keyword option like 'sum'
                                 'quantity' : 'sum',
                                 'cost' : 'sum'}).rename(columns={'description' : 'full_description',
                                                                  'quantity' : 'total_quantity',
                                                                  'cost' : 'total_cost'})
all_condensed[69060:69065]

Trying to look at the full string of a single cell (to check that this actually worked) is a bit of a pain:

In [None]:
#Show that p069063's full_description actually contains the joined text:
entry = all_condensed.loc[all_condensed['id'] == 'p069063'].reset_index()
entry.loc[0, 'full_description']

*i. Merging together two DataFrames*
<a id="combining"></a>

Okay, we finally have the desired data from resources.csv organized!  We now want to combine this DataFrame with our other training DataFrame.  To do so, we'll use the **.merge()** function.  This function takes as arguments the DataFrame we want as our "left columns" followed by the DataFrame we want to tack on as the "right columns".  We also specify the common link between the two DataFrames (in this case, 'id') via *on='id'*:

In [7]:
combined_training_dataset = pd.merge(training_dataset, all_condensed, on='id')
combined_training_dataset[0:5]

Woo hoo!  All that's left to do is the same kind of merging but with resources.csv and test.csv:

In [8]:
combined_test_dataset = pd.merge(test_dataset, all_condensed, on='id')
combined_test_dataset[0:5]

I believe the data is now organized in a manner familiar to what we encountered in the Programming Exercises.  Don't forget to also check for and handle other NaNs in the datasets:

In [9]:
combined_training_dataset['teacher_prefix'] = combined_training_dataset['teacher_prefix'].fillna('none')
combined_test_dataset['teacher_prefix'] = combined_test_dataset['teacher_prefix'].fillna('none')

combined_training_dataset = combined_training_dataset.fillna('')
combined_test_dataset = combined_test_dataset.fillna('')

*iv. Changing Only Certain "Cells" (that is, Update Column Values for Only Certain Rows)*
<a id="changecells"></a>

I'm using "cells" because I think it's easier to think in terms of spreadsheets.  In the data for this competition is the note that applications before a certain date required 4 essays.  It turns out that old essays 1 and 2 cover the same topics as current essay 1, and old essays 3 and 4 cover the same topics as current essay 2.  So what I'd like to do is combine essays 1 and 2 together into the project_essay_1 column and combine essays 3 and 4 together into the project_essay_2 column **ONLY for those "old application" rows**.  The "new application" rows should be left alone.

*(Note:  I don't entirely know if it matters that the two essays are treated as two separate features.  However, in experiments, my models had a slightly better AUC when they were left as separate features.  I also tried smashing ALL of the text features together out of curiosity, but that did result in a worse AUC.)*

The "cell selection" method in Pandas that I'll be using is .loc[row_indexer, col_indexer].  To select the "old application" rows, I'm going to ask Pandas to find all the rows that currently DO have text in project_essay_4.  That will be the row_indexer.  The col_indexer will be the column where I want to leave the two combined essays.

In [10]:
combined_training_dataset.loc[combined_training_dataset['project_essay_4'] != '', 'project_essay_1'] = combined_training_dataset['project_essay_1'] + ' ' + combined_training_dataset['project_essay_2']
combined_training_dataset.loc[combined_training_dataset['project_essay_4'] != '', 'project_essay_2'] = combined_training_dataset['project_essay_3'] + ' ' + combined_training_dataset['project_essay_4']

#and do the same for the test data:
combined_test_dataset.loc[combined_test_dataset['project_essay_4'] != '', 'project_essay_1'] = combined_test_dataset['project_essay_1'] + ' ' + combined_test_dataset['project_essay_2']
combined_test_dataset.loc[combined_test_dataset['project_essay_4'] != '', 'project_essay_2'] = combined_test_dataset['project_essay_3'] + ' ' + combined_test_dataset['project_essay_4']

combined_training_dataset[16:20]

To see if the above worked, look at the whole string in an "old application"'s cell:

In [None]:
combined_training_dataset.loc[18, 'project_essay_2']

*v. Editing Lots of String Data (Removing Punctuation, Setting to Lowercase)*
<a id="stringdata"></a>

This pertains only to the newer models where I'm incorporating the text-heavy features.  To make it so that the vocabulary_lists can match the words in the text sentences, it's best to strip out punctuation marks and lowercase all the letters.

Additionally, this now attempts to better format the text such that it's easier to identify misspelled words and typos.  For comparing words to an English dictionary, I now strip out URLs and Twitter hashtags that obviously aren't in the dictionary.  For typos (like not spacing after a comma or quotation mark), I replace those with an equal sign ( = ) and then tally them up later on.

The 'check the dictionary' code takes a while to run, so I've left the old code here if you're not interested in looking for misspellings.  (From my experiments, incorporating the misspellings data hasn't yield any better AUC.)  And if you want to skip ahead, here's [the next section](#dateinfo).

*Old, adequate code to format text* :

In [12]:
USING_DICTIONARY_CODE = True
if USING_DICTIONARY_CODE == False:
    #OLD Text preprocessing
    the_essays = [
        'project_essay_1',
        'project_essay_2',
        'full_description',  #not an essay... but has the same problems
        'project_resource_summary'  # probably has the same issues
    ]

    def tidy_essays(dataset):
        for col_name in the_essays:
            #strip out the irritating new line stuff:
            dataset[col_name] = dataset[col_name].replace(r'(\\r)|(\\n)', ' ', regex=True)
            #get everything down to just one space (hopefully):
            dataset[col_name] = dataset[col_name].replace(r'  ', ' ', regex=True)
            dataset[col_name] = dataset[col_name].replace(r'  ', ' ', regex=True)
            dataset[col_name] = dataset[col_name].replace(r'  ', ' ', regex=True)

    #do both training and test datasets:
    tidy_essays(combined_training_dataset)
    tidy_essays(combined_test_dataset)
    
    #OLD finish cleaning up all the text
    text_columns = [
        'project_title',
        'project_essay_1',
        'project_essay_2',
        'project_resource_summary',
        'project_subject_categories',
        'project_subject_subcategories',
        'full_description'
    ]

    punc_pattern = '["#$%&\'()*+,.:;<=>?@\\\\^_`{|}~]'

    def text_edits(dataset):
        for col_name in text_columns:
            #lowercase all
            dataset[col_name] = dataset[col_name].str.lower()
            #treat !'s as separate words just in case the model picks up on something:
            dataset[col_name] = dataset[col_name].replace(r'!', ' !', regex=True)
            #don't remove hyphens and smash words together; keep the words separate
            #same (hopefully) for /
            dataset[col_name] = dataset[col_name].replace(r'(\-)|(/)', ' ', regex=True)  #ADDED so hopefully works
            #replace the rest of the punctuation:
            dataset[col_name] = dataset[col_name].replace(punc_pattern, '', regex=True)

    #do both training and test datasets:
    text_edits(combined_training_dataset)
    text_edits(combined_test_dataset)

    combined_training_dataset[16:22]   

*New code to find misspellings and typos.*
<a id="misspelledwords"></a>

Note:  This first cell takes a long time to run (probably 3 minutes on just the training dataset).

In [13]:
#USING_DICTIONARY_CODE located in cell above
if USING_DICTIONARY_CODE:
    text_columns = [
        'project_title',
        'project_essay_1',
        'project_essay_2',
        'project_resource_summary',
        'project_subject_categories',
        'project_subject_subcategories',
        'full_description'
    ]

    def tidy_text(dataset):
        for col_name in text_columns:
            #new idea:  handle weird utf8 symbol characters like smiley faces
            #  replace with ?, then strip out that question mark and replace it with a space:
            #first, remove actual ?'s and replace with no space '' (this is because some URLs contain ?'s):
            dataset[col_name] = dataset[col_name].replace('\?', '', regex=True)
            dataset[col_name] = dataset[col_name].str.encode('ascii', 'replace')  # replaces with a ? mark
            dataset[col_name] = dataset[col_name].str.decode('utf8') #turn back to normal string
                #new ?'s will be removed in a later regex
            #new idea:  handle acronyms since they're not going to show up in dictionaries:
                #is the actual acronym important?  I'm assuming not.  But perhaps you should work on a copy of the data
            dataset[col_name] = dataset[col_name].replace('([A-Z][.])+', '', regex=True)
            #0th - lowercase everything:
            dataset[col_name] = dataset[col_name].str.lower()
            #1st - strip out the irritating new lines that literally became \\r\\n in the text:
            #replace with a space so that words aren't unfairly smashed together and count as an error
            dataset[col_name] = dataset[col_name].replace('(\\\\r)|(\\\\n)', ' ', regex=True)
            #2nd - remove all the website links
            # .com AND .org AND .gov AND .edu AND .net
            dataset[col_name] = dataset[col_name].replace('[^ ]*\.com[^ ]*', '', regex=True)
            dataset[col_name] = dataset[col_name].replace('[^ ]*\.org[^ ]*', '', regex=True)
            dataset[col_name] = dataset[col_name].replace('[^ ]*\.gov[^ ]*', '', regex=True)
            dataset[col_name] = dataset[col_name].replace('[^ ]*\.edu[^ ]*', '', regex=True)
            dataset[col_name] = dataset[col_name].replace('[^ ]*\.net[^ ]*', '', regex=True)
            #2ndb - also remove tweet hashtags:
            dataset[col_name] = dataset[col_name].replace('#[a-z][^ ]*', '', regex=True)
            #3rd - remove other \\ affecting things like " marks
            dataset[col_name] = dataset[col_name].replace('\\\\', '', regex=True)
            #4th - don't remove hyphens/slashes and smash words together; instead, keep the words separate
            #same with ellipsis
            dataset[col_name] = dataset[col_name].replace('(\-)|(/)|(\.\.\.\.)|(\.\.\.)', ' ', regex=True)
            #5th - some people didn't space correctly.  Replace those specific instances involving punctuation marks
            #with a = sign to better handle that particular typo later on when looking for misspelled words
            #(first, remove any equal signs already in the text just in case)
            dataset[col_name] = dataset[col_name].replace('=', '', regex=True)
            dataset[col_name] = dataset[col_name].replace('(?<=\w)[,.;!"](?=\w)', '=', regex=True)
            #6th - treat !'s as separate words in case the model picks up on something related to too many exclamatory remarks:
            dataset[col_name] = dataset[col_name].replace('!', ' ! ', regex=True)
            #7th remove the remaining punctuation:
            dataset[col_name] = dataset[col_name].replace('["#$%&\'()*+,.:;[\]<>?@\\^_`{|}~]', '', regex=True)
            #8th - make everything have a separation of just one space:
            dataset[col_name] = dataset[col_name].replace(' *   *', ' ', regex=True)
            #and no space at the end:
            dataset[col_name] = dataset[col_name].str.strip()

        return

    tidy_text(combined_training_dataset)
    tidy_text(combined_test_dataset)

Here's an example of an essay with two cases of an '=' sign inserted where someone forgot to space after punctuation:

In [None]:
combined_training_dataset['project_essay_2'].iloc[3]

Now, count up these "typing errors" in each text column.  I assume a reader would be more forgiving of these, but track them anyways.  Then, fix them (replace the = sign with a space) so that they don't show up in the vocabulary list for that column.

In [14]:
if USING_DICTIONARY_CODE:
    COLUMNS_TO_CHECK = [
        'project_title',
        'project_essay_1',
        'project_essay_2',
        'project_resource_summary'
    ]

    def total_typing_errors(df):
        new_col_names = []
        for col in COLUMNS_TO_CHECK:
            new_col_name = col + '_typing_errors'
            new_col_name = new_col_name[8:]  #drop 'project_' from the name to make it shorter
            df[new_col_name] = df[col].str.count('=')
            new_col_names.append(new_col_name)

        df['total_typing_errors'] = df[new_col_names].sum(axis=1)

        return

    total_typing_errors(combined_training_dataset)
    total_typing_errors(combined_test_dataset)
    
    #remove those = signs now:
    for col_name in text_columns:
        combined_training_dataset[col_name] = combined_training_dataset[col_name].replace('=', ' ', regex=True)
        combined_test_dataset[col_name] = combined_test_dataset[col_name].replace('=', ' ', regex=True)

Here's the full row for the example from earlier to check that the columns and values are correct:

In [None]:
combined_training_dataset[3:4]

And here's to check that the '=' errors have been removed:

In [None]:
combined_training_dataset['project_essay_2'].iloc[3]

Now that the text is fully formatted, grab the unique words from each column to later compare to a dictionary, looking for potential misspellings.

In [16]:
if USING_DICTIONARY_CODE:
    join_cols = [
        'project_title',
        'project_essay_1',
        'project_essay_2',
        'project_resource_summary',
        'full_description'
    ]
    #first, join the rows of the test dataset with the training dataset to get a full look at all the words in the data
    all_dataset = pd.concat([combined_training_dataset[join_cols], combined_test_dataset[join_cols]])
    
    def get_unique_word_sets(df):
        new_set = set()
        df.str.split().apply(new_set.update)

        return new_set

    title_set = get_unique_word_sets(all_dataset['project_title'])  
    essay_1_set = get_unique_word_sets(all_dataset['project_essay_1'])  
    essay_2_set = get_unique_word_sets(all_dataset['project_essay_2']) 
    summary_set = get_unique_word_sets(all_dataset['project_resource_summary'])
    description_set = get_unique_word_sets(all_dataset['full_description'])
    print(len(title_set))
    print(len(essay_1_set))
    print(len(essay_2_set))
    print(len(summary_set))
    print(len(description_set))

Turn the English Word Frequency data into a set:

In [19]:
if USING_DICTIONARY_CODE:
    #Contains 333,332 words
    #And '!' as acceptable word
    dictionary_set = set(dictionary_dataset['word'])
    dictionary_set.add('!')
    print(len(dictionary_set))

From each column's unique words, subract out the ones that do exist in the dictionary (using both the English Word Frequency data and the Unix data):

In [20]:
if USING_DICTIONARY_CODE:
    title_set = title_set - dictionary_set - unix_set  
    essay_1_set = essay_1_set - dictionary_set - unix_set
    essay_2_set = essay_2_set - dictionary_set - unix_set
    summary_set = summary_set - dictionary_set - unix_set
    description_set = description_set - dictionary_set - unix_set
    print(len(title_set))
    print(len(essay_1_set))
    print(len(essay_2_set))
    print(len(summary_set))
    print(len(description_set))

Most of the columns reference product names that aren't going to be in a dictionary.  Thus, work under the assumption that full_description is mostly product descriptions (and hopefully copied-and-pasted from a product page that has been proofread and doesn't have many misspellings itself).

In [21]:
if USING_DICTIONARY_CODE:
    title_set = title_set - description_set
    essay_1_set = essay_1_set - description_set
    essay_2_set = essay_2_set - description_set
    summary_set = summary_set - description_set
    print(len(title_set))
    print(len(essay_1_set))
    print(len(essay_2_set))
    print(len(summary_set))

Lastly, I'm going to remove from the sets any word that contains a number in it.  More than likely, that's some sort of product name or acceptable abbreviation (like 3rd):

In [22]:
if USING_DICTIONARY_CODE:
    def make_set_alpha(word_set):
        words_to_drop = set()
        for word in word_set:
            if word.isalpha() == False:
                words_to_drop.add(word)

        return word_set - words_to_drop

    title_set = make_set_alpha(title_set)
    essay_1_set = make_set_alpha(essay_1_set)
    essay_2_set = make_set_alpha(essay_2_set)
    summary_set = make_set_alpha(summary_set)
    print(len(title_set))
    print(len(essay_1_set))
    print(len(essay_2_set))
    print(len(summary_set))

Now, unfortunately, the dictionaries used in making these sets are not complete.  Thus, the above sets are still a mix of real words, product names, and misspelled words, so this won't be very accurate.  However, I'll still use them to count up the possibly misspelled words in each column:

In [23]:
if USING_DICTIONARY_CODE:
    #for each column is a matching set of 'misspelled' words:
    COLS_AND_SETS = {
        'project_title' : title_set,
        'project_essay_1' : essay_1_set,
        'project_essay_2' : essay_2_set,
        'project_resource_summary' : summary_set
    }
    
    def count_spelling_errors(string, word_set):
        list_words = string.split()
        error_count = 0
        for word in list_words:
            if word in word_set:
                error_count += 1

        return error_count

    def total_spelling_errors(df):
        new_col_names = []
        for col, word_set in COLS_AND_SETS.items():
            new_col_name = col + '_spelling_errors'
            new_col_name = new_col_name[8:]  #drop 'project_' from the name to make it shorter
            df[new_col_name] = df[col].apply(count_spelling_errors, args=(word_set,))
            new_col_names.append(new_col_name)

        df['total_spelling_errors'] = df[new_col_names].sum(axis=1)

        return
        
    total_spelling_errors(combined_training_dataset)
    total_spelling_errors(combined_test_dataset)

Let's make sure all the columns got added correctly:

In [25]:
if USING_DICTIONARY_CODE:
    combined_training_dataset.loc[combined_training_dataset['total_spelling_errors'] > 12]

And let's look at my favorite essay in that bunch:

In [26]:
combined_training_dataset['project_essay_2'].loc[110575]

Nice.  And... just to make sure that randomly smashing the keyboard is *bad* for an application:

In [None]:
combined_training_dataset['project_is_approved'].loc[110575]

Okay!  But, sadly, since my dictionaries aren't complete, here's an example of one with a bunch of  "errors" that aren't really:

In [None]:
combined_training_dataset['project_essay_2'].loc[172786]

That's full of a bunch of websites names (that weren't written as URLs) as well as product names that didn't show up in full_description.

*vi. Handling Date Information*
<a id="dateinfo"></a>

After working more on this project, I've decided to incorporate the project_submitted_datetime as a feature.  Based on some data visualization later on, I've decided to split this information by week of the years this dataset spans.  I don't know if there's some preferred method for doing this, but I like the idea of converting these values to an integer like 'last two digits of year' * 1000 + week (for example, 16 * 1000 + 25 = 1625).  Since they're numbers, I originally thought about bucketizing them.  However, I wonder if treating them as strings and thus as a vocabulary_list would be a better representation.  I'm going with the latter.

In [28]:
#https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-from-pandas-datetime-column-python
#https://stackoverflow.com/questions/17950374/converting-a-column-within-pandas-dataframe-from-int-to-string
#First, get the project_submitted_datetime column into a format that Pandas can work with
combined_training_dataset['project_submitted_datetime'] = pd.to_datetime(combined_training_dataset['project_submitted_datetime'])
combined_training_dataset['year_week'] = combined_training_dataset['project_submitted_datetime'].map(lambda x: 100*(x.year - 2000) + x.week).apply(str)

combined_test_dataset['project_submitted_datetime'] = pd.to_datetime(combined_test_dataset['project_submitted_datetime'])
combined_test_dataset['year_week'] = combined_test_dataset['project_submitted_datetime'].map(lambda x: 100*(x.year - 2000) + x.week).apply(str)
combined_training_dataset[0:5]

**II.   Data Visualization**
<a id="datavisual"></a>

Colaboratory has a [Welcome notebook](https://colab.research.google.com/notebooks/welcome.ipynb#scrollTo=yv2XIwi5hQ_g) with a small section on Visualization, but a more detailed [Charts notebook](https://colab.research.google.com/notebooks/charts.ipynb) with several Matplotlib examples can be found in the "For more information" section of that Welcome site.

Much of the visualization in the Programming Exercises was already coded in, so.... I want to try making some of these charts for myself!

You can find much nicer and more interesting charts in other kernels, so definitely check those out.  This is now just me messing around with code.

**1.  Bar Chart** - *Number of Approved/Not-Approved Applications Based on teacher_prefix*
<a id="barchart"></a>

In [None]:
import matplotlib.pyplot as plt

In [None]:
#Prefixes and Approved vs Unapproved Application
grouped_prefixes = combined_training_dataset.groupby(["teacher_prefix", "project_is_approved"])
grouped_prefixes = grouped_prefixes.agg({'teacher_prefix' : 'count'}).rename(columns={'teacher_prefix' : 'count'})
grouped_prefixes

In [None]:
arr_counts = grouped_prefixes["count"].tolist()
#split arr_counts into two separate lists:
y_no = []
y_yes = []
for i in range(0, len(arr_counts) - 1):
    if i % 2 == 0:
        y_no.append(arr_counts[i])
    else:
        y_yes.append(arr_counts[i])
y_yes.append(arr_counts[i+1])
y_no.append(0)
x_labels = ['Dr', 'Mr.', 'Mrs', 'Ms', 'Teacher', 'None']

#Make a multiple bar graph
#via https://stackoverflow.com/questions/14270391/python-matplotlib-multiple-bars
N = len(x_labels)
ind = np.arange(N)  # the x locations for the groups
width = 0.4      # the width of the bars
fig = plt.figure()
ax = fig.add_subplot(111)

rects1 = ax.bar(ind, y_yes, width, color='#7CFC00')
rects2 = ax.bar(ind+width, y_no, width, color='#DC143C')

ax.set_ylabel('Submissions Count')
ax.set_xticks(ind+width)
ax.set_xticklabels(x_labels)
ax.legend( (rects1[0], rects2[0]), ('Approved', 'NOT Approved') )

def autolabel(rects):
    for rect in rects:
        h = rect.get_height()
        ax.text(rect.get_x()+rect.get_width()/2., 1.05*h, '%d'%int(h),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.title('Submissions Approved or NOT Approved Based on Name Prefix')
plt.show()

And what are those percentages?

In [None]:
#Chance to get approved by prefix
print('Chance to get approved by prefix')
arr_total_applications = []
for i in range(0, len(y_yes)):
    arr_total_applications.append(y_yes[i] + y_no[i])
    print(x_labels[i] + ': ' + "%f" % (y_yes[i] / arr_total_applications[i]))

So, for immediate acceptance, just don't submit a prefix on your application, obviously!  #percentagesneverlie

**2. Scatter Plot** - *Acceptance Rate vs. Request Cost within a Binned Range*
<a id="scatterplot"></a>

Curious to see if the acceptance rate differs as the cost of the request gets higher.  There are too many different values for the cost (100, 100.01, 100.02, etc.), so I'm also going to try binning with Pandas to group those costs into ranges.

In [None]:
bins = [0, 150, 200, 250, 300, 350, 400, 500, 600, 750, 1000, 1000000]
combined_training_dataset['binned_costs'] = pd.cut(combined_training_dataset['total_cost'], bins)

grouped_costs = combined_training_dataset.groupby(["binned_costs", "project_is_approved"])
grouped_costs = grouped_costs.agg({'binned_costs' : 'count'}).rename(columns={'binned_costs' : 'count'})

arr_percent_yes = []
for i in range(0, (len(bins) - 1) * 2):
    if i % 2 == 0:
        k = i // 2
        label = 'no'
        no_count = grouped_costs["count"].values[i]
        print("%d+ - %s: %d" % (bins[k], label, no_count))
    else:
        label = 'yes'
        yes_count = grouped_costs["count"].values[i]
        percent_yes = yes_count / (no_count + yes_count)
        arr_percent_yes.append(percent_yes)
        print("%d+ - %s: %d, percent: %f" % (bins[k], label, yes_count, percent_yes))

In [None]:
arr_x_labels = []
for i in range(0, len(bins)-2):
    arr_x_labels.append("%d+" % (bins[i]))
arr_x_labels.append("_1000+") #add the _ so that the plot doesn't alphabetize the numbers
 
plt.scatter(arr_x_labels, arr_percent_yes)
plt.xlabel("Cost of Request")
plt.ylabel("Yes Acceptance Rate")
plt.show()

I just sorta eyeballed the bin ranges.  Still, that downward trend seems sensible as I would expect a higher percentage of cheaper requests to get approved over much pricier ones.

But I'm curious to try out a more official way (using quantiles to split the data into equal group sizes):

In [None]:
#function code via https://colab.research.google.com/notebooks/mlcc/sparsity_and_l1_regularization.ipynb?hl=en#scrollTo=bLzK72jkNJPf
def quantile_based_buckets(feature_values, num_buckets):
    quantiles = feature_values.quantile(
        [(i+1.)/(num_buckets + 1.) for i in range(num_buckets)])
    return [quantiles[q] for q in quantiles.keys()]

quantile_bins = quantile_based_buckets(combined_training_dataset["total_cost"], 12)
#^But why does the DataFrame later show NaN for the lowest range (from 0 to 133)?
#...Looks like Pandas requires you to specify the lower/upper limits if you pass in a list to pd.cut?
quantile_bins.insert(0, 0)
quantile_bins.append(1000000)

combined_training_dataset['quantile_costs'] = pd.cut(combined_training_dataset['total_cost'], quantile_bins)

grouped_costs = combined_training_dataset.groupby(["quantile_costs", "project_is_approved"])
grouped_costs = grouped_costs.agg({'quantile_costs' : 'count'}).rename(columns={'quantile_costs' : 'qcount'})

arr_percent_yes = []
for i in range(0, (len(quantile_bins) - 1) * 2):
    if i % 2 == 0:
        k = i // 2
        label = 'no'
        no_count = grouped_costs["qcount"].values[i]
        print("%d+ - %s: %d" % (quantile_bins[k], label, no_count))
    else:
        label = 'yes'
        yes_count = grouped_costs["qcount"].values[i]
        percent_yes = yes_count / (no_count + yes_count)
        arr_percent_yes.append(percent_yes)
        print("%d+ - %s: %d, percent: %f" % (quantile_bins[k], label, yes_count, percent_yes))



In [None]:
arr_x_labels = []
for i in range(0, len(quantile_bins)-2):
    arr_x_labels.append("%d+" % (quantile_bins[i]))
arr_x_labels.append("_1000+") #add the _ so that the plot doesn't alphabetize the numbers
 
plt.scatter(arr_x_labels, arr_percent_yes)
plt.xlabel("Cost of Request")
plt.ylabel("Yes Acceptance Rate")
plt.show()

**3. Line Plot** - *Chance of Acceptance Based on Number of Previous Submissions from 0 to 100*
<a id="lineplot"></a>

Out of curiosity, I'm also interested in odds based on previous submissions.

In [None]:
grouped_previous_submissions = combined_training_dataset.groupby(["teacher_number_of_previously_posted_projects", "project_is_approved"])
grouped_previous_submissions = grouped_previous_submissions.agg({'teacher_number_of_previously_posted_projects' : 'count'}).rename(columns={'teacher_number_of_previously_posted_projects' : 'count'})
grouped_previous_submissions[0:202]

In [None]:
arr_percents_yes = []
for i in range(0,202):
    if i % 2 == 0:
        k = i / 2
        label = 'no'
        no_count = grouped_previous_submissions["count"].values[i]
        print("%d - %s: %d" % (k, label, no_count))
    else:
        label = 'yes'
        yes_count = grouped_previous_submissions["count"].values[i]
        percent_yes = yes_count / (no_count + yes_count)
        arr_percents_yes.append(percent_yes)
        print("%d - %s: %d, percent: %f" % (k, label, yes_count, percent_yes))
    

In [None]:
x_num_previous = np.arange(0, 101)
#y is arr_percents_yes

plt.plot(x_num_previous, arr_percents_yes)

plt.xlabel("Number of Previous Submissions")
plt.ylabel("Chance of Accepted Submission")
plt.title("Chance of Accepted Submission Based on Previous Number of Submissions")
plt.show()

An expected general upward trend that tapers off (Note: higher x values [say, 30+] have far less entries, especially the much higher x values).  Also interesting that the chance of acceptance for a new applicant is quite high (~82%).

**4. Stacked-Bars Bar Chart** - *Comparing the Top 5 vs Bottom 5 States in Terms of Acceptance Rate*
<a id="stackedbars"></a>

And why not look at the states, too?  (Also going to try getting the *groupby* to act like a normal DataFrame this time.)

In [None]:
grouped_states = combined_training_dataset.groupby(["school_state", "project_is_approved"])
#a way to eliminate the multi-index?:
#https://stackoverflow.com/questions/39778686/pandas-reset-index-after-groupby-value-counts
grouped_states = grouped_states.size().rename('count').reset_index()

arr_percents = []
for i in range(0,102):
    if i % 2 == 0:
        no_count = grouped_states["count"].values[i]
    else:
        yes_count = grouped_states["count"].values[i]
        total_count = no_count + yes_count
        percent_no = no_count / total_count
        percent_yes = yes_count / total_count
        arr_percents.append(percent_no)
        arr_percents.append(percent_yes)
        
grouped_states['chances'] = pd.Series(arr_percents, index=grouped_states.index)
grouped_states


In [None]:
#Find the lowest approval rates:
grouped_states.loc[(grouped_states['project_is_approved'] == 1) & (grouped_states['chances'] < .83)]

In [None]:
#And the highest ones:
grouped_states.loc[(grouped_states['project_is_approved'] == 1) & (grouped_states['chances'] > .868)]

In [None]:
#Make a stacked bar chart of Top 5 vs Bottom 5 acceptance rates because...  I just wanna see what it looks like...

idxes = ['1 DE/DC', '2 WY/TX', '3 OH/NM', '4 CT/FL', '5 WA/MT']
lowest = [.812639, .815670, .822052, .824500, .828125]
highest = [.891341, .875706, .871467, .871294, .868050]

#apparently the 2nd bar just paints over the first, so the 2nd must be smaller or it gets hidden
plt.bar(idxes, highest, label="DE, WY, OH, CT, WA", color='#87CEFA')
plt.bar(idxes, lowest, label="DC, TX, NM, FL, MT", color='#B22222')

plt.plot()

#make the scale more useful
#https://stackoverflow.com/questions/3777861/setting-y-axis-limit-in-matplotlib
axes = plt.gca()
axes.set_ylim([.75, .92])

plt.title('Submission Acceptance Rates for Top 5 States vs Bottom 5 States')
plt.legend()
plt.xlabel('State')
plt.ylabel("Chance of Accepted Submission")
plt.show()

**5. Pie Chart** - *Percentage of Applications Accepted and NOT Accepted According to Grade*
<a id="piechart"></a>

In [None]:
grouped_grades = combined_training_dataset.groupby(["project_grade_category", "project_is_approved"])
#a way to eliminate the multi-index?:
#https://stackoverflow.com/questions/39778686/pandas-reset-index-after-groupby-value-counts
grouped_grades = grouped_grades.size().rename('count').reset_index()
grouped_grades

In [None]:
arr_no = []
arr_yes = []
arr_labels = []
arr_all = []
for i in range(0, len(grouped_grades["project_grade_category"].values)):
    if i % 2 == 0:
        arr_labels.append("%s - No" % (grouped_grades["project_grade_category"].values[i]))
        arr_no.append(grouped_grades["count"][i])
    else:
        arr_labels.append("%s - Yes" % (grouped_grades["project_grade_category"].values[i]))
        arr_yes.append(grouped_grades["count"][i])
    arr_all.append(grouped_grades["count"][i])

colors = ['#DC143C', '#7CFC00']
plt.pie(arr_all, labels=arr_labels, colors=colors,
        startangle=50,
        explode = (.3, .3, .3, .3, .3, .3, .3, .3),
        autopct = '%1.2f%%',
        shadow=True)

plt.axis('equal')
plt.title('Pie Chart Example')
plt.show()


Would be cool to split the pie into grouped pieces (for example, Grades 3-5 Yes and No exploding off as a combined piece), but [that looks nontrivial](http://https://stackoverflow.com/questions/20549016/explode-multiple-slices-of-pie-together-in-matplotlib/20556088).

**6. More Line Plots to Look at Dates (Month, Week, Weekday)**
<a id="datedata"></a>

I'm curious if it looks like the date the application was submitted may have any impact on its approval.  If so, this could be a good feature to incorporate.

This requires a couple new Pandas functions.  The first is pd.to_datetime to get the "project_submitted-datetime" column into a format that Pandas can work with.  To group by month, the groupby function is going to take a pd.Grouper() in place of a typical column name's string.

In [None]:
#combined_training_dataset['project_submitted_datetime'] = pd.to_datetime(combined_training_dataset['project_submitted_datetime'])
  #^done in an earlier step when preprocessing the data
#try grouping by month:
#https://stackoverflow.com/questions/44908383/how-can-i-group-by-month-from-a-date-field-using-python-pandas

grouped_months = combined_training_dataset.groupby([pd.Grouper(key='project_submitted_datetime', freq='1M'), "project_is_approved"])
grouped_months = grouped_months.size().rename('count').reset_index()

arr_percents_month = []
for i in range(0,26):
    if i % 2 == 0:
        no_count = grouped_months["count"].values[i]
    else:
        yes_count = grouped_months["count"].values[i]
        total_count = no_count + yes_count
        percent_no = no_count / total_count
        percent_yes = yes_count / total_count
        arr_percents_month.append(percent_no)
        arr_percents_month.append(percent_yes)
        
grouped_months['chances'] = pd.Series(arr_percents_month, index=grouped_months.index)
grouped_months

And there's definitely some variation, so let's graph that:

In [None]:
x_num_previous_m = np.arange(0, 13)

arr_percents_m_yes = []
for i in range(0,26):
    if not (i % 2 == 0):
        arr_percents_m_yes.append(arr_percents_month[i])


plt.plot(x_num_previous_m, arr_percents_m_yes)

plt.xlabel("Month")
plt.ylabel("Chance of Accepted Submission")
plt.title("Chance of Accepted Submission Based on Month")
plt.show()

So it might be useful to bucketize the data by month.  But would weekly look any different?

In [None]:
#only difference here is freq='1W' instead of '1M'
grouped_weeks = combined_training_dataset.groupby([pd.Grouper(key='project_submitted_datetime', freq='1W'), "project_is_approved"])
grouped_weeks = grouped_weeks.size().rename('count').reset_index()

arr_percents_w = []
for i in range(0,106):
    if i % 2 == 0:
        no_count = grouped_weeks["count"].values[i]
    else:
        yes_count = grouped_weeks["count"].values[i]
        total_count = no_count + yes_count
        percent_no = no_count / total_count
        percent_yes = yes_count / total_count
        arr_percents_w.append(percent_no)
        arr_percents_w.append(percent_yes)
        
grouped_weeks['chances'] = pd.Series(arr_percents_w, index=grouped_weeks.index)
grouped_weeks

And graphing that:

In [None]:
x_num_previous_w = np.arange(0, 53)

arr_percents_yes_w = []
for i in range(0,106):
    if not (i % 2 == 0):
        arr_percents_yes_w.append(arr_percents_w[i])


plt.plot(x_num_previous_w, arr_percents_yes_w)

plt.xlabel("Week")
plt.ylabel("Chance of Accepted Submission")
plt.title("Chance of Accepted Submission Based on Week")
plt.show()

Perhaps monthly would do, but I'm going to go with separating by week since that one stretch has some pretty wild swings.

I also checked to see if the day of the week an application was submitted mattered, but the odds seemed pretty much the same for each weekday.  The code for that is in the hidden snippet if you're curious to see how it's implemented.

In [None]:
#going to also try by day of the week:
#https://stackoverflow.com/questions/13740672/in-pandas-how-can-i-groupby-weekday-for-a-datetime-column
combined_training_dataset['weekday'] = combined_training_dataset['project_submitted_datetime'].dt.weekday

grouped_weekday = combined_training_dataset.groupby(["weekday", "project_is_approved"])
#grouped_dates = grouped_dates.agg({'project_submitted_datetime' : 'count'})
grouped_weekday = grouped_weekday.size().rename('count').reset_index()

arr_percents_weekday = []
for i in range(0,14):
    if i % 2 == 0:
        no_count = grouped_weekday["count"].values[i]
    else:
        yes_count = grouped_weekday["count"].values[i]
        total_count = no_count + yes_count
        percent_no = no_count / total_count
        percent_yes = yes_count / total_count
        arr_percents_weekday.append(percent_no)
        arr_percents_weekday.append(percent_yes)
        
grouped_weekday['chances'] = pd.Series(arr_percents_weekday, index=grouped_weekday.index)
grouped_weekday

**7. Violin Plots (Using Spelling/Typing Error Data)**
<a id="violin"></a>

In light of the new 'spelling' error data from earlier, wanted to see if that had any effect at all on application approval.

In [None]:
if USING_DICTIONARY_CODE:
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    #convert data back to int:
    pd_plot = combined_training_dataset[['total_spelling_errors', 'total_typing_errors', 'project_is_approved']].copy()
    pd_plot['total_spelling_errors'] = pd.to_numeric(pd_plot['total_spelling_errors'])
    pd_plot['total_typing_errors'] = pd.to_numeric(pd_plot['total_typing_errors'])
    
    fig, axes = plt.subplots()
    # plot violin. 'project_is_approved' is according to x axis, 
    # 'comma_count' is y axis, data is training_dataset. ax - is axes instance
    sns.violinplot('project_is_approved','total_spelling_errors', data=pd_plot, ax = axes)
    axes.set_title('Spelling Errors vs Approval')

    axes.yaxis.grid(True)
    axes.set_xlabel('project_is_approved')
    axes.set_ylabel('total-spelling_errors')

    plt.show()

But, as with everything else, it seems to not make a difference.

In [None]:
if USING_DICTIONARY_CODE:
    fig, axes = plt.subplots()
    # plot violin. 'project_is_approved' is according to x axis, 
    # 'comma_count' is y axis, data is training_dataset. ax - is axes instance
    sns.violinplot('project_is_approved','total_typing_errors', data=pd_plot, ax = axes)
    axes.set_title('Typing Errors vs Approval')

    axes.yaxis.grid(True)
    axes.set_xlabel('project_is_approved')
    axes.set_ylabel('total-typing_errors')

    plt.show()

**III.  Applying the Crash Course to this Project to Create a Linear Classifier (Does NOT Use Text-Heavy Features)**
<a id="linearclassifier"></a>

I was looking over my course notes, and the code for just about everything I want to incorporate into a model that uses a linear classifier can be found in the [Programming Exercises from Chapter 13 - Regularization: Sparsity](https://colab.research.google.com/notebooks/mlcc/sparsity_and_l1_regularization.ipynb?hl=en):

- Randomizing the data; stressed in Chapter 6
- Splitting the data into training and validation sets; stressed in Chapter 5
- Choosing good features (binning/bucketizing, scaling, handling outliers); stressed in Chapter 8
- Setting up everything related to training (feature columns, input functions, training, predictions); from throughout the course
- Regularization; from Chapters 10, 11, and 13 (L1 Regularization)

The only thing missing is the code for determining the ROC and AUC, which can be found in the [Programming Exercises from Chapter 12 - Classification](https://colab.research.google.com/notebooks/mlcc/logistic_regression.ipynb?hl=en).

Thus, I'll be working off the code from Chapter 13, and I'll try to keep it organized in a similar flow so that it looks familiar.

Now, why the note about not using text-heavy features?  Well, first I wanted to start small and just use features that would be easier to work with (total_cost, school_state, etc.).  Then, once I did want to incorporate features like the project_essays, I couldn't get it to work!  See [Section IV](#bruteforcebaby) for how I had to change how I stored and loaded the text-heavy data differently into TensorFlow.

**1. Randomizing the Data**
<a id="random"></a>

The data here looks like it's already been randomized, but just to be safe, I'll go ahead and make sure:

In [29]:
# Data sets are:
# combined_training_dataset
# combined_test_dataset

combined_training_dataset = combined_training_dataset.reindex(
    np.random.permutation(combined_training_dataset.index))

**2. Creating Training and Validation Sets**
<a id="sets"></a>

(Note:  I'll be skipping the "preprocessing features functions" from those exercises since that stuff was basically handled in the Getting Started section.)

In [30]:
USING_OLD_MODELS = False

#combined_training_dataset.shape  -->  182,080 examples
#split 80% training, 20% validation --> 145,664 training, 36,416 validation
HEAD = 145664
TAIL = 36416
TARGET_STR = "project_is_approved"

training_examples = combined_training_dataset.head(HEAD)
validation_examples = combined_training_dataset.tail(TAIL)

if USING_OLD_MODELS:  
    #new models = don't separate examples from targets here
    #old models require targets to be separate:
    training_targets = combined_training_dataset[[TARGET_STR]].head(HEAD)
    validation_targets = combined_training_dataset[[TARGET_STR]].tail(TAIL)
    

**3. Choosing Good Features - (DESIRED_FEATURES const)**
<a id="desiredfeatures"></a>

This is where my code will differ some from the Exercises.  I plan to experiment with the different features to see which I really want to include.  However, doing that in the Exercises can involve editing or commenting out a lot of code.  So instead I just want to make a list of "desired features"  and then make that the only thing I'll have to update.

*Note:  If you want to play around with this Notebook, one option is to include ALL the features on your first run.  That way, all the data is saved (later on) into a TFRecord from the start.  Then, the only thing you'll have to change is this DESIRED_FEATURES list to experiment with different features.)

[Take this link to quickly jump back down to training a Linear Classifier with heavy-text Features.](#lineartexttraining)

[Or this link to jump back down to training the DNN Classifier.](#dnntraining)

In [55]:
# GLOBAL CONSTANTS
#USING_OLD_MODELS = False  #(located in cell above)

# ***  DESIRED_FEATURES  ***
#To make it easier to swap features in and out (and even create and toss in new ones), many functions throughout
#the rest of this Notebook only require you to update this DESIRED_FEATURES list.
# DESIRED_FEATURES is a list whose elements are a tuple of the form:
#       (feature_name, feature_type, num_bucket), so (str, str, int)
# feature_type - types of features include:
WORDS_STRING = 'words_string' #for strings that should be looked at as individual words
WHOLE_STRING = 'whole_string' # for strings that should NOT be looked at as individual words
BUCKET_INT = 'bucket_int' # ints that will be bucketized
BUCKET_FLOAT = 'bucket_float' # floats that will be bucketized
NUM_INT = 'num_int' #treat the int as a normal, individual number
NUM_FLOAT = 'num_float' # treat the float as a normal, individual number

# num_bucket is meant only for the bucket_int and bucket_float features; how many buckets to split the data into
# alternatively, you can set num_bucket equal to an array of your choosing for the buckets, like:  [1.0, 2.0, 5.0]
ERROR_BUCKETS = [0, 1, 2, 4, 7, 10, 20, 50]

if USING_OLD_MODELS:
    DESIRED_FEATURES = [
            #'id', not used, not handled
            #'teacher_id', not used, not handled
    ('teacher_prefix', WHOLE_STRING, None),
    ('school_state', WHOLE_STRING, None),
            #'project_submitted_datetime', not used
    #('project_grade_category', WHOLE_STRING, None),
            #('project_subject_categories', WORDS_STRING, None), not handled in old model
            #('project_subject_subcategories', WORDS_STRING), not handled in old model
            #('project_title', WORDS_STRING), not handled in old model
            #('project_essay_1', WORDS_STRING), not handled in old model
            #('project_essay_2', WORDS_STRING, None), not handled in old model
            #'project_essay_3', not used, not handled
            #'project_essay_4', not used, not handled
            #('project_resource_summary', WORDS_STRING), not handled in old model
    ('teacher_number_of_previously_posted_projects', BUCKET_INT, 6),
            #('full_description', WORDS_STRING), not handled in old model
    #('total_quantity', BUCKET_INT, 10),
    #('total_cost', BUCKET_FLOAT, 12),
    #('year_week', WHOLE_STRING, None),  #effectively takes the place of project_submitted_datetime
    ('project_is_approved', NUM_INT, None) #target
]
else:
    DESIRED_FEATURES = [
            #'id', not used, not handled
            #'teacher_id', not used, not handled
    #('teacher_prefix', WHOLE_STRING, None),
    #('school_state', WHOLE_STRING, None),
            #'project_submitted_datetime', not used, not handled
    #('project_grade_category', WHOLE_STRING, None),
    #('project_subject_categories', WORDS_STRING, None), #because some belong to multiple categories
    #('project_subject_subcategories', WORDS_STRING, None),
    #('project_title', WORDS_STRING, None),
    ('project_essay_1', WORDS_STRING, None),
    ('project_essay_2', WORDS_STRING, None),
            #'project_essay_3', not used, not handled
            #'project_essay_4', not used, not handled
    #('project_resource_summary', WORDS_STRING, None),
    ('teacher_number_of_previously_posted_projects', BUCKET_INT, 6),
    ('full_description', WORDS_STRING, None),
    #('total_quantity', BUCKET_INT, 10),
    ('total_cost', BUCKET_FLOAT, 12),
    #('year_week', WHOLE_STRING, None),  #effectively takes the place of project_submitted_datetime
#POSSIBLE NEW FEATURES:  (ints are now WHOLE_STRING since treated like vocabulary list)
    #('comma_count',  WHOLE_STRING, 6),
    #('hyphen_count', WHOLE_STRING, 2),
    #('sentence_count', WHOLE_STRING, 6),
    #('avg_sentence_length', BUCKET_FLOAT, 5),
    #('plural_pronouns', WHOLE_STRING, 5),
    #('wrong_article', WHOLE_STRING, 1),  #the error ones = don't have enough to make buckets
    #('cap_errors', WHOLE_STRING, 1),
    #('singular_pronouns', WHOLE_STRING, 5),
    #('repeat_words', WHOLE_STRING, 1),
    #('avg_word_size', BUCKET_FLOAT, 6),
    #('num_charged_words', NUM_INT, 1),
#NEW ERROR-RELATED FEATURES:
    #('title_typing_errors', BUCKET_INT, ERROR_BUCKETS),
    ('title_spelling_errors', BUCKET_INT, ERROR_BUCKETS),
    #('essay_1_typing_errors', BUCKET_INT, ERROR_BUCKETS),
    ('essay_1_spelling_errors', BUCKET_INT, ERROR_BUCKETS),
    #('essay_2_typing_errors', BUCKET_INT, ERROR_BUCKETS),
    ('essay_2_spelling_errors', BUCKET_INT, ERROR_BUCKETS),
    #('resource_summary_typing_errors', BUCKET_INT, ERROR_BUCKETS),
    #('resource_summary_spelling_errors', BUCKET_INT, ERROR_BUCKETS),
    ('total_typing_errors', BUCKET_INT, ERROR_BUCKETS),
    ('total_spelling_errors', BUCKET_INT, ERROR_BUCKETS),
    ('project_is_approved', NUM_INT, None) #target
]


#some functions (like the ones saving to a TFRecord) just want the string column names, so grab those:
DESIRED_COLUMNS = []
for str_name, str_type, dontcare in DESIRED_FEATURES:
    DESIRED_COLUMNS.append(str_name)

STR_TARGET = 'project_is_approved' #target, which needs to be singled out and kept out of some functions

# construct_feature_columns (shown later) is designed to work with either the Linear Classifer
# or the DNN Classifier depending on the following constant
# That is, if training the DNN Classifier, turn categorical_column_with_vocabulary_list into embeddings:
IS_DNN_CLASSIFIER = False

*Bucketizing*

I'm not currently planning on using any continuous numeric data (though, the code does handle and account for that type of data and feature column).  Instead, I'll be bucketizing them, and here's the function to create quantile-based buckets:  

In [32]:
def get_quantile_based_buckets(feature_values, num_buckets):
    """
    Args:
        feature_values:  Pandas DataFrame (one column)
        num_buckets:  how many buckets
        
    Returns:
        An array of the quantiles
    """
    quantiles = feature_values.quantile(
        [(i+1.)/(num_buckets + 1.) for i in range(num_buckets)])
    return [quantiles[q] for q in quantiles.keys()]

I'd recommend looking at the quantile buckets that get created.  If it's not in an ascending order with *unique* numbers at each position, TensorFlow can't handle it and training crashes.

In [35]:
for str_name, str_type, num_bucket in DESIRED_FEATURES:
    print(str_name)
    if (str_type == BUCKET_INT) or (str_type == BUCKET_FLOAT):
        if (type(num_bucket) == list) == False:
            print(get_quantile_based_buckets(training_examples[str_name], num_bucket))
        else:
            print(num_bucket)

**4. Setting Up for Training**
<a id="trainingsetup"></a>

First up is the general input function to create more specific input functions for training vs validation vs prediction:

In [None]:
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
    """
    Args:
        features: pandas DataFrame of features
        targets: pandas DataFrame of targets
        batch_size: Size of batches to be passed to the model
        shuffle: True or False. Whether to shuffle the data.
        num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
        Tuple of (features, labels) for next data batch
    """
    
    # Grab only the features specified in DESIRED_FEATURES:
    selected_features_data = features[DESIRED_COLUMNS]
    
    # Convert pandas data into a dict of np arrays.
    features = {key:np.array(value) for key,value in dict(selected_features_data).items()}                                            

    # Construct a dataset, and configure batching/repeating.
    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
    ds = ds.batch(batch_size).repeat(num_epochs)

    # Shuffle the data, if specified.
    if shuffle:
        ds = ds.shuffle(10000)

    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

For the new 'spelling'/typing error data, it's not distributed in such a way that lends itself well to bucketizing with the above function.  Thus, I'm going to turn those values all into strings (and treat them as a categorical columns with vocabulary_lists):

Next up is the function to make Feature Columns.  But before that, vocabulary_lists need to be created for each categorical column:

In [36]:
def vocab_lists_for_each_feature():
    vocab_for_feature = {}
    for str_name, str_type, dont_care in DESIRED_FEATURES:
        #features to treat as full strings:
        if (str_type == WHOLE_STRING):
            vocab_for_feature[str_name] = training_examples[str_name].unique().tolist()
        #string features that should be split into individual words
        elif (str_type == WORDS_STRING):
            new_set = set()
            training_examples[str_name].str.split().apply(new_set.update)
            vocab_for_feature[str_name] = new_set
        #everything else is a number that doesn't need a vocabulary list
                  
    return vocab_for_feature

vocab_lists_for_features = vocab_lists_for_each_feature()

And now the feature columns can be constructed.  This contains code for making bucketized numeric columns as well as categorical columns with vocabulary list.  If the value of global IS_DNN_CLASSIFIER is true, it then converts the categorical columns into embedding columns.

In [37]:
def construct_feature_columns():
    """Construct the TensorFlow Feature Columns.

    Returns:
        A set of feature columns
    """
    
    arr_num_columns = []
    arr_vocabulary_columns = []
    #for use in determining embedding dimensions for each embedding_column:
    arr_dimensions = []
    
    for str_name, str_type, num_bucket in DESIRED_FEATURES:
        # don't create a feature column for the target:
        if (str_name == STR_TARGET):
            continue
        # create normal numeric_columns for normal numbers
        elif (str_type == NUM_INT) or (str_type == NUM_FLOAT):
            arr_num_columns.append(tf.feature_column.numeric_column(str_name))
        # create bucketized columns for bucket numbers:
        elif (str_type == BUCKET_INT) or (str_type == BUCKET_FLOAT):
            #print(str_name)  for debugging irritating training error when it doesn't like the values comprising the buckets
            if (type(num_bucket) == list) == False:  #then get the buckets
                bucket_column = tf.feature_column.bucketized_column(
                    tf.feature_column.numeric_column(str_name),
                    boundaries=get_quantile_based_buckets(training_examples[str_name], num_bucket))
            else:
                bucket_column = tf.feature_column.bucketized_column(
                    tf.feature_column.numeric_column(str_name), boundaries=num_bucket)
            arr_num_columns.append(bucket_column)
        # create categorical vocab columns for strings:
        elif (str_type == WHOLE_STRING) or (str_type == WORDS_STRING):
            arr_vocabulary_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(key=str_name, vocabulary_list=vocab_lists_for_features[str_name]))
            #for DNN Classifier, also calculate the number of dimensions for embedding:
            if IS_DNN_CLASSIFIER:
                fourth_root = int(math.sqrt(math.sqrt(len(vocab_lists_for_features[str_name]))))
                if (fourth_root == 0):
                    fourth_root += 1
                arr_dimensions.append(fourth_root)

    #turn vocab_columns into embedding_columns
    #arr_vocab and arr_dimensions will match by index
    end = len(arr_dimensions)
    arr_embedding_columns = []
    if IS_DNN_CLASSIFIER:
        for i in range(0, end):
            arr_embedding_columns.append(tf.feature_column.embedding_column(arr_vocabulary_columns[i], dimension=arr_dimensions[i]))
        arr_vocabulary_columns = arr_embedding_columns
        
    feature_columns = set(arr_num_columns + arr_vocabulary_columns)

    return feature_columns

Then we have the function to train the model.  (Note:  This is where **5. Regularization** gets implemented.)
<a id="regular"></a>

**An important note about the train_model function:**
In the Colab Programming Exercises, they split the training into "periods" so that you can observe changes in loss as the training progresses.  However, as far as I can tell, that makes it *really* slow here in Kaggle.  So I've skipped applying periods in this function.  (All the periods do is split up the number of steps.  For example, if you have 10 periods and 1000 steps, then using periods will train the model for 100 steps, stop, start up again, train for 100 more, stop, start up again, etc.  Using no periods means that all 1000 steps happen after a single start, which appears to be much, much faster with the same end result.  But since there are no periods, you won't get a visual graph of the training/validation loss changes over time.)

In [None]:
def train_linear_classifier_model(
    learning_rate,
    regularization_strength,
    steps,
    batch_size,
    feature_columns,
    training_examples,
    training_targets,
    validation_examples,
    validation_targets,
    periods=1):
    """Trains a linear classifier model.

    In addition to training, this function also prints training progress information,
    as well as a plot of the training and validation loss over time.

    Args:
        learning_rate: A `float`, the learning rate.
        regularization_strength: A `float` that indicates the strength of the L1
            regularization. A value of `0.0` means no regularization.
        steps: A non-zero `int`, the total number of training steps. A training step
            consists of a forward and backward pass using a single batch.
        feature_columns: A `set` specifying the input feature columns to use.
        training_examples: A `DataFrame` containing one or more columns to use as input features for training.
        training_targets: A `DataFrame` containing exactly one column from to use as target for training.
        validation_examples: A `DataFrame` containing one or more columns to use as input features for validation.
        validation_targets: A `DataFrame` containing exactly one column to use as target for validation.
        periods: A integer, the number of times to train the model.  #Programming Exercises had periods = 7

    Returns:
    A `LinearClassifier` object trained on the training data.
    """

    #steps_per_period = steps / periods  SKIPPING PERIODS
    
    # Create a linear classifier object.
    my_optimizer = tf.train.FtrlOptimizer(learning_rate=learning_rate, l1_regularization_strength=regularization_strength)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
    linear_classifier = tf.estimator.LinearClassifier(
        feature_columns=feature_columns,
        optimizer=my_optimizer
    )
    
    # Create input function for training (dependent on batch_size passed in)
    training_input_fn = lambda: my_input_fn(training_examples, 
                      training_targets[TARGET_STR], 
                      batch_size=batch_size)
    
    # Train the model, but do so inside a loop so that we can periodically assess
    # loss metrics.
    print("Training model...")
    print("LogLoss (on validation data):")
    #training_log_losses = []  SKIPPING PERIODS
    #validation_log_losses = []
    for period in range (0, periods):
        # Train the model, starting from the prior state.
        linear_classifier.train(
        input_fn=training_input_fn,
        steps=steps
        )
        # Take a break and compute predictions.
        training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)
        training_probabilities = np.array([item['probabilities'] for item in training_probabilities])

        validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
        validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])

        # Compute training and validation loss.
        training_log_loss = metrics.log_loss(training_targets, training_probabilities)
        validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)
        # Occasionally print the current loss.
        print("  Training loss for period %02d: %0.2f" % (period, training_log_loss))
        print("  Validation loss for period %02d : %0.2f" % (period, validation_log_loss))
        # Add the loss metrics from this period to our list.  SKIPPING PERIODS
        #training_log_losses.append(training_log_loss)
        #validation_log_losses.append(validation_log_loss)
        
    print("Model training finished.")

    # Periods slow down Kaggle, so only do one; makes this graph pointless
    # Output a graph of loss metrics over periods.
    #plt.ylabel("LogLoss")
    #plt.xlabel("Periods")
    #plt.title("LogLoss vs. Periods")
    #plt.tight_layout()
    #plt.plot(training_log_losses, label="training")
    #plt.plot(validation_log_losses, label="validation")
    #plt.legend()

    return linear_classifier

**6. Training**
<a id="lineartraining"></a>

First, create the input functions used to predict.  (Unlike the Programming Exercises, I'm doing this outside of the train_model function since they need to be used later outside of that function:

In [None]:
# Create input functions.
predict_training_input_fn = lambda: my_input_fn(training_examples, 
                              training_targets[TARGET_STR], 
                              num_epochs=1, 
                              shuffle=False)
predict_validation_input_fn = lambda: my_input_fn(validation_examples, 
                                validation_targets[TARGET_STR], 
                                num_epochs=1, 
                                shuffle=False)

And now it's time to train!

In [None]:
if USING_OLD_MODELS & (not IS_DNN_CLASSIFIER):
    linear_classifier = train_linear_classifier_model(
        learning_rate=.1,
        regularization_strength=0.1,
        steps=1000,  
        batch_size=140,
        periods=1,
        feature_columns=construct_feature_columns(),
        training_examples=training_examples,
        training_targets=training_targets,
        validation_examples=validation_examples,
        validation_targets=validation_targets) 


**7. ROC and AUC**
<a id="rocandauc"></a>

The code for this is found in [Chapter 12's Programming Exercises](https://colab.research.google.com/notebooks/mlcc/logistic_regression.ipynb?hl=en).


In [None]:
if USING_OLD_MODELS & (not IS_DNN_CLASSIFIER):
    #training_metrics = linear_classifier.evaluate(input_fn=predict_training_input_fn)
    validation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)

    #print("AUC on the training set: %0.2f" % training_metrics['auc'])
    #print("Accuracy on the training set: %0.2f" % training_metrics['accuracy'])

    print("AUC on the validation set: %0.2f" % validation_metrics['auc'])
    print("Accuracy on the validation set: %0.2f" % validation_metrics['accuracy'])

In [None]:
if USING_OLD_MODELS & (not IS_DNN_CLASSIFIER):
    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)
    # Get just the probabilities for the positive class.
    validation_probabilities = np.array([item['probabilities'][1] for item in validation_probabilities])

    false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(
        validation_targets, validation_probabilities)
    plt.plot(false_positive_rate, true_positive_rate, label="our model")
    plt.plot([0, 1], [0, 1], label="random classifier")
    _ = plt.legend(loc=2)

**8. Experiment Results**
<a id="linearpretextexps"></a>

I tried a couple combinations of the following features:  previous_submissions, total_cost, teacher_prefix, school_state, grade_category.  Along with adjusting various hyperparameters, the best AUC ever yielded was only .57, which isn't much better than random.  

**IV. Dealing with Text-Heavy Features:  Using the Crash Course to Bypass TensorFlow Pain**
<a id="bruteforcebaby"></a>

Since an AUC of .57 isn't very good, my next step was to start working text-heavy features like project_essays into the model.  And that's where I ran into huge problems.

In the style of this set of [Programming Exercises](https://colab.research.google.com/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb?hl=en) from Chapter 17 - Embeddings, I implemented (or at least thought I had) a giant vocabulary list for use against the feature "project_resource_summary" text.  However, it had absolutely no affect on AUC at all, which surprised me.

Taking a look at the training process, it looked like TensorFlow was treating a given example's data for that feature as a *full entire string* (for example:  "My students need 7 ipads and 8 ipod nanos and cookies.").  It was then taking that *full* string and comparing it with the individual words in the vocabulary list ("my", "students", "need", "cookies", etc.).  But since none of those full sentences exist in the vocabulary_list, it was basically doing *nothing*.  What I wanted it to do was look at each individual word in the full string and compare those to the words in the vocabulary list.

However, I was never able to get TensorFlow to do that with the previous code.  I'm sure there's a way, but nothing I tried worked, and I either didn't ask the right questions in web searches or just couldn't find anything.  Plus, I got tired and frustrated of looking.

So... what's my solution?  Well, the code in the Programming Exercises linked above has TensorFlow evaluating a full sentence as individual words.  Thus, whatever it was doing, I knew that code worked.  That Exercise, however, is based on *loading the data into TensorFlow in a completely different way,* by using something called a TFRecord.  So I decided to take the data that's currently in my Pandas Dataframe, [save it as a TFRecord](https://stackoverflow.com/questions/41402332/tensorflow-create-a-tfrecords-file-from-csv), and then use the code in the Programming Exercises to load that TFRecord data back into TensorFlow to have it actually do what I want.  It seems ridiculous... but it works!!!  

**1.  Converting from Pandas to TFRecord**
<a id="tfrecordrescue"></a>

The below functions are to make it much easier to add and remove features by you simply editing the DESIRED_FEATURES list.  The following functions then prepare the conversion of that feature data to a format that TFRecord can handle.  (One special note about strings is that TFRecord can't handle those.  Instead, they need to be converted to byte strings first with .encode().)

In [38]:
#functions to write correct values to TFRecord
#depending on the column value, TFRecord either needs to store a bytes_list, int64_list, or float_list value
def make_bytes_list_append(col_index, value, example):
    example.features.feature[DESIRED_COLUMNS[col_index]].bytes_list.value.append(value.encode())

#for the data with actual sentences, split them into individual words
def make_bytes_list_extend(col_index, value, example):
    arr_strings = value.split(' ')
    arr_bstrings = list(map(lambda x: x.encode(), arr_strings))
    example.features.feature[DESIRED_COLUMNS[col_index]].bytes_list.value.extend(arr_bstrings)
    
def make_int64_list(col_index, value, example):
    example.features.feature[DESIRED_COLUMNS[col_index]].int64_list.value.append(value)

def make_float_list(col_index, value, example):
    example.features.feature[DESIRED_COLUMNS[col_index]].float_list.value.append(value)


#function to create an list of the right function to call depending on the feature
def match_feature_with_tfrecord_function():
    arr_functions = []
    for str_name, str_type, dontcare in DESIRED_FEATURES:
        #features consisting of a full string = bytes, append
        if (str_type == WHOLE_STRING):
            arr_functions.append(make_bytes_list_append)
        #features consisting of text that should be broken down into words = bytes, extend
        elif (str_type == WORDS_STRING):
            arr_functions.append(make_bytes_list_extend)
        #features consisting of non-integer numbers = float
        elif (str_type == NUM_FLOAT) or (str_type == BUCKET_FLOAT):
            arr_functions.append(make_float_list)
        #features consisting of integers = int64
        elif (str_type == NUM_INT) or (str_type == BUCKET_INT):
            arr_functions.append(make_int64_list)
        
    return arr_functions

#loop through all the rows in a dataset, convert each individual cell to the right data type that TFRecord requires,
#take those tf.examples and save them into a TFRecord file
#NOTE:  REQUIRES arr_tfrecord_funcs = match_feature_with_tfrecord_function() to be called first
def save_rows_as_TFRecord(array_of_rows, tfrecord_file_name):
    with tf.python_io.TFRecordWriter(tfrecord_file_name) as writer:
        last_column = len(DESIRED_COLUMNS)
        for row in array_of_rows:
            example = tf.train.Example()
            for col_index in range(0, last_column):
                arr_tfrecord_funcs[col_index](col_index, row[col_index], example)  #for each feature, call the corresponding function

            writer.write(example.SerializeToString())
    return

#just a test function to make sure all the above works
def TEST_save_rows_as_TFRecord(array_of_rows, tfrecord_file_name):
    last_column = len(DESIRED_COLUMNS)
    for row in array_of_rows:
        example = tf.train.Example()
        for col_index in range(0, last_column):
            arr_tfrecord_funcs[col_index](col_index, row[col_index], example)  #for each feature, call the corresponding function
        print(example)

Now test to make sure that all works as intended by looking at a couple of examples before writing to a TFRecord:

In [39]:
#testing new functions:
arr_tfrecord_funcs = match_feature_with_tfrecord_function()
testdata = training_examples[DESIRED_COLUMNS][0:2]

TEST_save_rows_as_TFRecord(testdata.values, 'test')

*i. Handle the Test Dataset*
<a id="testrecord"></a>

Before saving the data as TFRecords, there's one change to make to the *test* dataset.  That dataset doesn't have a 'project_is_approved' column, which would cause the upcoming code to fail.  So, to make things easy, I'm just going to add that column to the dataset and fill it with a bunch of zeroes since those numbers won't be used for anything anyways.

In [40]:
combined_test_dataset['project_is_approved'] = 0
combined_test_dataset[0:5]

Using all the previous functions, now actually save the data sets as TFRecords.  This step takes **a LONG time**.  However, using this TFRecord method makes the actual training later on go blazing fast!

Additionally, this eats up **A LOT of disk space**, especially if you include all the features.

*On that note, if you want to play around with a lot of features and/or experiment with different hyperparameters, I'd recommend commenting out the last line that saves the test_dataset to free up some space.*

In [42]:
#Files to save:
TRAINING_TFRECORD = "training.tfrecords" #for training_examples
VALIDATION_TFRECORD = "validation.tfrecords" #for validation_examples
TEST_TFRECORD = "test.tfrecords" #for combined_test_dataset

arr_tfrecord_funcs = match_feature_with_tfrecord_function() #grab the conversion functions to call for each cell
#make the TFRecord for training data:
save_rows_as_TFRecord(training_examples[DESIRED_COLUMNS].values, TRAINING_TFRECORD)
#make the TFRecord for validation data:
save_rows_as_TFRecord(validation_examples[DESIRED_COLUMNS].values, VALIDATION_TFRECORD)
#make the TFRecord for test data:
save_rows_as_TFRecord(combined_test_dataset[DESIRED_COLUMNS].values, TEST_TFRECORD)

**2. Loading TFRecord into TensorFlow**
<a id="recordtotf"></a>

The following code is taken straight from the [Exercise](https://colab.research.google.com/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb) but adapted to fit the data for this project and, once again, to make it easier to add/remove features by only editing the DESIRED_FEATURES constant from earlier:

In [43]:
def parse_function(record):
    """Extracts features and labels from a TFRecord file.

    Args:
        record: file name of the TFRecord file    
    Returns:
        A `tuple` `(features, labels)`:
        features: A dict of tensors representing the features
        labels: A tensor with the corresponding labels.
    """
    features_to_parse = {}
    for str_name, str_type, dontcare in DESIRED_FEATURES:
        #features consisting of a full string that wasn't split apart
        if (str_type == WHOLE_STRING):
            features_to_parse[str_name] = tf.FixedLenFeature(shape=[1], dtype=tf.string)
        #features consisting of a string split apart into pieces of varying lengths.  Thus, need the VarLenFeature:
        elif (str_type == WORDS_STRING):
            features_to_parse[str_name] = tf.VarLenFeature(dtype=tf.string)
        #features consisting of non-integer numbers = float
        elif (str_type == NUM_FLOAT) or (str_type == BUCKET_FLOAT):
            features_to_parse[str_name] = tf.FixedLenFeature(shape=[1], dtype=tf.float32)
        #features consisting of integers = int64
        elif (str_type == NUM_INT) or (str_type == BUCKET_INT):
            features_to_parse[str_name] = tf.FixedLenFeature(shape=[1], dtype=tf.int64)

    parsed_features = tf.parse_single_example(record, features_to_parse)
    
    final_features = {}
    for str_name, str_type, dontcare in DESIRED_FEATURES:
        #don't include target in features; handle later as label
        if (str_name == STR_TARGET):
            continue
        ##DO NOT FORGET .values for the VarLenFeature strings (doing so creates a cryptic error that was hard to figure out)
        elif (str_type == WORDS_STRING):
            final_features[str_name] = parsed_features[str_name].values
        #everything else as normal
        else:
            final_features[str_name] = parsed_features[str_name]
    
    #grab the targets:
    labels = parsed_features[STR_TARGET]
    
    return final_features, labels

I now check to see that this is working and that the data is in the correct format (ie, separated strings).  By the way, if you happen to look at the output in the Programming Exercises, it won't have the byte strings displayed like b'teaching'.  Instead, it will have odd numerical representations like '\x0323teaching'.  That difference initially concerned me that my TFRecord attempt here had failed.  However, the Colab Programming environment (at least for me) uses Python 2, which apparently displays byte strings in that way.  Python 3 here on Kaggle just uses the b'value' approach.  So the data is indeed formatted correctly.

In [44]:
ds = tf.data.TFRecordDataset(TRAINING_TFRECORD)
ds = ds.map(parse_function)

n = ds.make_one_shot_iterator().get_next()
sess = tf.Session()
sess.run(n)

**3. Setting up Training (with Text-Heavy Features)**
<a id="traintextheavy"></a>

This code for the input_fn is taken straight from the Programming Exercise.  And the upcoming code for training is, too.  However, I've adapted them both to better fit the style of previous Exercises (so that you only have to change hyperparameters in one location).

In [45]:
# Create an input_fn that parses the tf.Examples from the given files,
# and split them into features and targets.
def tfrecord_input_fn(input_filenames, batch_size=100, num_epochs=None, shuffle=True):
    # Create a dataset and map features and labels.
    ds = tf.data.TFRecordDataset(input_filenames)
    ds = ds.map(parse_function)

    if shuffle:
        ds = ds.shuffle(10000)

    # Our feature data is variable-length, so we pad and batch
    # each field of the dataset structure to whatever size is necessary.
    ds = ds.padded_batch(batch_size, ds.output_shapes)

    ds = ds.repeat(num_epochs)

    # Return the next batch of data.
    features, labels = ds.make_one_shot_iterator().get_next()
    return features, labels

In [46]:
def train_linear_classifier_text(learning_rate=.05, steps=2000, batch_size=100):
    my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

    feature_columns = construct_feature_columns()

    classifier = tf.estimator.LinearClassifier(
      feature_columns=feature_columns,
      optimizer=my_optimizer,
    )

    classifier.train(
      input_fn=lambda: tfrecord_input_fn(TRAINING_TFRECORD, batch_size=batch_size),
      steps=steps)

    predict_training_input_fn = lambda: tfrecord_input_fn(TRAINING_TFRECORD,
                                  batch_size=batch_size,
                                  num_epochs=1, 
                                  shuffle=False)
    predict_validation_input_fn = lambda: tfrecord_input_fn(VALIDATION_TFRECORD, 
                                    batch_size=batch_size,
                                    num_epochs=1, 
                                    shuffle=False)
    
    return classifier, [predict_training_input_fn, predict_validation_input_fn]
                    #return these last two so you can call them later without editing hyperparameters

**4. Training**
<a id="lineartexttraining"></a>

And now the model can train while correctly incorporating the heavy-text features!

[Take this link to quickly edit DESIRED_FEATURES.](#desiredfeatures)

In [56]:
if (USING_OLD_MODELS == False) & (IS_DNN_CLASSIFIER == False):
    classifier, arr_funcs = train_linear_classifier_text(learning_rate=.01,
                                                        steps=1500,
                                                        batch_size=100)

**5. AUC**
<a id="lineartextauc"></a>

In [57]:
if (USING_OLD_MODELS == False) & (IS_DNN_CLASSIFIER == False):
    #evaluation_metrics = classifier.evaluate(input_fn=arr_funcs[0])

    #print("Training set metrics:")
    #for m in evaluation_metrics:
    #    print(m, evaluation_metrics[m])
    #print("---")

    evaluation_metrics = classifier.evaluate(input_fn=arr_funcs[1])

    print("Test set metrics:")
    for m in evaluation_metrics:
        print(m, evaluation_metrics[m])
    print("---")

**6. Experiment Results**
<a id="lineartextexps"></a>

WOOHOO!!!!!  YESSSSSSSS, IT WORKED!!!  Ahem, I was just excited to finally not get .56 or .57 AUC as a result and, more importantly, to finally have the heavy-text features being analyzed as desired.

i.  For the first experiment, I  incorporated "project_resource_summary" as the only heavy-text feature.  With various hyperparameters (steps=500/100, and learning_rate=.1/.01), this got the AUC up to around .64.

ii.  For the next experiment, I added project_title and could get to around .66 or .67 AUC, so not too much difference.

iii.  Up next was incorporating the two essays as separate features with separate vocab_lists, which got the AUC up to around .74 or so.  Quite a jump!

iv. Tried out some different feature combinations with more features (like 'year_week' and 'full_description'), but AUC seems to max around .76.  I also tossed in every feature at one point to see what would happen, but it didn't have much of an effect.  The likely best hyperparameters are:  steps=2000, learning_rate=.05, batch_size=100.  And sufficient features are likely:  school_state (?), the essays, previous submissions, full_description, total_cost

v.  Made some updates and maybe even fixes.  Changed 'year_week' from buckets to a vocabulary_list.  Used quantized buckets for previous_submissions.  Additionally, all of the above's vocabulary_lists / buckets were based on the full combined_training_dataset.   This experiment, they were based on *only training_examples* just in case using the full dataset would have some kind of influence on the validation data.  Used all features and the best hyperparameters from above and got AUC = .768

**V. Applying the Crash Course to Create a DNNClassifier**
<a id="dnnclassifier"></a>

The great news is that now, since the data is already being loaded into TensorFlow via the TFRecord that the [Programming Exercise](https://colab.research.google.com/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb) used, it's simple to keep following that Exercise and switch over to a DNN Classifier.

But first, I want to highlight a couple of points mentioned in either that Exercise or the chapters leading up to it:

 - [Chapter 15 (Training Neural Nets)](https://developers.google.com/machine-learning/crash-course/training-neural-networks/best-practices) talked about using **dropout regularization** for DNN.  A dropout of 0 means no regularization, 1 means max and nothing gets learned
 - It also mentioned that the **AdamOptimizer** might be a better/more efficient option for a non-convex NN, over the AdagradOptimizer.
 - Also, in general, this chapter was about some important "gotchas" that can arise when training and how to possibly handle those with hyperparameters
 - [Chapter 17 (Embeddings)](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture) talked about the number of dimensions to use in embeddings.  ("Higher dimensions are good because it allows us to tease apart more distinctions and therefore you can learn better relationships.  On the downside, more dimensions means a higher chance of overfitting and leads to slower training and the need for more data.")  And they gave the following empirical rule of thumb:  *the number of dimensions should be the **4th-root of the size of your vocabulary***.
 - The Exercise also discussed how using an **embedding_column** when setting up the feature columns is more computational efficient (and which is what I've set up the construct_feature_columns function to do for the DNN)

**1. Setting Up for Training**
<a id="dnntrainingsetup"></a>

All that's really needed now (besides how construct_feature_columns has already been set up for the embedding columns) is a slightly different training function:

In [49]:
def train_dnn_classifier_text(learning_rate=.05, steps=2000, batch_size=100, hidden_units=[20,20], dropout=.1):
    #my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

    feature_columns = construct_feature_columns()

##WHAT'S CHANGED:##
    classifier = tf.estimator.DNNClassifier(
      feature_columns=feature_columns,
      hidden_units=hidden_units,  #how many hidden units?:  https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
      optimizer=my_optimizer,
      dropout=dropout
    )

    classifier.train(
      input_fn=lambda: tfrecord_input_fn(TRAINING_TFRECORD, batch_size=batch_size),
      steps=steps)

    predict_training_input_fn = lambda: tfrecord_input_fn(TRAINING_TFRECORD,
                                  batch_size=batch_size,
                                  num_epochs=1, 
                                  shuffle=False)
    predict_validation_input_fn = lambda: tfrecord_input_fn(VALIDATION_TFRECORD, 
                                    batch_size=batch_size,
                                    num_epochs=1, 
                                    shuffle=False)
    
    return classifier, [predict_training_input_fn, predict_validation_input_fn]
                    #return these last two so you can call them later without editing hyperparameters

**2. Training**
<a id="dnntraining"></a>

And then a slightly different way to train.

[Take this link to quickly edit DESIRED_FEATURES.](#desiredfeatures)

In [50]:
if (USING_OLD_MODELS == False) & (IS_DNN_CLASSIFIER):
    dnn_classifier, dnn_arr_funcs = train_dnn_classifier_text(learning_rate=.005,
                                                        steps=1500,
                                                        batch_size=100,
                                                        hidden_units=[6],
                                                        dropout=.1)

**3. AUC**
<a id="dnnauc"></a>

In [51]:
if (USING_OLD_MODELS == False) & (IS_DNN_CLASSIFIER):
    #evaluation_metrics = dnn_classifier.evaluate(input_fn=dnn_arr_funcs[0])

    #print("Training set metrics:")
    #for m in evaluation_metrics:
    #    print(m, evaluation_metrics[m])
    #print("---")

    evaluation_metrics = dnn_classifier.evaluate(input_fn=dnn_arr_funcs[1])

    print("Test set metrics:")
    for m in evaluation_metrics:
        print(m, evaluation_metrics[m])
    print("---")

**4. Experiment Results**
<a id="dnnpretextexps"></a>

And... after using various hyperparameters and tweaks, the AUC here hasn't been much different from the Linear Classifier, with the best AUC being around .76.  (Though, some high learning_rates and hidden units did lead to an AUC of .50!  Yikes!)

The best of the best AUC was .769, and the model used these features:  state, categories, subcategories, the essays, previous subs, full_desc, total_quantity, total_cost, year_week.  The hyperparameters were:  learning_rate = .005, hidden_units = [5], steps=1500, batch_size=100.

However, by using far less features (only:  the essays, previous subs, full_desc, total_cost), the best AUC was .764.  Thus, those are likely sufficient features.  Hyperparameters were:  learning_rate = .005, hidden_units = [2], steps=1500, batch_size=100, dropout=.1.

*Other Types of Experiments*
I wasn't sure if joining multiple text features together as one feature vs keeping them as separate features would make a difference or not.  Smashing ALL the text features together into a single feature indeed had a negative effect on the models.  However, combining only the two essays into a single big essay had only a slightly negative effect (if any at all since the slight difference in AUC may have just been due to randomness).

Since most applications get accepted, I was curious if using a vocabulary_list based on only the *rejected* entries would do anything.  It turns that this only had a minimal negative impact as well.  (However, using only the unique words from the rejected set - accepted set was a disaster resulting in .57 AUC.)

**VI. Submitting Predictions**
<a id="submit"></a>

I've decided to go with the Linear Classifier model, using these features:  school_state, project_essay_1, project_essay_2, previous_submissions, full_desc, total_cost, year_week.  (Though I probably don't need to include school_state or year_week.)  And these were the hyperparameters used:  steps = 1500, batch_size = 100, learning_rate = .05.  This tends to yield an AUC a little above .76. 

And I'll submit one for the DNN Classifier model, too.  That will use the same features as the linear_classifier.  And the hyperparameters are:  learning_rate = .005, hidden_units = [4], steps=1500, batch_size=100, dropout=.1.  This also tended to yield an AUC a little above .76.

I also did an experiment.  The first two submissions trained on 80% of the data and validated on 20% of the data.  For the last two submissions, I tested what would happen if the models were trained on all 100% of the data without doing any validation (since the features/hyperparameters had effectively been validated through the previous experiments).  This resulted in a minimally better AUC.

Finally, I did various tests of both model types involving the new spelling/typing error features.  However, the increase was minimal.  The best ever AUC was .773 with *every single feature*.  Thus, the new error ones didn't add much and only increase the model size.

**1. Making Predictions on the Test Dataset**
<a id="predictions"></a>

Since the combined_test_dataset has been processed and saved (and loaded) in the same way as the other data via TFRecord, we just need to create an input function and call .predict() on the returned classifier:

In [52]:
#https://www.kaggle.com/skleinfeld/getting-started-with-the-donorschoose-data-set
predict_test_input_fn = lambda: tfrecord_input_fn(TEST_TFRECORD, 
                                    num_epochs=1, 
                                    shuffle=False)

if (USING_OLD_MODELS == False):
    if (IS_DNN_CLASSIFIER == False):
        predictions_generator = classifier.predict(input_fn=predict_test_input_fn)
    else:
        predictions_generator = dnn_classifier.predict(input_fn=predict_test_input_fn)
        
    predictions_list = list(predictions_generator)
    probabilities = [p["probabilities"][1] for p in predictions_list]
    print('Now have the probabilities.')

**2. Convert to .csv**
<a id="csv"></a>

This competition requires the submission to be a .csv file with entries of the form *id*, *probability prediction*.  I was originally worried that I would also have to save the 'id' value in the test dataset's TFRecord and then [load in all those values](https://stackoverflow.com/questions/37151895/tensorflow-read-all-examples-from-a-tfrecords-at-once/44879011#44879011) to complete this bizarre full circle:  .csv -> Pandas -> TFRecord -> Pandas -> .csv.  Fortunately, that's not the case.  Since the TFRecord file was made row-by-row from the Pandas data, we can just take the Pandas column as is since its order matches that of the probabilities array.

Thus, we can just make a new Pandas dataframe using those two arrays:

In [53]:
submission_values = pd.DataFrame({'id': combined_test_dataset['id'], 'project_is_approved': probabilities})
submission_values[0:5]

And turn that into a .csv file:

In [None]:
submission_values.to_csv('linear-spelling.csv', index=False)

**3. Submit to Kaggle**
<a id="submitfile"></a>

Then click the "Commit & Run" button.  Once that's finished, on your kernel's "homepage" is an "Output" tab where you can click on the "Submit to Competition" button to submit your .csv file.