My Predictive Analytics project for General Assembly Data Science course using NaNoWriMo data
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
analyze
clean data
clean
rawdata
scrape
.gitignore
PresentationSlides.pdf
PresentationSlides.pptx
README.md

README.md

Predicting NaNoWriMo Winners

Objective

Every year in November, writers all around the world participate in National Novel Writing Month (NaNoWriMo) and try to write 50,000 words of a novel within 30 days. They track their word count progress on the NaNoWriMo website where they may also donate to the writing cause, join 'Regions' for writing camaraderie, and display the summary of their novel in progress. Those writers who write 50,000 words before the end of November are declared 'Winners'.

My goal is to create a machine learning model that can predict whether a participating writer will be a NaNoWriMo 'winner' using data from the site.

Motivation

I love writing and I am enjoy participating in NaNoWriMo. This idea stems from another personal project: creating my own Word Count Tracker that would track how much I write over time, similar to that of the cumulative word count graph displayed on each writer's novel profile every NaNoWrimo.

I wanted to take it a step further and also visualize the aggregate word count progress of a region and the whole site.

Imgur Imgur

Visualizing writing progress can motivate one to write more and reach his or her writing goals! I hope creating this predictive model may help other writers and future NaNoWriMo participants improve their writing strategies continue to write and finish their novels.

The Data

The data I will use to construct this model is user data and novel data from the website. This includes usernames, novel titles, word count, and 'Winner' labels.

NaNoWriMo vocabulary

Some NaNoWriMo vocabulary to understand:

Writer - A NaNoWriMo.org user that is participating in a current NaNoWriMo contest. Win - When a writer reaches 50,000 word count goal for their novel and validates this word count with the NaNoWriMo website. Word Count/Word Count Submission - For a novel or a submission to that novel, the number of words recorded to have been written Submission - The act of updating the word count for a novel. During a contest, if there is no update for a novel on a given day, the word count submission for that novel is recorded as 0 and the total word count for a novel remains the same. NOTE: A writer my update the word count for their novel multiple times a day. The site will not record the updates until the end of the data. The aggregate of these updates is the submission.
Contest - A NaNoWriMo event. That is, when the NaNoWriMo site opens and writers may create a novel profile and begin writing and adding submissions.
Donation/Donor - If a user makes a monetary donation to the NaNoWriMo organization and their mission, they are marked as a 'donor' on the site. NaNoWriMo does not disclose the amount the user donated, just that they are a donor. NOTE: A user may donate without being a writer. But for the purposes of this project, those users don't exist in this data set :)
Municipal Liaison - Taken from the NaNoWriMo website: "Municipal Liaisons (MLs) are volunteers who add a vibrant, real-world aspect to NaNoWriMo festivities all over the world." These writers are particularly involved NaNoWriMo users :D Sponsorship - Writers may have their novels sponsored, with the sponsor money going to further the NaNoWriMo mission.
Novel - A writer's 'entry' in the NaNoWriMo contest - the thing they commit to writing during the contest. NOTE: 'Novels' may not actually be novels. Writers may choose to write memoirs, non-fiction, movie scripts, etc.

Scraping NaNoWriMo Data

I created a script utilizing the site's Word Count API to get word count submission history.

The trouble is, the NaNoWriMo API, as far as I know, only gets data from the most recent contest, in this case, November 2015. This was not enough to make much of an interesting model.

Other data I wanted to incorporate in the model include a user's past daily word count averages, number of novels started, novel synopses, and whether or not they've donated to the NaNoWriMo cause.

Luckily, all the data I wanted was available on the NaNoWriMo website, but I wasn't about to click through 500+ user profiles manually entering information into a spreadsheet to get all of it.

I used Kimono Labs to scrape most of the qualitative user data including usernames, whether they're a donor or even a volunteer Municipal Liaison for the site, if they're novels are sponsored, and all the names of their past novels. I was also able to get some quantitative data such as how long they've been a NaNoWriMo member, their lifetime word count, and what years they've participated.

Below is a snapshot of Kimono Labs point-and-click interface to capture the data from a NaNoWriMo profile page.
Imgur

However, I wasn't able to get the word count data from past NaNoWriMos using Kimono Labs. That data is presented on each novel's stats page as a bar graph rendered by JavaScript. Kimono can't parse JavaScript.

Imgur

I researched a few different ways to parse JavaScript using Python, but then I realized I only needed a single line of the JavaScript code that stored the data points for the graph. I read the HTML document for each novel profile page as a regular text document and grabbed the line I needed.

I also wanted to extract novel synopses and excerpts, but I ran into some difficulties using Kimono to grab the large amount of text from each novel profile page. I decided it was time to switch tools.

Imgur

With Beautiful Soup it was really easy to navigate the HTML structure of the novel profile page, and to find the tags and attributes for the text data I needed.

With all the data I needed, the next step was to process and aggregate all the information for analysis.

Scraping Script Guide

The following are a description of the iPython scripts used to scrape data.

GetCurrentContestStats - Utilizes NaNoWriMo API to get data from the most recent contest ScrapeNovelSynopses - Uses Beautiful Soup to scrape each novel synopses ScrapeNovelSynopses - Uses Beautiful Soup to scrape each novel excerpt ScrapeWCSubmissions - Parses HTML file for a JavaScript variable that contains information about daily word count submission for each novel

Raw Data Guide

Data Description Source
User Names A list of writer's usernames Hand collected
Novel Pages A list of novels by the selected writers Kimono Labs API
Novel WC Info Word count stats for each of the novels The ScrapeWCSubmissions script
Novel Names, Urls, Dates The novels with their respective NaNoWriMo page urls and the date they were entered into a NaNoWriMo contest Kimono Labs API
Novel Meta Data Contains more information about the novels Kimono Labs API
Basic User Profile Information A writer's username, their lifetime word count, how long the have been a NaNoWriMo member Kimono Labs API
Fact Sheets Various information a writer could share about their age, occupation, location, hobbies, sponsorship, or role as a Municipal Liaison for NaNoWriMo Kimono Labs API
Participation Information The past years a writer has participated in NaNoWriMo and whether they were winners or donors in that year Kimono Labs API

The Data Processing Process

After scraping all the data, the task at hand was to aggregate the information.

Extracting Numeric Data from Novel Text Data

I had the following information on each of the novels of each writer.

Novel Meta Data - Contains the name of the novel, the writer, the genre, the final word count, daily average word count, and whether or not it was a winning novel Novel Word Count Info - Basic statistics calculated for each novel

I merged these files on the novel name and also appended each novel's synopses and excerpt to create a novel_data.csv file.

There is also a great deal of information in the text data for each novel - the genre, synopses, excerpt. I hypothesize, if a writer is well-prepared for NaNoWriMo, they will have a clear genre chosen for their novel, and their novel profile will have a well-written synopses and excerpt - signs that their novel idea is fleshed out and they've done some planning before the contest starts.

From the text data, I extracted numeric data such as number of words, unique words, paragraphs, and sentences in a synopses and excerpt. I also calculated a reading score for the synopses and excerpt, and classified the genre of each novel as standard (fits into the usual novel genres such as Fiction, Historical, Young Adult) or non-standard (the novel hasn't been given a genre yet, it's a more obscure genre, or a combination of different genres).

I then appended this data to the other novel data in another novel_features file.

Aggregating Writer Data

In addition to their novels, I had the following raw data about each writer:

Basic User Profile Data - A writer's username, their lifetime word count, how long the have been a NaNoWriMo member Fact Sheets - Various information a writer could share about their age, occupation, location, hobbies, sponsorship, or role as a Municipal Liaison for NaNoWriMo Participation Data - The past years a writer has participated in NaNoWriMo and whether they were winners or donors in that year.

After a bit of cleaning, I merged the data in these files by writers' usernames.

Now, I needed to somehow aggregate the major novel data for each writer and merge it with the other writer data.

There were two different ways I aggregated the data. In one way I took typical averages of the novel word count statistics. In the other, I excluded novels created in the most current NaNoWriMo contest (November 2015). I wanted to use these novels as the target of my predictions. That is, I wanted to use the writers' past novels up to November 2014 to predict whether the novels of November 2015 would be 'winning novels' for the writer. Thus, there are two similarly named 'user_summary' files.

For the user_summary file, certain statistics (eg. Expected Final Word Count, Expected Daily Average) take into account data from NaNoWriMo November 2015. The other file with '_no2015' appended to the file name has the November 2015 information excluded from those statistics.

Processing Script Guide

The following are a description of the iPython scripts used to clean and process the raw data.

FactSheetParser - Parses the raw Fact Sheets data ParseMemberLength - Cleans member length data in the raw Basic User Profile Data AppendParticipationData/AppendParticipationData_negate2015 - Two similar scripts that parse the raw Participation Data and appends results to other writer data (Basic Info, Fact Sheets) AggregateNovelStatsData/AggregateNovelStatsData_negate2015 - Two similar scripts that aggregate novel word count statistics and appends results to other writer data(Basic Info, Fact Sheets) AggregateFinalandDailyAvgs/AggregateFinalandDailyAvgs_negate2015 - Two similar scripts that aggregate the final word count and daily average of novels and appends results to other writer data (Basic Info, Fact Sheets) CalculateTextFeaturesandReadingScore - Classified as novel's genre as standard or nonstandard, and extracted the number of words, unique words, sentences, paragraphs, and reading score of novel synopses. CalculateReadingScoreExcerpts - Extracted the number of words, unique words, sentences, paragraphs, and reading score of novel excerpts.

Data Dictionaries

Writers - About the Data

Contains basic profile information about each writer and their past NaNoWriMo statistics.

There are 501 rows and 41 columns.

The data may be found here.

Writers - Data Dictionary

Writer Name - The writer's NaNoWriMo username

Member Length - The number of years a writer has been a NaNoWriMo user

LifetimeWordCount - The total number of words a writer has written over all NaNoWriMo contests

url - The url to the writer's profile on NaNoWriMo.org

Age - The age of the writer

Birthday - The birthday of the writer

Favorite books or authors - The writer's recorded favorite books or authors

Favorite noveling music - The writer's favorite music to listen to while writing

Hobbies - The writer's recorded hobbies

Location - The location from where the writer is writing

Occupation - The writer's recorded occupation

Primary Role - If the writer is a "Municipal Liaison" for NaNoWriMo, it is recorded here

Sponsorship URL - If the writer's novel is sponsored, a sponsorship url is recorded here

Expected Final Word Count - The average of the final word count for all a writer's novels

Expected Daily Average - The average of the daily average word count for all a writer's novels, calculated from using

CURRENT WINNER - Indicates whether the writer is a winner of the "current" or "next" NaNoWriMo (November 2015)

Current Donor - Indicates whether the writer is a donor of the "current" or "next" NaNoWriMo (November 2015)

Wins - The number of past wins for a writer. Wins cannot be greater than Participated.

Donations - The number of past donations for a writer. Donations cannot be greater than Participated.

Participated - The number of past NaNoWriMo contests in which the writer was a participant

Consecutive Donor - The maximum number of consecutive contests for which the writer has donated

Consecutive Wins - The maximum number of consecutive contests for which the writer has won

Consecutive Part - The maximum number of consecutive contests for which the writer has participated

Part Years - A list of years for which the writer has participated in NaNoWriMo

Win Years - A list of years for which the writer has won

Donor Years - A list of years for which the writer has donated

Num Novels - The number of novels which a writer has entered into NaNoWriMo

Expected Num Submissions - The average. over all a writer's novels, of the number of word count submissions entered for a novel

Expected Avg Submission - The average. over all a writer's novels, of the average number of words entered in all word count submissions for a novel

Expected Min Submission - The average, over all a writer's novels, of the minimum number of words entered in all word count submissions for a novel

Expected Min Day - The average day (from 1-30), over all contests a writer participated, on which the writer entered the minimum number of words

Expected Max Submission - The average, over all a writer's novels, of the maximum number of words entered in all word count submissions for a novel

Expected Max Day - The average day (from 1-30), over all contests a writer participated, on which the writer entered the maximum number of words

Expected Std Submissions - The average, over all a writer's novels, of the standard deviation of the number of words entered for all word count submissions for a novel

Expected Consec Subs - The average, over all a writer's novels, of the number of consecutive submissions (at least 2 submissions in a row) entered for a novel

FW Total - For the current NaNoWriMo, the total word count of a novel in the first week of the contest

FW Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the first week of the contest

FH Total - For the current NaNoWriMo, the total word count of a novel written in the first half of the contest

FH Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the first half of the contest

SH Total - For the current NaNoWriMo, the total word count of a novel written in the second half of the contest

SH Sub - For the current NaNoWriMo, the number of word count submissions to a novel in the second half of the contest

Novels - About the Data

Contains basic profile information about each novel and their word count statistics.

There are 2122 rows and 9 columns.

The data may be found here.

Novels - Data Dictionary

Writer Name - The writer of the novel

Novel Name - The title of the novel

Genre - The genre of the novel

Final Word Count The final recorded word count for the novel

Daily Average The average recorded word count of the novel over the 30 day period of its contest

Winner Indicates whether the novel is a winning novel (reached 50,000 words) during its contest

Synopses The novel synopses

url The url of the novel's stats page

Novel Date The date of the contest for which the novel was written

Excerpt The novel excerpt

Novel Numeric Features - About the Data

Contains numeric data represeting each novel's genre, synopses, and excerpt.

There are 2122 rows and 23 columns.

The data may be found here.

Novel Numeric Features - Data Dictionary

Note: There are some columns that are duplicates from the novel_data file, so they will not be redefined here.

has genre 0 if the novel has no given genre. 1 if otherwise.

standard genre 1 if the novel's given genre is one of the following "usual" genres: __ . 0 if otherwise.

has_synopses 0 if the novel has no synopsis. 1 if otherwise.

num words The number of words in a novel's synopsis.

num uniques The number of unique words in a novel's synopsis.

num sentences The number of sentences in a novel's synopsis.

paragraphs The number of sentences in a novel's synopsis.

fk score The Flesch-Kincaid score of the novel synopsis.

has excerpt 0 if the novel has no excerpt. 1 if otherwise.

num words excerpt The number of words in a novel's excerpt.

num uniques excerpt The number of unique words in a novel's excerpt.

num sentences excerpt The number of sentences in a novel's excerpt.

paragraphs The number of sentences in a novel's excerpt.

fk score excerpt The Flesch-Kincaid score of the novel synopsis.

Exploring the Data

After I had constructed the data set, I proceeded with exploring the data with Python and matplotlib visualizations.

Exploring the Writer data

Writer Name Member Length LifetimeWordCount url Age Birthday Favorite books or authors Favorite noveling music Hobbies Location ... Expected Max Submission Expected Max Day Expected Std Submissions Expected Consec Subs FW Total FW Sub FH Total FH Sub SH Total SH Sub
0 Nicaless 2 50919 http://nanowrimo.org/participants/nicaless 24 December 20 Ursula Le Guin, J.K. Classical, Musicals Reading, Video Games, Blogging, Learning San Francisco, CA ... 24935.0 28.000000 6235.712933 12.000000 6689 6 12486 9 11743 3
1 Rachel B. Moore 10 478090 http://nanowrimo.org/participants/rachel-b-moore NaN NaN 2666, Unaccustomed Earth, Exit Music, Crazy Lo... Belle and Sebastian, Elliott Smith, PJ Harvey,... Reading, volunteering, knitting, listening to ... San Francisco ... 3809.0 9.000000 1002.295167 6.800000 16722 7 24086 14 26517 14
2 abookishbabe 1 0 http://nanowrimo.org/participants/abookishbabe NaN April 2 Colleen Hoover, Veronica Roth, Jennifer Niven,... Tori Kelley Reading (DUH), Day dreaming, Going to Disneyla... Sacramento, CA ... NaN NaN NaN NaN 28632 1 29299 2 0 0
3 alexabexis 11 475500 http://nanowrimo.org/participants/alexabexis NaN NaN NaN Three Goddesses playlist Florence + the Machin... drawing, reading, movies & TV shows, comics, p... New York City ... 2325.0 8.545455 570.626795 8.090909 25360 7 38034 12 40766 9
4 AllYellowFlowers 3 30428 http://nanowrimo.org/participants/AllYellowFlo... NaN NaN Lolita, Jesus' Son, Ask the the sound of the coffeemaker cryptozoology Allston ... 2054.5 4.500000 538.273315 21.000000 1800 5 5300 10 5700 9

Wins and Losses for NaNoWriMo 2015

There are 219 winners and 282 nonwinners out of 501 writers. A little over a 4:3 ratio of winners to nonwinners. At first glance, winning is almost a coin-toss at 44%. One has near to a 50/50 chance of guessing correctly whether or not a writer is a NaNoWriMo winner.

Lifetime Word Count vs Member Length

Imgur

Few writers have a written more than 1,000,000 words (or 20 NaNoWriMo winning novels) over the course of their NaNoWriMo lifetime. The density of nonwinners for NaNoWriMo 2015 decreases as Member Length increases, and a higher Lifetime Word Count indicates higher likelihood of winning. It makes sense that the longer one writes (Member Length) and the more words one writes (Lifetime word Count) makes one more likely to reach the NaNoWriMo writing goal.

Expected Avg Submission vs Expected Daily Average

Imgur

It almost looks like there are clusters. If Expected Daily Average => Expected Avg Submission, a writer is more likely to win. It's worth noting that the minimum daily average needed to win a NaNoWriMo contest is about 1,666 (50,000 words / 30 days).

Number of Wins vs Number of times participated

Imgur

It looks like there may be possible clusters here as well. Writers who have already had more than 5 wins are very likely to win again. Also, writers who have participated more than 5-10 times have better chances of winning as well.

Expected Daily Average vs Expected Num Submissions

Imgur

Many writers seem to cluster around an Expected Daily Average of 1500-2000. 1,666 is the minimum daily average to win a NaNoWriMo contest. The higher an Expected Daily Average, the more likely a writer is to win the upcoming contest.

Also interesting is how the density of nonwinners decreases as Expected Num Submissions increases, so higher Expected Num Submissions may also be indicative of winning.

Distribution Word Count Submissions in early weeks of a contest

I wanted to look retrospectively at the latest NaNoWriMo contest and see how winners can be predicted as early as the first week or two weeks of a contest

Imgur

Imgur

As expected, writers who submit more often in the early weeks are more likely to win.

Average First Week Submissions vs Expected Daily Average

Imgur

Additionally, writers whose daily average in the first week is equal to or greater than the Expected Daily Average of their past novels are more likely to win.

Does being Municipal Liaison or having a novel sponsored have effect on winning?

Imgur

Municipal Liaisons, which I've flagged with a binary variable (1 if they are a ML, 0 if otherwise), are a small fraction of the total NaNoWriMo writer population, but the majority of these MLs turn out to be winners at the end of the month.

Imgur

Likewise, very few writers have sponsors for their novel.

The ratio of winners to nonwinners for those with Sponsors is 2. The ratio of winners to nonwinners for those who are Municipal Liaisons is almost 6. It definitely seems like one is more likely to win if they are a Municipal Liaison of if their novel is sponsored!

Exploring the Novel data

Writer Name Novel Name Genre Final Word Count Daily Average Winner Synopses url Novel Date Excerpt
0 Nicaless Novel: Lauren's Birthday Genre: Young Adult 24229 807 0 \n<p></p>\n http://nanowrimo.org/participants/nicaless/nov... November 2015 \n<p></p>\n
1 Nicaless Novel: A Mystery in the Kingdom of Aermon Genre: Fantasy 50919 1,697 1 \n<p>Hitoshi is appointed the youngest Judge a... http://nanowrimo.org/participants/nicaless/nov... November 2014 \n<p>This story, funnily enough, started out a...
2 Rachel B. Moore Novel: Finding Fortunato Genre: Literary 50603 1,686 1 \n<p>Sam and Anna Gold and their newly adoptiv... http://nanowrimo.org/participants/rachel-b-moo... November 2015 \n<p></p>\n
3 Rachel B. Moore Novel: The Residency Genre: Literary 50425 1,680 1 \n<p>It's every writer's dream - an all-expens... http://nanowrimo.org/participants/rachel-b-moo... November 2014 \n<p></p>\n
4 Rachel B. Moore Novel: The Jew From Fortunato Genre: Literary Fiction 41447 1,381 0 \n<p>20-something Andre Levinsky is a fish out... http://nanowrimo.org/participants/rachel-b-moo... November 2013 \n<p></p>\n

Overall Wins and Losses

The total number of novels in this sample is 2123. 1333 winners and 790 nonwinners for a 63/37 split. It's interesting that there are more winning novels than nonwinning novels while there are more winning writers for the most recent NaNoWriMo than there are nonwinning writers. But this makes sense because writers who write more novels are more likely to have their novels reach the 50,000 word goal.

Text Features

Winner Novel Date has genre standard genre has_synopses num words num uniques num sentences paragraphs fk score has excerpt num words excerpt num uniques excerpt num sentences excerpt paragraphs excerpt fk score excerpt
0 0 November 2015 1 1 0 0 0 0 0 0.00 0 0 0 0 0 0.00
1 1 November 2014 1 1 1 44 42 3 1 65.73 1 132 96 13 7 78.25
2 1 November 2015 1 1 1 153 109 7 4 58.62 0 0 0 0 0 0.00
3 1 November 2014 1 1 1 59 51 4 3 65.73 0 0 0 0 0 0.00
4 0 November 2013 1 0 1 124 93 4 1 56.93 0 0 0 0 0 0.00

Imgur

The average length in words of a synopses is 50 words, or about a few good sentences. This is likely skewed by the fact that 729 novels, more than a third, don't even have a synopsis. There are few novels with synopses longer than 100 words, but as synopses get longer, it seems more likely that they belong to a winning novel.

Imgur

from scipy.stats import ttest_ind

ttest_ind(winlose['fk score'].get_group(0), winlose['fk score'].get_group(1))
Ttest_indResult(statistic=-1.4376558464994371, pvalue=0.1506792394358735)

The Flesch-Kincaid reading scores look about normally distributed for this sample of novel synopses for both winners and nonwinners. In a t-test comparing the two data sets, the resulting p-value is greater than 10%. This means, I cannot reject a null hypothesis that the winning and non-winning novels have equal averages of Flesch-Kincaid scores. Flesch-Kincaid scores for a novel synopses are unlikely to be indicative of a winning novel.

Imgur

Trying to plot reading score of synopses against length of synopses produces this gobbled mess. It may be hard trying to predict winning novels with these features...

Logistic Regression

As the variable I want to predict is binary (1 if a writer is a winner, 0 if otherwise) I decided to use a logistic regression as my prediction model.

After extracting only the numerical columns from the writer data, replacing any NaN entries - entries belonging new writers who don't have data from past NaNoWriMos - with 0, I applied a Standard Scaler to normalize the data. I then performed an 80/20 split on the data - 400 observations for training and 101 observations for testing.

Cross-Validation Score

I created 10 different folds of the training data to train, test, and cross-validate a Logistic Regression model. The average cross-validation score was .86. This is a promising indication that this model does very well in predicting a winning or non-winning outcome for a writer.

Confusion Matrix and Classification Report

After cross-validating on just the training data, I re-trained the model on the entire training data set and then used the model to predict the outcomes for the writers in the test data set. Comparing the model's predictions with the actual outcomes, I obtained the following confusion matrix and classification report.

Actual 0 Actual 1
Predicted 0 51 4
Predicted 1 0 46
Precision Recall F1-Score Support
0 1.00 0.93 0.96 55
1 0.92 1.00 0.96 46
avg/total 0.96 0.96 0.96 101
                   Actual Class 0  Actual Class 1
Predicted Class 0              46               9
Predicted Class 1               8              38
             precision    recall  f1-score   support

          0       0.85      0.84      0.84        55
          1       0.81      0.83      0.82        46

avg / total       0.83      0.83      0.83       101

Only 4 winners were misclassified as non-winners. The Logistic Regression correctly identified the winners and nonwinners in the test data with about 83% accuracy, as illustrated by its precision, recall, and F1-scores.

ROC Curve

In plotting the the ROC curve for the model, I found the area under the curve was about .9, pretty close to an ideal area of 1.

Imgur

It seems like it's a pretty good model!

Visualize the results of the Logistic Regression with PCA

There are a lot of features in this data set, so I used Principal Components Analysis to decompose the data and easily to visualize where the winners and non-winners fall on a 2 dimensional plane.

Imgur

Above are the first and second principal components of the training data set, colored by the winners and nonwinners.

Imgur

Above is how the Logistic Regression splits the decomposed test data. Comparing it with the actual results of the test data below, the Logistic Regression did very well generalizing the data and sorting out the winners and nonwinners of NaNoWriMo.

Imgur

Pleased with the results of the Logistic Regression model, I then similarly trained a Decision Tree on the features to compare the two methods.

It also performed very well in predicting winners and nonwinners achieving a similar scores for cross validation and precision and recall.

Actual 0 Actual 1
Predicted 0 46 9
Predicted 1 7 39
Precision Recall F1-Score Support
0 .87 0.84 0.85 55
1 0.81 0.85 0.83 46
avg/total 0.84 0.85 0.84 101

The Decision tree found the following features to be the most important.

0 1
23 FH Total 0.712847
1 LifetimeWordCount 0.057194
24 FH Sub 0.045414
14 Expected Avg Submission 0.039117
11 Consecutive Part 0.036241

FH Total - the total word count of a writer's novel submitted in the first half of the contest - is the most predictive feature of winning by a long shot, but this is a metric collected after the current contest has started. For next steps, I want to build a model with just the information I have from past contests.

Using Fewer Feaures and Applying Other Models

I excluded the features relevant to the current contest - the number of words and submissions accounted in the first week, first two weeks, or second two weeks. I then re-applied the Logistic Regression model.

Actual 0 Actual 1
Predicted 0 48 7
Predicted 1 22 24
Precision Recall F1-Score Support
0 0.69 0.87 0.77 55
1 0.77 0.52 0.62 46
avg/total 0.73 0.71 0.70 101

Imgur

The difference between this model and the previous, which included the current contest data, is about 10%.

I then compared the results against other models.

Naive Bayes

Actual 0 Actual 1
Predicted 0 48 7
Predicted 1 26 20
Precision Recall F1-Score Support
0 0.65 0.87 0.74 55
1 0.74 0.43 0.55 46
avg/total 0.69 0.67 0.65 101

Naive Bayes is not as accurate as Logistic Regression in this case.

SVM

Actual 0 Actual 1
Predicted 0 49 6
Predicted 1 20 26
Precision Recall F1-Score Support
0 0.71 0.89 0.79 55
1 0.81 0.57 0.67 46
avg/total 0.76 0.74 0.73 101

This Support Vector Machine does a little bit better than the Logistic Regression.

Decision Tree

Actual 0 Actual 1
Predicted 0 42 13
Predicted 1 22 24
Precision Recall F1-Score Support
0 0.66 0.76 0.71 55
1 0.85 0.52 0.58 46
avg/total 0.65 0.65 0.65 101

The Decision Tree did not do as well this time without data from the current contest.

0 1
3 Expected Final Word Count 0.356964
13 Expected Num Submissions 0.126865
1 LifetimeWordCount 0.124095
0 Member Length 0.070148
2 Age 0.067985

This time, the most important feature is Expected Final Word Count, or a writer's average final word count over all his or her past NaNoWriMos.

Random Forests

I trained the data on a Random Forest which yielded the following results.

Actual 0 Actual 1
Predicted 0 48 7
Predicted 1 18 28
Precision Recall F1-Score Support
0 0.73 0.87 0.79 55
1 0.80 0.61 0.69 46
avg/total 0.76 0.75 0.75 101

Random Forests and Support Vector Machines do best in predicting winners and nonwinners when excluding data from the current contest.

While it's possible past data to predict the outcome of a contest with a good degree of accuracy, including data from the first few weeks of a contest after it starts improves accuracy greatly.

Many non-winners were predicted to win for this second model, which is interesting. This means the past NaNoWriMo data for these writers showed promise that they would win again in the coming NaNoWriMo. However, they fell short in the first few weeks of the contest which effected their end outcome.

Other Experiments

Modeling Novel Data

I wanted to attempt to predict which novels will be winning novels based on what little I know about them: their genre, synopsis, and excerpt.

Winner Novel Date has genre standard genre has_synopses num words num uniques num sentences paragraphs fk score has excerpt num words excerpt num uniques excerpt num sentences excerpt paragraphs excerpt fk score excerpt
0 0 November 2015 1 1 0 0 0 0 0 0.00 0 0 0 0 0 0.00
1 1 November 2014 1 1 1 44 42 3 1 65.73 1 132 96 13 7 78.25
2 1 November 2015 1 1 1 153 109 7 4 58.62 0 0 0 0 0 0.00
3 1 November 2014 1 1 1 59 51 4 3 65.73 0 0 0 0 0 0.00
4 0 November 2013 1 0 1 124 93 4 1 56.93 0 0 0 0 0 0.00

However, the results for my Logistic Regression were lackluster.

Cross-Validation Score

Actual 0 Actual 1
Predicted 0 1 158
Predicted 1 0 266
Precision Recall F1-Score Support
0 1.00 0.01 0.01 159
1 0.63 1.00 0.77 266
avg/total 0.77 0.63 0.49 425

The Logistic Regression did not do much better than guessing, and other models yielded similar results.

Maybe it just doesn't make sense to predict if a novel wins just based on it's synopses or excerpt. Don't judge a book by it's cover I guess.

Clusters of Writers

I've tried classifying writers by whether or not they've "won" or not in the next NaNoWriMo contest, but that sort of dampens the spirit of NaNoWriMo. It's not just about winning after all. I want to see what other ways to create clusters of writers with K Means.

Imgur

It looks a k of 5 produces the best silhouette score, so the data can best be fitted into

Imgur

Genre Recommendation

While I could not create a very accurate model for predicting whether or not a novel will win based on its synopses or excerpt, I still wanted to do something interesting with all the novel data I had. So I started to build a simple recommendation system that, given a writer's NaNoWriMo username, would suggest new genres for the writer to try writing for based on their past.

Here's a subset of the list of writers and all the genres they've written in past NaNoWriMos.

Writer Name Genres
0 Nicaless Fantasy, Young Adult
1 Rachel B. Moore Literary, Literary Fiction
2 abookishbabe Young Adult
3 alexabexis Romance, Horror/Supernatural, Horror & Superna...
4 AllYellowFlowers Literary, Literary Fiction

Below defines a function that calculates the jaccard distance from two different lists of genres.

def jaccard(a, b):
    if (type(a) == "float" or type(b) == float):
        return 0
    a = set(a.split(", "))
    b = set(b.split(", "))
    intersect = a.intersection(b)
    union = a.union(b)
    return float(len(intersect)) / len(union)

nicaless_genres = writer_genres['Genres'][writer_genres['Writer Name'] == "Nicaless"].values[0]
abookishbabe_genres = writer_genres['Genres'][writer_genres['Writer Name'] == "abookishbabe"].values[0]

jaccard(nicaless_genres, abookishbabe_genres)

0.5

The above score means that the "distance" between the above two writers' genres is .5. In other words, half of all the genres written between the two writers are shared between the two.

I then created a function called getSimilar that uses the jaccard function to calculate the distance between a given writer's list of genres and all other writers' genres and returns a set of suggested genres based on the top ten closes writers.

getSimilar("Nicaless")
I suggest you try writing for the following genres:
{'Romance', 'Science Fiction', 'Young Adult & Youth', 'nan'}
getSimilar("Trillian Anderson")
I suggest you try writing for the following genres:
{'Fanfiction','Non-Fiction','Romance','Science Fiction','Steampunk','Thriller/Suspense','Young Adult','nan'}
getSimilar("AmberMeyer")
I suggest you try writing for the following genres:
{'Fantasy', 'Science Fiction', 'Young Adult'}
getSimilar("Brandon Sanderson")
I suggest you try writing for the following genres:
{'Romance', 'Science Fiction', 'Young Adult & Youth', 'nan'}

Cool! Looks like I have a lot in common with what Brandon Sanderson writes based on our recommendations!

Of course, this recommender only works for writers already in my list of writers and their known past-written genres, but I'm hoping it's a list that will continue to expand so that I can then evaluate the effectiveness of the recommender and make improvements.

Next Steps

I thoroughly enjoyed diving into the NaNoWriMo data and exploring this intersection between my two passions: data science and writing.

Some possible next steps for this project include:

  • Collecting more data to see how well the models perform in predicting outcomes for new writers not currently in my data set
  • Performing more feature engineering on the data points excluding the current contest data to see how I can boost model scores
  • Figuring out what are the defining features in the 5 unique clusters of writers discovered from KMeans Clustering
  • Predicting a final word count instead of just a binary win/lose outcome
  • Building out the genre recommender
  • Exploring what features might be better predictive winning novels
    • Are novels in certain genres more likely to win than others?