Nick Videtti
<br>
IST-664 Natural Language Processing
<br>
Final Project
<br>
Winter 2023
<br>
<br>
This project does analysis on both the words and the sentences in the Brown corpus of NLTK.
<br>
This analysis will include corpus statistics, predicting category of words and sentences using NLTK Naive Bayes Classifier, and other experiments.
<br>
<br>
<br>
The first step is to pull the words and sentences from the Brown corpus and put them in a pandas DataFrame with their corresponding categories. We will remove non-alphabetic phrases from both the words and sentences, and will remove NLTK's English stopwords from the words.
<br>
Due to processing time, stopwords will not be removed in the sentences, but this issue will be addressed later on.

In [1]:
import nltk
import pandas
import re

#Create list of catgories to loop through
categories = nltk.corpus.brown.categories()

#Create empty dataframes to store categorized sentences/words
SENTENCE_CATEGORIES = pandas.DataFrame(columns = ['Sentence','Category'])
WORD_CATEGORIES = pandas.DataFrame(columns = ['Word','Category'])

#Loop through categories
for category in categories:
    #Create sentnece dataframe for each category and add it to SENTENCE_CATEGORIES
    sents_og = nltk.corpus.brown.sents(categories = categories[categories.index(category)])
    sents_category = pandas.DataFrame([['',category,'',sent_og] for sent_og in sents_og], columns = ['Sentence', 'Category', 'Readable Sentence', 'Original Sentence'])
    SENTENCE_CATEGORIES = pandas.concat([SENTENCE_CATEGORIES, sents_category])

    #Create word dataframe for each category and add it to WORD_CATEGORIES
    words = nltk.corpus.brown.words(categories = categories[categories.index(category)])
    #Remove non-alphabetic phrases from words and convert to lowercase
    words = [word.lower() for word in words if re.compile('^[a-z]+$').match(word.lower())]
    words_category = pandas.DataFrame([[word,category] for word in words], columns = ['Word', 'Category'])
    WORD_CATEGORIES = pandas.concat([WORD_CATEGORIES, words_category])

#Remove Stopwords from WORD_CATEGORIES
WORD_CATEGORIES = WORD_CATEGORIES[~WORD_CATEGORIES['Word'].isin(nltk.corpus.stopwords.words('english'))]

#Make columns for "Readable Sentence" and "Sentence"
SENTENCE_CATEGORIES['Sentence'] = [[word.lower() for word in sent if re.compile('^[aA-zZ]+$').match(word.lower())] for sent in SENTENCE_CATEGORIES['Original Sentence']]
SENTENCE_CATEGORIES['Readable Sentence'] = SENTENCE_CATEGORIES['Sentence'].str.join(' ')
#Fix "Original Sentence" column
SENTENCE_CATEGORIES['Original Sentence'] = SENTENCE_CATEGORIES['Original Sentence'].str.join(' ').str.strip()

Let's inspect the first 10 rows of the pandas DataFrame we created for the sentences. The main setence we will use is in list format, so another column was created for "Readable Sentence", which simply concatenates the Sentence list with spaces in between, and "Original Sentence", which shows the original sentence.

In [2]:
#Inspect first 10 rows of Sentence data
SENTENCE_CATEGORIES.iloc[:10]

Unnamed: 0,Sentence,Category,Readable Sentence,Original Sentence
0,"[dan, morgan, told, himself, he, would, forget...",adventure,dan morgan told himself he would forget ann tu...,Dan Morgan told himself he would forget Ann Tu...
1,"[he, was, well, rid, of, her]",adventure,he was well rid of her,He was well rid of her .
2,"[he, certainly, want, a, wife, who, was, fickl...",adventure,he certainly want a wife who was fickle as ann,He certainly didn't want a wife who was fickle...
3,"[if, he, had, married, her, have, been, asking...",adventure,if he had married her have been asking for tro...,"If he had married her , he'd have been asking ..."
4,"[but, all, of, this, was, rationalization]",adventure,but all of this was rationalization,But all of this was rationalization .
5,"[sometimes, he, woke, up, in, the, middle, of,...",adventure,sometimes he woke up in the middle of the nigh...,Sometimes he woke up in the middle of the nigh...
6,"[his, plans, and, dreams, had, revolved, aroun...",adventure,his plans and dreams had revolved around her s...,His plans and dreams had revolved around her s...
7,"[the, easiest, thing, would, be, to, sell, out...",adventure,the easiest thing would be to sell out to al b...,The easiest thing would be to sell out to Al B...
8,"[the, best, antidote, for, the, bitterness, an...",adventure,the best antidote for the bitterness and disap...,The best antidote for the bitterness and disap...
9,"[he, found, that, if, he, was, tired, enough, ...",adventure,he found that if he was tired enough at night ...,He found that if he was tired enough at night ...


Let's inspect the first 10 rows of the pandas DataFrame we created for the words.

In [3]:
#Inspect first 10 rows of Word data
WORD_CATEGORIES.iloc[:10]

Unnamed: 0,Word,Category
0,dan,adventure
1,morgan,adventure
2,told,adventure
5,would,adventure
6,forget,adventure
7,ann,adventure
8,turner,adventure
11,well,adventure
12,rid,adventure
16,certainly,adventure


Now it's time to create our word features for our text classifiers! Let's create those then take a look at the first few.

In [4]:
#Create word features
top_words = [word for (word, freq) in nltk.FreqDist([word for word in words]).most_common(1000)]
top_words_cats = WORD_CATEGORIES[WORD_CATEGORIES['Word'].isin(top_words)]
word_features = [({'Word': top_words_cats.iloc[row]['Word']}, top_words_cats.iloc[row]['Category']) for row in top_words_cats.index]

word_features[:20]

[({'Word': 'forget'}, 'adventure'),
 ({'Word': 'married'}, 'adventure'),
 ({'Word': 'asking'}, 'adventure'),
 ({'Word': 'could'}, 'adventure'),
 ({'Word': 'long'}, 'adventure'),
 ({'Word': 'went'}, 'adventure'),
 ({'Word': 'found'}, 'adventure'),
 ({'Word': 'less'}, 'adventure'),
 ({'Word': 'plenty'}, 'adventure'),
 ({'Word': 'could'}, 'adventure'),
 ({'Word': 'water'}, 'adventure'),
 ({'Word': 'much'}, 'adventure'),
 ({'Word': 'would'}, 'adventure'),
 ({'Word': 'give'}, 'adventure'),
 ({'Word': 'asleep'}, 'adventure'),
 ({'Word': 'happened'}, 'adventure'),
 ({'Word': 'last'}, 'adventure'),
 ({'Word': 'people'}, 'adventure'),
 ({'Word': 'could'}, 'adventure'),
 ({'Word': 'least'}, 'adventure')]

Now, it's time to create the sentence features! Due to the poor processing time of removing stopwords earlier, we will address the issue here. The features will include first word of sentences with category, last word of sentences with category, and sentence length with category. The first and last words will really be the first and last non-stopwords in the sentences, but the sentence lengths will include stopwords due to the aforementioned processing time issues.

In [5]:
#Create sentence features for first word of sentence and last word of sentence

#First words
first_words = []
for sent in SENTENCE_CATEGORIES['Sentence']:
    for word in sent:
        if word not in nltk.corpus.stopwords.words('english'):
            first_words.append(word)
            break       

top_first_words = [word for (word, freq) in nltk.FreqDist([word for word in first_words]).most_common(1000)]
top_first_words_cats = WORD_CATEGORIES[WORD_CATEGORIES['Word'].isin(top_first_words)]
first_word_features = [({'First Word': top_first_words_cats.iloc[row]['Word']}, top_first_words_cats.iloc[row]['Category']) for row in range(len(top_first_words_cats.index))]

#Last words
last_words = []
for sent in SENTENCE_CATEGORIES['Sentence']:
    for word in reversed(sent):
        if word not in nltk.corpus.stopwords.words('english'):
            last_words.append(word)
            break
        
top_last_words = [word for (word, freq) in nltk.FreqDist([word for word in last_words]).most_common(1000)]
top_last_words_cats = WORD_CATEGORIES[WORD_CATEGORIES['Word'].isin(top_last_words)]
last_word_features = [({'Last Word': top_last_words_cats.iloc[row]['Word']}, top_last_words_cats.iloc[row]['Category']) for row in range(len(top_last_words_cats.index))]

#Sentence lengths
sent_length_features = [({'Sentence Length': len(SENTENCE_CATEGORIES.iloc[row]['Sentence'])}, SENTENCE_CATEGORIES.iloc[row]['Category']) for row in range(len(SENTENCE_CATEGORIES.index))]


#Look at first 5 and last 5 features for all feature sets
for i in list(range(5)) + list(range(-5,0)): print(first_word_features[i]); print(last_word_features[i]); print(sent_length_features[i])

({'First Word': 'morgan'}, 'adventure')
({'Last Word': 'told'}, 'adventure')
({'Sentence Length': 9}, 'adventure')
({'First Word': 'told'}, 'adventure')
({'Last Word': 'would'}, 'adventure')
({'Sentence Length': 6}, 'adventure')
({'First Word': 'would'}, 'adventure')
({'Last Word': 'well'}, 'adventure')
({'Sentence Length': 10}, 'adventure')
({'First Word': 'forget'}, 'adventure')
({'Last Word': 'want'}, 'adventure')
({'Sentence Length': 10}, 'adventure')
({'First Word': 'well'}, 'adventure')
({'Last Word': 'wife'}, 'adventure')
({'Sentence Length': 6}, 'adventure')
({'First Word': 'face'}, 'science_fiction')
({'Last Word': 'could'}, 'science_fiction')
({'Sentence Length': 10}, 'science_fiction')
({'First Word': 'could'}, 'science_fiction')
({'Last Word': 'talk'}, 'science_fiction')
({'Sentence Length': 35}, 'science_fiction')
({'First Word': 'talk'}, 'science_fiction')
({'Last Word': 'write'}, 'science_fiction')
({'Sentence Length': 14}, 'science_fiction')
({'First Word': 'write'}, 's

We will use the features in our classifiers, but first let's put all of our results into a pandas DataFrame and display the first few rows!

In [6]:
#Create pandas DataFrame for features
featuresDF = pandas.DataFrame(columns = ['Feature Name', 'Feature Value', 'Category', 'Actual Feature'])

#Assign raw features to a column
featuresDF['Actual Feature'] = word_features + first_word_features + last_word_features + sent_length_features

#Calculate other columns
featuresDF['Feature Name'] = [featuresDF.iloc[row]['Actual Feature'][0].keys() for row in featuresDF.index]
featuresDF['Feature Name'] = featuresDF['Feature Name'].str.join('')
featuresDF['Feature Value'] = [featuresDF.iloc[row]['Actual Feature'][0].values() for row in featuresDF.index]
featuresDF['Feature Value'] = featuresDF['Feature Value'].str.join('')
featuresDF['Category'] = [featuresDF.iloc[row]['Actual Feature'][1] for row in featuresDF.index]

#Display first 10 rows
featuresDF[:10]

Unnamed: 0,Feature Name,Feature Value,Category,Actual Feature
0,Word,forget,adventure,"({'Word': 'forget'}, adventure)"
1,Word,married,adventure,"({'Word': 'married'}, adventure)"
2,Word,asking,adventure,"({'Word': 'asking'}, adventure)"
3,Word,could,adventure,"({'Word': 'could'}, adventure)"
4,Word,long,adventure,"({'Word': 'long'}, adventure)"
5,Word,went,adventure,"({'Word': 'went'}, adventure)"
6,Word,found,adventure,"({'Word': 'found'}, adventure)"
7,Word,less,adventure,"({'Word': 'less'}, adventure)"
8,Word,plenty,adventure,"({'Word': 'plenty'}, adventure)"
9,Word,could,adventure,"({'Word': 'could'}, adventure)"


Now it's time to put those features to use! We will now create Naive Bayes Classifiers using NLTK for word features, first word of sentence features, last word of sentence features, sentence length features, and all features. For each of these classifiers, we will also perform a five-fold, ten-fold, and 30-fold cross validation to test their accuracy

In [7]:
#NLTK Naive Bayes Classifiers
import random

#Create function for cross validation, default set to five-fold
def cross_val(features, folds = 5):
    partition_length = int(len(features) / int(folds))
    accuracy = []
    for fold in list(range(1, int(folds)+1)):
        train = features[ : partition_length*(fold-1)]
        train = train + features[partition_length*fold : ]
        test = features[partition_length*(fold-1) : partition_length*fold]
        classifier = nltk.NaiveBayesClassifier.train(train)
        accuracy.append(nltk.classify.accuracy(classifier, test))
    print('RESULTS OF ' + str(int(folds)) + '-FOLD CROSS VALIDATION:')
    print('Average Accuracy -', round((sum(accuracy) / len(accuracy))*100, 2), '%')
    print('Minimum Accuracy -', round(min(accuracy)*100, 2), '%')
    print('Maximum Accuracy -', round(max(accuracy)*100, 2), '%')


#Word Features
random.shuffle(word_features)
print('Word Features Classifier')
cross_val(word_features)
print()
cross_val(word_features,10)
print()
cross_val(word_features,30)
print('\n\n\n')

#First Word Features
random.shuffle(first_word_features)
print('First Word Features Classifier')
cross_val(first_word_features)
print()
cross_val(first_word_features,10)
print()
cross_val(first_word_features,30)
print('\n\n\n')

#Last Word Features
random.shuffle(last_word_features)
print('Last Word Features Classifier')
cross_val(last_word_features)
print()
cross_val(last_word_features,10)
print()
cross_val(last_word_features,30)
print('\n\n\n')

#Sentence Length Features
random.shuffle(sent_length_features)
print('Sentence Length Features Classifier')
cross_val(sent_length_features)
print()
cross_val(sent_length_features,10)
print()
cross_val(sent_length_features,30)
print('\n\n\n')

#ALL Features
all_features = word_features + first_word_features + last_word_features + sent_length_features
random.shuffle(all_features)
print('All Features Classifier')
cross_val(all_features)
print()
cross_val(all_features,10)
print()
cross_val(all_features,30)

Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 36.37 %
Minimum Accuracy - 36.01 %
Maximum Accuracy - 36.65 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 36.41 %
Minimum Accuracy - 35.86 %
Maximum Accuracy - 36.98 %

RESULTS OF 30-FOLD CROSS VALIDATION:
Average Accuracy - 36.46 %
Minimum Accuracy - 35.33 %
Maximum Accuracy - 37.69 %




First Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 22.21 %
Minimum Accuracy - 21.81 %
Maximum Accuracy - 22.41 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 22.16 %
Minimum Accuracy - 21.71 %
Maximum Accuracy - 22.92 %

RESULTS OF 30-FOLD CROSS VALIDATION:
Average Accuracy - 22.16 %
Minimum Accuracy - 21.2 %
Maximum Accuracy - 22.94 %




Last Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 22.5 %
Minimum Accuracy - 22.35 %
Maximum Accuracy - 22.59 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 22.5 %
Minimum Accuracy 

It appears that none of the feature sets have noteworthy accuracy. Perhaps we should be using using the feature sets to predict between two categories instead of all of the NLTK Brown corpus categories. We will reduce down to only the "news" and "humor" categories, then try this again.

In [8]:
#Filter feature sets to only news and humor categories
word_features = [(featdict,category) for (featdict,category) in word_features if category in ('news', 'humor')]
first_word_features = [(featdict,category) for (featdict,category) in first_word_features if category in ('news', 'humor')]
last_word_features = [(featdict,category) for (featdict,category) in last_word_features if category in ('news', 'humor')]
sent_length_features = [(featdict,category) for (featdict,category) in sent_length_features if category in ('news', 'humor')]


#Word Features
random.shuffle(word_features)
print('Word Features Classifier')
cross_val(word_features)
print()
cross_val(word_features,10)
print()
cross_val(word_features,30)
print('\n\n\n')

#First Word Features
random.shuffle(first_word_features)
print('First Word Features Classifier')
cross_val(first_word_features)
print()
cross_val(first_word_features,10)
print()
cross_val(first_word_features,30)
print('\n\n\n')

#Last Word Features
random.shuffle(last_word_features)
print('Last Word Features Classifier')
cross_val(last_word_features)
print()
cross_val(last_word_features,10)
print()
cross_val(last_word_features,30)
print('\n\n\n')

#Sentence Length Features
random.shuffle(sent_length_features)
print('Sentence Length Features Classifier')
cross_val(sent_length_features)
print()
cross_val(sent_length_features,10)
print()
cross_val(sent_length_features,30)
print('\n\n\n')

#ALL Features
all_features = word_features + first_word_features + last_word_features + sent_length_features
random.shuffle(all_features)
print('All Features Classifier')
cross_val(all_features)
print()
cross_val(all_features,10)
print()
cross_val(all_features,30)

Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 66.9 %
Minimum Accuracy - 64.85 %
Maximum Accuracy - 68.48 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 67.2 %
Minimum Accuracy - 61.64 %
Maximum Accuracy - 70.18 %

RESULTS OF 30-FOLD CROSS VALIDATION:
Average Accuracy - 67.5 %
Minimum Accuracy - 59.02 %
Maximum Accuracy - 79.23 %




First Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 83.24 %
Minimum Accuracy - 82.32 %
Maximum Accuracy - 84.29 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 83.22 %
Minimum Accuracy - 82.11 %
Maximum Accuracy - 84.68 %

RESULTS OF 30-FOLD CROSS VALIDATION:
Average Accuracy - 83.15 %
Minimum Accuracy - 79.53 %
Maximum Accuracy - 85.71 %




Last Word Features Classifier
RESULTS OF 5-FOLD CROSS VALIDATION:
Average Accuracy - 83.99 %
Minimum Accuracy - 83.75 %
Maximum Accuracy - 84.22 %

RESULTS OF 10-FOLD CROSS VALIDATION:
Average Accuracy - 84.11 %
Minimum Accuracy 

Much better! Let's take a look at the most informative features for each of the feature sets! We will stick to just a ten-fold cross validation for the sake of simplicity, and will show the top 5 most informative features for each of the 10 classifiers generated for each feature set.

In [9]:
#Word Features most informative features
print('Word Most Informative Features')
partition_length = int(len(word_features) / 10)
for fold in list(range(1, 11)):
    train = word_features[ : partition_length*(fold-1)]
    train = train + word_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(5)
    print()

Word Most Informative Features
ROUND 1
Most Informative Features
                    Word = 'looked'        humor : news   =     12.2 : 1.0
                    Word = 'heart'         humor : news   =      9.9 : 1.0
                    Word = 'eyes'          humor : news   =      8.7 : 1.0
                    Word = 'american'       news : humor  =      8.2 : 1.0
                    Word = 'mind'          humor : news   =      7.5 : 1.0

ROUND 2
Most Informative Features
                    Word = 'looked'        humor : news   =     13.7 : 1.0
                    Word = 'american'       news : humor  =      9.5 : 1.0
                    Word = 'eyes'          humor : news   =      8.9 : 1.0
                    Word = 'heart'         humor : news   =      8.9 : 1.0
                    Word = 'english'       humor : news   =      7.7 : 1.0

ROUND 3
Most Informative Features
                    Word = 'eyes'          humor : news   =     10.0 : 1.0
                    Word = 'heart'      

In [10]:
#First Word Features length most informative features
print('First Word Most Informative Features')
partition_length = int(len(first_word_features) / 10)
for fold in list(range(1, 11)):
    train = first_word_features[ : partition_length*(fold-1)]
    train = train + first_word_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(5)
    print()

First Word Most Informative Features
ROUND 1
Most Informative Features
              First Word = 'mouth'         humor : news   =     22.0 : 1.0
              First Word = 'nice'          humor : news   =     22.0 : 1.0
              First Word = 'president'      news : humor  =     18.8 : 1.0
              First Word = 'sat'           humor : news   =     18.5 : 1.0
              First Word = 'looked'        humor : news   =     17.0 : 1.0

ROUND 2
Most Informative Features
              First Word = 'president'      news : humor  =     20.5 : 1.0
              First Word = 'sat'           humor : news   =     18.3 : 1.0
              First Word = 'looked'        humor : news   =     16.8 : 1.0
              First Word = 'blanche'       humor : news   =     16.0 : 1.0
              First Word = 'horse'         humor : news   =     14.8 : 1.0

ROUND 3
Most Informative Features
              First Word = 'horse'         humor : news   =     24.8 : 1.0
              First Word = 'nice' 

In [11]:
#Last Word Features most informative features
print('Last Word Most Informative Features')
partition_length = int(len(last_word_features) / 10)
for fold in list(range(1, 11)):
    train = last_word_features[ : partition_length*(fold-1)]
    train = train + last_word_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(5)
    print()

Last Word Most Informative Features
ROUND 1
Most Informative Features
               Last Word = 'horse'         humor : news   =     22.9 : 1.0
               Last Word = 'shoes'         humor : news   =     19.9 : 1.0
               Last Word = 'sat'           humor : news   =     19.3 : 1.0
               Last Word = 'president'      news : humor  =     18.5 : 1.0
               Last Word = 'glass'         humor : news   =     16.8 : 1.0

ROUND 2
Most Informative Features
               Last Word = 'horse'         humor : news   =     22.9 : 1.0
               Last Word = 'shoes'         humor : news   =     19.8 : 1.0
               Last Word = 'sat'           humor : news   =     19.2 : 1.0
               Last Word = 'president'      news : humor  =     18.6 : 1.0
               Last Word = 'glass'         humor : news   =     16.8 : 1.0

ROUND 3
Most Informative Features
               Last Word = 'sat'           humor : news   =     19.3 : 1.0
               Last Word = 'preside

In [12]:
#Sentence length most informative features
print('Sentence Length Most Informative Features')
partition_length = int(len(sent_length_features) / 10)
for fold in list(range(1, 11)):
    train = sent_length_features[ : partition_length*(fold-1)]
    train = train + sent_length_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(5)
    print()

Sentence Length Most Informative Features
ROUND 1
Most Informative Features
         Sentence Length = 54              humor : news   =      7.1 : 1.0
         Sentence Length = 57              humor : news   =      4.2 : 1.0
         Sentence Length = 66              humor : news   =      4.2 : 1.0
         Sentence Length = 53              humor : news   =      4.2 : 1.0
         Sentence Length = 44              humor : news   =      2.5 : 1.0

ROUND 2
Most Informative Features
         Sentence Length = 53              humor : news   =      7.0 : 1.0
         Sentence Length = 54              humor : news   =      4.2 : 1.0
         Sentence Length = 59              humor : news   =      4.2 : 1.0
         Sentence Length = 60              humor : news   =      4.2 : 1.0
         Sentence Length = 66              humor : news   =      4.2 : 1.0

ROUND 3
Most Informative Features
         Sentence Length = 56              humor : news   =      4.3 : 1.0
         Sentence Length = 59

In [13]:
#ALL Features most informative features
print('All Most Informative Features')
partition_length = int(len(all_features) / 10)
for fold in list(range(1, 11)):
    train = all_features[ : partition_length*(fold-1)]
    train = train + all_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(5)
    print()

All Most Informative Features
ROUND 1
Most Informative Features
                    Word = 'looked'        humor : news   =     26.8 : 1.0
              First Word = 'horse'         humor : news   =     26.4 : 1.0
              First Word = 'president'      news : humor  =     21.2 : 1.0
                    Word = 'seemed'        humor : news   =     21.2 : 1.0
                    Word = 'eyes'          humor : news   =     21.2 : 1.0

ROUND 2
Most Informative Features
                    Word = 'looked'        humor : news   =     32.7 : 1.0
                    Word = 'eyes'          humor : news   =     24.2 : 1.0
                    Word = 'heart'         humor : news   =     21.3 : 1.0
              First Word = 'president'      news : humor  =     20.5 : 1.0
               Last Word = 'president'      news : humor  =     20.4 : 1.0

ROUND 3
Most Informative Features
                    Word = 'looked'        humor : news   =     26.7 : 1.0
               Last Word = 'president'   

While the Sentence Length Most Informatve Features are technically the "most informative features", they are hardly informative. These would be much more intuituve if we used sentence length ranges instead of discrete sentence lengths. Let's update the Sentence Length Features using sentence length ranges, then look at the first 10 and last 10 features and the top 10 most informative features.

In [14]:
#Check lengths of sentences
sent_lengths = [feature[0]['Sentence Length'] for feature in sent_length_features]
print('SENTENCE LENGTH ANALYSIS:')
print('Minimum Length:', min(sent_lengths))
print('Maximum Length:', max(sent_lengths))
print('Average Length:', sum(sent_lengths) / len(sent_lengths))

#Update sentence length feature set
for (featdict, category) in sent_length_features:
    if featdict['Sentence Length'] >= 0 and featdict['Sentence Length'] <= 5: featdict.update({'Sentence Length' : '0-5 Words'})
    elif featdict['Sentence Length'] >= 6 and featdict['Sentence Length'] <= 10: featdict.update({'Sentence Length' : '6-10 Words'})
    elif featdict['Sentence Length'] >= 10 and featdict['Sentence Length'] <= 15: featdict.update({'Sentence Length' : '11-15 Words'})
    elif featdict['Sentence Length'] >= 16 and featdict['Sentence Length'] <= 20: featdict.update({'Sentence Length' : '16-20 Words'})
    elif featdict['Sentence Length'] >= 21 and featdict['Sentence Length'] <= 30: featdict.update({'Sentence Length' : '21-30 Words'})
    elif featdict['Sentence Length'] >= 31 and featdict['Sentence Length'] <= 40: featdict.update({'Sentence Length' : '31-40 Words'})
    elif featdict['Sentence Length'] >= 41 and featdict['Sentence Length'] <= 50: featdict.update({'Sentence Length' : '41-50 Words'})
    elif featdict['Sentence Length'] >= 51 and featdict['Sentence Length'] <= 75: featdict.update({'Sentence Length' : '51-75 Words'})
    elif featdict['Sentence Length'] >= 76 and featdict['Sentence Length'] <= 100: featdict.update({'Sentence Length' : '76-100 Words'})
    elif featdict['Sentence Length'] >= 101: featdict.update({'Sentence Length' : 'More Than 100 Words'})

#Look at first 10 and last 10 updated features
print('\n\nFirst 10 and Last 10 Updated Sentence Length Features')
for i in list(range(5)) + list(range(-5,0)): print(sent_length_features[i])


SENTENCE LENGTH ANALYSIS:
Minimum Length: 0
Minimum Length: 91
Average Length: 18.043164200140943


First 10 and Last 10 Updated Sentence Length Features
({'Sentence Length': '16-20 Words'}, 'news')
({'Sentence Length': '21-30 Words'}, 'news')
({'Sentence Length': '16-20 Words'}, 'news')
({'Sentence Length': '0-5 Words'}, 'news')
({'Sentence Length': '21-30 Words'}, 'news')
({'Sentence Length': '0-5 Words'}, 'news')
({'Sentence Length': '31-40 Words'}, 'humor')
({'Sentence Length': '31-40 Words'}, 'news')
({'Sentence Length': '6-10 Words'}, 'humor')
({'Sentence Length': '21-30 Words'}, 'humor')


In [15]:
#Sentence length most informative features
print('Sentence Length Most Informative Features')
partition_length = int(len(sent_length_features) / 10)
for fold in list(range(1, 11)):
    train = sent_length_features[ : partition_length*(fold-1)]
    train = train + sent_length_features[partition_length*fold : ]
    classifier = nltk.NaiveBayesClassifier.train(train)
    print('ROUND', fold)
    classifier.show_most_informative_features(10)
    print()

Sentence Length Most Informative Features
ROUND 1
Most Informative Features
         Sentence Length = '76-100 Words'  humor : news   =      4.4 : 1.0
         Sentence Length = '51-75 Words'   humor : news   =      2.4 : 1.0
         Sentence Length = '6-10 Words'    humor : news   =      1.6 : 1.0
         Sentence Length = '21-30 Words'    news : humor  =      1.3 : 1.0
         Sentence Length = '0-5 Words'     humor : news   =      1.2 : 1.0
         Sentence Length = '11-15 Words'    news : humor  =      1.2 : 1.0
         Sentence Length = '16-20 Words'    news : humor  =      1.2 : 1.0
         Sentence Length = '41-50 Words'   humor : news   =      1.1 : 1.0
         Sentence Length = '31-40 Words'    news : humor  =      1.0 : 1.0

ROUND 2
Most Informative Features
         Sentence Length = '76-100 Words'  humor : news   =      4.3 : 1.0
         Sentence Length = '51-75 Words'   humor : news   =      2.3 : 1.0
         Sentence Length = '6-10 Words'    humor : news   =     

We can also look at which sentence lengths correspond to which categories! Let's see how many sentences fall into each length range and category, then put the results into a pandas DataFrame.

In [16]:
#Create data frame
length_ranges_DF = pandas.DataFrame(columns = ['Sorter', 'Feature', 'Sentence Length', 'Category', 'Sentences', 'News', 'Humor'])
length_ranges_DF['Feature'] = sent_length_features
length_ranges_DF['Sentence Length'] = [length_ranges_DF.iloc[row]['Feature'][0]['Sentence Length'] for row in range(len(length_ranges_DF.index))]
length_ranges_DF['Category'] = [length_ranges_DF.iloc[row]['Feature'][1] for row in range(len(length_ranges_DF.index))]
length_ranges_DF['Sentences'] = [1 for row in range(len(length_ranges_DF.index))]

length_ranges_DF.loc[length_ranges_DF['Category'] == 'news', 'News'] = 1
length_ranges_DF.loc[length_ranges_DF['Category'] == 'humor', 'Humor'] = 1

#Sorter
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '0-5 Words', 'Sorter'] = 0
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '6-10 Words', 'Sorter'] = 1
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '10-15 Words', 'Sorter'] = 2
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '16-20 Words', 'Sorter'] = 3
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '21-30 Words', 'Sorter'] = 4
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '31-40 Words', 'Sorter'] = 5
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '41-50 Words', 'Sorter'] = 6
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '51-75 Words', 'Sorter'] = 7
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '76-100 Words', 'Sorter'] = 8
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == 'More Than 100 Words', 'Sorter'] = 9

#Drop Feature and Category Columns
length_ranges_DF = length_ranges_DF[['Sorter', 'Sentence Length', 'Sentences', 'News', 'Humor']]

#Aggregate
length_ranges_DF = length_ranges_DF.groupby(['Sorter', 'Sentence Length']).aggregate('count')

#Drop Sorter Column
length_ranges_DF.index = [index for (sorter,index) in length_ranges_DF.index]

#Print results
length_ranges_DF

Unnamed: 0,Sentences,News,Humor
0-5 Words,694,546,148
6-10 Words,854,624,230
16-20 Words,1030,864,166
21-30 Words,1330,1139,191
31-40 Words,565,461,104
41-50 Words,148,117,31
51-75 Words,42,28,14
76-100 Words,4,2,2


Now instead of the total values, lets look at the proportion of categories among Sentence Length and also add which category has the highest proportion.

In [17]:
#Convert to proportions
length_ranges_DF['News'] = round(length_ranges_DF['News'] / length_ranges_DF['Sentences'], 4)
length_ranges_DF['Humor'] = round(length_ranges_DF['Humor'] / length_ranges_DF['Sentences'], 4)

#Add Most Common Category column
length_ranges_DF.loc[length_ranges_DF['News'] == length_ranges_DF['Humor'], 'Most Frequent Category'] = 'Even Split'
length_ranges_DF.loc[length_ranges_DF['News'] > length_ranges_DF['Humor'], 'Most Frequent Category'] = 'News'
length_ranges_DF.loc[length_ranges_DF['News'] < length_ranges_DF['Humor'], 'Most Frequent Category'] = 'Humor'

#Print results
length_ranges_DF

Unnamed: 0,Sentences,News,Humor,Most Frequent Category
0-5 Words,694,0.7867,0.2133,News
6-10 Words,854,0.7307,0.2693,News
16-20 Words,1030,0.8388,0.1612,News
21-30 Words,1330,0.8564,0.1436,News
31-40 Words,565,0.8159,0.1841,News
41-50 Words,148,0.7905,0.2095,News
51-75 Words,42,0.6667,0.3333,News
76-100 Words,4,0.5,0.5,Even Split


This gives the idea that there are more News sentences overall than Humor Sentences, which we saw in the DataFrame with the whole numbers of sentences. The last experiment will be to compare the proportion of news and humor sentences based on an even number of news and humor sentences. We will find whichever category has less, take all of those sentences, then take the same amount of the other category.

In [18]:
#Number of news sentences and humor sentences
news_sents = [(featdict,category) for (featdict,category) in sent_length_features if category == 'news']
humor_sents = [(featdict,category) for (featdict,category) in sent_length_features if category == 'humor']

#Min between number of news sentences and number of humor sentences
samenumber = min(len(news_sents),len(humor_sents))

#Create new sentence features
sent_length_features = news_sents[:samenumber] + humor_sents[:samenumber]

#Create data frame
length_ranges_DF = pandas.DataFrame(columns = ['Sorter', 'Feature', 'Sentence Length', 'Category', 'Sentences', 'News', 'Humor'])
length_ranges_DF['Feature'] = sent_length_features
length_ranges_DF['Sentence Length'] = [length_ranges_DF.iloc[row]['Feature'][0]['Sentence Length'] for row in range(len(length_ranges_DF.index))]
length_ranges_DF['Category'] = [length_ranges_DF.iloc[row]['Feature'][1] for row in range(len(length_ranges_DF.index))]
length_ranges_DF['Sentences'] = [1 for row in range(len(length_ranges_DF.index))]

length_ranges_DF.loc[length_ranges_DF['Category'] == 'news', 'News'] = 1
length_ranges_DF.loc[length_ranges_DF['Category'] == 'humor', 'Humor'] = 1

#Sorter
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '0-5 Words', 'Sorter'] = 0
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '6-10 Words', 'Sorter'] = 1
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '10-15 Words', 'Sorter'] = 2
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '16-20 Words', 'Sorter'] = 3
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '21-30 Words', 'Sorter'] = 4
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '31-40 Words', 'Sorter'] = 5
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '41-50 Words', 'Sorter'] = 6
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '51-75 Words', 'Sorter'] = 7
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == '76-100 Words', 'Sorter'] = 8
length_ranges_DF.loc[length_ranges_DF['Sentence Length'] == 'More Than 100 Words', 'Sorter'] = 9

#Drop Feature and Category Columns
length_ranges_DF = length_ranges_DF[['Sorter', 'Sentence Length', 'Sentences', 'News', 'Humor']]

#Aggregate
length_ranges_DF = length_ranges_DF.groupby(['Sorter', 'Sentence Length']).aggregate('count')

#Drop Sorter Column
length_ranges_DF.index = [index for (sorter,index) in length_ranges_DF.index]

#Convert to proportions
length_ranges_DF['News'] = round(length_ranges_DF['News'] / length_ranges_DF['Sentences'], 4)
length_ranges_DF['Humor'] = round(length_ranges_DF['Humor'] / length_ranges_DF['Sentences'], 4)

#Add Most Common Category column
length_ranges_DF.loc[length_ranges_DF['News'] == length_ranges_DF['Humor'], 'Most Frequent Category'] = 'Even Split'
length_ranges_DF.loc[length_ranges_DF['News'] > length_ranges_DF['Humor'], 'Most Frequent Category'] = 'News'
length_ranges_DF.loc[length_ranges_DF['News'] < length_ranges_DF['Humor'], 'Most Frequent Category'] = 'Humor'

#Print results
length_ranges_DF

Unnamed: 0,Sentences,News,Humor,Most Frequent Category
0-5 Words,276,0.4638,0.5362,Humor
6-10 Words,379,0.3931,0.6069,Humor
16-20 Words,377,0.5597,0.4403,News
21-30 Words,440,0.5659,0.4341,News
31-40 Words,209,0.5024,0.4976,News
41-50 Words,55,0.4364,0.5636,Humor
51-75 Words,22,0.3636,0.6364,Humor
76-100 Words,2,0.0,1.0,Humor


Now that is some good stuff! We can see here that generally humor has either shorter or longer sentences, and news takes more of the middle. This makes sense since jokes are usually either "one-liners" or take a while to "set up", and news is much more predictable in terms of how long it will be, plus is usually much more formal and standardized than humor is.
<br>
Finally, the very very last step to take would be to create classifiers and test them against these results.

In [19]:
#Create test feature sets
test_features = [({'Sentence Length' : index}, length_ranges_DF.loc[index]['Most Frequent Category'].lower()) for index in length_ranges_DF.index]

#Create train features
train_features = [feature for feature in featuresDF['Actual Feature'] if list(feature[0].keys())[0] == 'Sentence Length' and feature[1] in ('news','humor')]
train_features

#Create classifier
print('Updated Sentence Length Classifier')
classifier = nltk.NaiveBayesClassifier.train(train_features)
accuracy = nltk.classify.accuracy(classifier, test_features)
print('Accuracy -', round(accuracy*100, 2), '%')

Updated Sentence Length Classifier
Accuracy - 37.5 %


Looks great!!! Not the best accuracy, but pretty impressive given so few rules to test against!
<br>
<br>
CONCLUSION:
<br>
While countless more testing and experiments could be done, we still learned a lot during this project! It is important to identify what features or attributes will be of interest as we unforunatley spent a decent portion of this project getting to the point where we settled on sentnece length ranges. This is likely just a byproduct of the fact that our classifiers used supervised machine learning. Ideally, all attributes would be stored in a data frame and unsupervised learning would be done to discover which are of best interest to be more efficient in this process.