One of the central concerns of our work from the beginning has been getting the matter of *words* right. While so-called stopwords, sometimes also imagined as an extended set of function words to include words too common to be of any use in distinguishing one text from another, have proven useful in many applications, we felt it was better to keep all the words in our initial processing of the corpus and only remove words in later experiments as they proved problematic -- read "mostly harmless" -- in a particular context.

In [None]:
%matplotlib inline # Is this still needed?

In [1]:
import pandas
from nltk.tokenize import WhitespaceTokenizer
import numpy as np
import re

## Getting the Talks

In [29]:
# Load the CSV into a dataframe and then place all the texts into a list:

with open('../data/tedtalks2018.csv') as f:
    colnames = f.readline().strip().split(",")

print(colnames)

df = pandas.read_csv('tedtalks2018.csv', names=colnames)
print(df.shape)

texts = df.text.tolist()
labels = df.headline.tolist()

['rowID', 'Talk_ID', 'public_url', 'speaker_name', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text']
(2657, 12)


The first item in this list is the column header -- `texts[0] > 'text'` -- and so the first thing we are going to do is remove it:

In [30]:
print(texts[0])

text


In [31]:
texts.pop(0)

'text'

Now we'll sure that our first text is, in fact, a text and not something else:

In [32]:
print(texts[0][0:50])

  Thank you so much, Chris. And it's truly a great


## Tokenization

Now we need to determine the best way forward for tokenization. In other projects, I have used the NLTK's`WhitespaceTokenizer`, so we will begin there and check the results.

In [33]:
# List of word counts for each talk 
counts = [len(WhitespaceTokenizer().tokenize(text)) for text in texts]
print("Of the {} talks, the shortest is {} words; the longest {}; and the average {}.".format(
    len(counts),min(counts), max(counts), int(np.mean(counts))))

Of the 2656 talks, the shortest is 2 words; the longest 9185; and the average 2045.


In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Quick Word Count Visualization
# =-=-=-=-=-=-=-=-=-=-= 

import matplotlib.pyplot as plt
# import pylab
# pylab.rcParams['figure.figsize'] = (12, 6)

# Plot of Word Lengths
length_sorted = sorted(counts)
plt.bar(range(len(length_sorted)), length_sorted)
plt.xlabel('TEDtalks')
plt.ylabel('Length in Words')
plt.title('How Long is a TEDtalk?')
plt.grid(True)
plt.show()

## When are words not words?: Parenthetical Considerations

In the quick examination of the word counts above, the shortest text is two words. A two word talk? What is that?

In [34]:
for text in texts:
    if len(WhitespaceTokenizer().tokenize(text)) == 2:
        print(text)

  (Music)    (Applause)  


That's two parentheticals describing a mmusical performance. Let's repeat that code and see how big a text can get before we get spoken words.

In [35]:
# Incrementing 'less than' until I get spoken words.
for text in texts:
    if len(WhitespaceTokenizer().tokenize(text)) < 100:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Mechanical noises)    (Music) (Applause)  
  (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  


I started with 10 words and incremented by 10 until I got to 100, then to 150 and 200. From 10 words to 100 words, the results do not change that much: these are musical performances. Somewhere between 100 and 200 words, the lyrics of sung performances produce a transcript. We will need to set a threshold for the text we treat, but we will also need to consider the role of parentheticals in our analysis.

Can we see a list of parentheticals as a set?

In [36]:
# First, let's see a list of parentheticals in a talk:
re.findall('\(([^)]+)', texts[0])

['Mock sob',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Laughter',
 'Mock sob',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Applause',
 'Applause',
 'Laughter',
 'Applause']

In [37]:
# Getting a set is not a problem:
set(re.findall('\(([^)]+)', texts[0]))

{'Applause', 'Laughter', 'Mock sob'}

In [42]:
# Now to get a set for all the talks:
parentheticals = []
for text in texts:
    parentheticals.append(re.findall('\(([^)]+)', text))
    
# print(parentheticals) # Reveals that we have a list of lists

In [43]:
# The easiest way to flatten a list of lists is
parens_all = sum(parentheticals, [])
len(parens_all)

19570

Okay, so we have 19,570 parentheticals. At the very least, we can deduct that number from our total word count -- if we can't determine a way to remove them from our texts using some regex-fu. In the mean time, let's take a look at the set of all the parentheticals:

In [45]:
print(parens_all[0:10])

['Mock sob', 'Laughter', 'Laughter', 'Laughter', 'Laughter', 'Applause', 'Laughter', 'Mock sob', 'Laughter', 'Laughter']


In [47]:
paren_set = set(parens_all)
print (len(paren_set), paren_set)

878 {'Trumpet', 'laughter', 'DW: Chauvinist.', '"10,000 missiles"', '"Sure"', 'Singing ends', 'Text: Never, ever think outside the box.', 'Audience: Yes!', 'Intel ad jingle', 'A', 'Audience: Armani.', 'Breaking frozen lettuce or celery', '"The world population is growing by 75 million people each year. That\'s almost the size of Germany. Today, we\'re nearing 7 billion people. At this rate, we\'ll reach 9 billion people by 2040. And we all need food. But how? How do we feed a growing world without destroying the planet? We already know climate change is a big problem. But it\'s not the only problem. We need to face \'the other inconvenient truth.\' A global crisis in agriculture. Population growth + meat consumption + dairy consumption + energy costs + bioenergy production = stress on natural resources. More than 40% of Earth\'s land has been cleared for agriculture. Global croplands cover 16 million km². That\'s almost the size of South America. Global pastures cover 30 million km². T

Oi, we've got stuff in parentheses that shouldn't be there: we'll have to hand inspect the texts above and see if we need to make any corrections. 

How big is our total word count that these ~20,000 words might affect us? A quick summing of the `counts` from above -- `sum(counts)` -- reveals our total word count is: 5,432,831. Dividing all our parentheticals by our total word count gives us an impact of 0.3%:

In [49]:
len(parens_all)/sum(counts)

0.00360217352610453

## A Closer Look at the Tokens

Okay, now we need to look at the tokens involved, and we might as well get a frequency for each token while we are at it. 

In [None]:
counts = [len(WhitespaceTokenizer().tokenize(text)) for text in texts]