One of the central concerns of our work from the beginning has been getting the matter of *words* right. While so-called stopwords, sometimes also imagined as an extended set of function words to include words too common to be of any use in distinguishing one text from another, have proven useful in many applications, we felt it was better to keep all the words in our initial processing of the corpus and only remove words in later experiments as they proved problematic -- read "mostly harmless" -- in a particular context.

In [1]:
import pandas
from nltk.tokenize import WhitespaceTokenizer
import numpy as np
import re

## Getting the Talks

In [2]:
# Load the CSV into a dataframe and then place all the texts into a list:

with open('../data/tedtalks2018.csv') as f:
    colnames = f.readline().strip().split(",")

print(colnames)

df = pandas.read_csv('tedtalks2018.csv', names=colnames)
print(df.shape)

texts = df.text.tolist()

['rowID', 'Talk_ID', 'public_url', 'speaker_name', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text']
(2657, 12)


The first item in this list is the column header -- `texts[0] > 'text'` -- and so the first thing we are going to do is remove it:

In [3]:
texts.pop(0)

'text'

Now we'll sure that our first text is, in fact, a text and not something else:

In [4]:
print(texts[0][0:50])

  Thank you so much, Chris. And it's truly a great


## Tokenization

Now we need to determine the best way forward for tokenization. In other projects, I have used the NLTK's`WhitespaceTokenizer`, so we will begin there and check the results.

In [5]:
# List of word counts for each talk 
counts = [len(WhitespaceTokenizer().tokenize(text)) for text in texts]
print("Of the {} talks, the shortest is {} words; the longest {}; and the average {}.".format(
    len(counts),min(counts), max(counts), int(np.mean(counts))))

Of the 2656 talks, the shortest is 2 words; the longest 9185; and the average 2045.


A two word talk? What is that?

In [6]:
for text in texts:
    if len(WhitespaceTokenizer().tokenize(text)) < 100:
        print(text)

  (Applause)    (Music)    (Applause)  
  Let's just get started here.    Okay, just a moment.    (Whirring)    All right. (Laughter) Oh, sorry.    (Music) (Beatboxing)    Thank you.    (Applause)  
  (Music)    (Applause)    (Music)    (Music) (Applause)    (Music) (Applause) (Applause)    Herbie Hancock: Thank you. Marcus Miller. (Applause) Harvey Mason. (Applause)    Thank you. Thank you very much. (Applause)  
  (Music)    (Applause)    (Music)    (Applause)  
  (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)    (Music)    (Applause)  
  (Mechanical noises)    (Music) (Applause)  
  (Music)    (Applause)  
  (Music)    (Music) (Applause)    (Applause)  
  (Guitar music starts)    (Music ends)    (Applause)    (Distorted guitar music starts)    (Music ends)    (Applause)    (Ambient/guitar music starts)    (Music ends)    (Applause)  
  (Guitar music starts)    (Cheers)    (Cheers)    (Music ends)  


From 10 words to 100 words, the results do not change that much: these are musical performances. Somewhere between 100 and 200 words, the lyrics of sung performances produce a transcript. We will need to set a threshold for the text we treat, but we will also need to consider the role of parentheticals in our analysis.

Can we see a list of parentheticals as a set?

In [7]:
# First, let's see a list of parentheticals in a talk:
re.findall('\(([^)]+)', texts[0])

['Mock sob',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Laughter',
 'Mock sob',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Laughter',
 'Applause',
 'Applause',
 'Applause',
 'Laughter',
 'Applause']

In [8]:
# Getting a set is not a problem:
set(re.findall('\(([^)]+)', texts[0]))

{'Applause', 'Laughter', 'Mock sob'}

In [9]:
# Now to get a set for all the talks:
parentheticals = []
for texts in texts:
    parentheticals.append(re.findall('\(([^)]+)', text))
    
print(len(parentheticals))

2656


## Quick Word Count

In [None]:
import matplotlib.pyplot as plt

import pylab
%matplotlib inline
pylab.rcParams['figure.figsize'] = (12, 6)

In [None]:
# =-=-=-=-=-=-=-=-=-=-=
# Plot of Word Lengths
# =-=-=-=-=-=-=-=-=-=-= 

length_sorted = sorted(counts)
plt.bar(range(len(length_sorted)), length_sorted)
plt.xlabel('TEDtalks')
plt.ylabel('Length in Words')
plt.title('How Long is a TEDtalk?')
plt.grid(True)
plt.show()