## Option 2: TED Talks
 
Using a Kaggle dataset listing the characteristics (e.g., number of views, number of comments) for TED
talks, summarize the most common occupations of TED speakers. In addition, for the top 100 most
popular TED talks, what are the most common themes? Is there a relationship between speaker
occupation and popularity?

### Imports and Functions

In [70]:
import pandas as pd
import re
import collections
import sys
from string import punctuation

import warnings
warnings.filterwarnings('ignore')

In [44]:
# https://github.com/natashamathur/natasha/blob/master/common_word.py


more_stops = ['tedx', 'most','dont', 'want', 'im', 'not', '—', 'people', 'can', 'us','one', 'like', 'just', 'know', 'think', 'now',
             'cant', 'weve', 'shouldve', 'couldve', 'wont', 'isnt', 'your', 'youre', 'tedx']


def strip_punctuation(s):
    
    return ''.join(c for c in s if c not in punctuation)

def common_words(x, n=5, more_stops = more_stops, remove_stop_words = True):
    '''
    Takes in a .txt file or a string and returns the most commonly used word. 
    '''

    if type(x) is not str:
        x = open(x, 'r', encoding = 'latin-1')
        x = x.read()

    x = strip_punctuation(x)
    lx = x.lower().split()
    
    STOP_WORDS = [ "a", "about", "above", "tedx", "ted x", "after", "again", "nan", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "im", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves",
                 ]

    
    if len(more_stops) > 0:
        STOP_WORDS = STOP_WORDS + more_stops
        
    if remove_stop_words:
        lx = [w for w in lx if w not in STOP_WORDS]
            
        lx = [w for w in lx if len(w) > 3]
    
    d = {}
    for word in lx:
        if word not in d.keys():
            d[word] = 1
        else:
            d[word] += 1

    most_common = sorted(d, key = d.get, reverse=True)[:n]
            
    #print("The " + str(n) + " most common words are: " + ( ", ".join( e for e in most_common)) + ".")
    #return most_common
    return d, most_common

STOP_WORDS = ['tedx', "a", "about", "above", "after", "again", "nan", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up","us", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves" ]

def get_top_words_col(column):
    # Isolate the most commonly used terms in each column
    text = ''
    for i, r in top_ts.iterrows():
        text = text + " " + r[column] + " "
        
    d, mc = common_words(text,10)
    for w in mc:
        print("{} ({})".format(w, d[w]))
        
    return text

### Reading and Cleaning Data

In [4]:
m = pd.read_csv('ted_main.csv')
ts = pd.read_csv('transcripts.csv')

In [5]:
def clean_tags(tags):
    tag = tags.split(",")
    bad_chars = "[]'' "
    tag = [x.strip(" '") for x in tag]
    tag = [x.strip("']'") for x in tag]
    tag = [x.strip("['") for x in tag]
    return ' '.join(tag)

m.tags = m.tags.apply(clean_tags,)

In [6]:
def get_name(url):
    return (' '.join(url.split("talks/")[1].split("_")[:2])).title()

ts['name'] = ts.url.apply(get_name,)

In [7]:
m.columns, m.shape

(Index(['comments', 'description', 'duration', 'event', 'film_date',
        'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
        'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
        'url', 'views'],
       dtype='object'), (2550, 17))

In [8]:
m.shape, ts.shape

((2550, 17), (2467, 3))

## Common Occupations

In [9]:
m.num_speaker.value_counts()

1    2492
2      49
3       5
4       3
5       1
Name: num_speaker, dtype: int64

In [10]:
m[m.num_speaker == 2].speaker_occupation.value_counts().head(3)

Singer/songwriter    3
Musician             3
Researcher           2
Name: speaker_occupation, dtype: int64

In [11]:
m.speaker_occupation.value_counts().head(10)

Writer          45
Designer        34
Artist          34
Journalist      33
Entrepreneur    31
Architect       30
Inventor        27
Psychologist    26
Photographer    25
Filmmaker       21
Name: speaker_occupation, dtype: int64

As there had been 2550 by September 2017 and no more than 45 people with the same occupation, the self-proclaimed occupations of TED speakers vary greatly. The vast majority of TED talks are given by one person. Using the title provided, the most common occupations are writers, authors, designers, journalists, and entrepreneurs. 

In [12]:
def split_occ(occ):
    occ = re.split('[\[\] /]', str(occ))
    occ = [x.strip(",") for x in occ]
    occ = [x.lower() for x in occ]
    return occ

In [13]:
m['split_occ'] = m.speaker_occupation.apply(split_occ,)
occs = []
for i,r in m.iterrows():
    occs.extend(r.split_occ)
filler = ['and', 'or','of', 'the', '']
occs = [x for x in occs if x not in filler]

counter = collections.Counter(occs)
counter.most_common(10)

[('activist', 143),
 ('designer', 113),
 ('artist', 107),
 ('writer', 105),
 ('entrepreneur', 99),
 ('author', 92),
 ('scientist', 88),
 ('researcher', 80),
 ('expert', 78),
 ('journalist', 65)]

In [14]:
mx = m.dropna(subset=['speaker_occupation'])
mx.speaker_occupation = mx.speaker_occupation.str.lower()
mx[mx['speaker_occupation'].str.contains("activist")].speaker_occupation.head()

3     activist for environmental justice
24                  playwright, activist
39                    musician, activist
46                    musician, activist
63                              activist
Name: speaker_occupation, dtype: object

Since people's occupations can vary for the same profession, I decided to break down each given occupation into individual words. I removed common words (ex. "and", "of") and looked at the most common words left. The most common words were activist, designer, artist, enrepreneur, and writer. 

Many people have multiple descriptors. Speakers who are activists are commonly also writers or entrepreneurs. 

In [15]:
mx['occ_lim'] = [x for x in mx.split_occ if x not in ['writer', 'entrepreneur']]

In [16]:
occs = []
for i,r in mx.iterrows():
    occs.extend(r.occ_lim)
filler = ['and', 'or','of', 'the', '', 'writer', 'entrepreneur', 'activist', 'author', 'social']
occs = [x for x in occs if x not in filler]

counter = collections.Counter(occs)
counter.most_common(10)

[('designer', 113),
 ('artist', 107),
 ('scientist', 88),
 ('researcher', 80),
 ('expert', 78),
 ('journalist', 65),
 ('inventor', 58),
 ('educator', 57),
 ('biologist', 56),
 ('psychologist', 54)]

Therefore, I removed words that can apply to people across multiple industries (activist, entrepreneur, author, social, writer), and found that the most common descriptors included designer, artist, scientist, inventor, and economist. 

Overall, the most common occupation are activists across different areas, writer/author, or entrepreneur/inventor. Other popular professions include different types of scientists and economists. 

## Top 100 Most Popular Talks

In [17]:
m.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'split_occ'],
      dtype='object')

#### What makes a talk popular? 

In [18]:
top_by_comments = m.sort_values(by='comments', ascending=False)
top_by_comments.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,split_occ
96,6404,Richard Dawkins urges all atheists to openly s...,1750,TED2002,1012608000,42,Richard Dawkins,Richard Dawkins: Militant atheism,1,1176689220,"[{'id': 3, 'name': 'Courageous', 'count': 3236...","[{'id': 86, 'hero': 'https://pe.tedcdn.com/ima...",Evolutionary biologist,God atheism culture religion science,Militant atheism,https://www.ted.com/talks/richard_dawkins_on_m...,4374792,"[evolutionary, biologist]"
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,children creativity culture dance education pa...,Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,"[author, educator]"
644,3356,"Questions of good and evil, right and wrong ar...",1386,TED2010,1265846400,39,Sam Harris,Sam Harris: Science can answer moral questions,1,1269249180,"[{'id': 8, 'name': 'Informative', 'count': 923...","[{'id': 666, 'hero': 'https://pe.tedcdn.com/im...","Neuroscientist, philosopher",culture evolutionary psychology global issues ...,Science can answer moral questions,https://www.ted.com/talks/sam_harris_science_c...,3433437,"[neuroscientist, philosopher]"
201,2877,Jill Bolte Taylor got a research opportunity f...,1099,TED2008,1204070400,49,Jill Bolte Taylor,Jill Bolte Taylor: My stroke of insight,1,1205284200,"[{'id': 22, 'name': 'Fascinating', 'count': 14...","[{'id': 184, 'hero': 'https://pe.tedcdn.com/im...",Neuroanatomist,biology brain consciousness global issues illn...,My stroke of insight,https://www.ted.com/talks/jill_bolte_taylor_s_...,21190883,[neuroanatomist]
1787,2673,Our consciousness is a fundamental aspect of o...,1117,TED2014,1395100800,33,David Chalmers,David Chalmers: How do you explain consciousness?,1,1405350484,"[{'id': 25, 'name': 'OK', 'count': 280}, {'id'...","[{'id': 1308, 'hero': 'https://pe.tedcdn.com/i...",Philosopher,brain consciousness neuroscience philosophy,How do you explain consciousness?,https://www.ted.com/talks/david_chalmers_how_d...,2162764,[philosopher]


It could be the number of comments, but that could also just be a ranking of "most controversial". The talk that got the most comments was one by Richard Dawkins, on militant atheism, a topic there will surely be lots of dissenting opinions on. 

In [19]:
top_by_lang = m.sort_values(by='languages', ascending=False)
top_by_lang.groupby(['main_speaker']).mean()['languages'].to_frame().sort_values(by='languages', ascending=False).head()

Unnamed: 0_level_0,languages
main_speaker,Unnamed: 1_level_1
Matt Cutts,72.0
Derek Sivers,64.0
Richard St. John,61.0
Adora Svitak,58.0
Arianna Huffington,57.0


How many languages a talk is translated into could be a sign of how popular it is. However, some speakers' talks seem to be translated into more languages regardless of how popular the talk was. This could be a function of where the event was held or where the speaker is originally from. 

In [20]:
top_by_views = m.sort_values(by='comments', ascending=False)
top_100 = top_by_views.head(100).reset_index()

In [21]:
top_100[['main_speaker','title']].head()

Unnamed: 0,main_speaker,title
0,Richard Dawkins,Militant atheism
1,Ken Robinson,Do schools kill creativity?
2,Sam Harris,Science can answer moral questions
3,Jill Bolte Taylor,My stroke of insight
4,David Chalmers,How do you explain consciousness?


Finally, I decided to go with top 100 by number of views. Popular TED talks get attention regardless of where they took place, and the number of comments on a video is a function of the number of views. Some TED talks remained popular - Richard Dawkins' talk is the most popular by either standard. 

## Common Themes

Many of the most popular talks were about psychological or social issues, particularly those that had global applications. 

I looked for common themes but looking at the most common words used in the tags, the description, and the transcript of each talk. This is the area I would most like to devote additional time and effort to, such as by building a NLTK model to isolate repeated themes. 

Looking at the tags for each talk, the most common topics are those that touch on worldwide issues or inspirational topics. This is supported by the language used in the talk descriptions. The most common speeches are on understandable issues of general concern, such as cultural issues. Although scientists give many of the talks, heavy scientific issues are left out of this list. 

In [41]:
t100 = top_100.copy()

top_ts = t100.merge(ts, how='left', on='url')
top_ts = top_ts[['description',  'event', 'film_date',
        'main_speaker', 'name_x', 'related_talks', 'speaker_occupation', 'tags', 'title',
        'url',  'split_occ', 'transcript']]

top_ts = top_ts.drop([84], axis = 0)

t100.shape, top_ts.shape

((100, 19), (100, 12))

In [60]:
top_ts.head(2)

Unnamed: 0,description,event,film_date,main_speaker,name_x,related_talks,speaker_occupation,tags,title,url,split_occ,transcript,transcript_c
0,Richard Dawkins urges all atheists to openly s...,TED2002,1012608000,Richard Dawkins,Richard Dawkins: Militant atheism,"[{'id': 86, 'hero': 'https://pe.tedcdn.com/ima...",Evolutionary biologist,God atheism culture religion science,Militant atheism,https://www.ted.com/talks/richard_dawkins_on_m...,"[evolutionary, biologist]","That splendid music, the coming-in music, ""The...",splendid music comingin music elephant march a...
1,Sir Ken Robinson makes an entertaining and pro...,TED2006,1140825600,Ken Robinson,Ken Robinson: Do schools kill creativity?,"[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,children creativity culture dance education pa...,Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,"[author, educator]",Good morning. How are you?(Laughter)It's been ...,good morning youlaughterits great hasnt ive bl...


#### By description 

In [46]:
by_desc = get_top_words_col("description")

talk (37)
world (17)
even (16)
says (15)
makes (11)
change (11)
powerful (10)
case (10)
shares (10)
life (9)


#### By tags 

In [47]:
by_tags = get_top_words_col("tags")

culture (45)
global (31)
issues (31)
business (27)
science (19)
brain (19)
psychology (17)
health (17)
technology (14)
change (14)


#### By title - not substantive

In [27]:
import unicodedata
def keep_chr(c):
    return (unicodedata.category(c).startswith('P') and \
                (c != "#" and c != "@" and c != "&"))
PUNCTUATION = " ".join([chr(i) for i in range(sys.maxunicode) if keep_chr(chr(i))])

In [28]:
def clean_text(title):
    title = str(title).lower()
    title = strip_punctuation(title)
    title = title.split(" ")
    title = [x for x in title if x not in STOP_WORDS]
    title = ' '.join([x for x in title if x not in PUNCTUATION])

    return title

In [29]:
top_ts['title_cleaned'] = top_ts.title.apply(clean_text,)

In [30]:
by_title = get_top_words_col("title_cleaned")

make (4)
world (4)
science (3)
power (3)
climate (3)
wrong (3)
teach (3)
need (3)
violence (3)
women (3)


#### By transcript

In [51]:
top_ts['transcript_c'] = top_ts.transcript.apply(clean_text,)

In [52]:
by_transcript= get_top_words_col("transcript_c")

going (648)
thats (590)
said (587)
really (516)
world (504)
will (465)
years (461)
time (454)
make (409)
things (407)


In [71]:
problem_words = [ "a", "about", "above", "tedx", "ted x", "after", "again", "nan", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "im", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves",
                 "new", "york", "say", "no", "thank"]

In [55]:
by_transcript = [w for w in by_transcript.split(' ') if w not in problem_words]

In [57]:
ngrams = {}
for i in range(len(by_transcript)-2):
    ng = ' '.join(by_transcript[i:i+3])
    if ng in ngrams.keys():
        ngrams[ng] = ngrams[ng] + 1
    else:
        ngrams[ng] = 1

In [58]:
mc = sorted(ngrams, key = ngrams.get, reverse=True)[:10]
for i, w in enumerate(mc):
    print("{} ({})".format(w, ngrams[w]))

youre not going (9)
10 years ago (8)
every single day (8)
people around world (7)
amount dark energy (7)
whats going happen (6)
every single one (6)
seven years old (6)
shape extra dimensions (6)
thats not going (5)


## Speaker Occupation and Popularity

In [37]:
top_100.speaker_occupation.value_counts().head()

Writer                 5
Psychologist           5
Social psychologist    3
Journalist             3
Novelist               2
Name: speaker_occupation, dtype: int64

The most common talks were given by speakers from a variety of professions. 

In [38]:
occs = []
for i,r in top_100.iterrows():
    occs.extend(r.split_occ)
filler = ['and', 'or','of', 'the', '']
occs = [x for x in occs if x not in filler]

counter = collections.Counter(occs)
counter.most_common(10)

[('psychologist', 12),
 ('writer', 8),
 ('author', 6),
 ('social', 5),
 ('activist', 5),
 ('educator', 4),
 ('philosopher', 4),
 ('researcher', 4),
 ('designer', 4),
 ('expert', 3)]

While the list of most common professions overall included more scientists, the popular talks were given by people in more qualitative fields. 

In [39]:
occs = []
for i,r in top_100.iterrows():
    occs.extend(r.split_occ)
filler = ['and', 'or','of', 'the', '', 'writer', 'activist']
occs = [x for x in occs if x not in filler]

counter = collections.Counter(occs)
counter.most_common(10)

[('psychologist', 12),
 ('author', 6),
 ('social', 5),
 ('educator', 4),
 ('philosopher', 4),
 ('researcher', 4),
 ('designer', 4),
 ('expert', 3),
 ('research', 3),
 ('journalist', 3)]

Removing the ubiquitous "writer" and "activist" only makes that more apparent. 

In [40]:
top_100.groupby(['main_speaker', 'speaker_occupation']).size().to_frame().sort_values(by=[0], ascending=False).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
main_speaker,speaker_occupation,Unnamed: 2_level_1
Al Gore,Climate advocate,2
Ken Robinson,Author/educator,2
Brené Brown,Vulnerability researcher,2
Aaron Huey,Photographer,1
Melinda Gates,Philanthropist,1


Three people had two spots in the top 100 - former Vice President Al Gore, educator Ken Robinson, and scientist Brene Brown. 

A speakers occupation does not seem to automatically determine whether their talk would be popular. For example, there are numerous authors and activists who give talks of varying levels of popularity. However while scientists give many talks, they don't appear in the 100 most popular talks. 

The popularity of a talk likely depends on more factors in addition to the topic, such as how popular the speaker is, the quality of the presentation, timing in relation to current events, and marketing. 

In [69]:
others = m.tail(2450)

occs = []
for i,r in others.iterrows():
    occs.extend(r.split_occ)
filler = ['and', 'or','of', 'the', '', 'writer', 'activist']

occs = [x for x in occs if x not in filler]

counter = collections.Counter(occs)
counter.most_common(10)

[('designer', 109),
 ('artist', 101),
 ('entrepreneur', 99),
 ('author', 87),
 ('scientist', 86),
 ('researcher', 78),
 ('expert', 75),
 ('journalist', 63),
 ('educator', 56),
 ('social', 54)]