# Capstone Project: The Persuasive Power of Words

*by Nee Bimin*

## Notebook 1: Data Cleaning

In this notebook, we will inspect the data on TED Talks and clean them before performing some exploratory visualisations in the next notebook.

# Problem Statement

Once we step out into the working world, we need to communicate to get our ideas across. The question is, how do we use words to persuade others to agree with our viewpoints? In this project, we will use TED Talks data set from Kaggle which contains ratings for persuasiveness of the talks. Classifiers - decision trees, random forest regressors, and linear regression will be deployed to find the key predictors of persuasiveness. 

This project aims to not only help us get an idea of words that persuade, but also words that dissuade to avoid using those words in meetings and presentations. To evaluate the models, we will compare it against the baseline accuracy, and also consider the f1-score, which can be understood as the weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.

## Executive Summary

* Data Cleaning
* Exploratory Data Analysis
* Model Training
* Conclusion and Business Recommendations

## Content
- [Data Reading and Exploration](#Data-Reading-and-Exploration)
- [Data Cleaning of Main Dataframe](#Data-Cleaning-of-Main-Dataframe)
    * [Converting Data Types](#Converting-Data_Types)
    * [Handling Missing Values](#Handling-Missing-Values)
    * [Ratings Column Cleaning and Preprocessing](#Ratings-Column-Cleaning-and-Preprocesing)
    * [Speaker Occupation Column Cleaning and Preprocessing](#Speaker-Occupation-Column-Cleaning-and-Preprocesing)
    * [Tags Column Preprocessing](#Tages-Column-Preprocessing)
- [Data Cleaning of Transcripts](#Data-Cleaning-of-Transcripts)
- [Data Export](#Data-Export)

In [1]:
# # Run line(s) to install spacy and/or pyLDAvis
# conda install spacy
# conda install pyLDAvis

In [2]:
# Import libraries

import numpy as np
import pandas as pd
import requests
import time
import re
import seaborn as sns
import matplotlib.pyplot as plt
import ast
from collections import defaultdict

import os
import sys

%matplotlib inline

## Data Reading and Exploration

Read in data and explore it before cleaning it for visualisation.

In [3]:
ted = pd.read_csv('../data/ted_main.csv')
ted.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


* We see that duration here is in seconds. We will convert it to minutes. 
* The film dates and published dates are in time stamps which we have to transform later.
* Comments, ratings and views can be explored to find out the popularity of the talks.
* Languages are the number of languages that each talk is translated to. 

In [4]:
# Explore ratings column
ted.ratings[0]

"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

There are different types of ratings that give more insight as to why a talk is good or bad. Viewers can vote for the different types of ratings, unlike the typical ratings done on a scale. 

In [5]:
# Check out the tags column
ted.tags.head()

0    ['children', 'creativity', 'culture', 'dance',...
1    ['alternative energy', 'cars', 'climate change...
2    ['computers', 'entertainment', 'interface desi...
3    ['MacArthur grant', 'activism', 'business', 'c...
4    ['Africa', 'Asia', 'Google', 'demo', 'economic...
Name: tags, dtype: object

In [6]:
ted.tags.nunique()

2530

There are 2530 different tags used to categorise the talks. 

In [7]:
# Check data types
ted.dtypes

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object

## Data Cleaning of Main Dataframe

### Converting Data Types
The dates will be transformed to datetime objects. The variable duration will be converted to minutes.

In [8]:
# Convert to datetime objects
ted['film_date'] = pd.to_datetime(ted['film_date'])
ted['published_date'] = pd.to_datetime(ted['published_date'])

In [9]:
# Convert duration to minutes
ted['duration'] = ted['duration']/60

In [10]:
ted.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,19.4,TED2006,1970-01-01 00:00:01.140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1970-01-01 00:00:01.151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,16.283333,TED2006,1970-01-01 00:00:01.140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1970-01-01 00:00:01.151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,21.433333,TED2006,1970-01-01 00:00:01.140739200,26,David Pogue,David Pogue: Simplicity sells,1,1970-01-01 00:00:01.151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",18.6,TED2006,1970-01-01 00:00:01.140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1970-01-01 00:00:01.151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,19.833333,TED2006,1970-01-01 00:00:01.140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1970-01-01 00:00:01.151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


In [11]:
# Check data types
ted.dtypes

comments                       int64
description                   object
duration                     float64
event                         object
film_date             datetime64[ns]
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date        datetime64[ns]
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
dtype: object

### Handling Missing Values

In [12]:
# Check for null values
ted.isnull().sum()

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    6
tags                  0
title                 0
url                   0
views                 0
dtype: int64

Only the speaker occupation column has missing values, which we will check below.

In [13]:
ted[ted.speaker_occupation.isnull()]

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
1113,145,"After a crisis, how can we tell if water is sa...",3.616667,TEDGlobal 2011,1970-01-01 00:00:01.310601600,38,Sonaar Luthra,Sonaar Luthra: Meet the Water Canary,1,1970-01-01 00:00:01.326731605,"[{'id': 10, 'name': 'Inspiring', 'count': 73},...","[{'id': 523, 'hero': 'https://pe.tedcdn.com/im...",,"['TED Fellows', 'design', 'global development'...",Meet the Water Canary,https://www.ted.com/talks/sonaar_luthra_meet_t...,353749
1192,122,"The Pirate Party fights for transparency, anon...",18.283333,TEDxObserver,1970-01-01 00:00:01.331424000,10,Rick Falkvinge,Rick Falkvinge: I am a pirate,1,1970-01-01 00:00:01.333289675,"[{'id': 8, 'name': 'Informative', 'count': 156...","[{'id': 1329, 'hero': 'https://pe.tedcdn.com/i...",,"['Internet', 'TEDx', 'global issues', 'politic...",I am a pirate,https://www.ted.com/talks/rick_falkvinge_i_am_...,181010
1220,257,"As you surf the Web, information is being coll...",6.65,TED2012,1970-01-01 00:00:01.330473600,32,Gary Kovacs,Gary Kovacs: Tracking our online trackers,1,1970-01-01 00:00:01.336057219,"[{'id': 23, 'name': 'Jaw-dropping', 'count': 9...","[{'id': 1370, 'hero': 'https://pe.tedcdn.com/i...",,"['Internet', 'advertising', 'business', 'priva...",Tracking our online trackers,https://www.ted.com/talks/gary_kovacs_tracking...,2098639
1656,140,"In this lovely talk, TED Fellow Ryan Holladay ...",6.483333,TED@BCG San Francisco,1970-01-01 00:00:01.383091200,33,Ryan Holladay,Ryan Holladay: To hear this music you have to ...,1,1970-01-01 00:00:01.389369735,"[{'id': 1, 'name': 'Beautiful', 'count': 211},...","[{'id': 1152, 'hero': 'https://pe.tedcdn.com/i...",,"['TED Fellows', 'entertainment', 'music', 'tec...",To hear this music you have to be there. Liter...,https://www.ted.com/talks/ryan_holladay_to_hea...,1284510
1911,48,What do you do with an outdated encyclopedia i...,6.1,TEDYouth 2014,1970-01-01 00:00:01.415059200,34,Brian Dettmer,Brian Dettmer: Old books reborn as art,1,1970-01-01 00:00:01.423238442,"[{'id': 1, 'name': 'Beautiful', 'count': 361},...","[{'id': 610, 'hero': 'https://pe.tedcdn.com/im...",,"['TEDYouth', 'art', 'books', 'creativity']",Old books reborn as art,https://www.ted.com/talks/brian_dettmer_old_bo...,1159937
1949,70,Photographer Boniface Mwangi wanted to protest...,7.333333,TEDGlobal 2014,1970-01-01 00:00:01.413763200,33,Boniface Mwangi,Boniface Mwangi: The day I stood up alone,1,1970-01-01 00:00:01.427989423,"[{'id': 3, 'name': 'Courageous', 'count': 614}...","[{'id': 1757, 'hero': 'https://pe.tedcdn.com/i...",,"['TED Fellows', 'activism', 'art', 'corruption...",The day I stood up alone,https://www.ted.com/talks/boniface_mwangi_boni...,1342431


In [14]:
# Impute missing values with empty string
ted.speaker_occupation.fillna('', inplace=True)

In [15]:
ted.isnull().sum()

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    0
tags                  0
title                 0
url                   0
views                 0
dtype: int64

#### Add Talk ID column

There are no ID's for the talks, so we will create one to be used in modeling later.

In [16]:
ted['talk_id'] = range(1, len(ted)+1)

In [17]:
ted.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views', 'talk_id'],
      dtype='object')

### Ratings Column Cleaning and Preprocessing

As we have seen earlier, the ratings column include categories for the talks. Exploring further, we find that, for each talk, there are a few dictionaries which in turn contain: 
* ID, which corresponds to the name of the rating for the talk
* Name, which is the name of the rating, which is consistent across all talks
* Count, which is the number of votes received for each type of rating within the talk itself

We will convert the names into variables and add the count into the names column in a separate dataframe called ratings.

In [18]:
# Check the data structure of ratings
ted.ratings[0]

"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

Each observation is a string even thought it looks like dictionaries nested in a list. So we need to convert it into a list and dictionary before we can use the information within.

In [19]:
ratings = defaultdict(list)

# # Convert each row in the ratings column to list and dictionary
for index, row in ted.iterrows():
    rating = ast.literal_eval(row['ratings'])
    names = set()
    
    for item in rating:
        ratings[item['name']].append(item['count'])
        names.add(item['name'])

ratings = pd.DataFrame(ratings)
ratings.head()

Unnamed: 0,Funny,Beautiful,Ingenious,Courageous,Longwinded,Confusing,Informative,Fascinating,Unconvincing,Persuasive,Jaw-dropping,OK,Obnoxious,Inspiring
0,19645,4573,6073,3253,387,242,7346,10581,300,10704,4439,1174,209,24924
1,544,58,56,139,113,62,443,132,258,268,116,203,131,413
2,964,60,183,45,78,27,395,166,104,230,54,146,142,230
3,59,291,105,760,53,32,380,132,36,460,230,85,35,1070
4,1390,942,3202,318,110,72,5433,4606,67,2542,3736,248,61,2893


In [20]:
# Lowercase all columns and change hyphen to underscore
ratings.columns = [x.lower().replace('-','_') for x in ratings.columns]
ratings.columns

Index(['funny', 'beautiful', 'ingenious', 'courageous', 'longwinded',
       'confusing', 'informative', 'fascinating', 'unconvincing', 'persuasive',
       'jaw_dropping', 'ok', 'obnoxious', 'inspiring'],
      dtype='object')

For the purpose of EDA, we can make another column summing up the number of votes the talk received in total. 

Two other columns will be made for rating types that are considered positive and negative:
* Positive: 'funny', 'beautiful', 'ingenious', 'courageous', 'informative', 'fascinating', 'persuasive', 'jaw-dropping', 'inspiring'
* Negative: 'longwinded', 'confusing', 'unconvincing', 'obnoxious'


In [21]:
ratings['total'] = ratings.sum(axis=1)
# ratings = ratings.sort_values('total', ascending=False)

In [22]:
ratings['positive'] = (ratings.funny + ratings.beautiful + ratings.ingenious + ratings.courageous 
                       + ratings.informative + ratings.fascinating + ratings.persuasive + ratings.jaw_dropping 
                       + ratings.inspiring)
ratings['negative'] = ratings.longwinded + ratings.confusing + ratings.unconvincing + ratings.obnoxious

In [23]:
ratings.head()

Unnamed: 0,funny,beautiful,ingenious,courageous,longwinded,confusing,informative,fascinating,unconvincing,persuasive,jaw_dropping,ok,obnoxious,inspiring,total,positive,negative
0,19645,4573,6073,3253,387,242,7346,10581,300,10704,4439,1174,209,24924,93850,91538,1138
1,544,58,56,139,113,62,443,132,258,268,116,203,131,413,2936,2169,564
2,964,60,183,45,78,27,395,166,104,230,54,146,142,230,2824,2327,351
3,59,291,105,760,53,32,380,132,36,460,230,85,35,1070,3728,3487,156
4,1390,942,3202,318,110,72,5433,4606,67,2542,3736,248,61,2893,25620,25062,310


In [24]:
ratings.columns[1]

'beautiful'

In [25]:
# Add talk_id to the ratings dataframe
ratings['talk_id'] = ted['talk_id']
ratings.columns

Index(['funny', 'beautiful', 'ingenious', 'courageous', 'longwinded',
       'confusing', 'informative', 'fascinating', 'unconvincing', 'persuasive',
       'jaw_dropping', 'ok', 'obnoxious', 'inspiring', 'total', 'positive',
       'negative', 'talk_id'],
      dtype='object')

In [26]:
ratings.head()

Unnamed: 0,funny,beautiful,ingenious,courageous,longwinded,confusing,informative,fascinating,unconvincing,persuasive,jaw_dropping,ok,obnoxious,inspiring,total,positive,negative,talk_id
0,19645,4573,6073,3253,387,242,7346,10581,300,10704,4439,1174,209,24924,93850,91538,1138,1
1,544,58,56,139,113,62,443,132,258,268,116,203,131,413,2936,2169,564,2
2,964,60,183,45,78,27,395,166,104,230,54,146,142,230,2824,2327,351,3
3,59,291,105,760,53,32,380,132,36,460,230,85,35,1070,3728,3487,156,4
4,1390,942,3202,318,110,72,5433,4606,67,2542,3736,248,61,2893,25620,25062,310,5


In [27]:
# Select columns from ratings for modeling
ratings_select = ratings[['persuasive', 'inspiring', 'unconvincing', 'talk_id']]

# Merge the main TED Talks dataframe with the ratings dataframe
ted_model = ted.merge(ratings_select, how = 'left', on = ['talk_id'])

In [28]:
# To ensure that the persuasiveness is not skewed by the number of views,
# we divide the selected ratings columns by the number of views.

ted_model['norm_persuasive'] = ted_model['persuasive'] / ted_model['views']
ted_model['norm_inspiring'] = ted_model['inspiring'] / ted_model['views']
ted_model['norm_unconvincing'] = ted_model['unconvincing'] / ted_model['views']

In [29]:
ted_model.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,title,url,views,talk_id,persuasive,inspiring,unconvincing,norm_persuasive,norm_inspiring,norm_unconvincing
0,4553,Sir Ken Robinson makes an entertaining and pro...,19.4,TED2006,1970-01-01 00:00:01.140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1970-01-01 00:00:01.151367060,...,Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,1,10704,24924,300,0.000227,0.000528,6e-06
1,265,With the same humor and humanity he exuded in ...,16.283333,TED2006,1970-01-01 00:00:01.140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1970-01-01 00:00:01.151367060,...,Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2,268,413,258,8.4e-05,0.000129,8.1e-05
2,124,New York Times columnist David Pogue takes aim...,21.433333,TED2006,1970-01-01 00:00:01.140739200,26,David Pogue,David Pogue: Simplicity sells,1,1970-01-01 00:00:01.151367060,...,Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,3,230,230,104,0.000141,0.000141,6.4e-05
3,200,"In an emotionally charged talk, MacArthur-winn...",18.6,TED2006,1970-01-01 00:00:01.140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1970-01-01 00:00:01.151367060,...,Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,4,460,1070,36,0.000271,0.00063,2.1e-05
4,593,You've never seen data presented like this. Wi...,19.833333,TED2006,1970-01-01 00:00:01.140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1970-01-01 00:00:01.151440680,...,The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,5,2542,2893,67,0.000212,0.000241,6e-06


### Speaker Occupation Column Cleaning and Preprocessing

We saw earlier that there is a character '/' that should not be there. We need to check if there are other characters in other rows. 

In [30]:
# Visual check to see if the separation characters is because of the number of speakers or 
# because the speaker has two occupations
ted[['speaker_occupation', "num_speaker"]].head(10)

Unnamed: 0,speaker_occupation,num_speaker
0,Author/educator,1
1,Climate advocate,1
2,Technology columnist,1
3,Activist for environmental justice,1
4,Global health expert; data visionary,1
5,Life coach; expert in leadership psychology,1
6,"Actor, comedian, playwright",1
7,Architect,1
8,"Philosopher, cognitive scientist",1
9,"Pastor, author",1


We can see that these speakers have two or more occupations that are separated by '/', ',', ';'. There could be other characters that we did not observe from these first 10 observations. So we have to do a search for them.

In [31]:
prob_chars = re.compile(r'[=\+/&<>;\"\-\?%#$@\,\t\r\n]| and ')

problems_occupation = defaultdict(list)

for index, row in ted.iterrows():
    occupation = row['speaker_occupation']
    char = prob_chars.search(occupation)
    if char:
        chars = char.group()
        problems_occupation[chars].append(occupation)
        
print('Characters found in speaker occupation column:', problems_occupation.keys())

Characters found in speaker occupation column: dict_keys(['/', ';', ',', '-', ' and ', '+'])


In [32]:
# Checking the first twenty occupations that have '/'
problems_occupation['/'][:20]

['Author/educator',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Singer/songwriter',
 'Author/educator',
 'HIV/AIDS fighter',
 'Author/educator',
 '9/11 mothers',
 'Author/illustrator',
 'Author/educator',
 'Author/illustrator',
 'Director/choreographer, dancer']

For most of the occupations listed, the '/' character separates two different occupations with the exception of 'HIV/AIDS fighter' and '9/11 mothers'. These two should not be separated. The rest can be separated later.

In [33]:
# Checking the first twenty occupations that have ','
problems_occupation[';'][:20]

['Global health expert; data visionary',
 'Life coach; expert in leadership psychology',
 'Blogger; cofounder, Six Apart',
 'Psychologist; happiness expert',
 'Mathematician; statistician',
 'Primatologist; environmentalist',
 'Designer; creative director, Ideo',
 'Cellist; singer-songwriter',
 'Interaction designer; software developer',
 'Global health expert; data visionary',
 'Primatologist; environmentalist',
 'Psychologist; happiness expert',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Global health expert; data visionary',
 'Psychologist; happiness expert']

In [34]:
# Checking the first thirty occupations that have ','
problems_occupation[','][:30]

['Actor, comedian, playwright',
 'Philosopher, cognitive scientist',
 'Pastor, author',
 'Epidemiologist, philanthropist',
 'Pianist, composer',
 'inventor, engineer',
 'Humorist, web artist',
 'Anthropologist, expert on love',
 'Playwright, activist',
 'Founder, GrameenPhone',
 'Musician, activist',
 'Inventor, futurist',
 'Musician, activist',
 'Marketer, success analyst',
 'Performance poet, multimedia artist',
 'Physician, author',
 'Anthropologist, ethnobotanist',
 'Journalist, philosopher',
 'Actor, playwright, social critic',
 'Physicist, personal fab pioneer',
 'Biologist, Nobel laureate',
 'Singer, performance artist',
 'Science writer, innovation consultant, conservationist',
 'Biologist, genetics pioneer',
 'Biologist, biomechanics researcher',
 'Philosopher, cognitive scientist',
 'Performance poet, multimedia artist',
 'Computer scientist, entrepreneur and philanthropist',
 'Designer, educator',
 'Environmentalist, futurist']

Some of the occupations with comma separating them contain the title of the speaker and the company name e.g. 'Founder, GrameenPhone'. Splitting these would not make sense.

We also see that the separator 'and' can also be used to split the occupations.

In [35]:
# Checking the first twenty occupations that have '-'
problems_occupation['-'][:20]

['Co-founder, Architecture for Humanity',
 'Human-computer interface designer',
 'President-elect of Afghanistan',
 'Experimental audio-visual artist',
 'Assumption-busting economist',
 'Singer-songwriter',
 'Human-computer interaction researcher',
 'Singer-songwriter',
 'Close-up card magician',
 'World-builder',
 'Twitter co-founder',
 'Philosopher-comic',
 'Sustainable-business pioneer',
 'Experimental audio-visual artist',
 'Micro-sculptor',
 'Hip-hop artist',
 'Co-founder, Architecture for Humanity',
 'Anti-trafficking crusader',
 'X-ray visionary',
 'Anti-slavery activist']

The hyphens are not used to separate occupations so we can keep it as it is.

In [36]:
# Checking the first twenty occupations that have '+'
problems_occupation['+'][:20]

['Neuroscience PhD student + writer',
 'Satellite archaeologist + TED Prize winner',
 'Vagabond photojournalist + conceptual artist',
 'Architect + ecotourism specialist',
 'Science Historian + Writer',
 'Photographer + storyteller',
 'Comedian + Designer',
 'photographer + visual artist',
 'Vagabond photojournalist + conceptual artist',
 'Chaplain + author',
 'Mother + ALS Advocate',
 'Attorney + privacy advocate',
 'Graffiti artist + activist',
 'Entrepreneur + educator',
 'Satellite archaeologist + TED Prize winner',
 'Satellite archaeologist + TED Prize winner']

We can separate the occupations by '+' because they seem to be distinct occupations.

In [37]:
mult_occupation = re.compile(r'\/|\,|\;|\+| and ')
# end_issue = re.compile(r' \.\.\.')
occupations = defaultdict(list)
ignore_list = ['HIV/AIDS fighter','9/11 mothers','Founder, GrameenPhone',
                     'Co-founder, Architecture for Humanity','cofounder, Six Apart',
                     'CEO, Public Radio International (PRI)', 'Director of photography, National Geographic',
                     'Director of research, Samsung Research America', 'Founder, Transparency International',
                     'Founder, 4chan', 'Creator, The 99','COO, Facebook',
                     'Director, The Institute for Global Happiness', 'Executive chair, Ford Motor Co.',
                     'CEO, Kiva Systems','CEO, presentation designer', 'Founder, Doha Film Institute',
                     'Senior Editor, TIME Magazine', 'Artist, designer',
                     'Professor of Economics, University of Waterloo','Cofounder, Incredible Edible',
                     'COO, Mozilla Foundation', 'Psychologist, Disgust researcher', 'CEO, Team Rubicon',
                     'CEO, Lumos', 'COO, Unilever', 'COO, Facebook', 'Deputy director, NSA', 
                     'Research scientist, Google', 'Human Resources Manager, UPS', 
                     'Chief of the Community Partnership Division, Baltimore Police Department',
                     'Director of photography, Pixar', 'Campaign leader, Global Witness',
                     'CEO, Gates Foundation','Founder, Gravity','President, World Bank Group']

for index, row in ted.iterrows():
    occupation = row['speaker_occupation']
    problem_found = False
    if mult_occupation.search(occupation):
        problem_found = True
    if problem_found & (occupation not in ignore_list):
        occupation = re.split('\/|\,|\;|\+| and ', occupation)
        for item in occupation:
            occupations['talk_id'].append(row['talk_id'])
#             if end_issue.search(item):
#                 item = item.strip(' ...')
            occupations['speaker_occupation'].append(item.strip().lower())
    # All strings were converted to lowercase in order to avoid the same word in different formats.
    else:
        occupations['talk_id'].append(row['talk_id'])
        occupations['speaker_occupation'].append(occupation.lower())

occupations = pd.DataFrame(occupations)

In [38]:
occupations.head()

Unnamed: 0,talk_id,speaker_occupation
0,1,author
1,1,educator
2,2,climate advocate
3,3,technology columnist
4,4,activist for environmental justice


### Tags Data Preprocessing

In [39]:
tags = defaultdict(list)
for index, row in ted.iterrows():
    themes = ast.literal_eval(row['tags'])
    for item in themes:
        tags['talk_id'].append(row['talk_id'])
        tags['tags'].append(item)

tags = pd.DataFrame(tags)

print ('Number of tags: ', len(tags))
print ('Number of unique tags: ', len(tags['tags'].unique()))
tags.head()

Number of tags:  19154
Number of unique tags:  416


Unnamed: 0,talk_id,tags
0,1,children
1,1,creativity
2,1,culture
3,1,dance
4,1,education


## Data Cleaning of Transcripts

In [40]:
transcripts = pd.read_csv('../data/transcripts.csv')

In [41]:
transcripts.head()

Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


In [42]:
# Check one of the transcripts
transcripts.transcript[2]



There are things like background music and audience responses (e.g. Applause, Laughter), within parentheses, which should be removed so that we can focus on the speech itself.

In [43]:
# Create function to remove annotations
def remove_parenthetical(x):
    return re.sub("([\(\[]).*?([\)\]])", "\g<1>\g<2>", x)

transcripts['transcript'] = transcripts['transcript'].apply(remove_parenthetical)

In [44]:
# Check the same transcript to see if the annotations have been removed.
transcripts.transcript[2]



In [45]:
# Merge the main TED Talks dataframe with the transcripts dataframe
ted_model = ted_model.merge(transcripts, how = 'left', on = ['url'])

In [46]:
ted_model.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,url,views,talk_id,persuasive,inspiring,unconvincing,norm_persuasive,norm_inspiring,norm_unconvincing,transcript
0,4553,Sir Ken Robinson makes an entertaining and pro...,19.4,TED2006,1970-01-01 00:00:01.140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1970-01-01 00:00:01.151367060,...,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,1,10704,24924,300,0.000227,0.000528,6e-06,"Good morning. How are you?()It's been great, h..."
1,265,With the same humor and humanity he exuded in ...,16.283333,TED2006,1970-01-01 00:00:01.140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1970-01-01 00:00:01.151367060,...,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2,268,413,258,8.4e-05,0.000129,8.1e-05,"Thank you so much, Chris. And it's truly a gre..."
2,124,New York Times columnist David Pogue takes aim...,21.433333,TED2006,1970-01-01 00:00:01.140739200,26,David Pogue,David Pogue: Simplicity sells,1,1970-01-01 00:00:01.151367060,...,https://www.ted.com/talks/david_pogue_says_sim...,1636292,3,230,230,104,0.000141,0.000141,6.4e-05,"()Hello voice mail, my old friend.()I've calle..."
3,200,"In an emotionally charged talk, MacArthur-winn...",18.6,TED2006,1970-01-01 00:00:01.140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1970-01-01 00:00:01.151367060,...,https://www.ted.com/talks/majora_carter_s_tale...,1697550,4,460,1070,36,0.000271,0.00063,2.1e-05,If you're here today — and I'm very happy that...
4,593,You've never seen data presented like this. Wi...,19.833333,TED2006,1970-01-01 00:00:01.140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1970-01-01 00:00:01.151440680,...,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,5,2542,2893,67,0.000212,0.000241,6e-06,"About 10 years ago, I took on the task to teac..."


## Data Export

In [47]:
# Save to data folder
ted.to_csv('../data/ted_cleaned.csv', index=False)
ratings.to_csv('../data/ratings.csv', index=False)
occupations.to_csv('../data/occupations.csv', index=False)
tags.to_csv('../data/tags.csv', index=False)
transcripts.to_csv('../data/transcripts_cleaned.csv', index=False)
ted_model.to_csv('../data/ted_model.csv', index=False)