# Team 4 (Secati - TED Talks 

![](https://psmarketingimages.s3.amazonaws.com/blog/wp-content/uploads/2017/04/23103819/TED-Talks-for-Small-Business-and-Entrepreneurs.jpg)


## What is our Data about?

These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 21st, 2017. The TED main dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles. The TED transcripts dataset contains the transcripts for all talks available on TED.com.

### A lot goes into researching and creating a TED Talk.
Most TED Talks are edited, lightly but carefully. TED typically remove the first few sentences of warmup chatter, and excessive ums and uhs — but they won't distort the speaker's meaning with their edit.
It takes one of their pro video editors about a full day to edit an 18-minute TED Talk.
Almost every TED Talk hosted on TED.com has full subtitles and a snazzy clickable time-coded transcript.
While some of your favorite TED Talks were shot with multiple cameras — up to nine — others are filmed very simply. Next time you watch, count the different shots.

### Understanding the data:
- comments (int): The number of first level comments made on the talk
- description (str): A blurb of what the talk is about
- duration (int): The duration of the talk in seconds
- event (str): The TED/TEDx event where the talk took place
- film_date (int): The Unix timestamp of the filming
- languages (int): The number of languages in which the talk is available
- main_speaker (str): The first named speaker of the talk
- name (str): The official name of the TED Talk. Includes the title and the speaker.
- num_speaker (int): The number of speakers in the talk
- published_date (int): The Unix timestamp for the publication of the talk on TED.com
- ratings (str): A stringified dictionary of the various ratings given to the - talk (inspiring, fascinating, jaw dropping, etc.)
- related_talks (str): A list of dictionaries of recommended talks to watch next
- speaker_occupation (str): The occupation of the main speaker
- tags (str): The themes associated with the talk
- title (str): The title of the talk
- url (str): The URL of the talk
- views (int): The number of views on the talk

## Audience


1. Media Head
2. TED-talk Organizers
3. Event Organizers


## Usage for the Audience
- Create/Explore more topics that haven been cover yet on Ted talks
- Understand how to increase number of View, Comments, Ratings of Ted talks
- Understand which places/time/duration/event to organize the Ted talks
- Prediction of the view for future Ted talks (Understand how much investment should be spend on)

## Expectation :
+ Build a content recommendation for TED
  - Create a vector representation of each description
  - Create a similarity matrix for the vector representation created above
  - For each talk, based on some similarity metric, select 4 most similar talks

+ View Prediction:

  We measure our results based on views, comments, and positive ratings.

  - Views: What gets people to hear an idea?
  - Positive Ratings: What makes people react positively to the idea?
  - Comments: What kinds of ideas produce discussions?
  - We can also apply One-Hot-Encoding on the categorical attributes and get the data ready for training machine learning models. Then we print out the dimensions of the final dataset.


+ Data analysis:
  - Which video have the best/worst view?
  - Which video have the best/worst rating?
  - Positive/Negative index affect the quality of the talks?
  - Event that have the most talks?
  - Which talks provoke the most online discussion?
  - What were the "best" events in TED history to attend?
  - Which occupations deliver the funniest TED talks on average?
  - Central tendencys of Views & Comments
  - Correlation between Views & Comments per talk
  - Correlation between Views & Languages per talk


## Step 0: Setup the Environment

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime as dt
import regex as re
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
pd.set_option('display.max_rows', 100)

In [119]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Step 1: Read Data

### Create the main pandas data frame

In [0]:
df = pd.read_csv('gdrive/My Drive/FTMLE - Tonga/Week_3/assignments/datasets/04-ted-talks/ted.csv')

### Overview

In [121]:
# Show a summary of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB


In [122]:
df.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


In [123]:
df.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views'],
      dtype='object')

In [124]:
df.shape

(2550, 17)

## Step 2: Cleaning the data

In [125]:
# Firstly take a brief glance into the data frame

df.describe()

Unnamed: 0,comments,duration,film_date,languages,num_speaker,published_date,views
count,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0
mean,191.562353,826.510196,1321928000.0,27.326275,1.028235,1343525000.0,1698297.0
std,282.315223,374.009138,119739100.0,9.563452,0.207705,94640090.0,2498479.0
min,2.0,135.0,74649600.0,0.0,1.0,1151367000.0,50443.0
25%,63.0,577.0,1257466000.0,23.0,1.0,1268463000.0,755792.8
50%,118.0,848.0,1333238000.0,28.0,1.0,1340935000.0,1124524.0
75%,221.75,1046.75,1412964000.0,33.0,1.0,1423432000.0,1700760.0
max,6404.0,5256.0,1503792000.0,72.0,5.0,1506092000.0,47227110.0


### Set Index

In [126]:
# Id must be unique
# So the number of unique values must be equal the number of rows
# 2 methods:
# df['column'].nunique() == df['column'].count() OR 
# df['column'].nunique() == df.shape[0]

# However, this data frame does not have ID column, in order to improve data searching, we always need to create indexes for data lookup purpose. 
df.index = [i for i in range(1, len(df.values)+1)]
df.index.name = 'ID'
df.head()

Unnamed: 0_level_0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
2,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
3,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
4,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
5,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


### Check data duplication

In [127]:
df.nunique()

comments               559
description           2550
duration              1083
event                  355
film_date              735
languages               66
main_speaker          2156
name                  2550
num_speaker              5
published_date        2490
ratings               2550
related_talks         2550
speaker_occupation    1458
tags                  2530
title                 2550
url                   2550
views                 2550
dtype: int64

In [128]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2550 entries, 1 to 2550
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 358.6+ KB


In [129]:
# It seems that there is no dubplicate at all, we can double-check just to be sure
df.drop_duplicates(subset=None, keep="first", inplace=False)

Unnamed: 0_level_0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
2,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
3,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
4,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
5,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2546,17,"Between 2008 and 2016, the United States depor...",476,TED2017,1496707200,4,Duarte Geraldino,Duarte Geraldino: What we're missing in the de...,1,1505851216,"[{'id': 3, 'name': 'Courageous', 'count': 24},...","[{'id': 2596, 'hero': 'https://pe.tedcdn.com/i...",Journalist,"['TED Residency', 'United States', 'community'...",What we're missing in the debate about immigra...,https://www.ted.com/talks/duarte_geraldino_wha...,450430
2547,6,How can you study Mars without a spaceship? He...,290,TED2017,1492992000,3,Armando Azua-Bustos,Armando Azua-Bustos: The most Martian place on...,1,1505919737,"[{'id': 22, 'name': 'Fascinating', 'count': 32...","[{'id': 2491, 'hero': 'https://pe.tedcdn.com/i...",Astrobiologist,"['Mars', 'South America', 'TED Fellows', 'astr...",The most Martian place on Earth,https://www.ted.com/talks/armando_azua_bustos_...,417470
2548,10,Science fiction visions of the future show us ...,651,TED2017,1492992000,1,Radhika Nagpal,Radhika Nagpal: What intelligent machines can ...,1,1506006095,"[{'id': 1, 'name': 'Beautiful', 'count': 14}, ...","[{'id': 2346, 'hero': 'https://pe.tedcdn.com/i...",Robotics engineer,"['AI', 'ants', 'fish', 'future', 'innovation',...",What intelligent machines can learn from a sch...,https://www.ted.com/talks/radhika_nagpal_what_...,375647
2549,32,In an unmissable talk about race and politics ...,1100,TEDxMileHigh,1499472000,1,Theo E.J. Wilson,Theo E.J. Wilson: A black man goes undercover ...,1,1506024042,"[{'id': 11, 'name': 'Longwinded', 'count': 3},...","[{'id': 2512, 'hero': 'https://pe.tedcdn.com/i...",Public intellectual,"['Internet', 'TEDx', 'United States', 'communi...",A black man goes undercover in the alt-right,https://www.ted.com/talks/theo_e_j_wilson_a_bl...,419309


### Drop unnecessary columns

In [130]:
# The URL, related_talks and name are not necessary in our analysis
df.drop(columns=['url', 'name', 'related_talks'], inplace = True)
df.head()

Unnamed: 0_level_0,comments,description,duration,event,film_date,languages,main_speaker,num_speaker,published_date,ratings,speaker_occupation,tags,title,views
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,47227110
2,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,3200520
3,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,1636292
4,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,1697550
5,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,12005869


### Reorder for the sake of Readability

In [131]:
# For the reader's convenience (and my OCD)
df = df[['title', 'description', 'main_speaker', 'speaker_occupation', 'num_speaker', 'duration', 'event', 'film_date', 'published_date', 'views', 'comments', 'tags', 'languages', 'ratings']]
df.head()

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,Do schools kill creativity?,Sir Ken Robinson makes an entertaining and pro...,Ken Robinson,Author/educator,1,1164,TED2006,1140825600,1151367060,47227110,4553,"['children', 'creativity', 'culture', 'dance',...",60,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {..."
2,Averting the climate crisis,With the same humor and humanity he exuded in ...,Al Gore,Climate advocate,1,977,TED2006,1140825600,1151367060,3200520,265,"['alternative energy', 'cars', 'climate change...",43,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i..."
3,Simplicity sells,New York Times columnist David Pogue takes aim...,David Pogue,Technology columnist,1,1286,TED2006,1140739200,1151367060,1636292,124,"['computers', 'entertainment', 'interface desi...",26,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i..."
4,Greening the ghetto,"In an emotionally charged talk, MacArthur-winn...",Majora Carter,Activist for environmental justice,1,1116,TED2006,1140912000,1151367060,1697550,200,"['MacArthur grant', 'activism', 'business', 'c...",35,"[{'id': 3, 'name': 'Courageous', 'count': 760}..."
5,The best stats you've ever seen,You've never seen data presented like this. Wi...,Hans Rosling,Global health expert; data visionary,1,1190,TED2006,1140566400,1151440680,12005869,593,"['Africa', 'Asia', 'Google', 'demo', 'economic...",48,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}..."


### Handle missing data




In [132]:
# count the number of missing values in each column
# expectation after reading the info : speaker_occupation(6)

df.isna().sum()

title                 0
description           0
main_speaker          0
speaker_occupation    6
num_speaker           0
duration              0
event                 0
film_date             0
published_date        0
views                 0
comments              0
tags                  0
languages             0
ratings               0
dtype: int64

In machine learning, we need to handle missing values. There are many types of missing values:

1. Standard Missing Values: These are missing values that Pandas can detect.
2. Non-Standard Missing Values: Sometimes it might be the case where there’s missing values that have different formats.
3. Unexpected Missing Values: For example, if our feature is expected to be a string, but there’s a numeric type, then technically this is also a missing value.

In [133]:
# View those missing values of speaker_occupation
df[df['speaker_occupation'].isnull()]

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1114,Meet the Water Canary,"After a crisis, how can we tell if water is sa...",Sonaar Luthra,,1,217,TEDGlobal 2011,1310601600,1326731605,353749,145,"['TED Fellows', 'design', 'global development'...",38,"[{'id': 10, 'name': 'Inspiring', 'count': 73},..."
1193,I am a pirate,"The Pirate Party fights for transparency, anon...",Rick Falkvinge,,1,1097,TEDxObserver,1331424000,1333289675,181010,122,"['Internet', 'TEDx', 'global issues', 'politic...",10,"[{'id': 8, 'name': 'Informative', 'count': 156..."
1221,Tracking our online trackers,"As you surf the Web, information is being coll...",Gary Kovacs,,1,399,TED2012,1330473600,1336057219,2098639,257,"['Internet', 'advertising', 'business', 'priva...",32,"[{'id': 23, 'name': 'Jaw-dropping', 'count': 9..."
1657,To hear this music you have to be there. Liter...,"In this lovely talk, TED Fellow Ryan Holladay ...",Ryan Holladay,,1,389,TED@BCG San Francisco,1383091200,1389369735,1284510,140,"['TED Fellows', 'entertainment', 'music', 'tec...",33,"[{'id': 1, 'name': 'Beautiful', 'count': 211},..."
1912,Old books reborn as art,What do you do with an outdated encyclopedia i...,Brian Dettmer,,1,366,TEDYouth 2014,1415059200,1423238442,1159937,48,"['TEDYouth', 'art', 'books', 'creativity']",34,"[{'id': 1, 'name': 'Beautiful', 'count': 361},..."
1950,The day I stood up alone,Photographer Boniface Mwangi wanted to protest...,Boniface Mwangi,,1,440,TEDGlobal 2014,1413763200,1427989423,1342431,70,"['TED Fellows', 'activism', 'art', 'corruption...",33,"[{'id': 3, 'name': 'Courageous', 'count': 614}..."


In [134]:
# fill in those missing values with a default 'Other' value
df['speaker_occupation'].fillna('Other', inplace = True)
df.isna().sum()

title                 0
description           0
main_speaker          0
speaker_occupation    0
num_speaker           0
duration              0
event                 0
film_date             0
published_date        0
views                 0
comments              0
tags                  0
languages             0
ratings               0
dtype: int64

In [135]:
# Check again if there is any negative value
df.describe()

Unnamed: 0,num_speaker,duration,film_date,published_date,views,comments,languages
count,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0
mean,1.028235,826.510196,1321928000.0,1343525000.0,1698297.0,191.562353,27.326275
std,0.207705,374.009138,119739100.0,94640090.0,2498479.0,282.315223,9.563452
min,1.0,135.0,74649600.0,1151367000.0,50443.0,2.0,0.0
25%,1.0,577.0,1257466000.0,1268463000.0,755792.8,63.0,23.0
50%,1.0,848.0,1333238000.0,1340935000.0,1124524.0,118.0,28.0
75%,1.0,1046.75,1412964000.0,1423432000.0,1700760.0,221.75,33.0
max,5.0,5256.0,1503792000.0,1506092000.0,47227110.0,6404.0,72.0


### Handle errors

In [136]:
df.sample(10)

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2360,Meet the inventor of the electronic spreadsheet,Dan Bricklin changed the world forever when he...,Dan Bricklin,Software pioneer,1,720,TEDxBeaconStreet,1479513600,1484149970,992523,18,"['TEDx', 'business', 'code', 'collaboration', ...",19,"[{'id': 8, 'name': 'Informative', 'count': 266..."
865,How to make work-life balance work,"Work-life balance, says Nigel Marsh, is too im...",Nigel Marsh,Author and marketer,1,605,TEDxSydney,1275004800,1297099320,3813685,283,"['TEDx', 'business', 'culture', 'life', 'motiv...",41,"[{'id': 7, 'name': 'Funny', 'count': 981}, {'i..."
1269,How to air-condition outdoor spaces,"During the hot summer months, watching an outd...",Wolfgang Kessling,Physicist,1,695,TEDxSummit,1334620800,1340377202,703243,61,"['TEDx', 'alternative energy', 'architecture',...",22,"[{'id': 21, 'name': 'Unconvincing', 'count': 4..."
2174,Two reasons companies fail -- and how to avoid...,Is it possible to run a company and reinvent i...,Knut Haanaes,Strategist,1,638,TED@BCG London,1435622400,1459436743,1743987,23,"['business', 'creativity', 'curiosity', 'goal-...",34,"[{'id': 3, 'name': 'Courageous', 'count': 63},..."
1628,Ecology from the air,What are our forests really made of? From the ...,Greg Asner,Airborne ecologist,1,830,TEDGlobal 2013,1370995200,1384876854,692579,177,"['Africa', 'Natural resources', 'animals', 'bi...",28,"[{'id': 1, 'name': 'Beautiful', 'count': 36}, ..."
2144,Meet the dazzling flying machines of the future,"When you hear the word ""drone,"" you probably t...",Raffaello D'Andrea,Autonomous systems pioneer,1,695,TED2016,1455494400,1455897802,2560359,61,"['beauty', 'creativity', 'demo', 'design', 'dr...",26,"[{'id': 8, 'name': 'Informative', 'count': 468..."
1547,The voice of the natural world,Bernie Krause has been recording wild soundsca...,Bernie Krause,Natural sounds expert,1,888,TEDGlobal 2013,1370995200,1373900562,1022050,186,"['animals', 'nature', 'sound']",28,"[{'id': 10, 'name': 'Inspiring', 'count': 262}..."
500,The music of a war child,"For five years, young Emmanuel Jal fought as a...",Emmanuel Jal,Hip-hop artist,1,1083,TEDGlobal 2009,1248307200,1249606800,776516,210,"['entertainment', 'global issues', 'live music...",24,"[{'id': 10, 'name': 'Inspiring', 'count': 550}..."
1006,Can we make things that make themselves?,MIT researcher Skylar Tibbits works on self-as...,Skylar Tibbits,Inventor,1,364,TED2011,1298505600,1314890942,983929,156,"['TED Fellows', 'design', 'technology']",34,"[{'id': 10, 'name': 'Inspiring', 'count': 141}..."
1308,Is life really that complex?,Can an algorithm forecast the site of the next...,Hannah Fry,Complexity theorist,1,602,TEDxUCL,1338681600,1344088637,353350,148,"['TEDx', 'algorithm', 'anthropology', 'behavio...",0,"[{'id': 24, 'name': 'Persuasive', 'count': 108..."


In [137]:
df[df['languages'] == 0]

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
59,"A dance of ""Symbiosis""","Two Pilobolus dancers perform ""Symbiosis."" Doe...",Pilobolus,Dance company,1,825,TED2005,1109289600,1170979860,3051507,222,"['dance', 'entertainment', 'nature', 'performa...",0,"[{'id': 1, 'name': 'Beautiful', 'count': 1810}..."
116,"A string quartet plays ""Blue Room""",The avant-garde string quartet Ethel performs ...,Ethel,String quartet,1,214,TED2006,1138838400,1182184140,384641,27,"['cello', 'collaboration', 'culture', 'enterta...",0,"[{'id': 1, 'name': 'Beautiful', 'count': 216},..."
136,"""Woza""",After Vusi Mahlasela's 3-song set at TEDGlobal...,Vusi Mahlasela,"Musician, activist",1,299,TEDGlobal 2007,1181260800,1187695440,416603,36,"['Africa', 'entertainment', 'guitar', 'live mu...",0,"[{'id': 8, 'name': 'Informative', 'count': 4},..."
210,"""M'Bifo""","Rokia Traore sings the moving ""M'Bifo,"" accomp...",Rokia Traore,Singer-songwriter,1,419,TEDGlobal 2007,1181088000,1206580680,294936,67,"['Africa', 'entertainment', 'guitar', 'live mu...",0,"[{'id': 23, 'name': 'Jaw-dropping', 'count': 5..."
238,"""Kounandi""","Singer-songwriter Rokia Traore performs ""Kouna...",Rokia Traore,Singer-songwriter,1,386,TEDGlobal 2007,1181088000,1212627600,82488,43,"['Africa', 'guitar', 'live music', 'music', 's...",0,"[{'id': 22, 'name': 'Fascinating', 'count': 84..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1428,I think we all need a pep talk,"Kid President commands you to wake up, listen ...",Kid President,Inspirer,1,208,SoulPancake,1358985600,1359734822,828203,229,"['children', 'comedy', 'humor']",0,"[{'id': 10, 'name': 'Inspiring', 'count': 771}..."
1468,The technology of touch,"As we move through the world, we have an innat...",Katherine Kuchenbecker,Mechanical engineer,1,388,TEDYouth 2012,1353110400,1364569190,274986,183,"['TEDYouth', 'engineering', 'technology']",0,"[{'id': 9, 'name': 'Ingenious', 'count': 150},..."
1487,How much does a video weigh?,What color is a mirror? How much does a video ...,Michael Stevens,YouTube educator,1,441,TED-Ed,1362009600,1366815569,195899,126,"['Internet', 'TED-Ed', 'computers', 'humor', '...",0,"[{'id': 22, 'name': 'Fascinating', 'count': 17..."
2408,"""Turceasca""",Grammy-winning Silk Road Ensemble display thei...,Silk Road Ensemble,Musical explorers,1,389,TED2016,1455494400,1489759215,640734,5,"['art', 'live music', 'music', 'performance']",0,"[{'id': 1, 'name': 'Beautiful', 'count': 80}, ..."


We notice that ```languages``` appeared to have a value of 0. This is okay because these are performance. We can keep them.

Next, we also want to check the occupation of those speakers since we noticed that there are some regular expressions used.

In [138]:
df['speaker_occupation'].unique()

array(['Author/educator', 'Climate advocate', 'Technology columnist', ...,
       'Historian, philosopher', 'Astrobiologist', ' Robotics engineer'],
      dtype=object)

In [139]:
# Lowercase all job to make it case-insensitive
df['speaker_occupation'] = df['speaker_occupation'].str.lower()
df['speaker_occupation'].head()

ID
1                         author/educator
2                        climate advocate
3                    technology columnist
4      activist for environmental justice
5    global health expert; data visionary
Name: speaker_occupation, dtype: object

In [140]:
# Find all the errors in our speaker_occupation.
lst = list(df['speaker_occupation'].values.flatten())
regex = r'[^\w\s\,]'
char_list = []
job_list = []
for job in lst:
  for char in re.findall(regex,job): 
    char_list.append(char)
    job_list.append(job)
set(char_list)

{"'", '(', ')', '+', '-', '.', '/', ';', '\xad', '’', '\ufeff'}

In [141]:
# Find all the strings that contains those errors
# total 169 strings
job_list

['author/educator',
 'global health expert; data visionary',
 'life coach; expert in leadership psychology',
 'co-founder, architecture for humanity',
 'human-computer interface designer',
 'blogger; cofounder, six apart',
 'psychologist; happiness expert',
 'president-elect of afghanistan',
 'mathematician; statistician',
 'primatologist; environmentalist',
 'designer; creative director, ideo',
 'experimental audio-visual artist',
 'singer/songwriter',
 'cellist; singer-songwriter',
 'cellist; singer-songwriter',
 'singer/songwriter',
 'singer/songwriter',
 'singer/songwriter',
 'interaction designer; software developer',
 "general manager of microsoft's virtual earth",
 'global health expert; data visionary',
 'assumption-busting economist',
 'singer/songwriter',
 'singer-songwriter',
 'human-computer interaction researcher',
 'ceo, public radio international (pri)',
 'ceo, public radio international (pri)',
 'singer-songwriter',
 'singer/songwriter',
 'singer/songwriter',
 'close-up

However we noticed that some regex is needed in certain cases such as '9/11 mothers', 'co-founder', etc.
Therefore we must analyze case-by-case and find specific strings that need converting.

In [142]:
wrong_job = ['author/educator',
'global health expert; data visionary',
'life coach; expert in leadership psychology',
'blogger; cofounder, six apart',
'psychologist; happiness expert',
'mathematician; statistician'
'primatologist; environmentalist',
'designer; creative director, ideo',
'singer/songwriter',
'cellist; singer-songwriter',
'interaction designer; software developer',
'singer-songwriter',
'ceo, public radio international (pri)',
'primatologist; environmentalist',
'psychologist; happiness expert',
'activist, singer-songwriter',
"berkeley bionics' ceo",
'executive chair, ford motor co.',
'neuroscience phd student + writer',
'satellite archaeologist + ted prize winner',
'author/illustrator',
'entrepreneur, animator, philanthropist ...',
'former u.s. representative and nasa astronaut; survivors',
'vagabond photojournalist + conceptual artist',
'architect + ecotourism specialist',
'science historian + writer',
'photographer + storyteller',
'comedian + designer',
'photographer + visual artist',
'vagabond photojournalist + conceptual artist',
'chaplain + author',
'mother + als advocate',
'attorney + privacy advocate',
'graffiti artist + activist',
'entrepreneur + educator',
'women’s rights activist and entrepreneur',
'director/choreographer, dancer',
'marine biologist, explorer-photographer',
'human-computer interaction researcher and designer',
'big data techno-\xadoptimist and internist',
'ceo and co-founder, irelaunch',
'writer, activist and legal analyst\ufeff',
'physician and men’s health advocate',
'satellite archaeologist + ted prize winner',
'bioelectronics innovator\ufeff',
'satellite archaeologist + ted prize winner',
"tv journalist, women's empowerment advocate"]
len(wrong_job)

46

In [143]:
# Replace all unwanted characters
right_job = [re.sub(r'\ufeff|\.\.\.', '', job) for job in wrong_job]
right_job = [re.sub(r'(\s\+|\sand|;)', ',', job) for job in right_job]
right_job = [re.sub(r'(\/|-)', ', ', job) for job in right_job]
right_job

['author, educator',
 'global health expert, data visionary',
 'life coach, expert in leadership psychology',
 'blogger, cofounder, six apart',
 'psychologist, happiness expert',
 'mathematician, statisticianprimatologist, environmentalist',
 'designer, creative director, ideo',
 'singer, songwriter',
 'cellist, singer, songwriter',
 'interaction designer, software developer',
 'singer, songwriter',
 'ceo, public radio international (pri)',
 'primatologist, environmentalist',
 'psychologist, happiness expert',
 'activist, singer, songwriter',
 "berkeley bionics' ceo",
 'executive chair, ford motor co.',
 'neuroscience phd student, writer',
 'satellite archaeologist, ted prize winner',
 'author, illustrator',
 'entrepreneur, animator, philanthropist ',
 'former u.s. representative, nasa astronaut, survivors',
 'vagabond photojournalist, conceptual artist',
 'architect, ecotourism specialist',
 'science historian, writer',
 'photographer, storyteller',
 'comedian, designer',
 'photograph

In [144]:
# Hard code the result again for right job
right_job = ['author, educator',
 'global health expert, data visionary',
 'life coach, expert in leadership psychology',
 'blogger, cofounder, six apart',
 'psychologist, happiness expert',
 'mathematician, statisticianprimatologist, environmentalist',
 'designer, creative director, ideo',
 'singer, songwriter',
 'cellist, singer, songwriter',
 'interaction designer, software developer',
 'singer, songwriter',
 'ceo',
 'primatologist, environmentalist',
 'psychologist, happiness expert',
 'activist, singer, songwriter',
 'ceo',
 'executive chair',
 'neuroscience phd student, writer',
 'satellite archaeologist, ted prize winner',
 'author, illustrator',
 'entrepreneur, animator, philanthropist ',
 'former u.s. representative, nasa astronaut, survivors',
 'vagabond photojournalist, conceptual artist',
 'architect, ecotourism specialist',
 'science historian, writer',
 'photographer, storyteller',
 'comedian, designer',
 'photographer, visual artist',
 'vagabond photojournalist, conceptual artist',
 'chaplain, author',
 'mother, als advocate',
 'attorney, privacy advocate',
 'graffiti artist, activist',
 'entrepreneur, educator',
 'women’s rights activist, entrepreneur',
 'director, choreographer, dancer',
 'marine biologist, explorer, photographer',
 'human, computer interaction researcher, designer',
 'big data techno-adoptimist, internist',
 'ceo, co-founder',
 'writer, activist, legal analyst',
 'physician, men’s health advocate',
 'satellite archaeologist, ted-prize winner',
 'bioelectronics innovator',
 'satellite archaeologist, ted-prize winner',
 "tv-journalist, women's empowerment advocate"]

for w, r in zip(wrong_job, right_job):
  df['speaker_occupation'].replace(w, r, inplace= True)
df['speaker_occupation'].unique()

array(['author, educator', 'climate advocate', 'technology columnist',
       ..., 'historian, philosopher', 'astrobiologist',
       ' robotics engineer'], dtype=object)

Lastly, we check the number of unique event names then list all of them.

In [145]:
df['event'].unique()

array(['TED2006', 'TED2004', 'TED2005', 'TEDGlobal 2005', 'TEDSalon 2006',
       'TED2003', 'TED2007', 'TED2002', 'TEDGlobal 2007',
       'TEDSalon 2007 Hot Science', 'Skoll World Forum 2007', 'TED2008',
       'TED1984', 'TED1990', 'DLD 2007', 'EG 2007', 'TED1998',
       'LIFT 2007', 'TED Prize Wish', 'TEDSalon 2009 Compassion',
       'Chautauqua Institution', 'Serious Play 2008', 'Taste3 2008',
       'TED2001', 'TED in the Field', 'TED2009', 'EG 2008',
       'Elizabeth G. Anderson School', 'TEDxUSC', 'TED@State',
       'TEDGlobal 2009', 'TEDxKC', 'TEDIndia 2009',
       'TEDSalon London 2009', 'Justice with Michael Sandel',
       'Business Innovation Factory', 'TEDxTC',
       'Carnegie Mellon University', 'Stanford University',
       'AORN Congress', 'University of California', 'TEDMED 2009',
       'Royal Institution', 'Bowery Poetry Club', 'TEDxSMU',
       'Harvard University', 'TEDxBoston 2009', 'TEDxBerlin', 'TED2010',
       'TEDxAmsterdam', 'World Science Festival', 

In [146]:
df['event'].value_counts()

TED2014                       84
TED2009                       83
TED2013                       77
TED2016                       77
TED2015                       75
                              ..
TEDxAtlanta                    1
TEDxPuget Sound                1
TEDxUCL                        1
TEDxLondonBusinessSchool       1
Carnegie Mellon University     1
Name: event, Length: 355, dtype: int64

The dataset has 355 unique event names but from the looks of it, lots of these names can be categorised together as they are quite similar. 

By understanding what these events' main focus, we then can break down the event names in the following 12 categories, each consisting of at least 5 samples:


1. TED19: TED talks dated back in the 1900s
2. TED20: TED talks in the 2000s
3. TEDx: The TEDx program lets individuals, organizations and communities worldwide hold local, independent TED-like events. To date, more than 13,000 TEDx events have been held in 150 countries.
4. TEDGlobal: TEDGlobal is a conference that celebrates human ingenuity by exploring ideas, innovation and creativity from all around the world with different themes each year.
5. TEDSalon: TED Salons welcome an intimate audience for an afternoon or evening of highly-curated TED Talks revolving around a globally relevant theme. A condensed version of a TED flagship conference, they are distinct in their brevity, opportunities for conversation, and heightened interaction between the speaker and audience.
6. TEDWomen: TEDWomen is a three-day conference about the power of women and girls to be creators and change-makers.
7. TED@BCG: TED@BCG is a multi-year collaboration with Boston Consulting Group that has been held in Mumbai, Toronto, Milan, Paris, London, Berlin, Singapore, and San Francisco.
8. TED@: TED@ is a multi-year collaboration with different partners with touch points across the TED ecosystem.
9. TEDYouth: TEDYouth is a day-long event for middle and high school students, with live speakers, hands-on activities and great conversations. Scientists, designers, technologists, explorers, artists, performers (and more!) share short talks on what they do best, serving both as a source of knowledge and inspiration for youth around the globe.
10. TEDMED: TEDMED convenes and curates extraordinary people and ideas from all disciplines both inside and outside of medicine in pursuit of unexpected connections that accelerate innovation in health and medicine. Best known for their annual event, TEDMED is a year-round global community. 
11. TED : Other TED Talks, each focusing on a specific topic. These TED talks includes those filmed at their flagship TED conferences.
12. Other: These talks don’t come from TED or any of their partner conferences. These talks come from all over the Web.



In [0]:
def replace_event_cat(event):

  # Create a list of Regex to match the 355 events in the column event. 
  regex_list = ['TED19', 'TED20', 'TEDx', 'TEDGlobal', 'TEDSalon', 'TEDWomen', 'TED@BCG', 'TED@', 'TEDYouth', 'TEDMED', 'TED']

  # Replace all values in the event that match regex with the same regex name.
  for reg in regex_list:
    if re.match(reg, event):
      return reg

In [148]:
# Then replace accordingly with the same index in column event_category, if there is no match return None.
df['event_category'] = df['event'].apply(replace_event_cat)

df['event_category'].unique()

array(['TED20', 'TEDGlobal', 'TEDSalon', None, 'TED19', 'TED', 'TEDx',
       'TED@', 'TEDMED', 'TEDWomen', 'TEDYouth', 'TED@BCG'], dtype=object)

In [149]:
df.sample()

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings,event_category
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
955,Taking imagination seriously,Janet Echelman found her true voice as an arti...,Janet Echelman,artist,1,566,TED2011,1299110400,1307489760,1832930,2492,"['art', 'cities', 'culture', 'data', 'design',...",35,"[{'id': 23, 'name': 'Jaw-dropping', 'count': 3...",TED20


In [150]:
# We need to replace all None value to Other:
df['event_category'].fillna('Other', inplace = True)
df['event_category'].value_counts()

TED20        969
TEDx         471
TEDGlobal    463
TED          178
Other        111
TEDWomen      96
TEDSalon      79
TEDMED        68
TED@          60
TED@BCG       27
TEDYouth      19
TED19          9
Name: event_category, dtype: int64

### Handle Datetime columns : Reformatting for better analysis

In [151]:
df.sample()

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings,event_category
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1384,Teen wonders play bluegrass,"Brothers Jonny, Robbie and Tommy Mizzone are T...",Sleepy Man Banjo Boys,bluegrass musicians,1,302,TED@New York,1339027200,1353514178,1798001,87,"['children', 'entertainment', 'live music', 'p...",33,"[{'id': 1, 'name': 'Beautiful', 'count': 285},...",TED@


It is noticed that this data is using Unix time for ```film_date``` and ```published_date```. Unix time is a system for describing a point in time. It is the number of seconds that have elapsed since the Unix epoch, that is the time 00:00:00 UTC on 1 January 1970, minus leap seconds. Therefore we need to convert it into Year, Month and Date.

In [152]:
df['film_date'] = pd.to_datetime(df['film_date'], unit='s')
df['film_date'].head(10)

ID
1    2006-02-25
2    2006-02-25
3    2006-02-24
4    2006-02-26
5    2006-02-22
6    2006-02-02
7    2006-02-24
8    2006-02-23
9    2006-02-02
10   2006-02-25
Name: film_date, dtype: datetime64[ns]

In [153]:
df['published_date'] = pd.to_datetime(df['published_date'],unit='s').dt.to_period('D')
df['published_date'].head(10)

ID
1     2006-06-27
2     2006-06-27
3     2006-06-27
4     2006-06-27
5     2006-06-27
6     2006-06-27
7     2006-07-10
8     2006-07-10
9     2006-07-18
10    2006-07-18
Name: published_date, dtype: period[D]

In [0]:
df.sample()

Unnamed: 0_level_0,title,description,main_speaker,speaker_occupation,num_speaker,duration,event,film_date,published_date,views,comments,tags,languages,ratings,event_category
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
854,Silicon-based comedy,"In this first-of-its-kind demo, Heather Knight...",Heather Knight,Roboticist,1,364,TEDWomen 2010,2010-12-08,2011-01-21,757335,114,"['AI', 'comedy', 'entertainment', 'humor', 'ro...",30,"[{'id': 9, 'name': 'Ingenious', 'count': 162},...",TEDWomen


### Handle Tags column

In [161]:
pd.set_option('display.max_rows', 100)
df['tags'].head()

ID
1    ['children', 'creativity', 'culture', 'dance',...
2    ['alternative energy', 'cars', 'climate change...
3    ['computers', 'entertainment', 'interface desi...
4    ['MacArthur grant', 'activism', 'business', 'c...
5    ['Africa', 'Asia', 'Google', 'demo', 'economic...
Name: tags, dtype: object

In [0]:
# Create another table with differnt words extracted from the tags as columns and count the frequency of those words
# Firstly, we need to split the tags list into separate rows.

In [0]:
# lst = df['tags'].astype(str)
# lst
# regex = ['"', " ", ","]
# dct ={}
# for i in lst:
#   for _ in str(i):
#     # if _ not in regex:
#       if _ not in dct:
#         dct[_]= 1
#       else:
#         dct[_] += 1
# dct

### Handle Ratings column

In [159]:
df['ratings'].head(1)

ID
1    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
Name: ratings, dtype: object

In [167]:
df['ratings'] = df['ratings'].apply(lambda x: eval(str(x))).head()
df.loc[1,'ratings']

[{'count': 19645, 'id': 7, 'name': 'Funny'},
 {'count': 4573, 'id': 1, 'name': 'Beautiful'},
 {'count': 6073, 'id': 9, 'name': 'Ingenious'},
 {'count': 3253, 'id': 3, 'name': 'Courageous'},
 {'count': 387, 'id': 11, 'name': 'Longwinded'},
 {'count': 242, 'id': 2, 'name': 'Confusing'},
 {'count': 7346, 'id': 8, 'name': 'Informative'},
 {'count': 10581, 'id': 22, 'name': 'Fascinating'},
 {'count': 300, 'id': 21, 'name': 'Unconvincing'},
 {'count': 10704, 'id': 24, 'name': 'Persuasive'},
 {'count': 4439, 'id': 23, 'name': 'Jaw-dropping'},
 {'count': 1174, 'id': 25, 'name': 'OK'},
 {'count': 209, 'id': 26, 'name': 'Obnoxious'},
 {'count': 24924, 'id': 10, 'name': 'Inspiring'}]

In [174]:
def name(dict):
  
  rating_count = {}
  
  for i in len(dict):
    if i['name'] in rating_count:
      rating_count[i['name']] += i['count']
    rating_count[i['name']] = i['count']
  return rating_count

df['ratings'].apply(name)

TypeError: ignored

In [166]:
df.loc[1,'ratings']

"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

In [0]:
# Turns stringified dictionary into python dictionary
df['ratings'] = df['ratings'].apply(lambda x: eval(str(x)))

counter = {'Funny':0, 'Beautiful':0, 'Ingenious':0, 'Courageous':0, 'Longwinded':0, 'Confusing':0, 'Informative':0, 'Fascinating':0, 'Unconvincing':0, 'Persuasive':0, 'Jaw-dropping':0, 'OK':0, 'Obnoxious':0, 'Inspiring':0}

for i in range(len(df['ratings'])):
    for j in range(len(df['ratings'][i])):
        counter[df['ratings'][i][j]['name']] += df['ratings'][i][j]['count']
    
frequencies = list(counter.values())
descr = counter.keys()
descriptors = [x for _,x in sorted(zip(frequencies,counter.keys()), reverse=True)]
neg_descriptors = {"Confusing", "Unconvincing", "Longwinded", "Obnoxious", "OK"}
neg_indices  = [x for x in range (len(descriptors)) if descriptors[x] in neg_descriptors]
frequencies.sort(reverse=True)

## Exploratory Data Analysis