# Capstone Project Part 2: Data Set and the Goals

## Data Set



The dataset that I will be using for my Capstone Project comes from Words of Persuasion: Text Predictors of Persuasive TED Talks project that I found on data.world website shared by Owen Temple (for further information please see https://data.world/owentemple/text-and-content-features-of-most-persuasive-ted-talks )

This dataset consists of the various information about the Ted Talk videos available on TED website including the scripts of the talks, ratings, number of reviews, the occupation of the presenter. The available information was scraped by Owen Temple and the other linguistic features of the talks were added based on the goal of their project. Regarding the creation of these various linguistic features (positive emotion, negative emotion, anger, etc.), they seem to have created these features with LIWC2015 (Linguistic Inquiry and Word Count) software as the below manual was provided in the data dictionary:

https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf

Currently the dataset consists of 2406 rows and 187 columns, but I'll be adding a gender and a few other features as needed in the next step of my project.

## Resarch Questions and Goals        

Similar to the original project, I would like to investigate which linguistic features are indicative of the persuasive and inspiring ratings, but including another variable that is not yet in the dataset: gender. I think it would be interesting to see how the language used by male and female presenters affect these ratings and the number of views for each video.

My current research questions for the analyses are as follows:

1. Which linguistic features are more indicative of gender?
2. Which linguistic features are more indicative of persuasive and inspiring ratings?
3. Which non-linguistic features such as the topic, number of views, occupation of the presenter, etc. are more indicative of persuasive and inspiring ratings?
4. Is there a significant difference between the genders in terms of the number of views, persuasive & inspiring ratings as well as the more negative ratings such as obnoxious and unconvincing.
5. Can we predict the topics based on the talk script?

## Methods

I'll be using a combination of different supervised and unsupervised classification and regression models as well as t tests for my analyses to be able to answer the above questions. The results of the prediction and classification models will be reported with model accuracy scores, coefficients as well ad precision and recall scores, whereas I will be reporting t score and p values for the t tests.

## Challanges

One of the potential challenges that I can foresee at the moment is the gender creation and having an unbalanced class as a result. In that case, I'll be using the techniques such as oversampling to overcome this challenge. 

In [287]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from pprint import pprint
import matplotlib.pyplot as plt
from pandas_summary import DataFrameSummary
import seaborn as sns
import numpy as np

plt.style.use('ggplot')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

### Data Set Exploration

In [288]:
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
df = pd.read_excel('all_with_liwc_segmented.xls')

In [289]:
df.head()

Unnamed: 0,index,comments,description,duration,event,film_date,languages,main_speaker,name,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,music,conversation,transcript,persuasive,inspiring,unconvincing,applause,laughter,norm_persuasive,norm_inspiring,norm_unconvincing,transcript_1sthalf,transcript_2ndhalf,transcript_1q,transcript_2q,transcript_3q,transcript_4q,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,Moral,HarmVirtue,HarmVice,FairnessVirtue,FairnessVice,IngroupVirtue,IngroupVice,AuthorityVirtue,AuthorityVice,PurityVirtue,PurityVice,MoralityGeneral,affect_1h,posemo_1h,negemo_1h,anx_1h,anger_1h,sad_1h,affect_2h,posemo_2h,negemo_2h,anx_2h,anger_2h,sad_2h,affect_1q,posemo_1q,negemo_1q,anx_1q,anger_1q,sad_1q,affect_2q,posemo_2q,negemo_2q,anx_2q,anger_2q,sad_2q,affect_3q,posemo_3q,negemo_3q,anx_3q,anger_3q,sad_3q,affect_4q,posemo_4q,negemo_4q,anx_4q,anger_4q,sad_4q,posemo_change_h,negemo_change_h,affect_change_h,posemo_change_q,negemo_change_q,affect_change_q,published_year,Harm,Fairness,Purity,Ingroup,Authority
0,0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,0,0,"Good morning. How are you?()It's been great, h...",10704,24924,300,4,39,226.649482,527.747728,6.352284,"Good morning. How are you?()It's been great, h...","live in their heads. They live up there, and s...","Good morning. How are you?()It's been great, h...","boy said, ""I bring you myrrh."" And the third b...","they live in their heads. They live up there, ...","intelligence is, it's distinct. I'm doing a ne...",3119,38.67,90.77,30.78,56.66,15.07,14.84,91.5,59.7,18.98,12.28,2.5,2.56,2.6,2.76,1.86,6.7,7.57,13.02,12.02,5.13,6.7,2.28,20.94,3.3,1.54,2.12,1.67,1.83,3.78,2.66,1.03,0.35,0.1,0.1,16.45,0.45,0.03,2.47,1.31,11.99,2.82,1.76,2.08,2.15,1.64,3.56,2.79,0.8,1.76,0.13,1.67,0.74,0.45,0.03,0.35,8.27,3.33,1.12,2.98,0.83,0.45,6.25,12.98,1.15,13.95,2.44,7.6,3.94,3.27,1.44,0.42,0.1,0.13,0.0,0.22,0.0,0.0,0.06,0.06,0.1,26.42,7.18,7.69,0.13,0.35,1.25,0.06,0.48,2.24,4.26,2.76,0.0,0.99,0.0,0.0,0.0,0.0,0.03,0.0,0.41,0.0,0.0,0.03,0.51,4.29,2.82,1.47,0.58,0.06,0.13,3.28,2.5,0.58,0.13,0.13,0.06,3.85,3.59,0.26,0.0,0.0,0.0,4.74,2.05,2.69,1.15,0.13,0.26,3.97,2.95,0.64,0.13,0.26,0.0,2.57,2.06,0.51,0.13,0.0,0.13,-0.32,-0.89,0.57,-1.53,0.25,-1.78,2006,0.0,0.0,0.03,0.03,0.41
1,1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,0,0,"Thank you so much, Chris. And it's truly a gre...",268,413,258,3,21,83.736393,129.041531,80.611901,"Thank you so much, Chris. And it's truly a gre...",the speech by telling them the story of what h...,"Thank you so much, Chris. And it's truly a gre...",heard of phantom limb pain?()This was a rented...,the speech by telling them the story of what h...,extremely upset because one of the wire servic...,723,76.53,75.55,70.35,51.66,30.12,12.59,86.86,55.33,16.74,11.48,6.09,2.35,1.11,1.38,0.55,5.26,8.02,15.77,7.33,4.15,5.53,0.41,17.29,4.01,1.8,1.11,1.52,2.49,3.87,2.63,1.24,0.28,0.14,0.55,10.93,0.69,0.41,0.97,1.11,5.95,0.97,0.97,0.83,0.83,1.52,0.97,3.46,0.83,2.49,0.14,1.8,0.28,0.28,0.0,1.24,8.02,3.73,0.69,2.35,1.66,0.14,9.68,5.95,1.11,18.4,2.77,10.51,5.67,1.24,1.52,1.11,0.69,0.0,0.0,0.14,0.0,0.0,0.14,0.0,0.0,27.11,5.12,7.47,0.28,0.14,0.41,0.69,0.83,2.07,2.9,7.19,0.0,1.1,0.0,0.0,0.0,0.0,0.55,0.0,0.41,0.0,0.0,0.0,0.14,4.67,3.57,1.1,0.27,0.0,0.55,3.06,1.67,1.39,0.28,0.28,0.56,6.08,6.08,0.0,0.0,0.0,0.0,3.28,1.09,2.19,0.55,0.0,1.09,3.39,1.69,1.69,0.0,0.56,0.56,2.75,1.65,1.1,0.55,0.0,0.55,-1.9,0.29,-2.19,-4.43,1.1,-5.53,2006,0.0,0.0,0.0,0.55,0.41
2,2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,0,0,"()Hello voice mail, my old friend.()I've calle...",230,230,104,12,38,140.561709,140.561709,63.558338,"()Hello voice mail, my old friend.()I've calle...","one yet. It's mainly like, the United States, ...","()Hello voice mail, my old friend.()I've calle...","are you typing 11?"" He said, ""The message says...","one yet. It's mainly like, the United States, ...",Woz.()You try rhyming with garage!()Don't cry ...,3253,37.41,73.98,49.15,55.32,14.14,15.06,85.95,55.7,17.68,10.21,3.97,0.65,4.09,0.43,1.08,7.38,7.81,10.67,10.7,6.52,6.55,2.03,20.2,3.47,2.21,2.15,2.0,2.64,4.21,2.89,1.32,0.12,0.34,0.43,10.79,0.0,0.28,0.18,0.55,11.37,2.37,2.03,1.17,1.91,1.91,3.01,2.86,0.86,1.44,0.55,0.68,0.31,0.22,0.0,0.12,5.87,1.23,1.01,1.78,1.66,0.49,4.0,14.02,1.38,13.13,2.27,6.64,4.46,1.94,0.55,0.46,1.01,0.09,0.06,1.01,0.09,0.12,0.52,0.28,0.03,28.44,7.59,7.53,0.4,0.34,1.08,0.22,0.4,2.31,4.95,3.32,0.31,0.76,0.03,0.0,0.0,0.0,0.12,0.0,0.03,0.0,0.03,0.06,0.49,4.16,2.88,1.29,0.12,0.37,0.31,4.26,2.9,1.36,0.12,0.31,0.56,5.28,3.69,1.6,0.25,0.49,0.37,3.05,2.07,0.98,0.0,0.24,0.24,2.95,2.21,0.74,0.12,0.12,0.25,5.58,3.6,1.99,0.12,0.5,0.87,0.02,0.07,-0.05,-0.09,0.39,-0.48,2006,0.03,0.0,0.09,0.12,0.03
3,3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,0,0,If you're here today  and I'm very happy th...,460,1070,36,7,10,270.978764,630.320167,21.207034,If you're here today  and I'm very happy th...,Island Park. Right now we're separated by abou...,If you're here today  and I'm very happy th...,are striving for solutions that won't compromi...,Randall's Island Park. Right now we're separat...,projects include thousands of new parking spac...,3087,72.68,66.21,61.22,44.77,19.92,21.41,85.0,49.95,13.09,7.32,2.85,2.36,1.04,0.58,0.49,5.77,6.8,13.83,7.68,4.6,5.77,1.55,13.44,6.03,3.3,1.43,2.01,3.11,4.92,2.92,1.88,0.13,0.42,0.39,8.94,0.26,0.32,0.19,0.65,11.18,2.17,2.24,1.07,1.75,1.23,3.76,1.3,0.78,0.16,0.26,1.49,0.16,1.0,0.06,0.29,10.63,4.08,2.04,3.6,1.13,0.68,4.02,8.94,1.1,15.58,2.2,9.1,4.11,3.82,0.39,0.75,1.88,0.13,0.06,0.52,0.0,0.0,0.19,0.23,0.1,18.27,5.73,5.99,0.23,0.06,0.39,0.13,1.78,0.19,2.3,1.17,0.29,2.07,0.06,0.23,0.19,0.03,0.9,0.06,0.1,0.03,0.0,0.06,0.39,4.73,2.59,2.14,0.06,0.52,0.45,5.11,3.24,1.62,0.19,0.32,0.32,4.55,2.21,2.34,0.13,0.26,0.52,4.93,2.98,1.95,0.0,0.78,0.39,4.81,3.12,1.43,0.13,0.39,0.26,5.41,3.35,1.8,0.26,0.26,0.39,0.65,-0.52,1.17,1.14,-0.54,1.68,2006,0.29,0.22,0.06,0.96,0.13
4,4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,0,0,"About 10 years ago, I took on the task to teac...",2542,2893,67,3,10,211.72978,240.965481,5.580604,"About 10 years ago, I took on the task to teac...",of the population. Sierra Leone down there. Ma...,"About 10 years ago, I took on the task to teac...",the life expectancy of the African countries a...,of the population. Sierra Leone down there. Ma...,strategy as down here. The improvement of the ...,3163,53.92,70.14,54.36,45.52,14.64,15.4,86.63,56.66,14.13,7.52,1.93,2.43,1.36,0.16,1.64,6.61,8.09,12.33,9.9,6.99,7.56,0.98,15.81,5.09,3.67,1.9,2.12,2.69,2.85,1.96,0.89,0.16,0.16,0.03,8.06,0.54,0.06,0.06,0.19,9.77,1.2,1.42,1.61,2.34,0.95,3.67,1.96,1.49,0.35,0.03,1.26,0.06,1.07,0.13,0.06,9.29,3.29,1.01,3.45,1.74,0.28,3.0,12.36,1.01,16.63,2.5,11.0,3.51,2.56,0.6,0.6,1.9,0.06,0.32,0.35,0.0,0.19,0.0,0.19,0.0,19.29,7.37,6.64,0.35,0.28,0.7,0.0,0.35,0.51,2.06,0.89,0.16,1.42,0.0,0.0,0.16,0.0,0.98,0.06,0.0,0.0,0.0,0.03,0.19,2.15,1.39,0.76,0.19,0.13,0.0,3.54,2.53,1.01,0.13,0.19,0.06,2.03,1.52,0.51,0.25,0.0,0.0,2.28,1.26,1.01,0.13,0.25,0.0,2.81,1.53,1.28,0.13,0.26,0.13,4.25,3.5,0.75,0.12,0.12,0.0,1.14,0.25,0.89,1.98,0.24,1.74,2006,0.0,0.16,0.03,1.04,0.0


In [290]:
df.tail()

Unnamed: 0,index,comments,description,duration,event,film_date,languages,main_speaker,name,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,music,conversation,transcript,persuasive,inspiring,unconvincing,applause,laughter,norm_persuasive,norm_inspiring,norm_unconvincing,transcript_1sthalf,transcript_2ndhalf,transcript_1q,transcript_2q,transcript_3q,transcript_4q,WC,Analytic,Clout,Authentic,Tone,WPS,Sixltr,Dic,function,pronoun,ppron,i,we,you,shehe,they,ipron,article,prep,auxverb,adverb,conj,negate,verb,adj,compare,interrog,number,quant,affect,posemo,negemo,anx,anger,sad,social,family,friend,female,male,cogproc,insight,cause,discrep,tentat,certain,differ,percept,see,hear,feel,bio,body,health,sexual,ingest,drives,affiliation,achieve,power,reward,risk,focuspast,focuspresent,focusfuture,relativ,motion,space,time,work,leisure,home,money,relig,death,informal,swear,netspeak,assent,nonflu,filler,AllPunc,Period,Comma,Colon,SemiC,QMark,Exclam,Dash,Quote,Apostro,Parenth,OtherP,Moral,HarmVirtue,HarmVice,FairnessVirtue,FairnessVice,IngroupVirtue,IngroupVice,AuthorityVirtue,AuthorityVice,PurityVirtue,PurityVice,MoralityGeneral,affect_1h,posemo_1h,negemo_1h,anx_1h,anger_1h,sad_1h,affect_2h,posemo_2h,negemo_2h,anx_2h,anger_2h,sad_2h,affect_1q,posemo_1q,negemo_1q,anx_1q,anger_1q,sad_1q,affect_2q,posemo_2q,negemo_2q,anx_2q,anger_2q,sad_2q,affect_3q,posemo_3q,negemo_3q,anx_3q,anger_3q,sad_3q,affect_4q,posemo_4q,negemo_4q,anx_4q,anger_4q,sad_4q,posemo_change_h,negemo_change_h,affect_change_h,posemo_change_q,negemo_change_q,affect_change_q,published_year,Harm,Fairness,Purity,Ingroup,Authority
2623,2623,12,Anna Heringer: The warmth and wisdom of mud bu...,781,TED2017,1492992000,8,Anna Heringer,Anna Heringer: The warmth and wisdom of mud bu...,1506609881,"[{'id': 1, 'name': 'Beautiful', 'count': 75}, ...","[{'id': 2532, 'hero': 'https://pe.tedcdn.com/i...",Architect,"['architecture', 'design', 'engineering', 'env...","""There are a lot of resources given by nature ...",https://www.ted.com/talks/anna_heringer_the_wa...,870315,0,0,It was the end of October in the mountains in...,45,141,4,6,7,51.705417,162.010307,4.596037,It was the end of October in the mountains in...,"the evening, I used to go with the workers to ...",It was the end of October in the mountains in...,"with their names in Bengali the doors, and the...","the evening, I used to go with the workers to ...","elements, for prefabrication of rammed earth e...",1762,66.52,59.87,38.08,69.62,17.27,17.54,83.09,55.05,12.54,6.24,2.21,1.93,0.96,0.06,1.08,6.3,8.23,13.85,8.51,5.11,8.97,1.48,13.22,4.48,2.1,1.36,1.02,2.27,3.46,2.84,0.51,0.0,0.06,0.17,6.36,0.11,0.23,0.06,0.17,10.1,1.36,1.82,1.59,1.93,1.65,3.06,1.76,0.57,0.0,1.02,1.14,0.51,0.28,0.0,0.17,8.0,2.84,1.65,2.61,1.02,0.28,3.97,8.8,0.17,14.19,1.42,9.42,3.46,3.92,0.74,0.51,0.96,0.06,0.0,0.4,0.0,0.0,0.23,0.17,0.0,17.54,6.02,7.21,0.06,0.06,0.45,0.0,1.14,0.0,1.14,1.48,0.0,1.53,0.17,0.0,0.06,0.0,0.23,0.0,0.4,0.0,0.0,0.4,0.28,3.76,3.08,0.57,0.0,0.11,0.11,3.17,2.6,0.45,0.0,0.0,0.23,3.2,2.75,0.23,0.0,0.0,0.0,4.31,3.4,0.91,0.0,0.23,0.23,2.05,1.59,0.45,0.0,0.0,0.23,4.28,3.6,0.45,0.0,0.0,0.23,-0.48,-0.12,-0.36,0.85,0.22,0.63,2017,0.17,0.06,0.4,0.23,0.4
2624,2624,21,Julio Gil: Future tech will give you the benef...,667,TED@UPS,1500508800,14,Julio Gil,Julio Gil: Future tech will give you the benef...,1506524207,"[{'id': 10, 'name': 'Inspiring', 'count': 105}...","[{'id': 2609, 'hero': 'https://pe.tedcdn.com/i...",Logistics expert,"['augmented reality', 'business', 'cities', 'c...",Don't believe predictions that say the future ...,https://www.ted.com/talks/julio_gil_future_tec...,1079284,0,0,"Today, more than half of the world's populati...",49,105,37,1,8,45.400469,97.28672,34.281987,"Today, more than half of the world's populati...",next reason why people move to cities? Access ...,"Today, more than half of the world's populati...","opportunities, easier access to services and g...",next reason why people move to cities? Access ...,"population density, is not always the best for...",1569,65.86,77.59,53.62,78.35,13.88,16.89,88.08,53.6,13.0,7.65,1.66,1.66,2.8,0.0,1.53,5.35,8.67,12.49,9.24,5.1,7.07,1.53,16.06,5.93,3.63,0.96,2.17,2.93,4.02,3.44,0.57,0.0,0.0,0.25,9.69,0.13,0.25,0.06,0.0,11.98,2.1,2.17,1.85,2.29,1.08,3.38,1.02,0.64,0.13,0.19,1.66,0.13,1.08,0.0,0.25,7.2,3.12,1.47,1.72,1.47,0.25,1.47,13.38,1.91,16.25,2.87,8.03,5.42,3.12,1.02,0.83,1.66,0.0,0.0,0.45,0.0,0.0,0.06,0.25,0.06,18.36,6.63,7.07,0.38,0.0,0.57,0.0,0.57,0.13,1.85,1.15,0.0,0.64,0.0,0.0,0.06,0.0,0.25,0.0,0.13,0.0,0.0,0.0,0.19,4.2,3.44,0.76,0.0,0.0,0.38,3.83,3.44,0.38,0.0,0.0,0.13,4.34,3.06,1.28,0.0,0.0,0.51,4.07,3.82,0.25,0.0,0.0,0.25,3.04,2.53,0.51,0.0,0.0,0.0,4.63,4.37,0.26,0.0,0.0,0.26,0.0,-0.38,0.38,1.31,-1.02,2.33,2017,0.0,0.06,0.0,0.25,0.13
2625,2625,19,Nabila Alibhai: Why people of different faiths...,711,TEDGlobal 2017,1503792000,9,Nabila Alibhai,Nabila Alibhai: Why people of different faiths...,1506456019,"[{'id': 1, 'name': 'Beautiful', 'count': 76}, ...","[{'id': 2643, 'hero': 'https://pe.tedcdn.com/i...",Place-maker,"['activism', 'art', 'community', 'faith', 'pai...","Divisions along religious lines are deepening,...",https://www.ted.com/talks/nabila_alibhai_why_p...,937630,0,0,"We live in a time of fear, and our response t...",13,142,7,3,2,13.864744,151.445666,7.465631,"We live in a time of fear, and our response t...","religious establishments. For example, with Ca...","We live in a time of fear, and our response t...","grow in fear-driven ways. We see more walls, m...","within religious establishments. For example, ...",spread the word through their congregations an...,1598,69.1,93.33,39.07,55.25,16.65,22.03,88.3,53.88,14.14,8.32,1.31,5.07,0.56,0.25,1.13,5.82,6.7,14.96,6.63,3.69,9.95,0.75,12.2,4.32,2.88,1.56,1.31,2.38,6.57,4.07,2.5,1.56,0.5,0.19,13.7,0.19,0.38,0.0,0.25,8.89,2.63,1.63,1.44,1.38,0.63,2.25,2.69,1.5,0.38,0.56,1.13,0.44,0.5,0.0,0.0,13.02,8.7,1.31,2.5,0.69,0.69,4.82,5.82,1.25,14.83,1.81,8.95,4.44,2.38,0.81,1.13,0.56,3.44,0.13,0.13,0.0,0.0,0.0,0.13,0.0,14.89,5.32,5.57,0.19,0.13,0.69,0.0,0.56,0.88,0.88,0.63,0.06,3.25,0.31,0.19,0.12,0.12,1.75,0.25,0.19,0.0,0.31,0.0,0.0,7.47,3.99,3.49,2.12,0.87,0.37,5.66,4.15,1.51,1.01,0.13,0.0,7.21,1.99,5.22,2.99,1.0,0.75,7.75,6.0,1.75,1.25,0.75,0.0,5.76,3.51,2.26,1.5,0.0,0.0,5.54,4.79,0.76,0.5,0.25,0.0,0.16,-1.98,2.14,2.8,-4.46,7.26,2017,0.5,0.24,0.31,2.0,0.19
2626,2626,22,Mei Lin Neo: The fascinating secret lives of g...,327,TED2017,1492992000,18,Mei Lin Neo,Mei Lin Neo: The fascinating secret lives of g...,1506437715,"[{'id': 8, 'name': 'Informative', 'count': 143...","[{'id': 2823, 'hero': 'https://pe.tedcdn.com/i...",Marine biologist,"['TED Fellows', 'animals', 'biology', 'conserv...","When you think about the deep blue sea, you mi...",https://www.ted.com/talks/mei_lin_neo_the_fasc...,953526,0,0,"Back home, my friends call me nicknames, such...",45,17,12,2,4,47.19326,17.828565,12.584869,"Back home, my friends call me nicknames, such...",knowledge gaps on their ecology and behavior. ...,"Back home, my friends call me nicknames, such...","for those killer clam myths! Unfortunately, th...",knowledge gaps on their ecology and behavior. ...,"reef home, and just having them around keeps t...",701,82.5,59.58,57.35,52.53,14.91,17.26,78.03,47.93,11.55,6.99,3.42,0.86,0.29,0.14,2.28,4.56,6.85,15.12,6.13,4.85,6.7,0.14,11.98,8.56,3.57,0.57,1.71,1.85,3.71,2.57,1.14,0.43,0.43,0.29,6.7,0.43,0.29,0.43,0.29,7.56,2.0,1.28,1.0,1.57,0.71,1.85,1.85,1.0,0.29,0.57,2.57,1.28,1.0,0.0,0.57,5.28,1.57,0.57,2.43,0.43,0.86,3.14,8.56,0.0,15.98,1.71,10.98,3.57,1.28,0.14,0.43,0.43,0.0,0.43,0.29,0.0,0.14,0.14,0.14,0.0,18.4,6.42,5.99,0.14,0.0,0.29,0.29,1.14,2.0,0.43,1.71,0.0,1.57,0.57,0.29,0.0,0.0,0.14,0.14,0.43,0.0,0.0,0.0,0.0,2.84,1.7,1.14,0.28,0.57,0.28,4.58,3.44,1.15,0.57,0.29,0.29,1.69,1.13,0.56,0.0,0.0,0.56,4.0,2.29,1.71,0.57,1.14,0.0,2.3,1.72,0.57,0.0,0.0,0.57,6.86,5.14,1.71,1.14,0.57,0.0,1.74,0.01,1.73,4.01,1.15,2.86,2017,0.86,0.0,0.0,0.28,0.43
2627,2627,21,Anindya Kundu: The boost students need to over...,425,TED Residency,1496707200,15,Anindya Kundu,Anindya Kundu: The boost students need to over...,1506352216,"[{'id': 1, 'name': 'Beautiful', 'count': 102},...","[{'id': 1733, 'hero': 'https://pe.tedcdn.com/i...","Sociologist, educator, writer","['TED Residency', 'education', 'personal growt...",How can disadvantaged students succeed in scho...,https://www.ted.com/talks/anindya_kundu_the_bo...,1161546,0,0,"So, I teach college students about inequality...",56,290,5,1,5,48.211608,249.667254,4.304608,"So, I teach college students about inequality...",obstacles that they were facing and navigate t...,"So, I teach college students about inequality...","grit isn't enough, especially in education. So...",the obstacles that they were facing and naviga...,"tailored mentorship and opportunities, they we...",1152,55.21,80.0,12.09,39.75,20.95,21.18,89.32,55.3,16.49,11.37,2.95,0.61,0.26,5.82,1.74,5.12,5.38,15.36,7.55,5.82,9.11,0.43,15.89,4.25,2.43,2.17,1.82,1.91,3.91,2.34,1.56,0.43,0.17,0.61,13.8,0.61,0.09,2.43,4.08,8.94,2.17,0.87,0.87,2.34,0.61,2.86,2.08,0.78,0.43,0.61,0.95,0.17,0.35,0.17,0.26,10.68,2.08,2.69,4.51,2.43,0.26,6.16,8.59,0.95,12.67,2.08,5.99,4.69,7.99,0.26,0.52,0.69,0.0,0.0,0.17,0.0,0.0,0.09,0.09,0.0,17.1,5.21,7.2,0.0,0.0,0.0,0.0,1.13,0.69,1.82,1.04,0.0,1.48,0.0,0.26,0.0,0.0,0.52,0.0,0.43,0.0,0.0,0.0,0.26,4.33,2.42,1.9,0.52,0.35,0.87,3.48,2.26,1.22,0.35,0.0,0.35,4.53,3.48,1.05,0.35,0.0,1.05,4.14,1.38,2.76,0.69,0.69,0.69,4.14,2.41,1.72,0.0,0.0,0.69,2.81,2.11,0.7,0.7,0.0,0.0,-0.16,-0.68,0.52,-1.37,-0.35,-1.02,2017,0.26,0.0,0.0,0.52,0.43


In [291]:
df.columns

Index(['index', 'comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'published_date',
       ...
       'affect_change_h', 'posemo_change_q', 'negemo_change_q',
       'affect_change_q', 'published_year', 'Harm', 'Fairness', 'Purity',
       'Ingroup', 'Authority'],
      dtype='object', length=187)

In [292]:
df.shape

(2406, 187)

In [293]:
#Dropping the index column
df.drop(columns='index', inplace=True)

In [294]:
df.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'published_date', 'ratings',
       ...
       'affect_change_h', 'posemo_change_q', 'negemo_change_q',
       'affect_change_q', 'published_year', 'Harm', 'Fairness', 'Purity',
       'Ingroup', 'Authority'],
      dtype='object', length=186)

In [295]:
#Checking for null values
na_column_list = [row for row in df.isnull().sum() if row>0]

In [296]:
na_column_list

[]

In [297]:
#checking the data types
df.dtypes

comments                int64
description            object
duration                int64
event                  object
film_date               int64
languages               int64
main_speaker           object
name                   object
published_date          int64
ratings                object
related_talks          object
speaker_occupation     object
tags                   object
title                  object
url                    object
views                   int64
music                   int64
conversation            int64
transcript             object
persuasive              int64
inspiring               int64
unconvincing            int64
applause                int64
laughter                int64
norm_persuasive       float64
norm_inspiring        float64
norm_unconvincing     float64
transcript_1sthalf     object
transcript_2ndhalf     object
transcript_1q          object
transcript_2q          object
transcript_3q          object
transcript_4q          object
WC        

### Feature Generation- Gender

In order to generate a gender variable, I'll be using gender-guesser library which takes the first name as an input and provides the matching gender.

In [298]:
!pip install gender-guesser



In [299]:
import gender_guesser.detector as gg
gender = gg.Detector(case_sensitive=False)

In [300]:
#fist trial with my name :) it seems to recognize Turkish characters!
print(gender.get_gender(u'Özge'))

female


In [301]:
#creating a column for the first name
df['fname'] = df.main_speaker.str.split(' ').map(lambda x:x[0])

In [302]:
#creating gender column with the gender guesser
df['gender'] = df.fname.map(gender.get_gender)

Gender guesser library returns 4 other classes besides male and female such as mostly-male, mostly-female, unknown and andy. Unknown indicates an equal probability of the given name being male or female, whereas andy is returned for the names that wasn't found in the database.

In [303]:
#viewing the results
df[['fname', 'gender', 'main_speaker']].head(20)

Unnamed: 0,fname,gender,main_speaker
0,Ken,male,Ken Robinson
1,Al,male,Al Gore
2,David,male,David Pogue
3,Majora,unknown,Majora Carter
4,Hans,male,Hans Rosling
5,Tony,male,Tony Robbins
6,Julia,female,Julia Sweeney
7,Joshua,male,Joshua Prince-Ramus
8,Dan,male,Dan Dennett
9,Rick,male,Rick Warren


In [304]:
df.gender.value_counts()

male             1346
female            620
unknown           262
mostly_male        89
mostly_female      55
andy               34
Name: gender, dtype: int64

For the rows that were returned as mostly-male and mostly-female, I'll proceed with these suggested genders.

In [305]:
df.gender = df.gender.map(lambda x:'male' if x=='mostly_male' else 'female' if x=='mostly_female' else x)

The gender guesser wasn't able to classify 296 variables 34 of which was not found in the database. I'll look check these names and see if there are any typos or non ascii characters in them and use genderize.io api to create the genders for the unknown ones for the next step of my project.

In [306]:
df.gender.value_counts()

male       1435
female      675
unknown     262
andy         34
Name: gender, dtype: int64

### Data Dictionary

As the data dictionary was available on the data.world web site for the dataset, I wanted to scrape it from there, but wasn't able to do so because of the general Java Script issue. However, I downloaded the html page to my computer and was able to access and parse the html on my local machine. 

In [24]:
URL = 'owentemple_text-and-content-features-of-most-persuasive-ted-talks _ Workspace _ data.world.htm'
soup = BeautifulSoup(open(URL), 'html.parser')

In [25]:
def scrape(soup):
    data_dict = {}
    for item in soup.find_all('div', {'class': 'FileItem__colRow___1-58V'}):
        for k in item:
            k = item.find('div', {'class': 'FileItem__colName___1Rwgm'}).text
        for v in item:
            v = item.find('span', {'class': 'Markdown__content___3thyu'}).text
        data_dict[k]=v
    return data_dict

In [26]:
data_dict = scrape(soup)

In [27]:
#adding the 2 newly generated columns to the dictionary
data_dict['fname'] = 'first name of the presenter'
data_dict['gender'] = 'gender of the presenters based on the first name'

#### Final dictionary

In [28]:
pprint(data_dict)

{'achieve': 'a Linguistic Inquiry and Word Count (LIWC) variable. Ratio of '
            'this category of words in the transcript. '
            'https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf',
 'adj': 'a Linguistic Inquiry and Word Count (LIWC) variable. Ratio of words '
        'in the verb category in the transcript. '
        'https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf',
 'adverb': 'a Linguistic Inquiry and Word Count (LIWC) variable.  Ratio of '
           'adverb words in the transcript '
           'https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf',
 'affect': 'a Linguistic Inquiry and Word Count (LIWC) variable. Ratio of the '
           'affect category of words in the transcript. '
           'https://repositories.lib.utexas.edu/bitstream/handle/2152/31333/LIWC2015_LanguageManual.pdf',
 'affect_1h': 'ratio of emotion words in the first hal

In [307]:
df.to_csv('part_2.csv')