Skip to content

Some exploratory data analysis of 2500+ TED talk transcripts

Notifications You must be signed in to change notification settings

parkeraddison/ted-talks-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ted-talks-analysis

Some exploratory data analysis of TED talk transcripts. I thought of this project so that I could work on work on:

  • Webscraping
  • Simple text analysis
  • NLTK library for Python
  • Principle component analysis
  • Data visualization and interactivity
  • LSTM neural net to generate original text from a corpus

Gameplan:

  • Scrape all (currently 2888) English transcripts of TED talks
  • Included data:
    • Title (and talk ID)
    • Speaker (and speaker ID)
    • # of views on TED
    • # of comments on TED
    • date published
    • TED tags (topics)
    • original language (specific word usage will be skewed by translation)
    • video length (to calculate rough wpm)
    • Event (e.g. TEDx vs TED Global)
    • Category (reader ratings, e.g. 'inspirational' or 'confusing')
  • Run some basic analysis (duration, length, common words, past/future tense, passive/active voice, etc)
  • Break into categories (such as top ranked 'inspirational') and try to use PCA to find the specific distinguishing words
  • Feed text into LSTM and churn out a sudo TED talk
  • Create a nice interactive visalization of results

About

Some exploratory data analysis of 2500+ TED talk transcripts

Topics

Resources

Stars

Watchers

Forks