In [3]:
import pandas as pd
import numpy as np

## Abstract
TED talks have become a standard on how to deliver informational talks. All of these talks are presented in a digital format online for public viweing. By using the transcripts and website data for all TED talks given up to September 2017 +++. Each video in the website is provided with suggested similar videos and informative topic-tags on them. We present an improvement in the suggested videos, and we consolidate a base on tags that are more useful by using bisecting k-means and LDA-GS respectively.

# Introduction

## Motivation
We wish to create a labeling and clustering process to process documents, videos, or talks based solely on their transcript. Our secondary objective if for this process to have minimun to none human intervention. We will be using the transcripts on all the TED talks that have been published up to September 2017 (roughly 2400).

An automatic and human-less classification method would allow better suggestions to the online audience, would provide an **unbiased-human classification**, and would be generalizable to other fields and problems. This sort of classification method would give the tools for a machine to sort through thousands of legal documents, health documents, speech transcript, or even books, and allow the scholar or reader to approach them seamlessly. 

## TED Talks
Since 1984, TED has become an iconic conference in which experts of the world in different fields present their ideas and analysis in the fields of technology, entertainment and design (T.E.D.). Since 2006, the conference platform decided to make every talk public and free by publishing them on their website. Given its history and prestige, TED talks have become a standard for quality when it comes to delivering an informational talk to an audience.

# Data
## Description
We will describe this data more thoroughly as we develop the visualizations, but we would like to highlight some information that helps us understand this data, and with that, some interesting facts worth mentioning. The most common speaker occupation is "Writer" with 45 occurrences, followed by "Designer" with a total of 34 occurrences. The total number of occupations among speakers is 1458. The average talk is 13.7 minutes long, with the shortest being 2.25 minutes and the longest being about 1.5 hours, which was given by the author of "The Hitchhiker's Guide to the Galaxy". The average TED talk has been translated to 27 languages, and there are 86 talks that have no assigned language. These talks have a mean number of views significantly lower than average, and they are mainly musical presentations. The average number of views is 1.6 million and the talk with highest number of views has been seen almost 50 million times and it's called "Do schools kill creativity?".

An important variable we must understand is the one named "ratings". This variable is a categorical description of the talk. It is opinion-based and given by the online audience: after watching the video, the viewer is asked: *"How would you describe this talk? Tell us by choosing up to three words. (If you choose just one, it will count three times.)".*  Afterwards, the viewer is given 14 possible adjectives to describe the talk with, and can only choose 3 of them. The same 14 possibilities are given to all viewers.

With regards to the video tags, TED gives each video a number of possible tags to link a talk with different topics. The average video has 7.56 tags, with some videos having over 30 tags and some having just one. The other important variable that will be useful for us to observe is the related talks. Every video is given a connection to 6 other videos that are suggested for the viewer to watch. We suggest to improve these two metrics, having specific tags that might be more useful for the viewer, and testing if there is any link in the related talks with the number of views the talks have.

## Variables

In [None]:
""" TEX
\begin{tabular}{llll}
\toprule
{} &                0 &     1 &                                        2 \\
\midrule
0  &           title: &   str &                    The title of the talk \\
1  &     description: &   str &        A blurb of what the talk is about \\
2  &    main\_speaker: &   str &      The first named speaker of the talk \\
3  &     speaker\_occ: &   str &       The occupation of the main speaker \\
4  &     num\_speaker: &   int &       The number of speakers in the talk \\
5  &        duration: &   int &      The duration of the talk in seconds \\
6  &       film\_date: &   int &       The date of filming Unix timestamp \\
7  &  published\_date: &   int &    The online publication Unix timestamp \\
8  &        comments: &   int &  The number of comments made on the talk \\
9  &       languages: &   int &   Number of languages available for talk \\
10 &         ratings: &  dict &    The various ratings given to the talk \\
11 &             url: &   url &                      The URL of the talk \\
12 &           views: &   int &          The number of views on the talk \\
13 &   related\_talks: &  dict &          List of dict of 6 related talks \\
14 &            tags: &  list &      The themes associated with the talk \\
\bottomrule
\end{tabular}
"""

# Results

# Analysis

# Conclusion

# Introduction

## TED Talks
Since 1984, TED has become an iconic conference in which experts of the world in different fields present their ideas and analysis in the fields of technology, entertainment and design (T.E.D.). Since 2006, the conference platform decided to make every talk public and free by publishing them on their website. Given its history and prestige, TED talks have become a standard for quality when it comes to delivering an informational talk to an audience.

## Motivation
We wish to create a labeling and clustering process to process documents, videos, or talks based solely on their transcript. Our secondary objective if for this process to have minimun to none human intervention. We will be using the transcripts on all the TED talks that have been published up to September 2017 (roughly 2400).

An automatic and human-less classification method would allow better suggestions to the online audience, would provide an **unbiased-human classification**, and would be generalizable to other fields and problems. This sort of classification method would give the tools for a machine to sort through thousands of legal documents, health documents, speech transcript, or even books, and allow the scholar or reader to approach them seamlessly. 

## Algorithm Approach:
The process will be divided fundamentally into two different parts.

The first part will be the labeling process. The labelling process will be split into three different steps. The first step will be to apply a Latent Dirichlet Allocation (LDA) using a Gibbs Sampling technique in order to find the prevalent topics in the whole corpus of TED talks. This will give us a list of words for each topic which describes the topic. The second labeling step is to find a tagging label for each topic. Given the list of words assigned to each topic, we implement and algorithm to assign a macro concept that encapsulate all the words for each topic. For example, our LDA may have as an output the following list: `['government', 'party', 'elections', 'voting', 'candidate']` 
Step two will label this as `Politics`. Step three is applying these topic labels to the different clusters that will be provided by part two.

The second part will be a two-fold. First, we will use a bisecting K-Means algorithm in order to determine what the main clusters in the text are. Second, we will use a cosine similarity system to find related documents.

1. Labeling
    2. LDA - Prevalent Topics
    2. Wiki, Glove - Label the topics
    2.  - Label the clusters
1. Clustering
    2. Bisecting KMeans - Main Clusters
        3. Count Vectorizer
        3. sklearn.feature_extraction.text.CountVectorizer
    2. Cosine Similarity - Related Talks

# TED Talks Data
## Data Description
### Main Dataset
The main dataset contains descriptive information for 2550 TED Talks.

We will describe this data more thoroughly as we develop the visualizations, but we would like to highlight some information that helps us understand this data, and with that, some interesting facts worth mentioning. The most common speaker occupation is "Writer" with 45 occurrences, followed by "Designer" with a total of 34 occurrences. The total number of occupations among speakers is 1458. The average talk is 13.7 minutes long, with the shortest being 2.25 minutes and the longest being about 1.5 hours, which was given by the author of "The Hitchhiker's Guide to the Galaxy". The average TED talk has been translated to 27 languages, and there are 86 talks that have no assigned language. These talks have a mean number of views significantly lower than average, and they are mainly musical presentations. The average number of views is 1.6 million and the talk with highest number of views has been seen almost 50 million times and it's called "Do schools kill creativity?".

An important variable we must understand is the one named "ratings". This variable is a categorical description of the talk. It is opinion-based and given by the online audience: after watching the video, the viewer is asked: *"How would you describe this talk? Tell us by choosing up to three words. (If you choose just one, it will count three times.)".*  Afterwards, the viewer is given 14 possible adjectives to describe the talk with, and can only choose 3 of them. The same 14 possibilities are given to all viewers.

With regards to the video tags, TED gives each video a number of possible tags to link a talk with different topics. The average video has 7.56 tags, with some videos having over 30 tags and some having just one. The other important variable that will be useful for us to observe is the related talks. Every video is given a connection to 6 other videos that are suggested for the viewer to watch. We suggest to improve these two metrics, having specific tags that might be more useful for the viewer, and testing if there is any link in the related talks with the number of views the talks have.

### Transcript Dataset
The transcript dataset contains the transcripts for 2467 TED talks. In this database we found three duplicates. We decided to analyze only the talks that are found on both databases in order to have homogeneous data. The 86 talks for which there is no transcript data are the most recent ones. Therefore, we discarded the data that was duplicated and for which there were no transcripts.

##### Source:
The data has been scraped from the official TED Website and is available under
the Creative Commons License. It was retrieved from the Kaggle featured data sets in October 2017.

In [3]:
car = [0,1,2,3]

In [4]:
car[1] = 9

In [5]:
car

[0, 9, 2, 3]