## TED: Data Cleaning

Assignment prompt:  
Create a short document (1-2 pages) in your github describing the data wrangling steps that you undertook to clean your capstone project data set. What kind of cleaning steps did you perform? How did you deal with missing values, if any? Were there outliers, and how did you decide to handle them? This document will eventually become part of your milestone report.

---

## Data Cleaning

There were two CSV files for this project. One contained all of the metadata and the other contained all of the transcripts. I imported them both into Pandas dataframes. The relevant metadata columns are described here:  

__metadata__  
url: Video URL  
name: Title of talk  
event: Event where the talk was recorded  
ratings: Counts for user-defined ratings (see below for list)  
duration: Video runtime  
num_speaker: Number of main speakers giving talk
tags: One-word descriptions of the talk

__transcripts__  
url: Video URL   
transcript: Video transcript  

Columns: 9  
Rows: 2550

The __steps__ I took can be summarized as:

A. Duplicates  
B. Data Formatting Changes  
C. Non-TED Data Removal  
D. Missing Data  
E. Positive Rating Percent Creation
F. Text cleanup
G. Text to GloVe Representation  

#### A. Duplicates

In both datasets, I looked for duplicates. There were two duplicates in the transcripts, so I deleted those entries. 

#### B. Data Formatting Changes

The column identifying each TED event contained 355 different entries. I summarized them based on the type of event into the following categories: 'Mission Blue', 'Other', 'Summit', 'TED Global', 'TED Live', 'TED Salon', 'TED Women', 'TED Yearly', 'TED Youth', 'TED@', 'TEDMED', 'TEDx'.  

Column created: __event_type__  
Column removed: __event__

#### C. Non-TED Data Removal

After summarizing the _event_ column, it became clear that there were non-TED talks in the dataset (e.g., university gratuation speeches, talks from TED-like events, and TED-Ed videos which are studio-produced animated shorts). I removed all videos from non-TED events, per the TED website's "past events" list, which brought the total number of videos down from 2550 to 2456.

Remaining rows: 2456

#### D. Missing Data

There were no missing datapoints in the __metadata__ dataset, but many transcripts were missing from the __transcript__ dataset. Most of these were from non-TED talks, so they were deleted. Of the remaining 47 missing transcripts, many were from musical or dance performances. Still others were from TEDx talks that either had been missed (unlikely) or had not had transcripts made before the data was pulled. I manually checked the URL for every missing transcript and imported the 11 that existed. Talks with no transcript were removed from the dataset.

Remaining rows: 2420

#### E. Positive Rating Percent Creation

The output variable for model assessment is a measure of the relative "success" of a video, defined as the percentage of _positive_ ratings. The _ratings_ column contained counts for every video's ratings in each of the following 14 categories, which have been separated into __positive__ and __negative__ groups:

___positive___: Beautiful, Courageous, Fascinating, Funny, Informative, Ingenious, Inspiring, Jaw-dropping, Persuasive  
___negative___: Confusing, Longwinded, Obnoxious, OK, Unconvincing  

"OK" is ambiguous, but is considered "negative" because it does not show up in TED's search options for "show me a video that is..." assumedly because nobody says, "I want to watch something that's just OK."

Column created: 
__pos_pct__: (float) positive ratings / (positive + negative ratings)

#### F. Text cleanup

In sum, text cleanup consisted of: separating joined-together words, removing interjections, removing lyrics, and creating counts of different text attributes.

The imported transcripts were collected in such a way that their newlines were removed. This caused the last word of every line to be joined with the first word of the next line: "...end of this line.The line begins...." Most interjections (e.g., "(Applause)") were given their own line in the transcripts, compounding the problem: "...end of this line.(Applause)The line begins...." To fix this, a space was placed after every terminal punctuation (period, question mark, exclamation point) and before and after every open parentheses. 

Many of these 820 different interjections were not part of the text and were removed where they showed up in at least 10 transcripts. 

There were some talks that consisted mostly of song lyrics, which were surrounded by "♫" symbols. All song lyrics were removed and every transcript with less than 1000 remaining characters was removed because they consisted of singers or dancers introducing themselves or saying a brief word about the meaning of their song/dance. 

Finally, because things like music, dance, and video have an impact that we cannot get through the transcript, getting a count of the number of times they are shown could be useful. Similarly, perhaps the number of times "(Laughter)" or "(Applause)" shows up could give us an indicator of how positively or negatively a talk was received. Columns were created for counts of laughter and applause, as well as boolean columns for music, dance, and video.

Words were not lemmatized and punctuation and stopwords were not removed. This step may be taken later, but GloVe representation is trained on data with these items included.

Remaining rows: 2382  
Columns created:   
__has_music__: (bool) video contains music  
__has_dance__: (bool) video contains dancing  
__has_audio_video__: (bool) video contains non-music audio/video reference  
__num_chars__: (int) number of characters in transcript  
__laughter__: (int) transcript count for __(Laughter)__  
__applause__: (int) transcript count for __(Applause)__  

#### G. Text to GloVe Representation

Texts were parsed using spaCy, which uses a 300-dimension GloVe representation for each word, trained on common crawl data with 685k keys and 685k distinct vectors. Essentially, each token (word or punctuation) is given a 300-dimension vector to describe its relationship with all other words in all other contexts. These vectors are not ordered and do not have any interpretable meaning on their own. Every document was given its own vector by summing all its tokens' vectors. All 300 values were added to the dataframe for each talk.

Columns created:
__#(1-300)__: (float) transcript GloVe value for given dimension



## Data Visualization and Outlier Detection

After the data cleaning steps above, the dataset had the following properties:  

2382 rows x 331 columns  

Text Columns:
__url__, __name__, __tags__, __transcript__, 

Categorical/Boolean Columns:
__event_type__, __has_music__, __has_dance__, __has_audio_video__, __num_speaker__

Numerical Columns:
__duration__, __pos_pct__, __laughter__, __applause__, __num_chars__, __#(1-300)__

For each continuous variable except the GloVe vectors, I created a histogram, boxplot, identified outliers based on IQR, and idetnified basic statistics like min, max, variance, mean, and median. I also created a scatterplot for each combination of those variables and calculated their correlation based on Spearman's rho. I repeated these steps for each level of the catgorical columns and calculated the Kruskal-Wallace H test (alpha = 0.05) with a Mann-Whitney U test for significant pairs, using a Bonferroni correction for alpha for __event_type__, where there were 12 groups to compare pair-wise.  

Major takeaways from this process:

1. __duration__ and __num_chars__ are highly correlated (rho=0.9), so __num_chars__ was removed to avoid multicollinearity.  
2. None of the continuous distributions look remotely normal and tend to skew heavily to one end of their range or the other.  
3. The distributions for __laughter__ and __applause__ contain enough zeros that their medians are 2 and 1, respectively. This indicates that transcribers were not consistent including these kinds of audience interactions and they should perhaps be excluded. They will be kept for now, but their use in prediction may be inconsistent.
4. No outliers were far-reaching enough that they indicated mis-labled/mis-calculated data. However, some observations had a very large __duration__. TED talks are supposed to last no more than 18 minutes, according to the official submission guidelines. However, some _TED Yearly_, _TED Global_, and _Other_ talks lasted upwards of 30 minutes and in some cases an entire hour. These were special, out-of-the-ordinary talks that are not representative of the whole and were usually on controversial topics (e.g., "Militant Atheism") and given by renowned individuals (e.g., Al Gore). Because of this, all 38 talks with a duration of more than 25 minutes were removed. This cutoff was chosen to allow for running over time on a talk and for intro/outro filler on either end of the actual video. Also, at 25 minutes, the number of outliers drops to 0 and the distribution looks bimodal with no significant tails. 
5. There are significant differences in __duration__ and __pos_pct__ among categorical variable groups (especially __event_type__), which is expected and indicates perhaps these variables will be useful in creating a predictive model. These differences are present even after removing overly-long talks.

Column removed: __num_chars__
Remaining rows: 2344

Final shape: 2344 x 330