# TED Talk Transcripts

## Getting What I Want Out of the HTML

### Author
```
<h4 class='h12 talk-link__speaker'>Al Gore</h4>
```

### Author and Title
```
<title>Al Gore: Averting the climate crisis | TED Talk Subtitles and Transcript | TED.com</title>
```

### Date
```
<div class='meta'>
<span class='meta__item'>
Posted
<span class='meta__val'>
Jun 2006
</span>
</span>
<span class='meta__row'>
Rated
<span class='meta__val'>
Funny, Informative
</span>
</span>
</div>
```

### Length

Length is probably best estimated by the last of the times:
```
<data class='talk-transcript__para__time'>
15:57
</data>
```

### Text

It looks like the actual "talk" is contained in the the div: `<div class='talk-article__body talk-transcript__body'>` and then paragraphing is achieved with: `<span class='talk-transcript__para__text'>`. (It would be nice, perhaps, to keep the paragraphing? But it is, above all, a transcription artifact.)


## CSV

I think I want all this in a CSV in order to have a data structure. If so, what I want to do: 

* Pull author either from `h4` or from `title` and place it in an *author* column.
* Pull title from `title` and place it in a *title* column.
* Pull date from `meta` and place it in *date* column.
* Pull the length of the file from the transcript above and place it in a *length* column.
* Pull the text from `talk-transcript__body` and place it in a *text* column.

I think I can do all this with **`BeautifulSoup`**, and I imagine I need to:

1. Read a file.
2. Grab these elements and place them in a list.
3. Write list to line in CSV file.

So, what I need to do first is come up with the **`BeautifulSoup`** that will do this.

In [45]:
import glob, re, csv
from bs4 import BeautifulSoup as soup                                                     

for file in glob.glob("../test/*"):
    soup = soup(open(file).read(), "lxml")
    at = soup.find("title").text
    author = at[0:at.find(':')]
    title  = at[at.find(":")+1 : at.find("|") ]
    date = re.sub('[^a-zA-Z0-9]',' ', soup.select_one("span.meta__val").text)
    length_data = soup.find_all('data', {'class' : 'talk-transcript__para__time'})
    (m, s) = ([x.get_text().strip("\n\r") 
          for x in length_data if re.search(r"(?s)\d{2}:\d{2}", 
                                            x.get_text().strip("\n\r"))][-1]).split(':')
    length = int(m) * 60 + int(s)
    firstpass = re.sub(r'\([^)]*\)', '', soup.find('div', class_ = 'talk-transcript__body').text)
    text = re.sub('[^a-zA-Z\.\']',' ', first)
    myfile = open('../outputs/tedtalkstest.csv', 'w', newline='')
    writer=csv.writer(myfile, delimiter=',',quoting=csv.QUOTE_MINIMAL)
    writer.writerow((author,title,date,length,text))
    myfile.close()

AttributeError: 'ResultSet' object has no attribute 'find'

In [53]:
import glob
from bs4 import BeautifulSoup as soup                                                     

for file in glob.glob('../test/*'):
    soup = soup(open(file).read(), "lxml")
    at = soup.title.string
    print(at)

Al Gore: Averting the climate crisis | TED Talk Subtitles and Transcript | TED.com


AttributeError: 'ResultSet' object has no attribute 'title'

---

In [None]:
# To get rid of numbers in a string:

# A list comprehension
clean = ''.join(i for i in text if not i.isdigit())

## Features

* Text length/time would give us a sense of pacing. And if we can pull the popularity of the talks into the analysis, we might know something about what's the "preferred" pace for TED talks.

---

## Operations

### Rename Files

I'm looking for this:

    <link href="http://www.ted.com/talks/al_gore_on_averting_climate_crisis/transcript" rel="canonical" />
    
And I want to name the file: `al_gore_on_averting_climate_crisis`.

In [None]:
# Early attempt at re-naming files
filepath = os.path.abspath('/Users/john/Code/ted/transcript?language=en')
fileopen = open(filepath).read()
title = soup.find_all('link', 'rel="canonical"')

# os.rename(old_file_name, new_file_name)
print(title)

# Early attempt to get the date:
#date_spans = soup.find_all('span', {'class' : 'meta__val'})
#date = [x.get_text().strip("\n\r") for x in date_spans if re.search(r"(?s)[A-Z][a-z]{2}\s+\d{4}", x.get_text().strip("\n\r"))][0]