$\Large{\text{Web Scrapping Ted Talks.}}$

See [Ted Talks Website here](https://www.ted.com/talks?event=tedx&amp;sort=newest).


__*NOTE:*__ 
<br></br>
__The Ted talks are updated frequently. At the time this notebook was made, there were 17 total pages and the firt 16 of those had 36 talks each (6 columns by 6 rows). Page 17 had less talks. The order and total number of talks on these pages will change with time.__ 


When you go to the [Ted Talks Website](https://www.ted.com/talks?event=tedx&sort=newest),    
you can see there are videos posted along 6 columns across and 6 rows down, equating to 36 talks per page. If you scroll down to the bottom of the Ted Talk site you can see that there are 17 pages total.The final page does not contain 36 talks, but only has 28. 

In this notebook I am going to show an example of web scrapping. Using the first page of the 17, I will grab some information on the first video. At the end of this notebook, I will build some code that will automatically crawl through all 17 pages and grab the same data for each video on each page. These will be stored in an Ordered Dictionry during the looping process and then dumped into a `pandas` dataframe at the end. I export thid dataframe as a `.csv` file and examine the data in another Jupyter notebook.


$\Huge\color{blue}{\text{Screenshots of Ted Talks Website}}$

<p>
<img src="images/ted_talks.png" style="width: 2000px;"/>
 <em> </em>
</p>

$\huge\color{maroon}{\text{There are 17 pages}}$

<p>
<img src="images/ted_talks_pages.png" style="width: 2000px;"/>
 <em> </em>
</p>

$\large\color{maroon}{\text{As can be seen in this last photo, some of the talks have "Rated" keywords}}$
$\large\color{maroon}{\text{while others do not.}}$

```


```

## What Information do we want on each video?
1. Page Number. I give each page crawled a number ranging from 1-17. This is a quick way to ID the page. 
2. Link of the page crawled
3. Title of Talk
4. Speaker
5. Date Talk was Posted
6. Rated keywords, those unrated are left blank
7. Duration of Talk 
8. Link to talk

Item #2 is what we pass to `request.get` and item #1 is the number we assign to each page as we iterate through all 17 pages. We are web scrapping the information for items 3-8 on the list above.
```

```
We use the `Beautiful Soup` module to parse the HTML data.


[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

```


```

$\large{\textbf{Since none of the talks of page 1 are Rated, I will use page 2 in this example.}}$
```

```

<font size=4>
    
If we take a look at the source code for the first video on the first page, we see it beings with `<div class='col'>`, as do all the other videos on the page. We can choose any class and attribute to tell Beautiful Soup to begin scrapping at. We choose the div class with 'm3' attribute; `<div class='m3'>`. Once this div class ends with its corresponding `</div>`, Beaufiful Soup stops reading for that result. 
**Make sure you choose one that captures *all* the information you need.**
    
    
</font>

```
<div class='col'>
<div class='m3'>
<div class='talk-link'>
<div class='media media--sm-v'>
<div class='media__image media__image--thumb talk-link__image'>
<a class=' ga-link' data-ga-context='talks' href='/talks/george_blair_west_3_ways_to_build_a_happy_marriage_and_avoid_divorce'>
<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" play="673" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/2ead0f81-93cc-4da1-80a5-e89b1ebdd386/GeorgeBlairWest_2017X-embed-new.jpg?quality=89&amp;w=320" crop="top" /><span class="thumb__aligner"></span></span></span><span class="thumb__duration">11:13</span></span>
</a>
</div>
<div class='media__message'>
<h4 class='h12 talk-link__speaker'>George Blair-West</h4>
<h4 class='h9 m5'>
<a class=' ga-link' data-ga-context='talks' href='/talks/george_blair_west_3_ways_to_build_a_happy_marriage_and_avoid_divorce'>
3 ways to build a happy marriage and avoid divorce
</a>
</h4>
<div class='meta'>
<span class='meta__item'>
Posted
<span class='meta__val'>
Jan 2019
</span>
</span>
</div>
</div>
</div>
</div>
```

In [4]:
website = "https://www.ted.com/talks?event=tedx&page=2&sort=newest"

In [5]:
import requests
from bs4 import BeautifulSoup

In [6]:
r = requests.get(website)

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')

In [45]:
# find all the records starting with the div class and having the 'm3' attribute. 
results = soup.find_all('div', attrs={'class':'media media--sm-v'}) 

In [46]:
len(results)  # 36

36

__We will use the second result as an example since it was rated.__

In [71]:
res = results[2]

In [72]:
res

<div class="media media--sm-v">
<div class="media__image media__image--thumb talk-link__image">
<a class=" ga-link" data-ga-context="talks" href="/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business">
<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">15:24</span></span>
</a>
</div>
<div class="media__message">
<h4 class="h12 talk-link__speaker">Christine Porath</h4>
<h4 class="h9 m5">
<a class=" ga-link" data-ga-context="talks" href="/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business">
Why being respectful to your coworkers is good for business
</a>
</h4>
<div class="meta">
<span class="me

__I cleaned up this talk's HTML tags so that each tag is on its own line. This makes it easier to see where each tag opens and closes. Each opening tag has a closing tag. For example, the very first tag `<div class="media media--sm-v">` doesn't close until the very end of Part 2 (`</div>`), which is what we'd expect since this is the one we told `Beautiful Soup` to use for separating the talks.__
**The last `</div>` in Part 1 closes `<div class="media__image media__image--thumb talk-link__image">`.**


## Part 1

    <div class="media media--sm-v">
    <div class="media__image media__image--thumb talk-link__image">
    <a class=" ga-link" data-ga-context="talks" href="/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business">
    <span class="thumb thumb--video thumb--crop-top">
    <span class="thumb__sizer">
    <span class="thumb__tugger">
    <img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/>
    <span class="thumb__aligner">
    </span>
    </span>
    </span>
    <span class="thumb__duration">15:24
    </span>
    </span>
    </a>
    </div>


__First 'a' tag holds the link under href.__
<br></br>
`res.find('a')['href']`


__Fifth span class holds duration, their format. 15:24__
<br></br>
`res.find(name='span', attrs={'class', 'thumb__duration'}).text`
<br></br>
`res.findAll('span')[4].text`
<br></br>
`res.findChildren(name='span', attrs={'class', 'thumb__duration'})[0].text`
<br></br>
`res.findChild(name='span', attrs={'class', 'thumb__duration'}).text`




## Part 2

    <div class="media__message">
    <h4 class="h12 talk-link__speaker">Christine Porath
    </h4>
    <h4 class="h9 m5">
    <a class=" ga-link" data-ga-context="talks" href="/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business">
    Why being respectful to your coworkers is good for business
    </a>
    </h4>
    <div class="meta">
    <span class="meta__item">
    Posted
    <span class="meta__val">
    Oct 2018
    </span>
    </span>
    <span class="meta__row">
    Rated
    <span class="meta__val">
    Informative, Inspiring
    </span>
    </span>
    </div>
    </div>
    </div>

__Second 'a' tag also holds the link.__
<br></br>
`res.findAll('a')[1]['href']`
<br></br>
`res.findChildren('a')[1]['href']`
<br></br>
`res.find('a')` # wont work because it only provides access to the first 'a' tag.
<br></br>
`res.findChild('a')` # wont work because it only provides access to the first 'a' tag.


__Talk Title, same tag as link.__
<br></br>
`res.findChildren('a')[1].text`
<br></br>
`res.findAll('a')[1].text`

__Speaker Name is the first h4 class.__
<br></br>
`res.find('h4').text`
<br></br>
`res.findAll('h4')[0].text` # don't need this bc its first h4 tag to occur.
<br></br>
`res.findChild('h4').text`
<br></br>
`res.findChildren('h4')[0].text` # don't need this bc its first h4 tag to occur.

__Date Talk with Posted and Rated__
<br></br>
`len(res.findAll('div'))`  # 3
<br></br>
`len(res.findAll('span'))`  # 9
<br></br>
Either `findAll` or `findChildren` can be used with same result. 
<br></br>

First instint is to use the `div` tag since there are less of them. However, the `span` tag has two class attributes that are used solely for the date posted (`'meta__item'`) and rated (`'meta__row'`). Therefore we can use 
<br></br>
`res.find(name='span', attrs={'class':'meta__val'}).text`
<br></br>
and
<br></br>
`res.find(name='span', attrs={'class':'meta__row'}).text`
<br></br>
<br></br>
**Don't forget that `'meta__row'` won't be in every Ted talk result because it represents the rated keywords. What can be done is to use:**

```

if res.find(name='span', attrs={'class':'meta__row'}):
    rated_keywords = res.find(name='span', attrs={'class':'meta__row'}).text.replace('\n','')
else:
    rated_keywords = ''
```


1. Page Number. I give each page crawled a number ranging from 1-17. This is a quick way to ID the page. 
2. Link of the page crawled: `website`
3. Title of Talk: `res.findAll('a')[1].text`
4. Speaker: `res.find('h4').text`
5. Date Talk was Posted: `res.find(name='span', attrs={'class':'meta__val'}).text.replace('\n', '')`
6. Rated keywords, those unrated are left blank: `res.find(name='span', attrs={'class':'meta__row'}).text.replace('\n','')`
7. Duration of Talk: `res.find(name='span', attrs={'class', 'thumb__duration'}).text`
8. Link to talk: `res.find('a')['href']`

In [199]:
print(res.findAll('a')[1].text)
print(res.find('h4').text)
print(res.find(name='span', attrs={'class':'meta__val'}).text.replace('\n', ''))
print(res.find(name='span', attrs={'class':'meta__row'}).text.replace('\n',''))
print(res.find(name='span', attrs={'class', 'thumb__duration'}).text)
print(res.find('a')['href'])


Why being respectful to your coworkers is good for business

Christine Porath
Oct 2018
RatedInformative, Inspiring
15:24
/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business


In [205]:


talk_title     = res.findAll('a')[1].text
talk_speaker   = res.find('h4').text
date_posted    = res.find(name='span', attrs={'class':'meta__val'}).text.replace('\n', '')
rated_keywords = res.find(name='span', attrs={'class':'meta__row'}).text.replace('\n','')
talk_duration  = res.find(name='span', attrs={'class', 'thumb__duration'}).text
talk_link      = res.find('a')['href']



'/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business'

__Using `.find` vs `.findAll` to find HTML tags with certain class attributes.__

```python
Signature: first_result.find(name=None, attrs={}, recursive=True, text=None, **kwargs)
Docstring:
Return only the first child of this Tag matching the given
criteria.

Signature: first_result.findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
Docstring:
Extracts a list of Tag objects that match the given
criteria.  You can specify the name of the Tag and any
attributes you want the Tag to have.
```

In [15]:
res.find('span')

<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">15:24</span></span>

In [16]:
res.findAll('span')

[<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">15:24</span></span>,
 <span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span>,
 <span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><sp

In [17]:
res.findAll('span')[0]

<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">15:24</span></span>

In [18]:
res.findAll('span')[1]

<span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span>

In [19]:
res.findAll('span')[2]

<span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span>

In [20]:
res.findAll('span')[3]

<span class="thumb__aligner"></span>

In [21]:
res.findAll('span')[4]

<span class="thumb__duration">15:24</span>

In [22]:
res.findAll('span')[5]

<span class="meta__item">
Posted
<span class="meta__val">
Oct 2018
</span>
</span>

In [23]:
res.findAll('span')[6]  # THIS ONE HOLDS THE DATE

<span class="meta__val">
Oct 2018
</span>

In [27]:
res.findAll('span')[7]  # THIS ONE HOLDS THE RATED

<span class="meta__row">
Rated
<span class="meta__val">
Informative, Inspiring
</span>
</span>

In [28]:
res.findAll('span')[8]

<span class="meta__val">
Informative, Inspiring
</span>

In [31]:
res.find('span').text

'15:24'

In [32]:
res.find_all('span', attrs={'class':'meta_val'}) 

[]

In [14]:
res.find('a')

<a class=" ga-link" data-ga-context="talks" href="/talks/christine_porath_why_being_nice_to_your_coworkers_is_good_for_business">
<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="924" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/bf6f2115-e341-4c30-a94d-9a3815f0edd4/ChristinePorath_2018X-embed.jpg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">15:24</span></span>
</a>

In [9]:
first_result.findAll('a')

[<a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
 <span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="751" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/b3f4021a-9423-4ea3-aee6-3d55f78093f2/Amy+Price+Azano+Set+1.jpeg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">12:31</span></span>
 </a>,
 <a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
 The ruralities of autism
 </a>]

In [10]:
# has 2 children.
len(first_result.findAll('a'))

2

In [11]:
# first one
first_result.findAll('a')[0]

<a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="751" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/b3f4021a-9423-4ea3-aee6-3d55f78093f2/Amy+Price+Azano+Set+1.jpeg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">12:31</span></span>
</a>

In [12]:
# second one 
first_result.findAll('a')[1]

<a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
The ruralities of autism
</a>

In [13]:
first_result.findAll('a')[0]

<a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
<span class="thumb thumb--video thumb--crop-top"><span class="thumb__sizer"><span class="thumb__tugger"><img alt="" class=" thumb__image" crop="top" play="751" src="https://pi.tedcdn.com/r/talkstar-photos.s3.amazonaws.com/uploads/b3f4021a-9423-4ea3-aee6-3d55f78093f2/Amy+Price+Azano+Set+1.jpeg?quality=89&amp;w=320"/><span class="thumb__aligner"></span></span></span><span class="thumb__duration">12:31</span></span>
</a>

In [14]:
first_result.findAll('a')[0].text  # duration

'\n12:31\n'

In [15]:
first_result.findAll('a')[0]['href']  # link

'/talks/amy_price_azano_the_ruralities_of_autism'

In [16]:
first_result.findAll('a')[1]

<a class=" ga-link" data-ga-context="talks" href="/talks/amy_price_azano_the_ruralities_of_autism">
The ruralities of autism
</a>

In [17]:
first_result.findAll('a')[1]['href']  # also link

'/talks/amy_price_azano_the_ruralities_of_autism'

In [18]:
first_result.findAll('a')[1].text  # title

'\nThe ruralities of autism\n'

In [19]:
# h4 tag has the speaker name
first_result.findAll('h4')[0]

<h4 class="h12 talk-link__speaker">Amy Price Azano</h4>

In [20]:
first_result.findAll('h4')[0].text  # speaker name

'Amy Price Azano'

In [21]:
first_result.findAll('span')[-1]  # posted date

<span class="meta__val">
Jan 2019
</span>

In [22]:
first_result.findAll('span')[-1].text  # posted date

'\nJan 2019\n'

In [23]:
title       = first_result.findAll('a')[1].text.replace('\n','')

speaker     = first_result.findAll('h4')[0].text

link        = first_result.findAll('a')[0]['href']

dateposted = first_result.findAll('span')[-1].text.replace('\n','')

duration    = first_result.findAll('a')[0].text.replace('\n','')  

#  # MORE DURATION OPTIONS:
# duration_h, duration_m = [float(i) for i in first_result.findAll('span')[0].text.split(':')] 
# duration_seconds = first_result.find('img')['play']
# duration_minutes = float(duration_seconds)/60
# duration = divmod(float(duration_seconds), 60) # returns tuple of (mins, sec)


In [24]:
for i in [title, speaker, link, dateposted, duration]:
    print(i)

The ruralities of autism
Amy Price Azano
/talks/amy_price_azano_the_ruralities_of_autism
Jan 2019
12:31


```


```
# Scrapping all 17 pages and saving data

In [25]:
from collections import OrderedDict

out = OrderedDict()

out['page'] = []
out['page_source'] = []
out['title'] = []
out['speaker'] = []
out['date_posted'] = []
out['duration'] = []
out['link'] = []

    first page:  https://www.ted.com/talks?event=tedx&amp;sort=newest 
    second page: https://www.ted.com/talks?event=tedx&page=2&sort=newest 
    last page:   https://www.ted.com/talks?event=tedx&page=17&sort=newest 

First page is different from rest:
    `&amp;sort` verses `&page=2&sort` for the other pages.

In [26]:
# get all websites to scrape from.
websites = []
website  = 'https://www.ted.com/talks?event=tedx&amp;sort=newest'
websites = [website,]
for i in range(2, 18):
    websites.append(website.replace('amp;','page=%i&'%i))
websites

['https://www.ted.com/talks?event=tedx&amp;sort=newest',
 'https://www.ted.com/talks?event=tedx&page=2&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=3&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=4&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=5&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=6&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=7&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=8&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=9&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=10&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=11&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=12&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=13&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=14&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=15&sort=newest',
 'https://www.ted.com/talks?event=tedx&page=16&sort=newest',
 'https://www.ted.com/talks?event=te

In [27]:

out = OrderedDict()
out['page'] = []
out['pagesource'] = []
out['title'] = []
out['speaker'] = []
out['dateposted'] = []
out['duration'] = []
out['link'] = []


for i_site,site in enumerate(websites):
    r       = requests.get(site)
    soup    = BeautifulSoup(r.text, 'html.parser')
    results = soup.find_all('div', attrs={'class':'m3'}) 

    for i_res,res in enumerate(results):
        title      = res.findAll('a')[1].text.replace('\n','')
        speaker    = res.findAll('h4')[0].text
        dateposted = res.findAll('span')[-1].text.replace('\n','')
        duration   = res.find('span').text 
        link       = res.findAll('a')[0]['href']
        link       = site.split('/talks')[0] + link # add 'https://www.ted.com to beginning
        # APPEND TO DICTIONARY.
        out['page'].append(i_site+1)
        out['pagesource'].append(site)  # or r.url
        out['title'].append(title)
        out['speaker'].append(speaker)
        out['dateposted'].append(dateposted)
        out['duration'].append(duration)
        out['link'].append(link)
        

In [28]:
import pandas as pd

In [29]:
data = pd.DataFrame(out)

In [34]:
data.head()

Unnamed: 0,page,pagesource,title,speaker,dateposted,duration,link
0,1,https://www.ted.com/talks?event=tedx&amp;sort=...,The ruralities of autism,Amy Price Azano,Jan 2019,12:31,https://www.ted.com/talks/amy_price_azano_the_...
1,1,https://www.ted.com/talks?event=tedx&amp;sort=...,How stigma shaped modern medicine,Nathalia Holt,Jan 2019,15:30,https://www.ted.com/talks/nathalia_holt_how_st...
2,1,https://www.ted.com/talks?event=tedx&amp;sort=...,3 ways to build a happy marriage and avoid div...,George Blair-West,Jan 2019,11:13,https://www.ted.com/talks/george_blair_west_3_...
3,1,https://www.ted.com/talks?event=tedx&amp;sort=...,A mother and son's photographic journey throug...,Tony Luciani,Jan 2019,13:32,https://www.ted.com/talks/tony_luciani_a_mothe...
4,1,https://www.ted.com/talks?event=tedx&amp;sort=...,5 ways to share math with kids,Dan Finkel,Jan 2019,14:41,https://www.ted.com/talks/dan_finkel_5_ways_to...


In [31]:
data.page.value_counts() # this is correct. last page only has 26 videos.

9     36
8     36
2     36
3     36
4     36
5     36
6     36
7     36
1     36
16    36
10    36
11    36
12    36
13    36
14    36
15    36
17    28
Name: page, dtype: int64

In [32]:
data.to_csv('ted_talks.csv', index=False, encoding='utf-8')