# Trends

We are going to do some re-importing of texts here by year. The first time around we are going to do the combined dataset and look for overall trends, and then we will follow that up by loading both the `only` and `plus` datasets separately to see if there are any differences worth noting. Our goal here is to see what words trend not only to learn about TED talks as a developing collection of events but it might also be possible to compare the trends glimpsed here against either trends from the BYU corpus or Google Trends itself.

In [1]:
import pandas as pd
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.externals import joblib

In [11]:
import re

## Loading the Data

Working with only the data from the release, we have two data sets, `TEDonly` and `TEDplus` that we have previously merged into `TEDall_speakers` with an additional column indicating from which data set a given talk is taken. The first thing we will do is to load `TEDall`.

In [2]:
df = pd.read_csv('../output/TEDall_speakers.csv')
df.shape

(1747, 27)

We need to determine a way to get our texts into the appropriate year bins. Let's see where dates are stored:

In [4]:
with open('../output/TEDall_speakers.csv') as f:
    columns = f.readline().strip().split(",")
print(columns)

['Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


It looks like we have two places to get a date: the event at which a talk occurred and the date on which the talk was published on the TED website. For now, let's see if we can find a way to work with the date on which the talk first occurred.

To do that, we are going to get all the events and then compile them into a set and then go through the set to see what our next steps will look like.

In [8]:
events = set(df.event.tolist())
print(len(events))
print(events)

58
{'TEDMED 2017', 'TEDGlobal 2009', 'TED2005', 'TED1984', 'TEDYouth 2015', 'TEDYouth 2012', 'TEDGlobal>NYC', 'TEDActive 2014', 'TED2008', 'TEDWomen 2016', 'TEDMED 2014', 'TEDGlobal>Geneva', 'TEDYouth 2013', 'TEDActive 2011', 'TEDGlobal 2012', 'TEDWomen 2013', 'TEDMED 2012', 'TED2011', 'TED2002', 'TEDGlobal 2007', 'TEDGlobal 2011', 'TED2003', 'TED2007', 'TED2009', 'TED2013', 'TEDGlobal>London', 'TED2004', 'TED2012', 'TEDMED 2010', 'TED1994', 'TEDWomen 2010', 'TEDMED 2016', 'TEDSummit', 'TEDWomen 2017', 'TEDMED 2015', 'TEDGlobal 2017', 'TEDMED 2013', 'TED1998', 'TEDGlobal 2005', 'TEDYouth 2014', 'TED2015', 'TED2014', 'TEDGlobal 2013', 'TEDWomen 2015', 'TEDGlobal 2014', 'TED2017', 'TED2016', 'TED1990', 'TEDMED 2009', 'TED Senior Fellows at TEDGlobal 2010', 'TED2010', 'TED2006', 'TED2001', 'TEDGlobal 2010', 'TEDGlobalLondon', 'TEDMED 2011', 'TEDYouth 2011', 'TEDActive 2015'}


The events with years in them look good: we should be able to get the dates out using regex. In fact, let's build that now so that we can then filter out the events with dates in their name and get a list of the events for which we will need to assign a year. 

Neither `datetime` nor `isdigit()` will work here. the former expects dates to be more systematic and the latter expects dates to be set apart from words. Regex it is.

In [14]:
years = re.findall(r'\d+', str(events))
print(years)

['2017', '2009', '2005', '1984', '2015', '2012', '2014', '2008', '2016', '2014', '2013', '2011', '2012', '2013', '2012', '2011', '2002', '2007', '2011', '2003', '2007', '2009', '2013', '2004', '2012', '2010', '1994', '2010', '2016', '2017', '2015', '2017', '2013', '1998', '2005', '2014', '2015', '2014', '2013', '2015', '2014', '2017', '2016', '1990', '2009', '2010', '2010', '2006', '2001', '2010', '2011', '2011', '2015']


In a moment, we will use this either to add a label which is only a year or to replace the label for a series with the year -- not in our original dataset, of course. 

It is probably possible to find the strings without years in them in the set above, but with only 58 items and only a half dozen, at a glance, we can do this by hand: `TEDGlobal>NYC, TEDGlobal>Geneva, TEDGlobal>London, TEDSummit, TEDGlobalLondon.`

With that list in hand, let's search the web and find the dates we need.
```
TEDGlobal>NYC: 2017
TEDGlobal>Geneva: 2015
TEDGlobal>London: 2015
TEDSummit: 2016
TEDGlobalLondon: 2015
```
So now to do a **replace** on those strings.