<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-Data" data-toc-modified-id="Loading-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the Data</a></span></li><li><span><a href="#Parsing-the-Data" data-toc-modified-id="Parsing-the-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Parsing the Data</a></span></li><li><span><a href="#Working-with-the-Years" data-toc-modified-id="Working-with-the-Years-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Working with the Years</a></span></li></ul></div>

# Trends

We are going to do some re-importing of texts here by year. The first time around we are going to do the combined dataset and look for overall trends, and then we will follow that up by loading both the `only` and `plus` datasets separately to see if there are any differences worth noting. Our goal here is to see what words trend not only to learn about TED talks as a developing collection of events but it might also be possible to compare the trends glimpsed here against either trends from the BYU corpus or Google Trends itself.

In [1]:
import pandas as pd
import re

In [None]:
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.externals import joblib

## Loading the Data

Working with only the data from the release, we have two data sets, `TEDonly` and `TEDplus` that we have previously merged into `TEDall_speakers` with an additional column indicating from which data set a given talk is taken. The first thing we will do is to load those part of `TEDall` with which we would like to work -- that is, we do not need to load all of a data set. Once we have a set loaded into a dataframe, we can use `df.columns` to see its contents like this:

```python
df = pd.read_csv('../output/TEDall_speakers.csv')
df.columns

Index(['Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event',
       'duration', 'published', 'tags', 'views', 'text', 'speaker_1',
       'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile',
       'speaker_2', 'speaker2_occupation', 'speaker2_introduction',
       'speaker2_profile', 'speaker_3', 'speaker3_occupation',
       'speaker3_introduction', 'speaker3_profile', 'speaker_4',
       'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile'],
      dtype='object')
In [6]:
```

We can do the same for its index:

```python
df.index

RangeIndex(start=0, stop=1747, step=1)
```

But to get the column names without loading the entire data set, and in so doing save on memory use, we need to do this the old-fashioned way:

In [10]:
with open('../output/TEDall_speakers.csv') as f:
    columns = f.readline().strip().split(",")
print(columns)

['Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile']


Now we are going to load only the data we need.

In [13]:
df = pd.read_csv('../output/TEDall_speakers.csv', 
                    usecols=["public_url","event", "published", "text"])

In [14]:
df.shape

(1747, 4)

## Parsing the Data

It looks like we have two places to get a date: the event at which a talk occurred and the date on which the talk was published on the TED website. For now, let's see if we can find a way to work with the date on which the talk first occurred -- the relationship between a talk and its time is complicated by the dates sometimes being years apart, but that is something we can test later: when does a talk reveal any concurrence? 

To get all the events we can create a list and then compile to a set: 
```python
events = set(df.event.tolist())
print(len(events), events)
```
Or we can do this the **pandas** way:

In [17]:
events = df.event.unique().tolist()
print(len(events), events)

58 ['TED2006', 'TED2004', 'TED2005', 'TED2003', 'TED2007', 'TED2002', 'TED2008', 'TED1984', 'TED1990', 'TED1998', 'TED2001', 'TED2009', 'TED2010', 'TED2011', 'TED1994', 'TED2012', 'TED2013', 'TED2014', 'TED2015', 'TED2016', 'TED2017', 'TEDGlobal 2005', 'TEDGlobal 2007', 'TEDGlobal 2009', 'TEDMED 2009', 'TEDGlobal 2010', 'TED Senior Fellows at TEDGlobal 2010', 'TEDWomen 2010', 'TEDMED 2010', 'TEDActive 2011', 'TEDGlobal 2011', 'TEDMED 2011', 'TEDYouth 2011', 'TEDMED 2012', 'TEDGlobal 2012', 'TEDYouth 2012', 'TEDMED 2013', 'TEDGlobal 2013', 'TEDWomen 2013', 'TEDYouth 2013', 'TEDActive 2014', 'TEDMED 2014', 'TEDGlobal 2014', 'TEDYouth 2014', 'TEDWomen 2015', 'TEDGlobalLondon', 'TEDGlobal>London', 'TEDYouth 2015', 'TEDGlobal>Geneva', 'TEDMED 2015', 'TEDActive 2015', 'TEDSummit', 'TEDWomen 2016', 'TEDMED 2016', 'TEDGlobal 2017', 'TEDGlobal>NYC', 'TEDWomen 2017', 'TEDMED 2017']


The events with years in them look good: we should be able to get the dates out using regex. In fact, let's build that now so that we can then filter out the events with dates in their name and get a list of the events for which we will need to assign a year. (Neither `datetime` nor `isdigit()` will work here. the former expects dates to be more systematic and the latter expects dates to be set apart from words, so **regex** it is.)

In [5]:
years = re.findall(r'\d+', str(events))
print(years)

['2009', '2015', '2016', '2009', '2008', '2014', '1990', '1994', '2014', '1984', '2012', '2001', '2015', '2012', '2015', '2011', '2002', '2014', '2005', '2017', '2017', '2009', '2006', '2016', '2011', '2014', '2011', '2012', '2005', '2004', '2011', '2007', '2014', '2003', '2007', '2017', '2010', '2015', '2013', '2013', '2010', '2016', '1998', '2010', '2015', '2017', '2013', '2013', '2010', '2011', '2012', '2010', '2013']


In a moment, we will use this either to add a label which is only a year or to replace the label for a series with the year -- not in our original dataset, of course. 

It is probably possible to find the strings without years in them in the set above, but with only 58 items and only a half dozen, at a glance, we can do this by hand: `TEDGlobal>NYC, TEDGlobal>Geneva, TEDGlobal>London, TEDSummit, TEDGlobalLondon.`

With that list in hand, let's search the web and find the dates we need.
```
TEDGlobal>NYC: 2017
TEDGlobal>Geneva: 2015
TEDGlobal>London: 2015
TEDSummit: 2016
TEDGlobalLondon: 2015
```
So now to do a **replace** on those strings.

To replace *in situ*:

```python
df['event'].replace(
    to_replace=['ABC', 'AB'],
    value='A',
    inplace=True
)
```

To create a new column:

```python
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')
```

Ack. The above code works on numbers, but we are working with strings:

```python
search = []    
for values in df['col']:
    search.append(re.search(r'\d+', values).group())

df['col1'] = search
```

In [24]:
search = []
for event in df['event']:
    search.append(re.search(r'\d+', str(event)).group())
    
print(search[0:20])

AttributeError: 'NoneType' object has no attribute 'group'

I'm having a hard time with doing this in place in the dataframe itself, so let's try splitting this column off as a list that we will work on and then append it back as a new column with the year. So we are going from event to year.

Here's the pseudo-code as I imagine it at this moment:

```python
for event in events:
    if replace produces a four-digit number == True then done
    else replace these things as follows:
        TEDGlobal>NYC: 2017
        TEDGlobal>Geneva: 2015
        TEDGlobal>London: 2015
        TEDSummit: 2016
        TEDGlobalLondon: 2015
```

We need to start by re-populating the `events` list with all the values and not simply the unique values.

In [27]:
events = df.event.tolist()
type(events[0])

str

Let's start with the five replacements listed above, that way we can insert the dates and those dates will be kept by the regex we use to filter out letters. To do this, we are going to use a simple dictionary that has the five events that need a date. The small test established the functionality, if not the efficiency, of the code:

In [50]:
test = ["TEDGlobalLondon", "TED2006", "TED 2007", "TEDGlobalLondon", "TEDSummit"]

replacements = {"TEDGlobal>NYC": "2017",
                "TEDGlobal>Geneva": "2015",
                "TEDGlobal>London": "2015",
                "TEDSummit": "2016",
                "TEDGlobalLondon": "2015"}

for key, value in replacements.items():
    test = [w.replace(key, value) for w in test]
print(test)

['2015', 'TED2006', 'TED 2007', '2015', '2016']


We can use the same dictionary, `replacements` on our master list of events:

In [51]:
for key, value in replacements.items():
    events = [w.replace(key, value) for w in events]

Next up, we need to apply our regex to sweep out all the characters:

In [53]:
# First, we need to repopulate the
events = re.findall(r'\d+', str(events))
uniques = set(events) # We're working with a list, 
                      # so we use set here instead of pandas' "unique"
print(uniques)

{'2007', '1994', '2011', '2002', '2001', '2009', '2006', '1984', '2016', '2015', '2014', '2005', '2008', '1990', '2013', '2010', '2017', '2003', '2004', '2012', '1998'}


In [54]:
print(events[0:50])

['2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2006', '2004', '2006', '2005', '2006', '2006', '2004', '2006', '2004', '2006', '2004', '2004', '2004', '2005', '2005', '2006', '2005', '2005', '2005', '2006', '2005', '2005', '2006', '2005', '2006', '2005', '2003', '2006', '2006', '2005', '2006', '2006', '2007', '2007', '2007', '2002']


Okay, it looks like we have the years, and so we'll add this back to our dataframe:

In [55]:
df['year'] = events

In [59]:
df.to_csv('../output/TEDall_years.csv')

## Working with the Years