## Total Talks by Event

## Summary

In this notebook, we load the updated dataset with the year a talk was presented as an added feature. We group the texts of the talks by that year and then vectorize the terms into a TF-IDF matrix.

The TF-IDF score for the word t in the document d from the document set D is calculated as follows:

> tfidf ( t, d, D ) = tf ( t, d ) · idf ( t, D )

where:

> tf ( t, d ) = log ( 1 + freq (t, d ))

> idf (t, D) = log ( N / count ( d € D; t € D ))

## Notes
* [Hedonometer](https://hedonometer.org/timeseries/en_all/)

## Imports and Data

In [1]:
# Imports
import pandas as pd, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
# Load the Data
df = pd.read_csv('../output/TEDall.csv')

# .shape is just to check to make sure everything loaded correctly
# `list()` is to remind us of column names
print(df.shape, list(df))

(1747, 30) ['Unnamed: 0', 'Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile', 'presented', 'TEDevent']


## Compiling the Years into Pseudo-Documents

We begin by reminding ourselves of the years involved and the number of talks per year available to us:

In [3]:
df['presented'].value_counts().sort_index(ascending=True)

1984      1
1990      1
1994      1
1998      6
2001      3
2002     28
2003     34
2004     31
2005     62
2006     43
2007     92
2008     56
2009    155
2010    162
2011    150
2012    146
2013    162
2014    154
2015    149
2016    145
2017    166
Name: presented, dtype: int64

It looks like there's not much point in including the first five years on record: they total to 11, which is only half as many as the total of 28 for 2002. We start there with concatenating all the texts of the talks into a pandas series with the years as index. In pandas you can [concatenate strings][] based on some other criteria: here we are *grouping by* the year a talk was given, which is `presented` in our dataset. (We use the `all_years` variable initially so that we can call the edited series simply `years`.)

[concatenate strings]: https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby

In [4]:
all_years = df.groupby(['presented'])['text'].apply(lambda x: ','.join(x))#.reset_index()
all_years.head(25)

presented
1984      In this rather long sort of marathon present...
1990      I'm going to go right into the slides. And a...
1994      Because I usually take the role of trying to...
1998      As a clergyman, you can imagine how out of p...
2001      So I understand that this meeting was planne...
2002      What I want to talk about is, as background,...
2003      You know, one of the intense pleasures of tr...
2004      (Music)    (Music ends)    (Applause)    Tha...
2005      My name is Lovegrove. I only know nine Loveg...
2006      Thank you so much, Chris. And it's truly a g...
2007      I have all my life wondered what "mind-boggl...
2008      Roy Gould: Less than a year from now, the wo...
2009      I wrote a letter last week talking about the...
2010      Sadly, in the next 18 minutes when I do our ...
2011      Ten years ago exactly, I was in Afghanistan....
2012      Let me begin with four words that will provi...
2013      What is going to be the future of learning? ...
2014

In [5]:
years = all_years.drop([1984, 1990, 1994, 1998, 2001])
years.head(16)

presented
2002      What I want to talk about is, as background,...
2003      You know, one of the intense pleasures of tr...
2004      (Music)    (Music ends)    (Applause)    Tha...
2005      My name is Lovegrove. I only know nine Loveg...
2006      Thank you so much, Chris. And it's truly a g...
2007      I have all my life wondered what "mind-boggl...
2008      Roy Gould: Less than a year from now, the wo...
2009      I wrote a letter last week talking about the...
2010      Sadly, in the next 18 minutes when I do our ...
2011      Ten years ago exactly, I was in Afghanistan....
2012      Let me begin with four words that will provi...
2013      What is going to be the future of learning? ...
2014      Chris Anderson: The rights of citizens, the ...
2015      We are built out of very small stuff, and we...
2016      So a while ago, I tried an experiment. For o...
2017      Gayle King: Have a seat, Serena Williams, or...
Name: text, dtype: object

`years.values` returns an array of all the values. `years.values[0]` gives you the first item in the array/list, and as you count up the index, you walk through the collected talks for each year. 

Not entirely intuitively, you can get the values without explicitly calling them om a `for` loop:

```python
for item in years:
    print(item[0:20])
```

But this explicit version, which here also includes the index, is bit more clear:

In [6]:
for index, value in years.iteritems():
    print(f'{index}: {value[0:20]}')

2002:   What I want to tal
2003:   You know, one of t
2004:   (Music)    (Music 
2005:   My name is Lovegro
2006:   Thank you so much,
2007:   I have all my life
2008:   Roy Gould: Less th
2009:   I wrote a letter l
2010:   Sadly, in the next
2011:   Ten years ago exac
2012:   Let me begin with 
2013:   What is going to b
2014:   Chris Anderson: Th
2015:   We are built out o
2016:   So a while ago, I 
2017:   Gayle King: Have a


## Vectorizing the Year Pseudo-Documents

Our next step is to take the text in the value and run it through **sklearn**'s `tfidf` functionality.

Our first step is to load the data and the function that cleans the parentheticals out of the texts:

In [7]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]

def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

Then we instantiate our term frequency vectorizer and turn it loose on our pseudo-documents:

In [8]:
vec = CountVectorizer(preprocessor = clean_parens, min_df = 2)
word_count_vector=vec.fit_transform(year for year in years)

When we run `shape` to see if the output is close to what we expected, we have a matrix that is 16 x 28530. That looks good, so now we will transform those TFs into TFIDFs.

In [9]:
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

To make it easier to inspect our results, we invoke a pandas dataframe:

In [10]:
df_idf = pd.DataFrame(tfidf_transformer.idf_, 
                      index=vec.get_feature_names(), 
                      columns=["idf_weights"])

In [11]:
df_idf.sort_values(by=['idf_weights'])

Unnamed: 0,idf_weights
island,1.000000
building,1.000000
buildings,1.000000
built,1.000000
strong,1.000000
...,...
freshen,2.734601
frescoes,2.734601
cocky,2.734601
silvery,2.734601


In [12]:
# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(word_count_vector)

# Establish feature names:
feature_names = vec.get_feature_names()
 
# get tfidf vector for first document
first_document_vector=tf_idf_vector[0]
 
# print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Unnamed: 0,tfidf
the,0.516398
and,0.401365
to,0.320451
of,0.280900
that,0.252489
...,...
garments,0.000000
garment,0.000000
garlic,0.000000
gardner,0.000000


In [13]:
for year in years

SyntaxError: invalid syntax (<ipython-input-13-7c824be19cc7>, line 1)