# Into the Matrix

## Notes
* [Hedonometer](https://hedonometer.org/timeseries/en_all/)

## Imports and Data

In [1]:
# Imports
import pandas as pd, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [2]:
# Load the Data
dfAll = pd.read_csv('../output/TEDall.csv')

We use `shape` below just to make sure the dataset loaded as expected and then `list()` to remind us of our columns: we are looking for the column that distinguishes between the main TED events and the various extra events. We are going to focus on the main events for the time being.

In [3]:
print(dfAll.shape, list(dfAll))

(1747, 30) ['Unnamed: 0', 'Set', 'Talk_ID', 'public_url', 'headline', 'description', 'event', 'duration', 'published', 'tags', 'views', 'text', 'speaker_1', 'speaker1_occupation', 'speaker1_introduction', 'speaker1_profile', 'speaker_2', 'speaker2_occupation', 'speaker2_introduction', 'speaker2_profile', 'speaker_3', 'speaker3_occupation', 'speaker3_introduction', 'speaker3_profile', 'speaker_4', 'speaker4_occupation', 'speaker4_introduction', 'speaker4_profile', 'presented', 'TEDevent']


Let's make sure we remember the terms used:

In [5]:
print(set(dfAll.Set.tolist()))

{'plus', 'only'}


We want to filter the dataset so that we only work with the main TED event:

In [6]:
main = dfAll[dfAll['Set']=='only']
main.shape

(992, 30)

Let's examine the number of talks for each of the years:

In [7]:
main['presented'].value_counts().sort_index(ascending=True)

1984     1
1990     1
1994     1
1998     6
2001     3
2002    28
2003    34
2004    31
2005    36
2006    43
2007    68
2008    56
2009    81
2010    68
2011    70
2012    65
2013    76
2014    84
2015    75
2016    75
2017    90
Name: presented, dtype: int64

It looks like there's not much point in including the first five years on record: they total to 11, which is only half as many as the total of 28 for 2002. 

The easiest way to proceed is:

1. Concatenate all the texts of the talks into one big pseudo-document for each year
2. Drop the first five years

We start there with concatenating all the texts of the talks into a pandas series with the years as index. In pandas you can [concatenate strings][] based on some other criteria: here we are *grouping by* the year a talk was given, which is `presented` in our dataset. (We use the `all_years` variable initially so that we can call the edited series simply `years`.)

[concatenate strings]: https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby

In [8]:
# Concatenate all the texts of the talks into one big pseudo-document for each year
all_years = main.groupby(['presented'])['text'].apply(lambda x: ','.join(x))

In [9]:
# Drop the first five years
years = all_years.drop([1984, 1990, 1994, 1998, 2001])
years.head(16)

presented
2002      What I want to talk about is, as background,...
2003      You know, one of the intense pleasures of tr...
2004      (Music)    (Music ends)    (Applause)    Tha...
2005      My name is Lovegrove. I only know nine Loveg...
2006      Thank you so much, Chris. And it's truly a g...
2007      I have all my life wondered what "mind-boggl...
2008      Roy Gould: Less than a year from now, the wo...
2009      I wrote a letter last week talking about the...
2010      Sadly, in the next 18 minutes when I do our ...
2011      Ten years ago exactly, I was in Afghanistan....
2012      Let me begin with four words that will provi...
2013      What is going to be the future of learning? ...
2014      Chris Anderson: The rights of citizens, the ...
2015      We are built out of very small stuff, and we...
2016      So a while ago, I tried an experiment. For o...
2017      Gayle King: Have a seat, Serena Williams, or...
Name: text, dtype: object

`years.values` returns an array of all the values. `years.values[0]` gives you the first item in the array/list, and as you count up the index, you walk through the pseudo-document for each year. 

Not entirely intuitively, you can get the values without explicitly calling them in a `for` loop:

```python
for item in years:
    print(item[0:20])
```

But this explicit version, which here also includes the index, is bit more clear:

In [10]:
for index, value in years.iteritems():
    print(f'{index}: {value[0:20]}')

2002:   What I want to tal
2003:   You know, one of t
2004:   (Music)    (Music 
2005:   My name is Lovegro
2006:   Thank you so much,
2007:   I have all my life
2008:   Roy Gould: Less th
2009:   I wrote a letter l
2010:   Sadly, in the next
2011:   Ten years ago exac
2012:   Let me begin with 
2013:   What is going to b
2014:   Chris Anderson: Th
2015:   We are built out o
2016:   So a while ago, I 
2017:   Gayle King: Have a


## Cleaning the Year Pseudo-Documents

As a test case for later work, and without being terribly important for the current experiment, we are going to clean our texts using two functions: one to remove speakers and one to remove parentheticals.

In the cell below we create our two lists, speakers and parentheticals, and then create two separate functions and then a function to combine them. 

Our first step is to create the two lists of strings we want removed:

* `parentheticals` is from previous experiments
* `speakers` is probably unpythonic in its expression but it works

In [11]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]
len(parentheticals)

20

In [12]:
speakers = dfAll.speaker_1.tolist() + dfAll.speaker_2.tolist() + dfAll.speaker_3.tolist() + dfAll.speaker_4.tolist()
print(speakers[0:10])

['Al Gore', 'David Pogue', 'Majora Carter', 'Ken Robinson', 'Hans Rosling', 'Tony Robbins', 'Joshua Prince-Ramus', 'Julia Sweeney', 'Rick Warren', 'Dan Dennett']


In [13]:
def remove_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

def remove_speaker_names(text):
    temp_text = text
    for rgx_match in speakers:
        temp_text = re.sub(rgx_match, ' ', temp_text)
    return temp_text

def clean_text(text):
    the_text = text
    cleaned = remove_parens(remove_speaker_names(the_text))
    return cleaned

`remove_speaker_names` keeps throwing a `TypeError`.

## A Quick Word Count for Each Year

Before we go any further, let's just get a quick word count for each of our years

In [30]:
dfYears = years.to_frame()
dfYears.head()

Unnamed: 0_level_0,text
presented,Unnamed: 1_level_1
2002,"What I want to talk about is, as background,..."
2003,"You know, one of the intense pleasures of tr..."
2004,(Music) (Music ends) (Applause) Tha...
2005,My name is Lovegrove. I only know nine Loveg...
2006,"Thank you so much, Chris. And it's truly a g..."


In [31]:
dfYears = dfYears.apply(lambda x: x.astype(str).str.lower())

In [32]:
dfYears = dfYears.replace('[^\w\s\+]', '', regex = True)

In [34]:
dfYears['word_count'] = dfYears.text.apply(lambda x: len(str(x).split(' ')))

In [35]:
dfYears.head(16)

Unnamed: 0_level_0,text,word_count
presented,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,what i want to talk about is as background i...,82116
2003,you know one of the intense pleasures of tra...,93898
2004,music music ends applause thank you...,89746
2005,my name is lovegrove i only know nine lovegr...,110717
2006,thank you so much chris and its truly a grea...,118346
2007,i have all my life wondered what mindbogglin...,149144
2008,roy gould less than a year from now the worl...,125901
2009,i wrote a letter last week talking about the...,138005
2010,sadly in the next 18 minutes when i do our c...,136148
2011,ten years ago exactly i was in afghanistan i...,131579


## Vectorizing

Then we instantiate our term frequency vectorizer and turn it loose on our pseudo-documents:

1. In creating a list from the `years` series we are returning to the texts with punctuation, which we need for our `remove_parens` function to do its job. The difference almost a thousand words in the resulting matrix.

In [47]:
texts = [ value for index, value in years.iteritems() ]
year_labels = [ index for index, value in years.iteritems() ]

# This just checks our results
print(len(texts), texts[0][0:50], year_labels[0:5])

16   What I want to talk about is, as background, is  [2002, 2003, 2004, 2005, 2006]


In [43]:
vec = CountVectorizer(preprocessor = remove_parens, min_df = 2)
word_count_vector=vec.fit_transform(texts)
word_count_vector.shape

(16, 21723)

In [46]:
X = vec.fit_transform(texts)
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
term_matrix.shape

(16, 21723)

In [51]:
term_matrix['year'] = year_labels
term_matrix.set_index('year', inplace = True)

In [52]:
word_df = term_matrix.transpose()
word_df.head()

year,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
00,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0
000,43,54,61,66,62,81,100,73,87,67,52,112,65,74,80,99
000th,0,0,0,1,0,1,0,0,0,0,0,2,1,0,1,0
01,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0
02,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0
