# Into the Matrix

If you're just here for the term-year matrix, skip past the build section below and go straight to the [Analysis](#Analysis) section, where you can load the results of the build section from a CSV of ther results. 

See: [Hedonometer](https://hedonometer.org/timeseries/en_all/).

## Build

In this first section we (1) filter out all but the TED main talks, (2) group those talks by year, and then (3) count terms for each year. The process is in the following subsections:

1. In [Imports and Data](#Imports-and-data) we load the data, filter for the main TED talks, and then by way of inspection, count the number of talks available for each year. As it turns out, the first few years do not have many talks -- the first three years have only one talk each -- and so we drop those years subsequently.
2. In [Create "Texts" for Each Year](#Create-texts-for-each-year), we deploy **pandas**' `groupby` method to create a series with each year as the index and all the texts for that year as the value.
3. In [Clean the Texts](#Clean-the-texts), we attempt to remove a number of elements that do not belong in the texts proper but this currently is not complete. 
4. We then do [A Quick Word Count for Each Year](#A-quick-word-count-for-each-year) just to check what our numbers are looking like.
5. Finally, we [Vectorize the Texts](#Vectorize-the-texts), creating a dataframe which has the years for rows and the terms for columns which is then transposed so that the words are the rows and the years columns, making it easier, we hope, to "see" trends.

### Imports and Data

In [1]:
# Imports
import pandas as pd, re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

After loading the data, we use `.shape` and `list()` to make sure the dataset loaded as expected and then to remind us of our column headers -- we are looking for the column that distinguishes between the main TED events and the various extra events. We are going to focus on the main events for the time being.

In [2]:
# Load the Data
dfAll = pd.read_csv('../output/TEDall.csv')

# Remind ourselves what the terms are to distinguish
# between TED main talks and all the other talks
print(set(dfAll.Set.tolist()))

{'only', 'plus'}


In [3]:
# Filter the dataframe to just the TED main talks:
main = dfAll[dfAll['Set']=='only']
# main.shape
main['presented'].value_counts().sort_index(ascending=True)

1984     1
1990     1
1994     1
1998     6
2001     3
2002    28
2003    34
2004    31
2005    36
2006    43
2007    68
2008    56
2009    81
2010    68
2011    70
2012    65
2013    76
2014    84
2015    75
2016    75
2017    90
Name: presented, dtype: int64

### Create "Texts" for Each Year

It looks like there's not much point in including the first five years on record: they total to 11, which is only half as many as the total of 28 for 2002. 

The easiest way to proceed is:

1. Concatenate all the texts of the talks into one big pseudo-document for each year
2. Drop the first five years

We start there with concatenating all the texts of the talks into a pandas series with the years as index. In pandas you can [concatenate strings][] based on some other criteria: here we are *grouping by* the year a talk was given, which is `presented` in our dataset. (We use the `all_years` variable initially so that we can call the edited series simply `years`.)

[concatenate strings]: https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby

In [4]:
# Concatenate all the texts of the talks into one big pseudo-document for each year
all_years = main.groupby(['presented'])['text'].apply(lambda x: ','.join(x))

# Drop the first five years
years = all_years.drop([1984, 1990, 1994, 1998, 2001])

In [5]:
for index, value in years.iteritems():
    print(f'{index}: {value[0:60]}')

2002:   What I want to talk about is, as background, is the idea t
2003:   You know, one of the intense pleasures of travel and one o
2004:   (Music)    (Music ends)    (Applause)    Thank you!    (Ap
2005:   My name is Lovegrove. I only know nine Lovegroves, two of 
2006:   Thank you so much, Chris. And it's truly a great honor to 
2007:   I have all my life wondered what "mind-boggling" meant. Af
2008:   Roy Gould: Less than a year from now, the world is going t
2009:   I wrote a letter last week talking about the work of the f
2010:   Sadly, in the next 18 minutes when I do our chat, four Ame
2011:   Ten years ago exactly, I was in Afghanistan. I was coverin
2012:   Let me begin with four words that will provide the context
2013:   What is going to be the future of learning?    I do have a
2014:   Chris Anderson: The rights of citizens, the future of the 
2015:   We are built out of very small stuff, and we are embedded 
2016:   So a while ago, I tried an experiment. For one year, I

### Clean the Texts

As a test case for later work, and without being terribly important for the current experiment, we are going to clean our texts using two functions: one to remove speakers and one to remove parentheticals.

In the cell below we create our two lists, speakers and parentheticals, and then create two separate functions and then a function to combine them. 

Our first step is to create the two lists of strings we want removed:

* `parentheticals` is from previous experiments
* `speakers` is probably unpythonic in its expression but it works

In [None]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]
len(parentheticals)

In [None]:
speakers = dfAll.speaker_1.tolist() + dfAll.speaker_2.tolist() + dfAll.speaker_3.tolist() + dfAll.speaker_4.tolist()
print(speakers[0:10])

In [None]:
def remove_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

def remove_speaker_names(text):
    temp_text = text
    for rgx_match in speakers:
        temp_text = re.sub(rgx_match, ' ', temp_text)
    return temp_text

def clean_text(text):
    the_text = text
    cleaned = remove_parens(remove_speaker_names(the_text))
    return cleaned

`remove_speaker_names` keeps throwing a `TypeError`.

### A Quick Word Count for Each Year

Before we go any further, let's just get a quick word count for each of our years. In this section of cells, we do the following:

In [None]:
# Convert our series to a dataframe to make it easier to work in place:
dfYears = years.to_frame()

# Lowercase our texts
dfYears = dfYears.apply(lambda x: x.astype(str).str.lower())

# Remove everything that isn't a word, or space
dfYears = dfYears.replace('[^\w\s\+]', '', regex = True)

# Split on spaces and then count the length of the resulting list
dfYears['word_count'] = dfYears.text.apply(lambda x: len(str(x).split(' ')))

# See the results
dfYears.head(16)

### Vectorize the Pseudo-Texts

Then we instantiate our term frequency vectorizer and turn it loose on our pseudo-documents:

1. In creating a list from the `years` series we are returning to the texts with punctuation, which we need for our `remove_parens` function to do its job. The difference almost a thousand words in the resulting matrix.

In [None]:
# Countvectorizer expects a list, so we create a list
texts = [ value for index, value in years.iteritems() ]

# We are going to bring our years back to the resulting term matrix below, 
# so while we are creating lists from our series, lets grab those years
# (And yes you can create two lists from one list comprehension, but don't.)
year_labels = [ index for index, value in years.iteritems() ]

# This just checks our results
print(len(texts), texts[0][0:50], year_labels[0:5])

In [None]:
# The usual incantation (minus the desired speaker removal for now):
vec = CountVectorizer(preprocessor = remove_parens, min_df = 1, max_df = 1.0)
word_count_vector=vec.fit_transform(texts)
word_count_vector.shape

Some notes on how changes to the vectorizer parameters affect the overall word count:
```
min_df = 0, max_df = n/a   ==> 39118
min_df = 1, max_df = 1.0   ==> 39118
min_df = 2, max_df = n/a   ==> 21723
min_df = 2, max_df = 1.00  ==> 21723 (This makes sense, but I was just checking.)
min_df = 2, max_df = 0.99  ==> 19844 ("global" and "climate" seem to disappear?!)
min_df = 2, max_df = 0.95  ==> 19844
min_df = 2, max_df = 0.90  ==> 19158
```

In [None]:
# Create a dataframe from the resulting array
X = vec.fit_transform(texts)
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
term_matrix.shape

In [None]:
term_matrix['year'] = year_labels
term_matrix.set_index('year', inplace = True)

In [None]:
term_df = term_matrix.transpose()
term_df.reset_index(inplace=True)
term_df = word_df.rename(columns={'index': 'term'})
term_df.head()

In [None]:
# Let's save this dataframe 
# ==> Commented out so re-running notebook doesn't result in new file
# word_df.to_csv('../output/YTM_min1-max100.csv')