# Into the Matrix

## Build

In this first section we (1) filter out all but the TED main talks, (2) group those talks by year, and then (3) count terms for each year. The process is in the following subsections:

1. In [Imports and Data](#Imports-and-data) we load the data, filter for the main TED talks, and then by way of inspection, count the number of talks available for each year. As it turns out, the first few years do not have many talks -- the first three years have only one talk each -- and so we drop those years subsequently.
2. In [Create "Texts" for Each Year](#Create-texts-for-each-year), we deploy **pandas**' `groupby` method to create a series with each year as the index and all the texts for that year as the value.
3. In [Clean the Texts](#Clean-the-texts), we attempt to remove a number of elements that do not belong in the texts proper but this currently is not complete. 
4. We then do [A Quick Word Count for Each Year](#A-quick-word-count-for-each-year) just to check what our numbers are looking like.
5. Finally, we [Vectorize the Texts](#Vectorize-the-texts), creating a dataframe which has the years for rows and the terms for columns which is then transposed so that the words are the rows and the years columns, making it easier, we hope, to "see" trends.

### Imports and Data

In [1]:
# Imports
import pandas as pd, re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

After loading the data, we use `.shape` and `list()` to make sure the dataset loaded as expected and then to remind us of our column headers -- we are looking for the column that distinguishes between the main TED events and the various extra events. We are going to focus on the main events for the time being.

In [2]:
# Load the Data
dfAll = pd.read_csv('../output/TEDall.csv')

# Remind ourselves what the terms are to distinguish
# between TED main talks and all the other talks
print(set(dfAll.Set.tolist()))

{'plus', 'only'}


In [3]:
# Filter the dataframe to just the TED main talks:
main = dfAll[dfAll['Set']=='only']
# main.shape
main['presented'].value_counts().sort_index(ascending=True)

1984     1
1990     1
1994     1
1998     6
2001     3
2002    28
2003    34
2004    31
2005    36
2006    43
2007    68
2008    56
2009    81
2010    68
2011    70
2012    65
2013    76
2014    84
2015    75
2016    75
2017    90
Name: presented, dtype: int64

### Create "Texts" for Each Year

It looks like there's not much point in including the first five years on record: they total to 11, which is only half as many as the total of 28 for 2002. 

The easiest way to proceed is:

1. Concatenate all the texts of the talks into one big pseudo-document for each year
2. Drop the first five years

We start there with concatenating all the texts of the talks into a pandas series with the years as index. In pandas you can [concatenate strings][] based on some other criteria: here we are *grouping by* the year a talk was given, which is `presented` in our dataset. (We use the `all_years` variable initially so that we can call the edited series simply `years`.)

[concatenate strings]: https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby

In [4]:
# Concatenate all the texts of the talks into one big pseudo-document for each year
all_years = main.groupby(['presented'])['text'].apply(lambda x: ','.join(x))

# Drop the first five years
years = all_years.drop([1984, 1990, 1994, 1998, 2001])

In [5]:
for index, value in years.iteritems():
    print(f'{index}: {value[0:60]}')

2002:   What I want to talk about is, as background, is the idea t
2003:   You know, one of the intense pleasures of travel and one o
2004:   (Music)    (Music ends)    (Applause)    Thank you!    (Ap
2005:   My name is Lovegrove. I only know nine Lovegroves, two of 
2006:   Thank you so much, Chris. And it's truly a great honor to 
2007:   I have all my life wondered what "mind-boggling" meant. Af
2008:   Roy Gould: Less than a year from now, the world is going t
2009:   I wrote a letter last week talking about the work of the f
2010:   Sadly, in the next 18 minutes when I do our chat, four Ame
2011:   Ten years ago exactly, I was in Afghanistan. I was coverin
2012:   Let me begin with four words that will provide the context
2013:   What is going to be the future of learning?    I do have a
2014:   Chris Anderson: The rights of citizens, the future of the 
2015:   We are built out of very small stuff, and we are embedded 
2016:   So a while ago, I tried an experiment. For one year, I

### Clean the Texts

As a test case for later work, and without being terribly important for the current experiment, we are going to clean our texts using two functions: one to remove speakers and one to remove parentheticals.

In the cell below we create our two lists, speakers and parentheticals, and then create two separate functions and then a function to combine them. 

Our first step is to create the two lists of strings we want removed:

* `parentheticals` is from previous experiments
* `speakers` is probably unpythonic in its expression but it works

In [6]:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)", 
                  "\(laughs\)", "\(applause ends\)", "\(audio\)", "\(singing\)", 
                  "\(music ends\)", "\(cheers\)", "\(cheering\)", "\(recording\)", 
                  "\(beatboxing\)", "\(audience\)", "\(guitar strum\)", 
                  "\(clicks metronome\)", "\(sighs\)", "\(guitar\)", "\(marimba sounds\)", 
                  "\(drum sounds\)" ]
len(parentheticals)

20

In [7]:
speakers = dfAll.speaker_1.tolist() + dfAll.speaker_2.tolist() + dfAll.speaker_3.tolist() + dfAll.speaker_4.tolist()
print(speakers[0:10])

['Al Gore', 'David Pogue', 'Majora Carter', 'Ken Robinson', 'Hans Rosling', 'Tony Robbins', 'Joshua Prince-Ramus', 'Julia Sweeney', 'Rick Warren', 'Dan Dennett']


In [8]:
def remove_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

def remove_speaker_names(text):
    temp_text = text
    for rgx_match in speakers:
        temp_text = re.sub(rgx_match, ' ', temp_text)
    return temp_text

def clean_text(text):
    the_text = text
    cleaned = remove_parens(remove_speaker_names(the_text))
    return cleaned

`remove_speaker_names` keeps throwing a `TypeError`.

### A Quick Word Count for Each Year

Before we go any further, let's just get a quick word count for each of our years. In this section of cells, we do the following:

In [9]:
# Convert our series to a dataframe to make it easier to work in place:
dfYears = years.to_frame()

# Lowercase our texts
dfYears = dfYears.apply(lambda x: x.astype(str).str.lower())

# Remove everything that isn't a word, or space
dfYears = dfYears.replace('[^\w\s\+]', '', regex = True)

# Split on spaces and then count the length of the resulting list
dfYears['word_count'] = dfYears.text.apply(lambda x: len(str(x).split(' ')))

# See the results
dfYears.head(16)

Unnamed: 0_level_0,text,word_count
presented,Unnamed: 1_level_1,Unnamed: 2_level_1
2002,what i want to talk about is as background i...,82116
2003,you know one of the intense pleasures of tra...,93898
2004,music music ends applause thank you...,89746
2005,my name is lovegrove i only know nine lovegr...,110717
2006,thank you so much chris and its truly a grea...,118346
2007,i have all my life wondered what mindbogglin...,149144
2008,roy gould less than a year from now the worl...,125901
2009,i wrote a letter last week talking about the...,138005
2010,sadly in the next 18 minutes when i do our c...,136148
2011,ten years ago exactly i was in afghanistan i...,131579


### Vectorize the Pseudo-Texts

Then we instantiate our term frequency vectorizer and turn it loose on our pseudo-documents:

1. In creating a list from the `years` series we are returning to the texts with punctuation, which we need for our `remove_parens` function to do its job. The difference almost a thousand words in the resulting matrix.

In [10]:
# Countvectorizer expects a list, so we create a list
texts = [ value for index, value in years.iteritems() ]

# We are going to bring our years back to the resulting term matrix below, 
# so while we are creating lists from our series, lets grab those years
# (And yes you can create two lists from one list comprehension, but don't.)
year_labels = [ index for index, value in years.iteritems() ]

# This just checks our results
print(len(texts), texts[0][0:50], year_labels[0:5])

16   What I want to talk about is, as background, is  [2002, 2003, 2004, 2005, 2006]


In [11]:
# The usual incantation (minus the desired speaker removal for now):
vec = CountVectorizer(preprocessor = remove_parens, 
                      stop_words = stop_words,
                      min_df = 2)
word_count_vector=vec.fit_transform(texts)
word_count_vector.shape

(16, 21582)

Some notes on how changes to the vectorizer parameters affect the overall word count:
```
min_df = 0, max_df = 1.00  ==> 38974
min_df = 1, max_df = 1.00  ==> 38974
min_df = 2, max_df = 1.00  ==> 21582 
min_df = 2, max_df = 0.90  ==> 19153
```
It should be noted that the default for `max_df` is all documents, `1.0` which is why it is not actually in the code block above.

We are only dealing with 16 "texts" here, one for each year. That means for every increment of 6.25% we are dropping a year. It might be interesting to determine words that happen only in one year. The parameters for that would be `max_df = 1` -- I think, but I wonder if Sci-Kit Learn would think it was an error. We will try momentarily. 

In [12]:
# Create a dataframe from the resulting array
X = vec.fit_transform(texts)
term_matrix = pd.DataFrame(X.todense(), columns=vec.get_feature_names())
term_matrix.shape

(16, 21582)

In [13]:
# Replace the numbered index with the list of years
term_matrix['year'] = year_labels
term_matrix.set_index('year', inplace = True)

# Transpose the dataframe so that terms are rows and years columns
term_df = term_matrix.transpose()
term_df.reset_index(inplace=True)
term_df = term_df.rename(columns={'index': 'term'})
term_df.head()

year,term,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,00,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0
1,000,43,54,61,66,62,81,100,73,87,67,52,112,65,74,80,99
2,000th,0,0,0,1,0,1,0,0,0,0,0,2,1,0,1,0
3,01,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0
4,02,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0


In [14]:
# Let's save this dataframe 
# ==> Commented out so re-running notebook doesn't result in new file
# word_df.to_csv('../output/YTM_min1-max100.csv')

## One Year Wonders

In [15]:
one_year_vec = CountVectorizer(preprocessor = remove_parens, 
                      stop_words = stop_words,
                      max_df = 1)
one_year = one_year_vec.fit_transform(texts)
one_year.shape

(16, 17392)

In [16]:
# Convert our sklearn array into a pandas dataframe
wonders = pd.DataFrame(one_year.todense(), columns=one_year_vec.get_feature_names())
wonders['year'] = year_labels
wonders.set_index('year', inplace = True)

# Transpose the dataframe and make sure our index is named
wonders = wonders.transpose()
wonders.index.name = 'term'

# Add a column that sums the row counts...
wonders['sum'] = wonders.sum(axis=1)

# ...so we can sort by words with the greatest frequency
wonders.sort_values(by='sum', ascending=False, inplace = True)

## Beginning to work on code to highlight cell in which 
## non-zero value is located to make it easier to find the year
#
# def highlight_max(s):
#     '''
#     highlight the maximum in a Series yellow.
#     '''
#     is_max = s == s.max()
#     return ['background-color: yellow' if v else '' for v in is_max]

wonders.head(40)
# wonders.style.apply(highlight_max)

year,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,sum
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
bf,75,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,75
gk,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,46,46
telomeres,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,40,40
abed,0,0,0,0,0,0,0,0,0,0,0,39,0,0,0,0,39
indus,0,0,0,0,0,0,0,0,0,33,0,0,0,0,0,0,33
fonio,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,32,32
teszler,0,0,0,0,0,30,0,0,0,0,0,0,0,0,0,0,30
ems,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,29,29
mycelium,0,0,0,0,0,0,29,0,0,0,0,0,0,0,0,0,29
edi,0,0,0,0,0,0,0,0,0,0,0,0,27,0,0,0,27
