# Trends

We are going to do some re-importing of texts here by year. The first time around we are going to do the combined dataset and look for overall trends, and then we will follow that up by loading both the `only` and `plus` datasets separately to see if there are any differences worth noting. Our goal here is to see what words trend not only to learn about TED talks as a developing collection of events but it might also be possible to compare the trends glimpsed here against either trends from the BYU corpus or Google Trends itself.

<h1><span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-Data-Load" data-toc-modified-id="Imports-and-Data-Load-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and Data Load</a></span></li><li><span><a href="#Working-with-the-Years" data-toc-modified-id="Working-with-the-Years-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Working with the Years</a></span><ul class="toc-item"><li><span><a href="#KK-notes:" data-toc-modified-id="KK-notes:-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>KK notes:</a></span></li></ul></li><li><span><a href="#Finding-Words-with-Limited-Spans-of-Usage" data-toc-modified-id="Finding-Words-with-Limited-Spans-of-Usage-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Finding Words with Limited Spans of Usage</a></span></li><li><span><a href="#Trends" data-toc-modified-id="Trends-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Trends</a></span></li></ul></div>

## Imports and Data Load

In [1]:
import re, pandas as pd, matplotlib.pyplot as plt, numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
%matplotlib inline

For whatever reason, the changing of figure size only ever works for me after I have created the first graph. For that reason, it is on a separate line so that I can run it later:

In [3]:
plt.rcParams["figure.figsize"] = (20,10)

In [5]:
df = pd.read_csv('../output/TEDall_years.csv', index_col=0)
df.shape

(1747, 5)

In [4]:
df.head()

Unnamed: 0,public_url,event,published,text,year
0,https://www.ted.com/talks/al_gore_on_averting_...,TED2006,6/27/06,"Thank you so much, Chris. And it's truly a g...",2006
1,https://www.ted.com/talks/david_pogue_says_sim...,TED2006,6/27/06,"(Music: ""The Sound of Silence,"" Simon & Garf...",2006
2,https://www.ted.com/talks/majora_carter_s_tale...,TED2006,6/27/06,If you're here today — and I'm very happy th...,2006
3,https://www.ted.com/talks/ken_robinson_says_sc...,TED2006,6/27/06,Good morning. How are you? (Laughter) ...,2006
4,https://www.ted.com/talks/hans_rosling_shows_t...,TED2006,6/27/06,"About 10 years ago, I took on the task to te...",2006


In [6]:
df.dtypes

public_url    object
event         object
published     object
text          object
year           int64
dtype: object

## Working with the Years

Okay, now the data analysis begins with us sorting out the talks into year bins where we can count terms and then determine the best way to find out which words, if any, show notable dynamism. 

And we are going to have to decide how to define dynamism: Google Trends has a formula for word frequency

```
                   count of word(1)
                  ------------------
             count of all words in a year
```

But one searches on that and it graphs. I think we want to find some way to arrive at some algorithmic "flagging" of terms with particular kinds of dynamics.

Our hypothesis here is that TED events will likely have some topicality, and so we will see one event dynamism, but we also probably want to try to find words that rise and fall over two years or more.

We can make a quick check to see how many talks we have for each year. As `df.groupby('year').size()` reveals, the first year for which we have a substantial number of talks is 2002. We can probably safely start our analysis there.  

In [None]:
df.groupby('year').size()

We can also, somewhat gratuitously, visualize this:

In [None]:
df.groupby('year').size().plot()

Our next step is to filter by year, so maybe choosing a year like 1998 with 6 talks might be a good place to begin building our code:

In [None]:
year_2001 = df.loc[df['year'] == 2001]
year_2001.head()

What we want to do is count all the words in a given year and then be able to compare across years. So, what we want is something like this:

1. count all the words in a year
2. divide every word by the total count to assign it its frequency for the year[^1]
3. create a pandas dataframe (?) that looks like this:

```
| word  | 2002  | 2003  | 2004  |  ...  |
|-------|-------|-------|-------|-------|
| other | 0.007 | 0.007 |0.0008 | 0.001 |
| stuff | 0.001 | 0.002 | 0.002 | 0.003 |
```

So, do we create 16 lists, each with all the words for a year, and then write date back into a dataframe or is there a way to do this within **pandas**? (I'm trying to grok "the pandas way" as much as possible.)

[^1]: This is how Google Trends does it. See: Younes, N., & Reips, U. D. (2019). Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms. PloS one, 14(3), e0213554. doi:10.1371/journal.pone.0213554.

The best way to do this, I think, is to feed a function or a for loop a list of years, filter the dataframe by year, and create a giant list of words for each year. We need both the year and a name for that year, so we are going to need to get the years we want and then create an object that pairs the year with the label for the bag of words we are going to create. 

The code below is a little fussy, but it filters out the years we don't want and then creates a dictionary with the year as the key and the BoW name as the value. 

In [None]:
# At the heart of this is a lambda function, which allows us to apply a filter 
# the list and sorted functions add a lot of parentheses.
years = sorted(list(filter(lambda year: year > 2001, df.year.unique())))
print(years)

I think what we need to do is create empty dictionaries for each year and then fill those. 

Two levels: 
1. First years to dictionaries.   
   keys = years   
   items = each year's dictionary   
2. Level 2:   
   keys = word   
   items = count   

### KK notes:
    
https://stackoverflow.com/questions/12453580/concatenate-item-in-list-to-strings

https://codeburst.io/python-basics-11-word-count-filter-out-punctuation-dictionary-manipulation-and-sorting-lists-3f6c55420855

```python
year_words = {}

for year in years[0]:
    yeartalks = df.loc[df['year'] == year].text.tolist()
    onetalk = " ".join(yeartalks)
    
    # Make dictionary called word_counts. Choose from codeburst
    
    #year_words[year] = word_counts
    year_words[year] = yeartalks[0]
```

In [None]:
def onetext (label):
    """Simple function to grab texts out of pandas dataframe"""
    texts = df.loc[df['year'] == label].text.tolist()
    onetext = " ".join(texts)
    return onetext

In [None]:
yeartexts = dict(zip([year for year in years], [onetext(year) for year in years]))

With all of our texts now in a dictionary with the year as the key and the talks for a given year in one long string as a value, we now need to create a term frequency matrix with the years as rows and the words as columns. I know we can feed a list of texts into **Sci-Kit Learn** and it will build us an array, but we also want to associate those rows with our years...

This seems like a step back, but until I can discover how to send a dictionary to the vectorizer and get named rows, we are going to create two lists:

In [None]:
y, t = zip(*yeartexts.items())

In [None]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer()

# fit the model to the data 
vecs = vectorizer.fit_transform(t)

# check our work
vecs.shape

With the vectors established, we need to send them to a **numpy** array and then we can send that array to a **pandas** dataframe. In order to have human-understandable indices and column headers, we will use the `years` list for the index and create a list of terms using `get_feature_names()` for the column headers.

In [None]:
# from sklearn vector thingamabob to numpy array
vec_array = vecs.toarray()

# terms for column headers
feature_names = vectorizer.get_feature_names()

# from array to dataframe
trends = pd.DataFrame(data = vec_array, index = years, columns = feature_names)

# check our work
trends.head(16)

Before we go any further, we will save this to disk:

In [None]:
trends.to_csv('../output/tf_trends.csv', sep=',') # This is a 2.1 MB file.

## Finding Words with Limited Spans of Usage

Or, **words that occur only once, in a small cluster, or sporadically**.

Before we get to the trends across the years, for which we will sum each year and then divide each word's count by that sum, let's just take a look to see if there are words that occur only in one year.

In [4]:
# Re-load the trends dataframe
trends = pd.read_csv('../output/tf_trends.csv', index_col = 0)
trends.head()

Unnamed: 0,00,000,000000004,0000001,000001,00001,000042,0001,00046,000th,...,ālep,čapek,ōfunato,ʾan,ʾilla,ʾilāha,อย,อยman,อร,送你葱
2002,0,43,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2003,0,54,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2004,0,61,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2005,0,138,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2006,0,62,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
cols = trends.columns.tolist()
len(cols)

50096

In [6]:
print(cols[10000:10010])

['consented', 'consenting', 'consequence', 'consequences', 'consequential', 'consequently', 'conservancies', 'conservancy', 'conservation', 'conservationist']


There is a [SO question][] for the solution below.

[SO question]: https://stackoverflow.com/questions/57630072/find-columns-with-only-one-non-zero-value-in-pandas/57630162#57630162

In [None]:
df.melt('year').loc[lambda x : x['value']!=0].groupby('variable')['value'].apply(tupl)

In [None]:
sums = trends.sum(axis=1)
print(type(sums), "\n", sums)

## Mimicking Google Trends

First, we need to sum the words for a given year and then divide each words count by the sum in order to determine its normalized frequency for a given year.

We begin by recreating our one text per year upno which we will base our word counts. We are not simply using the `trends` dataframe (above) for the moment because we want to see how we can tweak the counts by perhaps cutting off words that only occur once and thus won't really be a compelling addition to an examination of trends. There are also a lot of numbers in the raw text, and I don't know how useful they are -- though it does suggest a separate examination is called for.

In [None]:
df = pd.read_csv('../output/TEDall_years.csv', index_col=0)

Please note that the version of onetext below has been changed to remove all non-letter characters.

In [36]:
def onetext (label):
    """Simple function to grab texts out of pandas dataframe"""
    texts = df.loc[df['year'] == label].text.tolist()
    joined = " ".join(texts)
    onetext = re.sub("[^a-zA-Z']"," ", joined)
    return onetext

In [37]:
years = sorted(list(filter(lambda year: year > 2001, df.year.unique())))

yeartexts = dict(zip([year for year in years], 
                     [onetext(year) for year in years]))

y, t = zip(*yeartexts.items())

Next we will create a list of tuples which has the year and its total word count:

In [53]:
wordcounts = [len(text) for text in t]
yearcounts = list(zip(years, wordcounts))
print(yearcounts)

[(2002, 429343), (2003, 499867), (2004, 469364), (2005, 1042616), (2006, 624273), (2007, 1113749), (2008, 671601), (2009, 1656804), (2010, 1830130), (2011, 1596795), (2012, 1492180), (2013, 1671871), (2014, 1551879), (2015, 1594325), (2016, 1573638), (2017, 1718390)]


Next we will vectorize the words for all the years -- I went back and forth a number of times on the value to provide `min_df`. Because our "documents" are analytical fictions of all talks for a given year and not individual talks, we do need to see if there are words that only occur in a given year. So we'll set `min_df = 1`, but that nearly doubles are word count to 48959 -- `min_df = 2` yields 28013 words.

Perhaps what we can do later is throw away words with values below a certain level -- so a word that only occurs once.

In [57]:
# If you want to pass options, pass them here:
vectorizer = CountVectorizer(max_df = 1.0, 
                             min_df = 1)

In [58]:
# fit the model to the data 
vecs = vectorizer.fit_transform(t)

# check our work
vecs.shape

(16, 48959)

In [41]:
features = vectorizer.get_feature_names()

In [43]:
print(features[0:50])

['aa', 'aaaaaah', 'aaaah', 'aah', 'aaron', 'ab', 'ababa', 'abacha', 'aback', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abate', 'abaya', 'abbey', 'abbreviated', 'abbreviation', 'abby', 'abc', 'abdomen', 'abdomens', 'abdominal', 'abducted', 'abduction', 'abdul', 'abdullah', 'abe', 'abhors', 'abide', 'abiding', 'abidjan', 'abilities', 'ability', 'abject', 'ablaze', 'able', 'abnormal', 'abnormalities', 'abnormality', 'aboard', 'abode', 'abolish', 'abolished', 'abolishing', 'abolition', 'abomination', 'aboriginal', 'abort', 'aborted']


What does the original "bookworm" code look like?