# Yearly Trends

## Analysis

With all of the above done, we now have a matrix with every word in a row and every year a column such that we can read a word's usage from left to right moving forward in time.

In [2]:
import pandas as pd

In [5]:
# Load the Data
df = pd.read_csv('../output/yearly-counts-min1-max100.csv', index_col = 'term')

# The 'Unnamed: 0' column is a vestigial index, let's drop it:
df.drop(columns = ['Unnamed: 0'], inplace=True)
# df.set_index('term', inplace=True)

# Check shape and list columns:
print(df.shape, list(df))

(39117, 16) ['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']


In [6]:
df.head()

Unnamed: 0_level_0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,0,0,0,0,0,2,0,1,0,0,1,0,0,0,0,0
0,43,54,61,66,62,81,100,73,87,67,52,112,65,74,80,99
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
42,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [7]:
# One term:
df.loc['nuclear']

2002    10
2003     1
2004     0
2005     4
2006     3
2007    11
2008    73
2009     9
2010    78
2011     7
2012    14
2013    22
2014    12
2015     8
2016     4
2017     2
Name: nuclear, dtype: int64

In [8]:
# Multiple terms:
terms = ['nuclear', 'global', 'climate']

# And this is the pandas way
df.loc[terms]

Unnamed: 0_level_0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
nuclear,10,1,0,4,3,11,73,9,78,7,14,22,12,8,4,2
global,12,11,2,12,40,26,29,36,32,20,26,16,33,39,102,41
climate,1,10,1,14,9,19,17,26,17,4,28,23,23,14,37,95


### Normalizing by Year

In the next series of cells, we first get the total number of words for each year, and then we get a list of our year columns so that we can then get a sum for each column and divide each term for a given year by the total number of words for that year. 

In [9]:
# a quick check of the sums involved
df.sum(axis = 0, skipna = True)

2002     74207
2003     86156
2004     81009
2005    100605
2006    106638
2007    135084
2008    115760
2009    127834
2010    124127
2011    121030
2012    100771
2013    128804
2014    147841
2015    138406
2016    130735
2017    164538
dtype: int64

This is a note: if you have a column that you wish to ignore when making column-wise calculations, first create a list of the column you do want, and then specify those columns in the calculation. The first few times I worked with this dataframe, I had a difficult time setting the index to `terms` and so this was the workaround I developed:
```python
years = list(df)[1:]
df[years] = df[years] / df[years].sum()
```

In [11]:
# divide each cell in a column by the total for each column
df = df / df.sum()

In [19]:
df.loc['nuclear']

2002    0.000135
2003    0.000012
2004    0.000000
2005    0.000040
2006    0.000028
2007    0.000081
2008    0.000631
2009    0.000070
2010    0.000628
2011    0.000058
2012    0.000139
2013    0.000171
2014    0.000081
2015    0.000058
2016    0.000031
2017    0.000012
Name: nuclear, dtype: float64

In [17]:
# and here's are three sample terms now with normalized frequency for a year
df.loc[terms]

Unnamed: 0_level_0,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
nuclear,0.000135,1.2e-05,0.0,4e-05,2.8e-05,8.1e-05,0.000631,7e-05,0.000628,5.8e-05,0.000139,0.000171,8.1e-05,5.8e-05,3.1e-05,1.2e-05
global,0.000162,0.000128,2.5e-05,0.000119,0.000375,0.000192,0.000251,0.000282,0.000258,0.000165,0.000258,0.000124,0.000223,0.000282,0.00078,0.000249
climate,1.3e-05,0.000116,1.2e-05,0.000139,8.4e-05,0.000141,0.000147,0.000203,0.000137,3.3e-05,0.000278,0.000179,0.000156,0.000101,0.000283,0.000577


In [14]:
# ==> Commented out so re-running notebook doesn't result in new file
# df.to_csv('../output/yearly-counts-min1-max100.csv')