# UK Research and the Sustainable Development Goals

This tutorial will explore the relationships between the United Nations Sustainable Development Goals (SDGs) and UK publicly funded research projects.

The tutorial consists of two segments:
1. Constructing a classifier to tag documents with SDG labels using supervised machine learning
2. Applying the classifier to UK research projects from Gateway to Research and performing analysis

## Preamble

In [None]:
%load_ext autoreload
%autoreload 2
# install im_tutorial package
!pip install git+https://github.com/nestauk/im_tutorials.git
!pip install annoy

In [None]:
# useful Python tools
from itertools import chain, combinations
from collections import Counter

# matplotlib for static plots
import matplotlib.pyplot as plt
import matplotlib
# networkx for networks
import networkx as nx
# numpy for mathematical functions
import numpy as np
# pandas for handling tabular data
import pandas as pd
# seaborn for pretty statistical plots
import seaborn as sns

pd.set_option('max_columns', 99)

# basic bokeh imports for an interactive scatter plot or line chart
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Circle, Line

# NB: If using Google Colab, this function must be run at 
# the end of any cell that you want to display a bokeh plot.
# If using Jupyter, then this line need only appear once at
# the start of the notebook.
output_notebook()

from im_tutorials.data import *

## Load Data

In [None]:
df_sdg = sdg.sdg_web_articles()

In [None]:
print(df_sdg.shape)

In [None]:
df_sdg.head()

In [None]:
sdg_definitions = {
     1: '1. No Poverty',
     2: '2. Zero Hunger',
     3: '3. Good Health & Well-being',
     4: '4. Quality Education',
     5: '5. Gender Equality',
     6: '6. Clean Water & Sanitation',
     7: '7. Affordable & Clean Energy',
     8: '8. Decent Work & Economic Growth',
     9: '9. Industry, Innovation & Infrastructure',
     10: '10.  Reduced Inequalities',
     11: '11.  Sustainable Cities & Communities',
     12: '12.  Responsible Consumption & Production',
     13: '13.  Climate Action',
     14: '14.  Life Below Water',
     15: '15.  Life on Land',
     16: '16.  Peace, Justice & Strong Institutions',
     17: '17.  Partnerships for the Goals'
}

In [None]:
sdg_names = list(sdg_definitions.values())

## Data Quality and Cleaning

### SDG Goals

In [None]:
df_sdg['n_goals'] = [len(x) for x in df_sdg['sdg_goals']]

fig, ax = plt.subplots()
df_sdg['n_goals'].value_counts().plot.bar(ax=ax)
ax.set_title('Number SDGs per Article')
ax.set_xlabel('N Goals')
ax.set_ylabel('Frequency');

In [None]:
df_sdg = df_sdg[(df_sdg['n_goals'] > 0) & (df_sdg['n_goals'] < 4)]

In [None]:
sdg_counts = pd.Series(chain(*df_sdg['sdg_goals'])).map(sdg_definitions).value_counts()

In [None]:
sdg_counts = pd.Series(chain(*df_sdg['sdg_goals'])).map(sdg_definitions).value_counts()

fig, ax = plt.subplots()
sdg_counts.plot.barh(ax=ax)
ax.set_title('Frequency of Goals')
ax.set_xlabel('Frequency')
ax.set_ylabel('Goal');

In [None]:
df_sdg = df_sdg[[False if 17 in x else True for x in df_sdg['sdg_goals']]]

In [None]:
sdg_counts = pd.Series(chain(*df_sdg['sdg_goals'])).map(sdg_definitions).value_counts()

In [None]:
fig, ax = plt.subplots()
sdg_counts.plot.barh(ax=ax)
ax.set_title('Frequency of Goals')
ax.set_xlabel('Frequency')
ax.set_ylabel('Goal');

### Text

We need to make sure that there is enough text in each article to provide a rich enough source of information for each SDG. We will have a look at the distribution of text lengths.

In [None]:
fig, ax = plt.subplots()
ax.hist(df_sdg['text'].str.len(), bins=100)
ax.set_title('Text Length')
ax.set_xlabel('N Characters')
ax.set_ylabel('Frequency');

Let's drop any texts that aren't at least the length of an old school tweet (clearly the minimum amount of characters required to convey any meaningful chunk of information in the 21st Century) and any duplicate texts.

In [None]:
df_sdg = df_sdg[df_sdg['text'].str.len() > 140]
df_sdg = df_sdg.drop_duplicates('text')
df_sdg = df_sdg.drop('index', axis=1)
df_sdg = df_sdg.reset_index()

In [None]:
df_sdg.shape

# SDG Classifier

## Text Preprocessing

### Tokenisation

Typically, for computers to understand human language, it needs to be broken down in to components, e.g. sentences, syllables, or words.

In the case of this work, we are going to analyse text at the word level. In natural language processing, the componenets below the sentence level are called **tokens**. The process of breaking a piece of text into tokens is called **tokenisation**. A token could be a word, number, email address or punctuation, depending on the exact tokenisation method used.

For example, tokenising the  `'The dog chased the cat.'` might give `['The', 'dog', 'chased', 'the', 'cat', '.']`.

In this case we will apply some extra processing during the tokenisation phase. We will

1. Tokenise each document at the word level.
2. Remove punctuation.
3. Remove **stop words**, such as `the`, `and`, `to` etc.
4. Apply lower case to all tokens.


❓What are some potential challenges with tokenization?

In [None]:
from im_tutorials.features.text_preprocessing import *

In [None]:
tokenized = [list(chain(*tokenize_document(document))) for document in df_sdg['text'].values]

In [None]:
doc_id = 0
n_tokens_print = 10

print('Original text of first document:')
print(df_sdg['text'].values[0], '\n')

print(f'First {n_tokens_print} tokens in first document {doc_id}:')
print(tokenized[doc_id][:n_tokens_print])

### Lemmatization

In many languages, words can have a root which can be modified with suffixes or appendixes or other methods.

We can use **lemmatization** to try to extract the root of most words.

❓Why is this useful?

❓What could go wrong?

In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

In [None]:
wnl = WordNetLemmatizer()
lemmatized = [[wnl.lemmatize(t) for t in b] for b in tokenized]

In [None]:
doc_id = 0
n_tokens_print = 10

print(f'First {n_tokens_print} tokens in first document {doc_id}:')
print(tokenized[doc_id][:n_tokens_print], '\n')

print(f'First {n_tokens_print} lemmas in first document {doc_id}:')
print(lemmatized[doc_id][:n_tokens_print])

### Term Frequencies

As well as stop words and punctuation, there may be other words that we want to remove, which are unique to our corpus. Often these are the tokens which appear very often and therefore convey little distinguishing information about each document.

Let's count up all of the tokens in our processed corpus and see which are the most common.

In [None]:
lemma_counts = Counter(chain(*lemmatized))
lemma_counts.most_common(50)

In [None]:
removes = ['development', 'country', 'report', 'also', 'action', 'sdg', 'meeting', 'policy', 'including', 'support',
          'implementation', 'national', 'new', 'conference', 'government', 'agreement', 'sdgs', 'goal', 'state',
          'agenda', 'organization', 'target', 'need', 'system', 'session', 'programme', 'management', 'party',
          'event', 'sector', 'process']

In [None]:
min_frequency = 5

df_sdg['clean_texts'] = [' '.join([t for t in doc if (t not in removes) & (lemma_counts[t] >= min_frequency)]) 
                     for doc in lemmatized] 


In [None]:
doc_id = 0
print(df_sdg['clean_texts'][doc_id])

### From Natural to Machine Language

Once we have preprocessed our text, we can apply various NLP techniques to further process, analyse, summarise the text, extract information from it, or use it as features in a later analysis.

#### Bag of Words

In general, when dealing with text, we need to somehow convert it in to numeric data that can be processed and analysed using mathematics. A very simple example would be to count the number of times each token appears in a document. For example if we have the sentence `'I like really cute cats, but all cats are cute really.'`, after pre-processing and tokenisation, we could generate a vector of word counts where each position represents the token count:

```
vector      token
[1,         i
 1,         like
 2,         really
 2,         cute
 2,         cats
 1,         but
 1,         all
 1,]        are
```

This method is called the **bag of words** approach, and in this case we can determine that the document is about really cute cats. But in real life, with many documents, things are not always so straightfoward.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
count_vectorizer = CountVectorizer()
bow_vecs = count_vectorizer.fit_transform(df_sdg['clean_texts'])

In [None]:
vocab = np.array(count_vectorizer.get_feature_names())

In [None]:
doc_id = 0
n_top_terms = 10

def get_vec_counts(bow, idx):
    return np.array(bow.todense()[idx])[0]

def get_top_terms(bow, doc_id):
    vec_counts = get_vec_counts(bow, doc_id)
    topn = np.argsort(vec_counts)[::-1][:n_top_terms]
    top_counts = vec_counts[topn]
    top_terms = vocab[topn]
    return top_terms, top_counts

top_terms, top_counts = get_top_terms(bow_vecs, doc_id)

for term, count in zip(top_terms, top_counts):
    print(count, term)

❓What are some potential limitations of bag of words?

#### Tf-Idf

An improvement on the simple bag of words is to somehow weight each token by it's importance, or how much information it carries. One way to to do this is by weighting the count of each word in a document with the inverse of its frequency across _all_ documents. This is called **term frequency-inverse document frequency** or **tf-idf**.

By doing this, a reasonably common word like `'height'` would probably be weighted lower than a less common, but more specific term such as `'altitude'`. Even if we have a document where height is mentioned more frequently than altitude, tf-idf can help us to identify that the document is referring to height in the context of altitude, rather than for example the height of a person.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tfidf = TfidfTransformer()
tfidf_vecs = tfidf.fit_transform(bow_vecs)

In [None]:
doc_id = 0
n_top_terms = 10

top_terms, top_counts = get_top_terms(tfidf_vecs, doc_id)

print('Score Term')
for term, count in zip(top_terms, top_counts):
    print(f'{count:.3f}', term)

We can now see that terms that are much more specific are weighted relatively higher than those which convey higher level and more generic information.

### Visualising Docs

Now that we have our documents described as vectors, we can visualise them!

To do this, we will need to project the high-dimensional document vectors in to low-dimensional space, in this case 2D.

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

SVD takes our sparse tf-idf vectors and compresses them in to a lower dimensional space.

In [None]:
svd = TruncatedSVD(n_components=30)
svd_vecs = svd.fit_transform(tfidf_vecs)

TSNE does a further projection in to 2 dimensional space. It tries to strike a balance between retaining local and global structure, making it good for visualisation.

In [None]:
tsne = TSNE(n_components=2)
tsne_vecs = tsne.fit_transform(svd_vecs)

In [None]:
single_goals = (df_sdg['n_goals'] == 1).index.values

In [None]:
tsne_vecs_single = tsne_vecs[single_goals]
goal_labels_single = [g[0] for g in df_sdg['sdg_goals'][single_goals]]
titles_single = df_sdg['title'][single_goals].values

In [None]:
from bokeh.models import HoverTool
from bokeh.palettes import Category20_16

In [None]:
colors = [Category20_16[g-1] for g in goal_labels_single]

cds = ColumnDataSource(data={
    'tsne_0': tsne_vecs[:, 0],
    'tsne_1': tsne_vecs[:, 1],
    'color': colors,
    'goal': [sdg_definitions[g] for g in goal_labels_single],
    'title': titles_single
})

p = figure(width=900, title='TSNE Plot of Single SDG Article Vectors')

hover = HoverTool(tooltips=[('Goal', '@goal'), ('Title', '@title')])

p.circle(source=cds, x='tsne_0', y='tsne_1', color='color', line_width=0, legend='goal', radius=0.4, alpha=0.9)
p.add_tools(hover)

show(p)
# output_notebook()

❓Does this visualisation have implications for our classification problem?

## Classification Model

Now that we have preprocessed and seen how to vectorized our text, it is time to build and train our model.

In this case, our **features** are the vectors created from the documents and our **target labels** are the SDG goals.

### Training Labels

This is a **multi-class** **multi-label** classification problem. This means that we have more than two possible labels - one for each of the SDGs - and each document can have more than one label assigned to it.

To train a model that can deal with this situation, we first need to transform our labels into binary features.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
sdg_classes = list(range(1, 17))
mlb = MultiLabelBinarizer(classes=sdg_classes)

In [None]:
sdg_goals_mlb = pd.DataFrame(mlb.fit_transform(df_sdg['sdg_goals']), columns=mlb.classes_)

In [None]:
sdg_goals_mlb.head()

We can count up the goals again just to make sure that we've done the right thing.

In [None]:
sdg_goals_mlb.sum()

Looks ok!

Now we can also look to see whether there are any patterns between the SDGs.

In [None]:
fig, ax = plt.subplots()
sns.heatmap(sdg_goals_mlb.corr(),ax=ax)
ax.set_title('Correlation Between SDG Labels');

### Train Test Split

To test the accuracy of our model, we will need to test it on data that is also labelled, but that it has not been trained with.

As we do not have a separate test dataset for this, we will hold back some of the original data with a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

We are going to hold back 20% of our data for the test set.

In [None]:
tfidf_vecs_train, tfidf_vecs_test, sdg_labels_train, sdg_labels_test = train_test_split(
    tfidf_vecs, sdg_goals_mlb, test_size=0.2
)

print('Training set length:', tfidf_vecs_train.shape[0])
print('Test set length:', tfidf_vecs_test.shape[0])

### A Quick Example

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
from sklearn.linear_model import LogisticRegression

knc = KNeighborsClassifier()
knc.fit(tfidf_vecs_train, sdg_labels_train)

In [None]:
preds = knc.predict(tfidf_vecs_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(sdg_labels_test, preds, target_names=sdg_names[:-1]))

Not great. It's not optimised.

### The Real Thing

Here we are going to train separate models for each SDG that will each be tuned with different parameters.

In [None]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from imblearn.metrics import classification_report_imbalanced

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
pipe = make_pipeline_imb(
    RandomUnderSampler(), 
    LogisticRegression(solver='lbfgs', fit_intercept=False)
)

C = np.logspace(-1, 2, 10)
strats = [0.5, 0.6, 0.7, 0.8]

params = {
    'randomundersampler__sampling_strategy': strats,
    'logisticregression__C': C,
}

grid = GridSearchCV(pipe, param_grid=params, cv=KFold(n_splits=3, shuffle=True))

We need to make all of the same vectorisation techniques we did before. We are also going to use SVD to create dense vectors.

In [None]:
tfidfv = TfidfVectorizer(ngram_range=(1,2))
tfidf_vecs = tfidfv.fit_transform(df_sdg['clean_texts'])

svdv = TruncatedSVD(n_components=300)
svd_vecs = svdv.fit_transform(tfidf_vecs)

svd_vecs_train, svd_vecs_test, sdg_labels_train, sdg_labels_test = train_test_split(
    svd_vecs, sdg_goals_mlb, test_size=0.2
)

In [None]:
classifiers = {}
preds = {}

for i in range(1, 17):
    print(sdg_definitions[i])
    grid.fit(svd_vecs_train, sdg_labels_train[i])
    best = grid.best_estimator_
    classifiers[i] = best
    preds[i] = best.predict(svd_vecs_test)
    print(classification_report_imbalanced(sdg_labels_test[i], preds[i]))
    print('\n')

❓What might be limiting the accuracy of these classifiers?

## Gateway to Research Projects

Gateway to Research is a database of UK funded research projects across all disciplines. This sample of their database contains titles, abstracts, research categories and the start year for each project.

We are going to apply our SDG classifier to the abstracts and then have a look at some possible analysis methods.

In [None]:
import ast

In [None]:
df_gtr = datasets.gateway_to_research_projects()

In [None]:
df_gtr.head()

In [None]:
df_gtr.shape

In [None]:
fig, ax = plt.subplots()
ax.hist(df_gtr['abstract_texts'].str.len(), bins=100)
ax.set_title('Abstract Length Histogram')
ax.set_xlabel('N Characters')
ax.set_ylabel('Frequency');

In [None]:
df_gtr['abstract_texts'].value_counts()[:3]

In [None]:
df_gtr.groupby('start_year')['project_id'].count()

In [None]:
df_gtr = df_gtr[(df_gtr['start_year'] > 2005) & (df_gtr['start_year'] < 2018)]
text_drop = df_gtr['abstract_texts'].value_counts().index[0]
df_gtr = df_gtr[~pd.isnull(df_gtr['abstract_texts'])]
df_gtr = df_gtr[df_gtr['abstract_texts'].str.len() > 140]
df_gtr = df_gtr[df_gtr['abstract_texts'] != text_drop]
df_gtr = df_gtr.sort_values('start_year')
df_gtr = df_gtr.reset_index()
df_gtr = df_gtr.drop('index', axis=1)

In [None]:
df_gtr.shape

### Apply Text Preprocessing

In [None]:
tokenized_gtr = [list(chain(*tokenize_document(document))) for document in df_gtr['abstract_texts'].values]
lemmatized_gtr = [[wnl.lemmatize(t) for t in b] for b in tokenized_gtr]
df_gtr['clean_texts'] = [' '.join(t) for t in lemmatized_gtr]

### Apply Model

In [None]:
tfidf_gtr = tfidfv.transform(df_gtr['clean_texts'])
svd_gtr = svdv.transform(tfidf_gtr)
sdgs_gtr = {}
for i, clf in classifiers.items():
    sdgs_gtr[i] = clf.predict(svd_gtr)

### Explore Results

In [None]:
df_gtr_sdgs = pd.DataFrame(sdgs_gtr)
df_gtr_sdgs.columns = [sdg_definitions[i] for i in df_gtr_sdgs.columns]
df_gtr_sdgs.sum(axis=0)

In [None]:
topic_count = Counter(chain(*df_gtr['research_topics']))
subject_count = Counter(chain(*df_gtr['research_subjects']))
print('N Topics:', len(topic_count))
print('N Subjects:', len(subject_count))

In [None]:
rs = sorted(set(chain(*df_gtr['research_subjects'])))
mlb_subjects = MultiLabelBinarizer(classes=rs)
subjects_mlb_df = mlb_subjects.fit_transform(df_gtr['research_subjects'])
subjects_mlb_df = pd.DataFrame(subjects_mlb_df, columns=rs)

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap((df_gtr_sdgs.groupby(df_gtr['funder_name']).sum() / 
 df_gtr_sdgs.groupby(df_gtr['funder_name']).sum().sum() * 100).T, ax=ax,
           annot=True, fmt='.1f')
ax.set_title('Percentage Projects by Funder for Each SDG')
ax.set_xlabel('Funder Name')
ax.set_ylabel('Goal');

### Correlation With Research Subjects

In [None]:
from scipy.stats import pearsonr

In [None]:
sdg_subj_corrs = np.zeros((subjects_mlb_df.shape[1], len(sdg_names) -1))

for i, subj in enumerate(subjects_mlb_df.columns):
    for j, sdg in enumerate(df_gtr_sdgs.columns):
        corr = pearsonr(subjects_mlb_df[subj], df_gtr_sdgs[sdg])[0]
        sdg_subj_corrs[i, j] = corr
        
sdg_subj_corrs_df = pd.DataFrame(sdg_subj_corrs,
                                 columns=df_gtr_sdgs.columns,
                                 index=subjects_mlb_df.columns)

In [None]:
fig, ax = plt.subplots(figsize=(20, 6))
sns.heatmap(sdg_subj_corrs_df.T, ax=ax, cmap='viridis')
ax.set_title('Pearson R Correlation Coefficient between SDGs and Research Subjects')
ax.set_xlabel('SDG')
ax.set_ylabel('Research Subject');
plt.tight_layout();

### SDGs Over Time

In [None]:
rolling_window = 3
sdgs_time_df = (
    (df_gtr_sdgs.groupby(df_gtr['start_year']).sum().loc[2006:2017].divide( 
     df_gtr.groupby('start_year')['project_id'].count().loc[2006:2017].values, axis=0) * 100)
     .rolling(rolling_window).mean())

In [None]:
p2 = figure(title='Percentage of GtR Projects by SDG over Time', width=900, height=500)

for i, col in enumerate(sdgs_time_df.columns):
    color = Category20_16[i-1]
    p2.line(
        x=sdgs_time_df.index.values, 
        y=sdgs_time_df[col],
        color=color,
        line_width=3,
        alpha=0.8,
        legend=col,
        muted_color=color,
        muted_alpha=0.3
    )

p2.legend.location = "top_left"
p2.legend.click_policy="mute"
p2.legend.label_text_font_size = '6pt'

show(p2)

❓What are the caveats behind this plot?

## Changes in Term Correlation

With our projects classified, we can hone in on individual SDGs.

Here's a simple example where we look at which new terms are highly correlating each year with SDG 7. Affordable & Clean Energy.

In [None]:
from sklearn.feature_selection import SelectKBest

In [None]:
ngram_range = (1, 1)

found = set()
for year, group in df_gtr.groupby('start_year'):
    print('===', year, '===')
    tfidf_corr = TfidfVectorizer(ngram_range=ngram_range)
    tfidf_corr_vecs = tfidf_corr.fit_transform(group['clean_texts'])
    
    ids = group.index.values
    sdg_labels = df_gtr_sdgs[sdg_names[6]].iloc[ids]
    skb = SelectKBest(k=20)
    vocab_corr = {v: k for k, v in tfidf_corr.vocabulary_.items()}
    skb.fit(tfidf_corr_vecs, df_gtr_sdgs[sdg_names[6]].values[ids])
    top_term_ids = np.argsort(np.nan_to_num(skb.scores_, 0))[::-1][0:50]
    top_terms = [vocab_corr[v] for v in top_term_ids]
    not_found_before = [t for t in top_terms if t not in found]
    found.update(top_terms)
    print(not_found_before)

❓How could we improve on this method to find the most emerging terms associated with the SDG?

#### An Aside on N Grams

We know that some in some cases, we might have words that appear together more often than we might expect. This might happen where we have commonly used phrases, or names of entities, for example `general relativity`. It can be useful to identify cases of this in our text so that the machine can understand that they represent different information when compared to the words appearing separately. Tokens of multiple words are called **n grams**. N grams containing two tokens are **bigrams**, n grams containing three words are **trigrams** and so on.

For example, in a corpus of text, we might have the sentence, `'I travelled from York to New York to find a new life.'`. After tokenisation and finding bigrams, we might end up with `['i', 'travelled', 'from', 'york', 'to', 'new_york', 'to', 'find', 'a', 'new', 'life', '.']`.

## Topics for Renewable Energy

#### Topic Modelling

When we have thousands of documents, it can be too many for a single person to read and understand in a reasonable space of time. A useful first step is often to be able to understand what the main themes are within the documents we have. Bag of words or tf-idf are useful processing methods, but they still require us to inspect each document individually or group them and identify topics manually. 

Luckily, there are automated methods of finding the groups of tokens that describe broad themes within a set of documents, which are referred to as **topic modelling**.

In this case, we are going to use **Latent Semantic Indexing** or **LSI**.

In [None]:
df_sdg_7 = df_gtr[(df_gtr_sdgs[sdg_names[6]] == 1)]

In [None]:
from gensim.corpora import Dictionary
from gensim.models.phrases import Phraser, Phrases
from gensim.models.lsimodel import LsiModel
from gensim.sklearn_api.lsimodel import LsiTransformer

In [None]:
sdg_7_tokenised = [list(chain(*tokenize_document(document))) for document in df_sdg_7['clean_texts'].values]

In [None]:
dictionary = Dictionary(sdg_7_tokenised)
dictionary.filter_extremes(no_above=0.2)

In [None]:
bow_sdg_7 = [dictionary.doc2bow(d) for d in sdg_7_tokenised]

In [None]:
num_topics = 300
lsi = LsiModel(bow_sdg_7, id2word=dictionary, num_topics=num_topics)

In [None]:
n_topics_print = 10

for topic_id in range(0, num_topics, int(num_topics/n_topics_print)):
    print('Topic', topic_id)
    print(lsi.print_topic(topic_id), '\n')

❓Do these topics reveal anything to you about the challenges of using machine learning methods exploring research topics?

In [None]:
svd_7 = TruncatedSVD(n_components=30)
svd_7_vecs = svd_7.fit_transform(lsi_vecs)
tsne_7 = TSNE(n_components=2)
tsne_7_vecs = tsne_7.fit_transform(svd_7_vecs)

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
agc = AgglomerativeClustering(n_clusters=20, affinity='cosine', linkage='complete')
agcs = agc.fit_predict(svd_7_vecs)

In [None]:
def make_topic_terms(model, num_topic_terms):
    topic_terms = []
    for i in range(model.num_topics):
        topic_terms.append([t[0] for t 
                   in model.show_topic(i)[:num_topic_terms]])        
    return np.array(topic_terms)

def make_topic_names(topic_vectors, topic_terms, num_topics=None):
    topic_names = []
    for vector in topic_vectors:
        topic_ids = np.argsort(vector)[::-1][:num_topics]
        name = ', '.join([c for c in chain(*topic_terms[topic_ids])])
        topic_names.append(name)
    return topic_names

topic_terms = make_topic_terms(lsi, 2)

topic_names = make_topic_names(lsi_vecs, topic_terms, num_topics=3)

In [None]:
titles = [x[:50] for x in df_gtr.iloc[df_sdg_7.index.values]['clean_texts'].values]

cmap = matplotlib.cm.hsv
norm = matplotlib.colors.Normalize(vmin=np.min(agcs), vmax=np.max(agcs))
colors = [matplotlib.cm.colors.to_hex(cmap(norm(i))) for i in agcs]

cds = ColumnDataSource(data={'tsne_0': tsne_7_vecs[:, 0],
                             'tsne_1': tsne_7_vecs[:, 1],
                             'name': titles,
                             'color': colors,
                             'cluster': agcs})

p = figure(width=900)
hover = HoverTool(tooltips=[("Topic", "@name"), ("Cluster", "@cluster")])
p.circle(source=cds, x='tsne_0', y='tsne_1', fill_color='color', line_color='color', 
         fill_alpha=0.5, line_alpha=0.5, radius=.5)
p.add_tools(hover)

show(p)

# output_notebook()

### Topic Networks

#### Communities of Projects

Our topic modeled projects can now be projected in to space, and we can find nearnest neighbours using Cosine distance.

In [None]:
df_sdg_7 = df_sdg_7.reset_index()
df_sdg_7 = df_sdg_7.drop('index',axis=1)

In [None]:
from annoy import AnnoyIndex
from collections import defaultdict

In [None]:
annoy_indices = {}
for year, group in df_sdg_7.groupby(['start_year']):
    ids = group.index.values

    vecs = svd_7_vecs[ids]
    t = AnnoyIndex(svd_7.n_components, 'angular')  # Length of item vector that will be indexed
    for idx, vec in zip(ids, vecs):
        t.add_item(idx, vec)
    t.build(500)
    annoy_indices[year] = t

In [None]:
years = df_sdg_7['start_year'].unique()

In [None]:
min_dist = 0.8

project_edges = defaultdict(list)

for year, group in df_sdg_7.groupby(['start_year']):
    edges_year = []
    ids = group.index.values
    annoy_index = annoy_indices[year]
    for idx in ids:
        for neighbour_idx in annoy_index.get_nns_by_item(idx, 30):
            if neighbour_idx == idx:
                continue
            else:
                dist = annoy_index.get_distance(neighbour_idx, idx)
                if dist < min_dist:
                    edges_year.append((idx, neighbour_idx, {'dist': 1 - dist}))
    project_edges[year].extend(edges_year)

In [None]:
import networkx as nx

In [None]:
g_p = nx.Graph()
g_p.add_edges_from(project_edges[2007])

g_p_node_pos = nx.spring_layout(g_p, seed=101, weight='dist')
nx.draw(g_p, pos=g_p_node_pos, node_size=15, node_color='C0');

In [None]:
import community

In [None]:
communities = community.best_partition(g_p, resolution=0.3, weight='dist')

In [None]:
nx.draw(g_p, pos=g_p_node_pos, node_size=15, node_color=list(communities.values()), cmap=matplotlib.cm.hsv)

In [None]:
resolution = 0.3

project_communities = {}
community_labels = {}
project_graphs = {}
for year, edge_list in project_edges.items():
    g = nx.Graph()
    g.add_edges_from(edge_list)
    project_graphs[year] = g
    
    communities = community.best_partition(g, resolution=resolution, weight='dist')
    print(f'N Communities at {year}:', len(set(communities.values())))
    
    community_ids = defaultdict(list)
    for proj, c in communities.items():
        community_ids[c].append(proj)
    project_communities[year] = community_ids

In [None]:
svd_communities = {}

for year, communities_year in project_communities.items():
    svd_communities_year = []
    for community_id, docs in communities_year.items():
        mean_vec = np.mean(svd_7_vecs[docs], axis=0)
        mean_vec = mean_vec / np.max(mean_vec)
        svd_communities_year.append(mean_vec)
    svd_communities[year] = svd_communities_year

In [None]:
from scipy.spatial.distance import cosine

In [None]:
similarity_thresh = 0.5

agg_edges = []
max_parents = 1

for i, year in enumerate(sorted(years)):
    if i > 0:
        past_year = year - 1
        past_vecs = svd_communities[past_year]
        current_vecs = svd_communities[year]
        for idx, vec in enumerate(current_vecs):
            similarities = [1 - cosine(vec, c_past) for c_past in past_vecs]
            sim_max_ids = np.argsort(similarities)[::-1][:max_parents]
            for sim_max_idx in sim_max_ids:
                edge = (f'{year}_{idx}', f'{past_year}_{sim_max_idx}', {'weight': similarities[sim_max_idx]})
            agg_edges.append(edge)      

In [None]:
nodes = []
for year, communities in project_communities.items():
    for idx, _ in enumerate(communities):
        nodes.append(f'{year}_{idx}')

In [None]:
plt.hist([e[2]['weight'] for e in agg_edges], bins=50);

In [None]:
h = nx.DiGraph()
h.add_nodes_from(nodes)
h.add_edges_from(agg_edges)

In [None]:
pos_x = np.array([int(d.split('_')[0]) for d in h.nodes])
pos_x = pos_x - np.max(pos_x)

tsne_agg = TSNE(n_components=1)
svd_df = pd.DataFrame(np.array(list(chain(*svd_communities.values()))))
pos_y = tsne_agg.fit_transform(svd_df)

pos_y = pos_y - np.min(pos_y) 
pos_y = pos_y / np.max(pos_y)

pos = {}
for node, x, y in zip(h.nodes, pos_x, pos_y):
    pos[node] = (x, y[0])

In [None]:
weights = np.array([1 / h.get_edge_data(e[0], e[1])['weight'] for e in h.edges])
weights = weights / np.max(weights)

In [None]:
from sklearn.cluster import KMeans

In [None]:
n_clusters = int(np.round(np.mean([len(c) for c in svd_communities.values()])))

km = KMeans(n_clusters=n_clusters)
km.fit(list(chain(*svd_communities.values())))
colors = km.labels_
cmap_nodes = matplotlib.cm.hsv

In [None]:
cmap = matplotlib.cm.get_cmap('inferno')
fig, ax = plt.subplots(figsize=(15, 7.5))
nx.draw(h, pos=pos, node_size=50, edge_color=weights, edge_cmap=cmap, width=2, node_color=colors, cmap=cmap_nodes)

In [None]:
lsi_vecs.shape

In [None]:
top_n = 3

lsi_years = {}

for year, group in df_sdg_7.groupby(['start_year']):
    ids = group.index.values
    lsi_year = np.sqrt(np.square(lsi_vecs[ids]))
    lsi_mean = np.mean(lsi_year, axis=0)
    lsi_years[year] = lsi_mean

In [None]:
lsi_df = pd.DataFrame(lsi_years).T.rolling(3).mean()

In [None]:
# megs = lsi_df.multiply(df_sdg_7.groupby('start_year')['project_id'].count(), axis=0)

In [None]:
n_topics = 10

fig, axs = plt.subplots(nrows=n_topics, figsize=(6, 1.2 * n_topics))

for i, ax in enumerate(axs):
    ax.plot(lsi_df[i], linewidth=2)
    title = ' '.join([c[0] for c in lsi.show_topic(i)][:5])
    ax.set_title(title)

plt.tight_layout()

### Topic Centrality

In [None]:
co_edges = []
for i, vec in enumerate(lsi_vecs):
    top_topics = np.argsort(vec)[::-1][:top_n]
    for combo in combinations(top_topics, 2):
        co_edges.append(tuple(sorted(combo)))
co_edges = list(set(co_edges))

g_topic_co = nx.Graph()
g_topic_co.add_edges_from(co_edges)

b = np.array(list(nx.centrality.betweenness_centrality(g_topic_co)))
d = np.array(list(nx.centrality.degree_centrality(g_topic_co).values())) * 100

cmap = matplotlib.cm.inferno
norm = matplotlib.colors.Normalize(vmin=np.min(b), vmax=np.max(b))
colors = [cmap(norm(i)) for i in b]
nx.draw(g_topic_co, node_size=d, node_color=colors, edge_color='gray')