# Science in Culture
This research project seeks to analyze the perception of science in culture. Techniques from NLP such as word embeddings (word2vec) and sentiment analysis are used.

## Reference Code
1. https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
2. https://github.com/RaRe-Technologies/gensim/blob/ba1ce894a5192fc493a865c535202695bb3c0424/docs/notebooks/Word2Vec_FastText_Comparison.ipynb

## Other references
1.   Cámara, M., & A., J. (2012). Political dimensions of scientific culture: Highlights from the Ibero-American survey on the social perception of science and scientific culture. Public Understanding of Science, 21(3), 369–384. https://doi.org/10.1177/0963662510373871
2. Jones, M. (2014). Cultural Characters and Climate Change: How Heroes Shape Our Perception of Climate Science. Social Science Quarterly, 95(1), 1-39.
3. Ruest, Nick, 2017, "#climatemarch tweets April 19-May 3, 2017", https://doi.org/10.5683/SP/KZZVZW, Scholars Portal Dataverse, V1
4. Ruest, Nick, 2017, "#MarchForScience tweets April 12-26, 2017", https://doi.org/10.5683/SP/7BC9V1, Scholars Portal Dataverse, V1
5. http://www.nltk.org/nltk_data/ id: brown; size: 3314357; author: W. N. Francis and H. Kucera; copyright: ; license: May be used for non-commercial purposes.
6. Salehan, Kim, & Lee. (2018). Are there any relationships between technology and cultural values? A country-level trend study of the association between information communication technology and cultural values. Information & Management, 55(6), 725-745.
7. Vishwanath, A., & Chen, H. (2008). Personal communication technologies as an extension of the self: A cross‐cultural comparison of people's associations with technology and their symbolic proximity with others. Journal of the American Society for Information Science and Technology, 59(11), 1761-1775.



## Imports

In [2]:
import nltk
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import numpy as np
from smart_open import smart_open
import os

## Training the models
The data for this project is not included in this repository due to restrictions on Twitter data. The two datasets used in this project are the widely available Brown corpus and Twitter data relating to #ClimateMarch and #MarchForScience hashtags. Roughly two million tweets were hydrated using twarc.

In [2]:
# Only train if you don't want to use the pretained models
params = {
    'alpha': 0.05,
    'size': 100,
    'window': 5,
    'iter': 5,
    'min_count': 5,
    'sample': 1e-4,
    'sg': 1,
    'hs': 0,
    'negative': 5
}

In [3]:
brown_model = Word2Vec(Text8Corpus('data/brown_corp.txt'), **params)

In [4]:
climate_model = Word2Vec(Text8Corpus('data/cleaned/climate_tweets_cleaned.txt'), **params)

In [5]:
mfs_model = Word2Vec(Text8Corpus('data/cleaned/MarchForScience_tweets_cleaned.txt'), **params)

## Saving and loading the models

In [4]:
# File paths -> required by saving/loading methods
brown_path = "models/brown_model.bin"
brown_vec = "models/vectors/brown_vec.kv"
brown_name = "models/vectors/brown"

climate_path = "models/climate_model.bin"
climate_vec = "models/vectors/climate_vec.kv"
climate_name = "models/vectors/climate"

mfs_path = "models/mfs_model.bin"
mfs_vec = "models/vectors/mfs_vec.kv"
mfs_name = "models/vectors/mfs"

In [6]:
# Saving models
brown_model.save(brown_path)
climate_model.save(climate_path)
mfs_model.save(mfs_path)

In [309]:
# Edited from gensim/scripts
def word2vec2tensor(word2vec_model_path, tensor_filename, binary=False):
    """Convert file in Word2Vec format and writes two files 2D tensor TSV file.
    File "tensor_filename"_tensor.tsv contains word-vectors, "tensor_filename"_metadata.tsv contains words.
    Parameters
    ----------
    word2vec_model_path : str
        Path to file in Word2Vec format.
    tensor_filename : str
        Prefix for output files.
    binary : bool, optional
        True if input file in binary format.
    """
    model = gensim.models.KeyedVectors.load(word2vec_model_path, mmap='r')
    outfiletsv = tensor_filename + '_tensor.tsv'
    outfiletsvmeta = tensor_filename + '_metadata.tsv'

    with smart_open(outfiletsv, 'wb') as file_vector, smart_open(outfiletsvmeta, 'wb') as file_metadata:
        for word in model.index2word:
            file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
            vector_row = '\t'.join(str(x) for x in model[word])
            file_vector.write(gensim.utils.to_utf8(vector_row) + gensim.utils.to_utf8('\n'))

    print("2D tensor file saved to %s", outfiletsv)
    print("Tensor metadata file saved to %s", outfiletsvmeta)

In [311]:
# Saving vectors
brown_model.wv.save(brown_vec)
word2vec2tensor(brown_vec, brown_name)

climate_model.wv.save(climate_vec)
word2vec2tensor(climate_vec, climate_name)

mfs_model.wv.save(mfs_vec)
word2vec2tensor(mfs_vec, mfs_name)

2D tensor file saved to %s models/vectors/brown_tensor.tsv
Tensor metadata file saved to %s models/vectors/brown_metadata.tsv
2D tensor file saved to %s models/vectors/climate_tensor.tsv
Tensor metadata file saved to %s models/vectors/climate_metadata.tsv
2D tensor file saved to %s models/vectors/mfs_tensor.tsv
Tensor metadata file saved to %s models/vectors/mfs_metadata.tsv


In [5]:
# Loading models
brown_model = Word2Vec.load(brown_path)
climate_model = Word2Vec.load(climate_path)
mfs_model = Word2Vec.load(mfs_path)

## Evaluating the models
Let's do some basic tests of cosine similarity.

In [22]:
brown_model.wv.most_similar(positive=['science', 'fear'], topn=10)

[('dwell', 0.9033710956573486),
 ('humanity', 0.8990538716316223),
 ('divine', 0.8985900282859802),
 ('obedience', 0.8958655595779419),
 ('virtue', 0.8896442651748657),
 ('invention', 0.885993242263794),
 ('doctrine', 0.8842350244522095),
 ('judgments', 0.8829445838928223),
 ('Utopian', 0.8806315064430237),
 ('profound', 0.8802723288536072)]

In [24]:
climate_model.wv.most_similar(positive=['science', 'fear'], topn=10)

[('Libs', 0.7004995346069336),
 ('oceans', 0.6837355494499207),
 ('might', 0.6310517191886902),
 ('Hiding', 0.6045321226119995),
 ('lab.', 0.5913923978805542),
 ('peacefully', 0.5868384838104248),
 ('backing', 0.5676073431968689),
 ('Muslims', 0.5611160397529602),
 ('NY', 0.5599805116653442),
 ('magically', 0.5589848756790161)]

In [55]:
mfs_model.wv.most_similar(positive=['science', 'fear'], topn=10)

[('MC…', 0.6706839799880981),
 ('enterprise', 0.6391789317131042),
 ('ignorance…', 0.6318367123603821),
 ('triumph', 0.5998080968856812),
 ('open,', 0.5971928834915161),
 ('stay!', 0.5955246090888977),
 ('facets', 0.5903943777084351),
 ('protesting,', 0.5866920948028564),
 ('represses', 0.5861929655075073),
 ('less"', 0.5806410312652588)]

In [49]:
brown_model.predict_output_word(['fear', 'science'])

[('science', 0.0006998557),
 ('philosophy', 0.0006726175),
 ('fear', 0.00059489015),
 ('mind', 0.0005511252),
 ('religion', 0.0005273468),
 ('pure', 0.00047090003),
 ('importance', 0.00046973577),
 ('feelings', 0.00046481914),
 ('poems', 0.00046039498),
 ('poetic', 0.0004552021)]

In [48]:
climate_model.predict_output_word(['fear', 'science'])

[('science', 0.014418488),
 ('rise.', 0.0102302795),
 ('might', 0.006852584),
 ('oceans', 0.0046408046),
 ('Muslims', 0.0039983448),
 ('declaring', 0.0037104057),
 ("We'll", 0.0035391354),
 ('fear', 0.0023272592),
 ('But', 0.0021787095),
 ('inaugurati…', 0.0019829252)]

In [54]:
mfs_model.predict_output_word(['science', 'fear'])

[('fear', 0.022886122),
 ('ignorance…', 0.002145881),
 ('less"', 0.0015034123),
 ('Saturday', 0.0010085657),
 ('benefit', 0.0010045078),
 ('truth.', 0.00093975145),
 ('ignorance,', 0.00078472577),
 ('society', 0.0007797721),
 ('role', 0.0007566012),
 ('knowledge', 0.00072946213)]

## Analysis
Let's analyze the models to see what we can discern about the differences in culture between these three models.

### Visualization

Visualizing word vectors is done with Embedding Projector, a tool from the team at TensorFlow. Gephi is also used to generate network visualizations, and finally Scattertext is used for visualizing sentiment. The relevant figures can be found in the images folder.

Here is the link for the embedding projection of climate data: https://projector.tensorflow.org/?config=https://raw.githubusercontent.com/ndalton12/nlp-english-project/master/json/climate_config.json

MFS data: https://projector.tensorflow.org/?config=https://raw.githubusercontent.com/ndalton12/nlp-english-project/master/json/mfs_config.json

Brown data: https://projector.tensorflow.org/?config=https://raw.githubusercontent.com/ndalton12/nlp-english-project/master/json/brown_config.json

### KMeans Clustering

In [9]:
from sklearn import cluster

X_b = brown_model[brown_model.wv.vocab]
X_c = climate_model[climate_model.wv.vocab]
X_m = mfs_model[mfs_model.wv.vocab]
NUM_CLUSTERS = 100

  after removing the cwd from sys.path.
  """
  


In [10]:
kmeans_b = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans_c = cluster.KMeans(n_clusters=NUM_CLUSTERS)
kmeans_m = cluster.KMeans(n_clusters=NUM_CLUSTERS)

In [11]:
kmeans_b.fit(X_b)
kmeans_c.fit(X_c)
kmeans_m.fit(X_m)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=100, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [41]:
labels_b = kmeans_b.labels_
labels_c = kmeans_c.labels_
labels_m = kmeans_m.labels_

In [37]:
# From Douglas Duhaime
# https://douglasduhaime.com/posts/clustering-semantic-vectors.html

class autovivify_list(dict):
  '''A pickleable version of collections.defaultdict'''
  def __missing__(self, key):
    '''Given a missing key, set initial value to an empty list'''
    value = self[key] = []
    return value

  def __add__(self, x):
    '''Override addition for numeric types when self is empty'''
    if not self and isinstance(x, Number):
      return x
    raise ValueError

  def __sub__(self, x):
    '''Also provide subtraction method'''
    if not self and isinstance(x, Number):
      return -1 * x
    raise ValueError
    
def find_word_clusters(labels_array, cluster_labels):
  cluster_to_words = autovivify_list()
  for c, i in enumerate(cluster_labels):
    cluster_to_words[ i ].append( labels_array[c] )
  return cluster_to_words

In [42]:
clusters_b = find_word_clusters(list(brown_model.wv.vocab.keys()),labels_b);
clusters_c = find_word_clusters(list(climate_model.wv.vocab.keys()),labels_c);
clusters_m = find_word_clusters(list(mfs_model.wv.vocab.keys()),labels_m);

In [50]:
for i in range(10):
    print("brown: ", clusters_b[i][0:10])
    print("climate: ", clusters_c[i][0:10])
    print("mfs: ",clusters_m[i][0:10])
    print()

brown:  ['purchasing', 'departments', 'personnel', 'provide', 'funds', 'services', 'assistance', 'program', 'management', 'medical']
climate:  ['leave', 'behind', 'left', 'where', 'tons', 'cover', 'pile', 'garbage', 'remember', 'signs.']
mfs:  ['Together', 'Side', 'FACT:', 'exonerated', 'post-conviction', 'DNA.', 'stop?', 'court!', 'soon,', 'builders.']

brown:  ['many', 'other', 'among', 'servants', 'were', 'friendly', 'begin', 'most', 'While', 'sessions']
climate:  ['ALREADY', 'US.', "Solar's", 'putting', 'trending', 'top', 'Why?', 'topic', 'Miami', 'vs']
mfs:  ["don't", "That's", 'were', 'just', 'trying', 'know', 'disposing', 'shit', 'him', 'if']

brown:  ['produced', 'number', 'items', ':', 'employed', '71', 'according', 'each', 'names', 'test']
climate:  ['you', 'your', 'gonna', 'Which', 'How', 'or', 'Are', 'much', 'D.C.?', 'during']
mfs:  ['United', 'States', 'even', 'advancements,', 'live."', 'Lawrence', 'Krauss,', 'theoretical', 'physicist', "They're"]

brown:  ['Bar', 'Berry',

### Predictive text
Using ULMFiT transfer learning model; need more compute/time to do this part properly.

In [1]:
from fastai import *
from fastai.text import * 

In [189]:
path = 'data/cleaned/'

In [115]:
data_climate = TextLMDataBunch.from_csv(path, 'climate_csv.csv')

In [None]:
data_brown = TextLMDataBunch.from_csv(path, 'brown_csv.csv')

In [313]:
date_mfs = TextLMDataBunch.from_csv(path, 'mfs_csv.csv')

In [116]:
learn = language_model_learner(data_climate, pretrained_model=URLs.WT103, drop_mult=0.5)

In [117]:
learn.fit_one_cycle(1, 1e-2)

Total time: 3:22:57
epoch  train_loss  valid_loss  accuracy
1      1.991627    2.713907    0.520528  (3:22:57)



In [266]:
learn.predict("Technology is", n_words=5)

Total time: 00:00



'Technology is not interested in danger ,'

In [188]:
learn.save_encoder('climate_enc')

### Similarity querying (top_n = 20)
I am going to chose some words here that I think will be interesting to look at and graph in Gephi. Obviously, this method is biased as to what I think are interesting words, but so much is the point of any research in the social sciences.

In [55]:
TOP_N = 20 # this value is more or less arbitrary, just picking a small number so the visualization isn't too crowded

words = ['science', 'fear', 'climate', 'liberty', 'justice', 'culture', 'money', 'environment', 'freedom', 'people', 'change', 
         'scientist', 'politics', 'hero', 'story', 'truth', 'lies', 'good', 'bad']

In [81]:
top_dict_b = {}
top_dict_c = {}
top_dict_m = {}

for word in words:
    lst_b = [x[0] for x in brown_model.wv.most_similar(positive=[word], topn=TOP_N)]
    lst_c = [x[0] for x in climate_model.wv.most_similar(positive=[word], topn=TOP_N)]
    lst_m = [x[0] for x in mfs_model.wv.most_similar(positive=[word], topn=TOP_N)]
    top_dict_b[word] = lst_b
    top_dict_c[word] = lst_c
    top_dict_m[word] = lst_m

### Graphing with networkx and Gephi
Using the queries above, we can create a graph and visualize it. View images/graphs to see the visualizations.

In [88]:
import networkx as nx
import matplotlib as plt

In [95]:
g_b = nx.Graph(top_dict_b)
g_c = nx.Graph(top_dict_c)
g_m = nx.Graph(top_dict_m)

In [96]:
nx.write_gexf(g_b, "graphs/brown.gexf")
nx.write_gexf(g_c, "graphs/climate.gexf")
nx.write_gexf(g_m, "graphs/mfs.gexf")

### Sentiment analysis

In [114]:
'''
VADAR tool:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for 
Sentiment Analysis of Social Media Text. Eighth International Conference on 
Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
'''
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [99]:
import pandas as pd

In [119]:
brown_file = pd.read_csv('data/cleaned/brown_csv.csv')
climate_file = pd.read_csv('data/cleaned/climate_csv.csv')
mfs_file = pd.read_csv('data/cleaned/mfs_csv.csv')

In [121]:
for i, line in enumerate(brown_file['text']):
    print(line)
    ss = sid.polarity_scores(line)
    
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]))
    
    if i > 10:
        break

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place 
nan
compound: 0.2023, 
neg: 0.081, 
neu: 0.809, 
pos: 0.11, 
 The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted 
nan
compound: 0.7579, 
neg: 0.0, 
neu: 0.854, 
pos: 0.146, 
 The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr
nan
compound: 0.7506, 
neg: 0.045, 
neu: 0.775, 
pos: 0.18, 
 `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' 
nan
compound

Making a judgement call here that anything with pos/neg sentiment > 0.1 is significant, else label it neutral. This judgement is based on the fact that neutral sentiment tends to dominate VADER's classifications.

In [126]:
for i, line in enumerate(brown_file['text']):
    ss = sid.polarity_scores(line)
        
    if ss['neg'] > 0.1 or ss['pos'] > 0.1:
        if ss['neg'] >= ss['pos']:
            result = 'neg'
        else:
            result = 'pos'
    else:
        result = 'neu'
        
    if i % 5000 == 0:
        print(i)
    
    
    brown_file['label'].loc[i] = result
    
print('done brown')
    
for i, line in enumerate(climate_file['text']):
    ss = sid.polarity_scores(line)
    
    if ss['neg'] > 0.1 or ss['pos'] > 0.1:
        if ss['neg'] >= ss['pos']:
            result = 'neg'
        else:
            result = 'pos'
    else:
        result = 'neu'
    
    if i % 10000 == 0:
        print(i)
    
    climate_file['label'].loc[i] = result
    
print('done climate')

for i, line in enumerate(mfs_file['text']):
    ss = sid.polarity_scores(line)
    
    if ss['neg'] > 0.1 or ss['pos'] > 0.1:
        if ss['neg'] >= ss['pos']:
            result = 'neg'
        else:
            result = 'pos'
    else:
        result = 'neu'
        
    if i % 10000 == 0:
        print(i)
    
    
    mfs_file['label'].loc[i] = result
    
print('done mfs')

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
done brown
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
done climate
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
done mfs


In [127]:
brown_file.head()

Unnamed: 0,label,text
0,pos,The Fulton County Grand Jury said Friday an in...
1,pos,The jury further said in term-end presentment...
2,pos,The September-October term jury had been char...
3,pos,`` Only a relative handful of such reports wa...
4,neg,The jury said it did find that many of Georgi...


In [128]:
climate_file.head()

Unnamed: 0,label,text
0,pos,"March for our right to clean air, water, land ..."
1,pos,Melania Trump a priority under new US immigra...
2,pos,USA Should Leave B'cuz It’s a Business Deal’ ...
3,neu,Our government is being run by climate denier...
4,pos,"Tomorrow, I march for climate justice and to ..."


In [129]:
mfs_file.head()

Unnamed: 0,label,text
0,neg,"Evansville, IN Trump would cut med research, d..."
1,pos,is fighting science. I guarantee science will...
2,pos,That's a bummer. is fighting science. I guara...
3,neu,Authorities estimated that more than people w...
4,pos,This is what democracy looks like. President ...


In [132]:
brown_file.to_csv('data/csv/brown_sentiment.csv',index=False)

In [133]:
climate_file.to_csv('data/csv/climate_sentiment.csv',index=False)

In [134]:
mfs_file.to_csv('data/csv/mfs_sentiment.csv',index=False)

In [147]:
# use pd.read_csv to load back in

### Scattertext
Using scattertext to plot sentiment analysis trends.

In [136]:
import scattertext as st
import spacy

In [138]:
# run "python3 -m spacy download en" first
nlp = spacy.load('en')

In [146]:
# This block took a few hours on a 2.3 ghz cpu
brown_corpus = st.CorpusFromPandas(brown_file, category_col='label', text_col='text', nlp=nlp).build()
climate_corpus = st.CorpusFromPandas(climate_file, category_col='label', text_col='text', nlp=nlp).build()
mfs_corpus = st.CorpusFromPandas(mfs_file, category_col='label', text_col='text', nlp=nlp).build()

In [172]:
# To save scattertext corpora because they take forever to build
import pickle

In [173]:
with open('models/scattertext/brown.pkl', 'wb') as output:
    pickle.dump(brown_corpus, output, pickle.HIGHEST_PROTOCOL)

In [174]:
with open('models/scattertext/climate.pkl', 'wb') as output:
    pickle.dump(climate_corpus, output, pickle.HIGHEST_PROTOCOL)

In [175]:
with open('models/scattertext/mfs.pkl', 'wb') as output:
    pickle.dump(mfs_corpus, output, pickle.HIGHEST_PROTOCOL)

In [180]:
# To load
with open('models/scattertext/brown.pkl', 'rb') as input_f:
    brown_corpus = pickle.load(input_f)

In [143]:
from pprint import pprint

In [181]:
print(list(brown_corpus.get_scaled_f_scores_vs_background().index[:10]))

['af', 'khrushchev', 'anode', 'negroes', 'prokofieff', 'negro', 'kohnstamm', 'helva', 'mrs', 'matsuo']


In [150]:
print(list(climate_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'marching', 'climate', 'retweet', 'streets', 'ppl', 'fight', 'badlyyyy', 'dc', 'protesting']


In [151]:
print(list(mfs_corpus.get_scaled_f_scores_vs_background().index[:10]))

['trump', 'marching', 'scientists', 'protest', 'retweet', 'science', 'lobbyists', 'sword', 'marches', 'phds']


In [156]:
term_freq_df_b = brown_corpus.get_term_freq_df()
term_freq_df_c = climate_corpus.get_term_freq_df()
term_freq_df_m = mfs_corpus.get_term_freq_df()

In [157]:
term_freq_df_b['neg score'] = brown_corpus.get_scaled_f_scores('neg')
term_freq_df_c['neg score'] = climate_corpus.get_scaled_f_scores('neg')
term_freq_df_m['neg score'] = mfs_corpus.get_scaled_f_scores('neg')

In [158]:
pprint(list(term_freq_df_b.sort_values(by='neg score', ascending=False).index[:10]))
pprint(list(term_freq_df_c.sort_values(by='neg score', ascending=False).index[:10]))
pprint(list(term_freq_df_m.sort_values(by='neg score', ascending=False).index[:10]))

['hell',
 'dead',
 'enemy',
 'death',
 'pain',
 'died',
 'bad',
 'killed',
 'war',
 'murder']
['treason',
 'racism',
 'polluted',
 'lies',
 'fear ny',
 'might rise',
 'oceans might',
 'fear in',
 'declaring war',
 'islamophobia']
['and no',
 'a funny',
 'funny sign',
 'are fighting',
 'war in',
 'marched to',
 'think of',
 'fighting to',
 'w/ them',
 'wonder scientists']


In [160]:
term_freq_df_m['pos score'] = mfs_corpus.get_scaled_f_scores('pos')

In [163]:
pprint(list(term_freq_df_m.sort_values(by='pos score', ascending=False).index[:30]))

['was chicago',
 'show donald',
 'retweet to',
 'well this',
 'cherish',
 'cherish it',
 'those today',
 'embrace it',
 'today who',
 'a shout',
 'who the',
 'embrace',
 'chicago today',
 'good is',
 'amazing signs',
 'how good',
 'signs up',
 'grateful',
 'many amazing',
 'up here',
 'in how',
 'who support',
 'grateful to',
 'today all',
 'support scien',
 'so yes',
 'relations so',
 'scien',
 'arts of',
 'useful arts']


In [164]:
html = st.produce_scattertext_explorer(brown_corpus, category='neg', 
                                       category_name='Negative', 
                                       not_category_name='Pos or neutral', 
                                       width_in_pixels=1000)

In [165]:
open("./scattertext-html/test.html", 'wb').write(html.encode('utf-8'))

10034407

In [167]:
html_b = st.word_similarity_explorer_gensim(brown_corpus,
                                       category='neg',
                                       category_name='Negative',
                                       not_category_name='Pos/Neu',
                                       target_term='science',
                                       minimum_term_frequency=5,
                                       pmi_threshold_coefficient=4,
                                       width_in_pixels=1000,
                                       word2vec=brown_model,
                                       max_p_val=0.05,
                                       save_svg_button=True)

  if np.issubdtype(vec.dtype, np.int):


In [169]:
open('./scattertext-html/brown_gensim_similarity.html', 'wb').write(html_b.encode('utf-8'))

11182897

In [176]:
html_c = st.word_similarity_explorer_gensim(climate_corpus,
                                       category='neg',
                                       category_name='Negative',
                                       not_category_name='Pos/Neu',
                                       target_term='science',
                                       minimum_term_frequency=5,
                                       pmi_threshold_coefficient=4,
                                       width_in_pixels=1000,
                                       word2vec=climate_model,
                                       max_p_val=0.05,
                                       save_svg_button=True)

  if np.issubdtype(vec.dtype, np.int):


In [177]:
open('./scattertext-html/climate_gensim_similarity.html', 'wb').write(html_c.encode('utf-8'))

43715579

In [178]:
html_m = st.word_similarity_explorer_gensim(mfs_corpus,
                                       category='neg',
                                       category_name='Negative',
                                       not_category_name='Pos/Neu',
                                       target_term='science',
                                       minimum_term_frequency=5,
                                       pmi_threshold_coefficient=4,
                                       width_in_pixels=1000,
                                       word2vec=mfs_model,
                                       max_p_val=0.05,
                                       save_svg_button=True)

  if np.issubdtype(vec.dtype, np.int):


In [179]:
open('./scattertext-html/mfs_gensim_similarity.html', 'wb').write(html_m.encode('utf-8'))

69292241