The files 'tumblr.txt' and 'words.txt' are associated with a runtime and deleted when one terminates. These files are attached on Canvas, for reference.

The words under consideration include all the tags we use for scraping in 'Final-Scraping-Sinha.ipynb' and some additional words for our reference. This list is manually compiled, utilising one's knowledge of word meanings to compare results of our analysis (e.g., 'alt' and 'alternative' should yield similar results).

This is the code used to analyse the semantics of under consideration. For the purpose of comparison, we use three different corpora from three different time periods and contexts. The first two descriptions are taken from [https://www.nltk.org/book/ch02.html](https://):


*   Brown corpus: established text samples of American English (1961)
*   Webtext corpus: less formal text samples including content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Caribbean, personal advertisements, and wine reviews 
* Tumblr corpus (tumblr.txt): Data scraped from Tumblr for this project, containing over 70,000 posts from Tumblr tagged with the words under consideration.



In [96]:
# Import NLTK
import nltk
nltk.download('punkt')
from nltk.corpus import brown, webtext
nltk.download('brown')
nltk.download('webtext')

# Import other useful libraries
import re
from gensim.models import Word2Vec
from collections import Counter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


In [7]:
# Reads the contents of the file 'tumblr.txt'
with open('tumblr.txt') as f:
  text = f.read()

# Splits the text into posts
tumblr_sents = text.split('\n')

In [12]:
# The following code is adapted from Practicum 3
# Takes in a list of tokens and returns a list of cleaned tokens
def clean_text(tokens):
  pattern = re.compile("[a-zA-Z0-9']")
  clean_tokens = [tok.lower() for tok in tokens if pattern.match(tok)]
  return clean_tokens

In [11]:
# Cleans the posts to remove emojis and convert words to lower case.
for i in range(len(tumblr_sents)):
  tumblr_sents[i] = clean_text(tumblr_sents[i])

In [21]:
# Creates and trains models using a neural network to learn word association for
# each of the corpora under consideration.
tumblr = Word2Vec(tumblr_sents)
web_sents = webtext.sents()
web = Word2Vec(web_sents)
brown_sents = brown.sents()
brown_model = Word2Vec(brown_sents)

In [108]:
# Vocabulary of each of the models.
tumblr_vocab = list(Counter(nltk.word_tokenize(text)))
webtext_vocab = list(web.wv.index_to_key)
brown_vocab = list(brown_model.wv.index_to_key)

In [109]:
# Reads in the contents of the file 'words.txt'
with open('words.txt') as f:
  words = f.readlines()

# Iterates through 'words' and prints the 10 most similar words to each of them
# in each of the corpora under consideration.
for word in words:

  word_strip = word.strip()

  print(word_strip)
  print('-------------------------------------------------------')

  # Ensures that the word is in the Tumblr vocabulary.
  if word_strip in tumblr_vocab:
    print('TUMBLR:')
    tumblr_sim = tumblr.wv.most_similar(word_strip, topn=10)
    for i in tumblr_sim:
      print(i)

  # Ensures that the word is in the webtext vocabulary.
  if word_strip in webtext_vocab:
    print('WEBTEXT:')
    webtext_sim = web.wv.most_similar(word_strip, topn=10)
    for i in webtext_sim:
      print(i)
  
  # Ensures that the word is in the Brown corpus vocabulary.
  if word_strip in brown_vocab:
    print('BROWN:')
    brown_sim = brown_model.wv.most_similar(word_strip, topn=10)
    for i in brown_sim:
      print(i)
      
  print('-------------------------------------------------------')

smh
-------------------------------------------------------
TUMBLR:
('lmao', 0.6884260773658752)
('cuz', 0.6883441805839539)
('abt', 0.6754273772239685)
('sheesh', 0.6723950505256653)
('cause', 0.651140570640564)
('shook', 0.6496242880821228)
('literally', 0.6445848345756531)
('seriously', 0.6441153287887573)
('honestly', 0.6428632736206055)
('always', 0.6423439383506775)
-------------------------------------------------------
lol
-------------------------------------------------------
TUMBLR:
('haha', 0.9380366802215576)
('lmao', 0.89799565076828)
('funny', 0.897441565990448)
('omg', 0.893879771232605)
('jokes', 0.8830115795135498)
('wtf', 0.8740629553794861)
('hilarious', 0.8264898657798767)
('texts', 0.803926408290863)
('texting', 0.7992066144943237)
('humor', 0.7978721857070923)
-------------------------------------------------------
slay
-------------------------------------------------------
TUMBLR:
('girlboss', 0.820033609867096)
('girlblog', 0.8009769916534424)
('gatekeep', 0.7