# No More Silence -- Word2Vec of Documents by Year

The following code uses Spark's implementation of Word2Vec on the No More Silence documents by year on Information Commons.

By convention, during preprocessing, we filtered out all documents belonging to more than 3 years. All documents spanning 3 or less years have all of their sentences mapped to each year the document spans.

In [1]:
# Load modules
from pyspark.sql.types import Row
from pyspark.ml.feature import Word2Vec

from json import dumps
from time import time

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1564501043604_0006,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


### Load Data

Loads data from my AWS S3 bucket

In [2]:
# Read in data from S3
raw = sc.textFile("s3://bchsi-spark02/home/millsh/sents.txt")

### Transform Raw Data

Input file is formatted as follows: Each line contains a sentence and the year(s) a sentence belongs.
Year(s) and sentence are delimited by tab ("\t"), words of a sentence are delimited by space (" "), and years are delimited by dash ("-").

In [3]:
raw_split = raw.map(
    lambda line: (
        line.split("\t")[0].split("-"), 
        line.split("\t")[1].split(" ")
    )
)

### Our Word2Vec Function

Function to run Spark's Word2vec for a given year.

In [4]:
def w2v(year):
    # Filter by year, and select words
    sents_filtered = raw_split.filter(lambda row: str(year) in row[0]) \
                     .map(lambda row: Row(row[1]))
    
    # Create Spark DF of words
    df = spark.createDataFrame(sents_filtered, ["text"])
    
    # Run Word2Vec
    word2Vec = Word2Vec(vectorSize=128, minCount=3, maxIter = 50, 
                        inputCol="text", outputCol="result")
    model = word2Vec.fit(df)
    
    # Return dictionaty of embeddings (keys are words, and values are word vecs)
    return { text : [ e for e in vector ] for text, vector in model.getVectors().collect() }

### Run Word2Vec by Year

In [5]:
# Relavent years (bulk of our data)
years = range(1982,1996)
results = {}

for year in years:
    t0 = time()
    
    # Run Word2vec, and store embeddings in dictionary indexed by year
    results[year] = w2v(year)
    
    # Print processintg time in sec
    print(year, time() - t0)

1982 59.433645248413086
1983 14.445246458053589
1984 140.23983120918274
1985 208.82717609405518
1986 176.01878094673157
1987 204.8292965888977
1988 257.15314650535583
1989 184.97558736801147
1990 432.4536256790161
1991 255.13430619239807
1992 289.4339325428009
1993 324.9008867740631
1994 218.08603239059448
1995 83.61162328720093

In [6]:
# Save Results in JSON format
with open("/tmp/w2vRes128-100.json", "w+") as ofile:
    ofile.write(dumps(results))

272200019