# MSDS 610 - Final Project - Jeremy Beard - 20220626

In this final project for the MSDS 610 Data Engineering course, I will be taking queried data from Stack Overflow (obtained through StackExchange), and creating an inverted index using pyspark, among other tools. I found this task challenging and the freedom that was provided with the assignment made for thought-provoking debate as for which method would be best in obtaining the inverted index of the Stack Overflow data. 

In the end, I used a combination of pyspark and a bit of brute-force methodology to get the job done and obtain the answer. I exported the data to a csv and voila! There we have it. 

In [1]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="MSDS610FINAL")

To start, I first created a pyspark instance and then imported the data which was queried from Stack Overflow (QueryResults2.csv). This data was 50,000 items containing only the post ID and tags from any posts from the year 2020. 

In [2]:
text_file = sc.textFile("file:///Users/jerem/OneDrive/Documents/School/_REGIS/2022-05_Summer/MSDS610/Week78-project/QueryResults2.csv")

from pyspark.rdd import RDD
isinstance(text_file, RDD)

True

In [3]:
text_file.take(5)

['id,tags',
 '"59794725","<r><function><for-loop><apply>"',
 '"59794726",""',
 '"59794727","<python><variables><while-loop>"',
 '"59794728","<python><environment><jupyter-lab>"']

In [4]:
next_list = text_file.map(lambda x: x.split(','))
next_list.take(5)

[['id', 'tags'],
 ['"59794725"', '"<r><function><for-loop><apply>"'],
 ['"59794726"', '""'],
 ['"59794727"', '"<python><variables><while-loop>"'],
 ['"59794728"', '"<python><environment><jupyter-lab>"']]

The trickiest part of this project in my opinion was actually converting the tags from one long unfiltered string, to an organized list of tags for each item. I used a combination of strip() and split() to obtain this list, contained in tags_filted (an intentional typo, don't worry :) )

The cell below was not actually used much but provided a basis for the later use of the same strip() and split() filtering. See below.

In [5]:
tags_filted = []

def read_tags_raw(tag):
    print(tag)
    return tag.strip('>').strip('<').split('><')

next_list.map(lambda x: tags_filted.append(read_tags_raw(x)))

PythonRDD[4] at RDD at PythonRDD.scala:53

In [6]:
next_list.take(5)

[['id', 'tags'],
 ['"59794725"', '"<r><function><for-loop><apply>"'],
 ['"59794726"', '""'],
 ['"59794727"', '"<python><variables><while-loop>"'],
 ['"59794728"', '"<python><environment><jupyter-lab>"']]

I then began to experiment with pyspark dataframes a bit. I found that DataFrames were a bit easier to work with as I had more experience with them and found them more familiar. I first just loaded the data again and then created the dataframe.

In [7]:
from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName(
    'Read CSV File into DataFrame').getOrCreate()
 
authors = spark.read.csv("file:///Users/jerem/OneDrive/Documents/School/_REGIS/2022-05_Summer/MSDS610/Week78-project/QueryResults2.csv", sep=',',
                         inferSchema=True, header=True)
 
df = authors.toPandas()
df.head()

Unnamed: 0,id,tags
0,59794725,<r><function><for-loop><apply>
1,59794726,
2,59794727,<python><variables><while-loop>
3,59794728,<python><environment><jupyter-lab>
4,59794729,<linux-kernel><linux-device-driver><embedded-l...


In [8]:
print(len(df['tags']))

50000


The cell below is an important one. These lists "splittags" and "ids" are important later as they will be used to create the inverted index and the final pyspark output. For each line of tags (currently just one long string), it gets sent to a function which filters through the string and returns a short list of tags, and this short list gets appended to another list.

In essence, after the cell below, we are left with a list of IDs and a list of lists, of tags.

In [9]:
def split_tags(tag):
    return tag.strip('>').strip('<').split('><')

splittags = []
ids = []
i = 0
for lin in df['tags']:
    if lin != None:
        splittags.append(split_tags(lin))
        ids.append(df['id'][i])
    i += 1
        
print(len(splittags))

21248


In [10]:
print(len(ids))

21248


The cell below is another key milestone. It goes through every single tag available and builds a unique list of tags. For each of these unique tags, it searches through all the IDs and finds out which IDs have this tag present. It then produces two key lists as a result: final_taglist, a single-dimensional list of unique tags; and final_idlist, a 2-dimensional list of IDs, essentially a list of lists, of IDs.

These 2 lists provide the essence of the inverted index. Now we just need to cross some t's and dot some i's :)

In [11]:
count = 0
final_taglist = []
final_idlist = []

def returnIdsWithTag(tg):
    returnList = [] #list of IDs which have the tag
    idx = 0
    for ide in ids:
        if tg in splittags[idx]:
            returnList.append(ids[idx])
        idx += 1
    return returnList
            

for tagline in splittags: #for each line of tags
    for tag in tagline: #for each actual tag
        if tag not in final_taglist: #if this is a new tag, add it and populate the id list
            final_taglist.append(tag)
            if returnIdsWithTag(tag) is not None:
                final_idlist.append(returnIdsWithTag(tag))
            
                    

The cells below are just checking to make sure the output is correct.

In [12]:
print("len_tags: {0}, len_ids: {1}".format(len(final_taglist), len(final_idlist)))

len_tags: 8818, len_ids: 8818


In [13]:
print("len_tag1: {0}, len_id1: {1}".format(len(final_taglist[0]), len(final_idlist[0]) ))

len_tag1: 1, len_id1: 608


In [14]:
print("len_tag2: {0}, len_id2: {1}".format(len(final_taglist[1]), len(final_idlist[1]) ))

len_tag2: 8, len_id2: 125


In [15]:
print("len_tag3: {0}, len_idlist3: {1}".format(len(final_taglist[2]), len(final_idlist[2]) ))

len_tag3: 8, len_idlist3: 82


Sorry for providing all this messy output below. I just wanted to make sure everything looked okay!

In [16]:
b = 0
while b < len(final_taglist):
    print("tag: {0}, ids: {1}".format(final_taglist[b], final_idlist[b]))
    b += 1

tag: r, ids: [59794725, 59794738, 59795246, 59794719, 59794724, 59794836, 59795421, 59795108, 59795573, 60102469, 60102606, 59795787, 59795790, 59795816, 59795843, 59795461, 60102414, 60102761, 60102932, 60102934, 60102966, 60103414, 60102567, 60102595, 60103010, 60103019, 60103147, 60103225, 60103740, 60103760, 60102742, 60103246, 60103247, 60103288, 60103475, 60103998, 60104049, 60103531, 60103536, 60103617, 60103656, 60103687, 60294046, 60294152, 60294211, 60521978, 60522038, 60294335, 60522130, 60522189, 60522238, 60522368, 60522703, 60522739, 60522107, 60522443, 60103938, 60294397, 60294408, 60294434, 60522578, 60522651, 60523184, 60523253, 60964576, 60523104, 60965008, 60965056, 60965115, 60965393, 60965485, 60965497, 60965311, 60965322, 60964967, 60965814, 60965994, 60966051, 60966056, 60966462, 60966524, 60966900, 60966592, 60967221, 60966976, 60967157, 60966807, 60967407, 60967408, 60967351, 60967982, 60968004, 60968036, 60968066, 60968145, 60967831, 60967894, 60967939, 614732

tag: compiler-construction, ids: [61473477, 59560902, 59573311]
tag: ll, ids: [61473477]
tag: thread-dump, ids: [61473479]
tag: syntax-error, ids: [61473489, 61474995, 64978292, 59550274, 59557409, 59568974, 59577120]
tag: findall, ids: [61473489, 64976182]
tag: python-re, ids: [61473489, 61475116, 61796809, 62591863, 65438464, 65438868, 65439746]
tag: powerapps, ids: [61473114, 64974493, 64978123, 59557099, 59565239, 59575132]
tag: wcf, ids: [61473115, 61474736, 62590465, 62590234, 59564109, 59564224, 59568335, 59575650]
tag: netnamedpipebinding, ids: [61473115]
tag: button, ids: [61473116, 61474286, 61797389, 62591079, 62590073, 62912578, 64974208, 64979975, 64979533, 64980557, 64980376, 64982530, 65440635, 59549422, 59552532, 59553418, 59554080, 59556990, 59557510, 59559272, 59559371, 59561073, 59561602, 59562922, 59564478, 59564492, 59566675, 59567298, 59568444, 59568778, 59572308, 59572131, 59572822, 59574766, 59579385, 59580679]
tag: joystick, ids: [61473116]
tag: frontend, ids: 

tag: mutex, ids: [64982540, 59553289]
tag: dialogflow-cx, ids: [64982552]
tag: kommunicate, ids: [64982552]
tag: dartfmt, ids: [64982567, 64982577, 64982578]
tag: jsp-tags, ids: [64982572]
tag: setter, ids: [64982580, 59561613, 59566368]
tag: libvlcsharp, ids: [64982591]
tag: laravel-cashier, ids: [65437246, 59557358, 59568533]
tag: builder, ids: [65437307]
tag: smartcard, ids: [65437342]
tag: emv, ids: [65437342]
tag: epson, ids: [65437343]
tag: duktape, ids: [65437351]
tag: clean-architecture, ids: [65437354, 65439323, 59551753, 59576331]
tag: pki, ids: [65437377]
tag: scheduler, ids: [65437395, 59566676]
tag: user-defined-types, ids: [65438295]
tag: android-immersive, ids: [65438311]
tag: scintilla, ids: [65438326]
tag: lexilla, ids: [65438326]
tag: cgminer, ids: [65438329]
tag: borrow-checker, ids: [65438330]
tag: hexo, ids: [65437733, 59560983]
tag: pygame-clock, ids: [65437734]
tag: underline, ids: [65437742]
tag: hook, ids: [65437762, 59555068, 59570571, 59579988]
tag: use-ref, 

tag: fluxlang, ids: [59571767]
tag: taocp, ids: [59571774]
tag: device, ids: [59571784]
tag: mido, ids: [59571840]
tag: heidisql, ids: [59571292]
tag: scipy-optimize, ids: [59571293]
tag: mamp, ids: [59571297]
tag: audio-streaming, ids: [59571299]
tag: python-sounddevice, ids: [59571299]
tag: taylor-series, ids: [59571320]
tag: grails-orm, ids: [59571335]
tag: grails-3.0, ids: [59571335]
tag: grails3, ids: [59571335]
tag: image-capture, ids: [59571384]
tag: raw-types, ids: [59571392]
tag: object-type, ids: [59571392]
tag: nrules, ids: [59571983]
tag: django-rest-knox, ids: [59571985]
tag: exploit, ids: [59571990]
tag: address-sanitizer, ids: [59571990]
tag: stdev, ids: [59572005]
tag: windows-server-container, ids: [59572011]
tag: numberformatexception, ids: [59572069]
tag: context-free-grammar, ids: [59572076]
tag: context-free-language, ids: [59572076]
tag: navigationlink, ids: [59572085]
tag: ml-agent, ids: [59571466]
tag: solver, ids: [59571467]
tag: cllocationmanager, ids: [595714

Now, after getting our final output (essentially...), I convert the list of ID's to one long string, so creating the last pyspark object and exporting it to csv will be a piece of cake.

In [17]:
def getIdListString(list_ids):
    result = ""
    for item in list_ids:
        result += str(item)
        result += " "
    return result
        
final_idstrings = []
for idlist in final_idlist:
    final_idstrings.append(getIdListString(idlist))

In [18]:
final_idstrings[0]

'59794725 59794738 59795246 59794719 59794724 59794836 59795421 59795108 59795573 60102469 60102606 59795787 59795790 59795816 59795843 59795461 60102414 60102761 60102932 60102934 60102966 60103414 60102567 60102595 60103010 60103019 60103147 60103225 60103740 60103760 60102742 60103246 60103247 60103288 60103475 60103998 60104049 60103531 60103536 60103617 60103656 60103687 60294046 60294152 60294211 60521978 60522038 60294335 60522130 60522189 60522238 60522368 60522703 60522739 60522107 60522443 60103938 60294397 60294408 60294434 60522578 60522651 60523184 60523253 60964576 60523104 60965008 60965056 60965115 60965393 60965485 60965497 60965311 60965322 60964967 60965814 60965994 60966051 60966056 60966462 60966524 60966900 60966592 60967221 60966976 60967157 60966807 60967407 60967408 60967351 60967982 60968004 60968036 60968066 60968145 60967831 60967894 60967939 61473223 60967524 61473127 61473616 61473656 61473451 61474168 61474517 61474076 61474932 61474947 61474948 61475253 

In [19]:
#now we have final_taglist, final_idlist, and final_idstrings
#let's convert back to pyspark formatting
final_list = list(zip(final_taglist, final_idstrings))
spark = SparkSession.builder.appName('SparkFinalProject').getOrCreate()
columns = ["tag", "ids"]
rdd = spark.sparkContext.parallelize(final_list)
df_final = spark.createDataFrame(rdd).toDF(*columns)

In [20]:
df_final.show()

+-------------------+--------------------+
|                tag|                 ids|
+-------------------+--------------------+
|                  r|59794725 59794738...|
|           function|59794725 59794799...|
|           for-loop|59794725 60102741...|
|              apply|59794725 60103247...|
|             python|59794727 59794728...|
|          variables|59794727 60102907...|
|         while-loop|59794727 59795181...|
|        environment|  59794728 60965534 |
|        jupyter-lab|59794728 65440082...|
|       linux-kernel|59794729 59795028...|
|linux-device-driver|59794729 62591116...|
|     embedded-linux|59794729 62190967...|
|               ruby|59794730 60102818...|
|              logic|59794730 59794582...|
|              macos|59794731 59794559...|
|          directory|59794731 61475579...|
|           fsevents|           59794731 |
|            ansible|59794732 60102519...|
|        ansible-awx|  59794732 59579657 |
|                 c#|59794737 59794755...|
+----------

In [21]:
type(df_final)

pyspark.sql.dataframe.DataFrame

And finally, exporting the inverted index to csv!

In [24]:
df_final.toPandas().to_csv('inverted_index_final.csv')

Thank you. This final project was both difficult and rewarding as I learned the value of an inverted index and what goes into creating it. I think overall, it would have been more ideal if I had used purely pyspark commands to create this inverted index but the freedom of the project was part of its benefits as well as its downsides! I found my method just as good as any other at this scale. 

Please let me know if you have any questions, thanks again!

All the best,
Jeremy