This is an inverted index of PostID, Tags, Scores, and ClosedDates queried from StackExchange, run through PySpark, report written in Jupyter Notebooks via GCP Dataproc, VMInstances, and Google Storage Bucket 😁🍸.
- Update 3/3/2022 - I initially chose GCP GCloud but there were many bugs. I decided to shift my work over to AWS EMR.
- Update 3/4/2022 - Decided I did not want to deal with the cloud today. Switched to local via hortonworks and docker.
- Update 3/5/2022 - This wonderful youtuber set me on the right path with GCP
- - Watch this video by Codible for help Connecting to a Hadoop Cluster on Google Dataproc with Jupyter Notebook: Click Here
- - Watch this video by Codible for help Using PySpark on Dataproc Hadoop Cluster to process large CSV file: Click Here
First, I went to data.stackexchange.com and clicked Compose Query.
I typed in the following sql-like code into the textbox, ran and downloaded the query as a csv.
select Id, Tags, Score, ClosedDate
from Posts
where CreationDate like '%2021%';
After setting up this GitHub repository I committed QueryResults.csv here.
Then, I set up a Google Cloud Platform Dataproc with 1 Master Node and 2 Worker Nodes. I loaded it with an Ubuntu 1.5 image, alloting 50 GB to each node. I included with Jupyter and Anaconda for this project. I also created a new Google Storage bucket in order to save my jupyter-notebook and inverted indexes for downloading later.
After my VMInstances provisioned I opened the SSH to my Worker Node and the Terminal from JupyterLab. From the terminal I downloaded the csv to my VMInstance:
wget https://github.com/kscully-dotcom/StackExchangeInvertedIndex/raw/main/QueryResults.csv
and saved it to my Google Storage Bucket:
gsutil cp QueryResults.csv gs://usr-1-google-storage-bucket-path/data
Then I started a new jupyter-lab project and used the PySpark terminal pre-built into my VMInstance. Here are some useful links that helped me throughout this project:
- PySpark Spark SQL Documentation
- PySpark Spark Core Documentation
- PySpark RDD Documentation
- PySpark reduceByKey Documentation
- PySpark withColumn Documentation
- Pyspark regexp_replace Documentation
- PySpark map Documentation
- StackOverflow Explanation of PySpark Reduce<
- MingChen0919 Provides Useful Tips on PySpark SparkContext and Shifting between RDDs and Dataframes
- Hadoop Output Formats
- Saving Pyspark RDD as a CSV File
This is not always recommended if the dataset is very large and will put too much weight on your CPU. In my case the InvertedIndex CSV files were not astronomically large. I would still recommend parsing the information needed from your output file before saving a TextFile to your computer system. Here is an easy one-liner to save your RDD as a CSV.
This exercise was a great way to introduce myself (a new data scientist) to the Data Architecture of large amounts of scalable data and how to parse through and extract the answers I -or a client- would need. I scraped web data in one very simple way via StackExhange. A similar approach would have been to use Google's BigQuery. For other examples I may use an API or the web's html. I also used Google Cloud Platform's Dataproc and Google Storage Bucket to organize this project. Another approach would have been to use AWS EMR or Locally hosting using a Virtual Machine. I decided to use Google Cloud Platform because I had already been familiar with it's tools and User Interface. AWS EMR seemed very clear and straightforward but needed more setup. I do recommend using a cloud service because of the capacity and demand of the data transactions on a local CPU. For example, I tried to set up a local Virtual Machine to runPyspark and the Virtual Machine alone was recommended to allot 10GB of my disk space to the virtual machine. This was not a safe amount of storage to take away from my CPU and 8GB was enough to shut down my Virtual Machine after installing Anaconda and Hadoop. Keep in mind I have 16GB of RAM (not a lot but will be upgrading soon) and 23GB of Virtual Memory on a custom built PC with no pre-installed bloatware.
The purpose of this exercise was to get familiar with the Apache Hadoop and Spark Data Engineering Tools. One of the first exercises many Data Scientists will do with MapReduce is a simple WordCount exercise. This InvertedIndex exercise is an extension off of the MapReduce WordCount exercise.
We learned how to setup Hadoop Clusters and run MapReduce transactions on RDD's with PySpark. We also learned how to organize a datalake. We practiced webscraping. We learned the expense on a local computer's memory of an organized datalake and running transactions and why it is important to utilize a portion of the computer's disk space to run these programs.
The code can be cleaner, and with more time it will be. We can use our InvertedIndexes to plot data visualizations and answer data questions. The Hadoop cluster organization can also be improved.
When we work for a large corporation, extract large amounts of data, have constantly scaling and updating data (i.e., weather data, financial transactions, energy data, etc.) we won't have the human or machine capacity to hold and organize it all. That's why we need ways to reduce and minimize the data across nodes, develop data lakes, and have them work efficiently enough that we can query that data efficiently and without expensive transactions to the system. Having development tools like Spark, Docker, Hadoop, Cassandra, etc. allows the data scientists, engineers, and business executives the ability to run programs in the background that organizes the data, collects the data for analysis, and to quickly query data to answer impactful business questions for clients and stakeholders. These tools allow an inexpensive and efficient solution for all of these necessary-evil problems.