## Natural Language Processing Demo

---

### Importing Modules & Starting Spark Session

---

In [None]:
!pip install spark-nlp==4.1.0

In [1]:
import pandas as pd
import sparknlp
from pyspark.sql import SparkSession
from importlib_metadata import version

In [3]:
#Note: Here we are ensuring that we are using a specific module from MAVEN:
#https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp_2.12/4.1.0

spark = SparkSession \
    .builder \
    .config("spark.jars.packages", f"com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0") \
    .getOrCreate()

22/09/25 14:15:57 WARN Utils: Your hostname, rambino-AERO-15-XD resolves to a loopback address: 127.0.1.1; using 192.168.0.234 instead (on interface wlp48s0)
22/09/25 14:15:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/rambino/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/rambino/.ivy2/cache
The jars for the packages stored in: /home/rambino/.ivy2/jars
com.johnsnowlabs.nlp#spark-nlp_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9ec51d7b-92da-426d-888a-370b3051e7f2;1.0
	confs: [default]
	found com.johnsnowlabs.nlp#spark-nlp_2.12;4.1.0 in central
	found com.typesafe#config;1.4.2 in central
	found org.rocksdb#rocksdbjni;6.29.5 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.828 in central
	found com.github.universal-automata#liblevenshtein;3.0.0 in central
	found com.google.code.findbugs#annotations;3.0.1 in central
	found net.jcip#jcip-annotations;1.0 in central
	found com.google.code.findbugs#jsr305;3.0.1 in central
	found com.google.protobuf#protobuf-java-util;3.0.0-beta-3 in central
	found com.google.protobuf#protobuf-java;3.0.0-beta-3 in central
	found com.google.code.gson#gson;2.3 in central
	found it.unimi.dsi#fastutil;7.0.12 in central
	found org.projectlombok#l

22/09/25 14:15:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


---

### Loading in local data

---

In [5]:
df = spark.read.json('./reddit-worldnews.json')
df.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- approved_at_utc: string (nullable = true)
 |    |-- approved_by: string (nullable = true)
 |    |-- archived: boolean (nullable = true)
 |    |-- author: string (nullable = true)
 |    |-- author_flair_background_color: string (nullable = true)
 |    |-- author_flair_css_class: string (nullable = true)
 |    |-- author_flair_richtext: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- author_flair_template_id: string (nullable = true)
 |    |-- author_flair_text: string (nullable = true)
 |    |-- author_flair_text_color: string (nullable = true)
 |    |-- author_flair_type: string (nullable = true)
 |    |-- author_fullname: string (nullable = true)
 |    |-- author_patreon_flair: boolean (nullable = true)
 |    |-- banned_at_utc: string (nullable = true)
 |    |-- banned_by: string (nullable = true)
 |    |-- can_gild: boolean (nullable = true)
 |    |-- can_mod_post: boolean (nullable = true)
 |  

In [9]:
#Let's say we want to get only 'author' and 'title' from this data set:
pd.set_option('max_colwidth', 800)

df.select('data.title','data.author') \
    .limit(5) \
    .toPandas()

Unnamed: 0,title,author
0,"Microsoft Corp said it has discovered hacking targeting democratic institutions, think tanks, and non-profit organizations in Europe.",jaykirsch
1,Deutsche Bank reportedly planned to extend the dates of $340 million in loans to Trump Organization to avoid a potential nightmare of chasing a sitting president for cash,canuck_burger
2,"Iranian ""morality police"" were forced to fire warning shots when a crowd intervened to prevent them from arresting two women for not wearing a hijab. The incident occurred in Tehran's northeastern Narmak neighbourhood on Friday night, and ended with a mob tearing the door off a police vehicle.",honolulu_oahu_mod
3,"Trump administration 'pushing Saudi nuclear deal' which could benefit company linked to Jared Kushner - Senior Trump administration officials pushed a project to share nuclear power technology with Saudi Arabia over the objections of ethics officials, according to a congressional report",madam1
4,"NASA Happily Reports the Earth is Greener, With More Trees Than 20 Years Ago–and It's Thanks to China, India",purplexxx


---

### Getting wordcounts (naive approach to NLP)

By splitting strings based on whitespaces and getting counts of resulting words (1-gram)

---

In [15]:
import pyspark.sql.functions as F

df_wordcount = df \
    .select(
        F.explode(F.split("data.title","\\s+")).alias("word")
    ) \
    .groupBy("word") \
    .count() \
    .orderBy(F.desc("count"))

df_wordcount \
    .limit(10) \
    .toPandas()

#This is lovely, but as we can see - the result is filler words that aren't very informative. Let's refine.

Unnamed: 0,word,count
0,to,58
1,the,46
2,of,42
3,in,41
4,a,25
5,for,20
6,and,19
7,from,12
8,on,11
9,with,10


---

### Using pre-trained ML model to analyse data

**NOTE:** the Udacity content fails at this juncture of the course for a few reasons:
1. The spark-nlp package used by Udacity is 1.7.3 - which is far below the current version (as of Sept 2022) of 4.1.0. The 'BasicPipeline' model below no longer appears to exist in this newer version.
2. Instead, John Snow Labs offers a wide variety of pre-trained models that you can load into your code: https://nlp.johnsnowlabs.com/models?edition=Spark+NLP+4.1&language=en&sort=downloads&type=model

If in the future I want to use Spark-nlp, I should consult the resources above.
However, since NLP is not strictly necessary for being a data engineer, I will skip this for now.

---

In [16]:
from com.johnsnowlabs.nlp.pretrained.pipeline.en import BasicPipeline as b

ModuleNotFoundError: No module named 'com.johnsnowlabs.nlp.pretrained.pipeline'