## SparkNLP

I've encountered some troubles with bringing in python objects like nltk and Language_tool when running spark jobs. A way around this could be using SparkNLP

```
py -m pip install spark-nlp

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .master("local[*]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0") \
    .getOrCreate()

```
^^ That above requests 4gb but we can experiment with lower memory

## Spelling mistakes

Below is an attempt to gather spelling mistakes, as that was the job that uses Language_tool and was giving me problems

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

# Create a sample dataframe
data = [("This is a sampl text with spelling mistaks."), ("It is hard to spel evrythng correctly."), ("Let's see if SparkNLP can help us find the errors.")]
df = spark.createDataFrame(data, ["text"])

# Define the pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell_checked") \
    .setDictionary("src/main/resources/spell/NorvigSweeting/big.txt")

finisher = Finisher() \
    .setInputCols(["spell_checked"]) \
    .setOutputCols(["spell_checked"]) \
    .setOutputAsArray(True)

pipeline = Pipeline(stages=[
    documentAssembler,
    tokenizer,
    spellChecker,
    finisher
])

# Fit the pipeline on the dataframe
model = pipeline.fit(df)

# Apply the pipeline on the dataframe
result = model.transform(df)

# View the result
result.show(truncate=False)

## Flesch Readability
The below also tackles the Flesch reading score, which has been inserted into the report job but I think it doesn't work correctly as of now

In [None]:
from pyspark.ml import Pipeline
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
from sparknlp.annotator import *
from sparknlp.base import *

# Create a sample dataframe
data = [("This is a sample sentence.",), ("It contains some words and punctuation.",), ("Let's see what the Flesch score is."), ("It should be around 70-80 for normal text.")]
df = spark.createDataFrame(data, ["text"])

# Define the pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token", "pos"]) \
    .setOutputCol("lemma")

finisher = Finisher() \
    .setInputCols(["lemma"]) \
    .setOutputCols(["lemma"]) \
    .setOutputAsArray(True)

# Define a UDF to calculate the Flesch reading score
def flesch_score(text):
    num_sentences = text.count('.')
    num_words = len(text.split())
    num_syllables = 0
    for word in text.split():
        syllables = SyllableCountingModel.pretrained() \
            .setInputCols(["token"]) \
            .setOutputCol("syllables") \
            .transform(spark.createDataFrame([(word,)]).toDF("text")).collect()[0].syllables
        num_syllables += syllables[0]
    return 206.835 - 1.015 * (num_words / num_sentences) - 84.6 * (num_syllables / num_words)

flesch_score_udf = udf(flesch_score, FloatType())

# Create the pipeline
pipeline = Pipeline(stages=[
    documentAssembler,
    tokenizer,
    posTagger,
    lemmatizer,
    finisher
])

# Fit the pipeline on the dataframe
model = pipeline.fit(df)

# Apply the pipeline on the dataframe
result = model.transform(df)

# Add the Flesch score column
result = result.withColumn("flesch_score", flesch_score_udf(result["lemma"]))

# View the result
result.show(truncate=False)

Further ones to check out that may be useful:

 - DependencyParserModel - analyzes grammatical structure
 - ChunkerModel - groups related words in a sentence
 - SentenceDetectorDLModel - sentence boundary detection to split text into individual sentences, to prepare data for named entity recognition and sentiment analysis
 - NERDLModel - identify named entities
 - SentimentDLModel - sentiment analysis for sentences positive, negative or neutral
 - MultiClassifierDLModel - multi-class specification to classify text data into multiple categories, classifies text data into topics, genres, or other sets. 

### The MultiClassifierDLModel

Running a few examples in:
```
texts = [
    'Asynchronous Web Scraping With Python AIOHTTP',
    'Automating Excel with Python Video Overview - Mouse Vs Python',
    'How to Monitor Python Functions on AWS Lambda with Sentry'
]
```

has produced the following:

```
+---+-----------------------------------------+
|id |result                                   |
+---+-----------------------------------------+
|0  |[[chunk, 0, 5, PRODUCT], [chunk, 7, 21, TECHNOLOGY]]|
|1  |[[chunk, 0, 8, PRODUCT], [chunk, 18, 25, TECHNOLOGY], [chunk, 35, 42, PERSON]]|
|2  |[[chunk, 13, 18, PRODUCT], [chunk, 28, 32, TECHNOLOGY], [chunk, 42, 53, ORGANIZATION]]|
+---+-----------------------------------------+

```

Could be useful

```
Unfortunately, there is no pre-trained model in SparkNLP that can specifically analyze the reputation of a blog article. However, there are some NLP techniques that can be used to approach this problem.

For example, you could use a pre-trained named entity recognition (NER) model to identify entities such as organizations, people, and locations in the text. This could help you determine if the article is discussing reputable companies or individuals in the field of software engineering.

Additionally, you could use a pre-trained sentiment analysis model to determine the overall sentiment of the article, which could provide some insight into whether the article is positive or negative towards the topic.

Finally, you could also use pre-trained models for topic classification to identify the main topics discussed in the article. This could help you determine if the article is discussing relevant and current topics in the field of software engineering.

While these techniques do not directly address the reputation of the author or the website, they can still provide some useful information for evaluating the credibility of a blog article in the field of software engineering.
```
