# NLP in Pyspark's MLlib

Natural Language Processing (NLP) is a very trendy topic in the data science area today that is really handy for tasks like **chat bots**, movie or **product review analysis** and especially **tweet classification**. In this notebook, we will cover the **classification aspect of NLP** and go over the features that Spark has for cleaning and preparing your data for analysis. We will also touch on how to implement **ML Pipelines** to a few of our data processing steps to help make our code run a bit faster. 

As we learned in the NLP concept review lectures, the text you process must first be cleaned, tokenized and vectorized. Essentially, we need to covert our text into a vector of numbers. But how do we do that? Spark has a variety of built in functions to accomplish all of these tasks very easily. We will cover all of it here!

### Agenda

    1. Review Data (quality check)
    2. Clean up the data (remove puncuation, special characters, etc.)
    3. Tokenize text data
    4. Remove Stopwords
    5. Zero index our label column
    5. Create an ML Pipeline (to streamline steps 3-5)
    6. Vectorize Text column
         - Count Vectors
         - TF-IDF: Term Frequency-Inverse Document Frequency and is a statistical measure that shows how important a word is to a document in a collection (or corpus) of documents
         - Word2Vec
    7. Train and Evaluate Model (classification)
    8. View Predictions

In [4]:
# First let's create our PySpark instance
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("NLP").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/25 08:44:56 WARN Utils: Your hostname, redsteam-inspiron-3580, resolves to a loopback address: 127.0.1.1; using 10.6.245.234 instead (on interface wlp3s0)
25/11/25 08:44:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/25 08:44:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


You are working with 1 core(s)


**Import Dependencies**

In [1]:
from pyspark.ml.feature import * #CountVectorizer,StringIndexer, RegexTokenizer,StopWordsRemover
from pyspark.sql.functions import * #col, udf,regexp_replace,isnull
from pyspark.sql.types import * #StringType,IntegerType
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# For pipeline development
from pyspark.ml import Pipeline 

## Read in Dataset

#### Kickstarter Dataset

##### What is Kickstarter?
"Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform, focused on creativity and merchandising. The company's stated mission is to "help bring creative projects to life". Kickstarter, has reportedly received more than $1.9 billion in pledges from 9.4 million backers to fund 257,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges. This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work" ~ Wikipedia

So, what if you can predict if a project will be or not to be able to get the money from their backers?

#### Content

The datastet contains the blurbs or short description of 215,513 projects runned along 2017, all written in english and all labeled with "successful" or "failed", if they get the money or not, respectively. From those texts you can train linguistics models for description, and even embeddings relative to the case.

**Source:** https://www.kaggle.com/oscarvilla/kickstarter-nlp

In [11]:
# pip install pyarrow

In [5]:
path ="../Datasets/"
# import pyspark.pandas as ps

# CSV
# df = ps.read.csv(path+'kickstarter.csv',inferSchema=True,header=True, index_col=0)
df = spark.read.csv(path+'kickstarter.csv',inferSchema=True,header=True)

                                                                                

In [7]:
#df.limit(4).toPandas()
df.show(truncate=False)

+---+-------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+
|_c0|blurb                                                                                                                                |state                               |
+---+-------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+
|1  |Using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills (ie Physics).   |failed                              |
|2  |MicroFly is a quadcopter packed with WiFi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.  |successful                          |
|3  |A small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious n

25/11/25 08:45:50 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


In [6]:
# Let's read a few full blurbs
df.show(4, truncate=False)

+---+-----------------------------------------------------------------------------------------------------------------------------------+----------+
|_c0|blurb                                                                                                                              |state     |
+---+-----------------------------------------------------------------------------------------------------------------------------------+----------+
|1  |Using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills (ie Physics). |failed    |
|2  |MicroFly is a quadcopter packed with WiFi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.|successful|
|3  |A small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious novel!                |failed    |
|4  |Zylor is a new baby cosplayer! Back this kickstarter to help fund new cosplay photoshoots to share hi

25/11/25 09:50:43 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv


We can see from the output above that the blurb text contains a good bit of punctuation and special characters. We'll need to clean that up. 

In [8]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- blurb: string (nullable = true)
 |-- state: string (nullable = true)



**See how many rows are in the df?**

In [9]:
df.count()

                                                                                

223627

## How many null values do we have?

Let's use our handy dandy function!

In [10]:
from pyspark.sql.functions import *

def null_value_calc(df):
    null_columns_counts = []
    numRows = df.count()
    for k in df.columns:
        nullRows = df.where(col(k).isNull()).count()
        if(nullRows > 0):
            temp = k,nullRows,(nullRows/numRows)*100
            null_columns_counts.append(temp)
    return(null_columns_counts)

null_columns_calc_list = null_value_calc(df)
spark.createDataFrame(null_columns_calc_list, ['Column_Name', 'Null_Values_Count','Null_Value_Percent']).show()

25/11/25 08:48:51 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: 
 Schema: _c0
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
                                                                                

+-----------+-----------------+------------------+
|Column_Name|Null_Values_Count|Null_Value_Percent|
+-----------+-----------------+------------------+
|      blurb|             1488|0.6653937136392296|
|      state|            13157| 5.883457722010312|
+-----------+-----------------+------------------+



Not too bad! Less than 1% for the blurb column and about 5% of the state column. Unfortunatly though, we will need each row of data to contain value in both of these columns to conduct our analysis, so let's how many rows that actually effects. 

In [13]:
# Of course you will want to know how many rows that affected before you actually execute it..
og_len = df.count()
drop_len = df.na.drop().count()
print("Total Rows that contain at least one null value:",og_len-drop_len)
print("Percentage of Rows that contain at least one null value:", (og_len-drop_len)/og_len)

25/11/25 08:55:02 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
25/11/25 08:55:02 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


Total Rows that contain at least one null value: 0
Percentage of Rows that contain at least one null value: 0.0


So dropping all rows that have at least one null value would impact just under 6% of our dataframe. I can live with that, so I'll go ahead and drop them. 

In [12]:
# Drop the null values
# It's only about 6% so that's okay
df = df.dropna()

In [14]:
# New df row count
df.count()

25/11/25 08:56:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


210470

If you want to make this notebook run faster, you can slice the df like this...

In [18]:
# Slice the dataframe down to make this notebook run faster
df = df.limit(400)
print('Sliced row count:',df.count())

Sliced row count: 400


25/11/25 09:03:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


## Quality Assurance Check (QA)

Let's make sure our dependent variable column is clean before we go any further. This an important step in our analysis. 

In [17]:
# Quick data quality check on the state column....
# This is going to be our category column so it's important
df.groupBy("state").count().orderBy(col("count").desc()).show(20,truncate=False)

25/11/25 09:01:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+---------------------------------------------------------------------------+-----+
|state                                                                      |count|
+---------------------------------------------------------------------------+-----+
|failed                                                                     |240  |
|successful                                                                 |156  |
| best experienced with Oculus Rift."                                       |1    |
| mixing pixel art graphics with deep storytelling and action combat system"|1    |
| Adventure with RPG elements."                                             |1    |
| smart                                                                     |1    |
+---------------------------------------------------------------------------+-----+



We can see from the query above that we have some invalid data in the label (state) column. Let's delete those.

In [19]:
df = df.filter("state IN('successful','failed')")
# Make sure it worked
df.groupBy("state").count().orderBy(col("count").desc()).show(truncate=False)

25/11/25 09:03:16 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+----------+-----+
|state     |count|
+----------+-----+
|failed    |240  |
|successful|156  |
+----------+-----+



In [21]:
# Let's check the quality of the blurbs
df.select("blurb").show(10,False)

+-----------------------------------------------------------------------------------------------------------------------------------+
|blurb                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------+
|Using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills (ie Physics). |
|MicroFly is a quadcopter packed with WiFi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.|
|A small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious novel!                |
|Zylor is a new baby cosplayer! Back this kickstarter to help fund new cosplay photoshoots to share his cuteness with the world!    |
|Hatoful Boyfriend meet Skeletons! A comedy Dating Sim that pu

25/11/25 09:59:29 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv


We see some punctuation proper casing and some slashes which might making parsing problematic. Let's clean this up a bit!

## Clean the blurb column

Keep in mind that you can/should do all of this in one call...
But we will show each individually for the purpose of learning.

In [20]:
# Replace Slashes and parenthesis with spaces
# You can test your script on line 7 of the df "(Legend of Zelda/Fable Inspired)"
df = df.withColumn("blurb",translate(col("blurb"), "/", " ")) \
        .withColumn("blurb",translate(col("blurb"), "(", " ")) \
        .withColumn("blurb",translate(col("blurb"), ")", " "))
df.select("blurb").show(7,False)

25/11/25 09:09:26 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+-----------------------------------------------------------------------------------------------------------------------------------+
|blurb                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------+
|Using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills  ie Physics . |
|MicroFly is a quadcopter packed with WiFi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.|
|A small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious novel!                |
|Zylor is a new baby cosplayer! Back this kickstarter to help fund new cosplay photoshoots to share his cuteness with the world!    |
|Hatoful Boyfriend meet Skeletons! A comedy Dating Sim that pu

In [23]:
# Removing anything that is not a letter
df = df.withColumn("blurb",regexp_replace(col('blurb'), '[^A-Za-z ]+', ''))
df.select("blurb").show(10,False)

+-------------------------------------------------------------------------------------------------------------------------------+
|blurb                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------+
|Using their own character users go on educational quests around a virtual world leveling up subjectoriented skills  ie Physics |
|MicroFly is a quadcopter packed with WiFi  sensors and  processors for ultimate stability  and fits in the palm of your hand   |
|A small indie press run as a collective for authors who want to selfpublish and a sexy smart  hilarious novel                  |
|Zylor is a new baby cosplayer Back this kickstarter to help fund new cosplay photoshoots to share his cuteness with the world  |
|Hatoful Boyfriend meet Skeletons A comedy Dating Sim that puts you into a high school ful

25/11/25 09:59:46 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv


In [21]:
# Remove multiple spaces
df = df.withColumn("blurb",regexp_replace(col('blurb'), ' +', ' '))
df.select("blurb").show(4,False)

25/11/25 09:14:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+-----------------------------------------------------------------------------------------------------------------------------------+
|blurb                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------+
|Using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills ie Physics .  |
|MicroFly is a quadcopter packed with WiFi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.|
|A small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious novel!                |
|Zylor is a new baby cosplayer! Back this kickstarter to help fund new cosplay photoshoots to share his cuteness with the world!    |
+-------------------------------------------------------------

In [22]:
# Lower case everything
df = df.withColumn("blurb",lower(col('blurb')))
df.select("blurb").show(4,False)

25/11/25 09:14:49 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
[Stage 60:>                                                         (0 + 1) / 1]

+-----------------------------------------------------------------------------------------------------------------------------------+
|blurb                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------+
|using their own character, users go on educational quests around a virtual world leveling up subject-oriented skills ie physics .  |
|microfly is a quadcopter packed with wifi, 6 sensors, and 3 processors for ultimate stability -- and fits in the palm of your hand.|
|a small indie press, run as a collective for authors who want to self-publish, and a sexy, smart , hilarious novel!                |
|zylor is a new baby cosplayer! back this kickstarter to help fund new cosplay photoshoots to share his cuteness with the world!    |
+-------------------------------------------------------------

                                                                                

Take a pause here and go look at your Spark UI. You'll notice that only the "show strings" calls (as opposed to each of the data manipulation calls) are creating jobs. This is because of Sparks lazy computation. 

In [23]:
spark

So when you want to speed up your notebook those are some calls you can take out. 

## Prep Data for NLP 

Alright so here is where our analysis turns from basic text cleaning to actually turning our text into number (the backbone of NLP). These next several steps in our analysis are very unique to NLP. 

### Split text into words (Tokenizing)

What is Tokenization: <span style="color:blue;">the process of breaking down a stream of text into smaller, meaningful units called tokens</span>

Yo'll see a new column is added to our dataframe that we call "words". This column contains an array of strings as opposed to just a string (current data type of the blurb column).

In [24]:
regex_tokenizer = RegexTokenizer(inputCol="blurb", outputCol="words", pattern="\W")
raw_words = regex_tokenizer.transform(df)
raw_words.show(2,False)
raw_words.printSchema()

  regex_tokenizer = RegexTokenizer(inputCol="blurb", outputCol="words", pattern="\W")
25/11/25 09:18:26 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+---+-----------------------------------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0|blurb                                                                                                                              |state     |words                                                                                                                                                |
+---+-----------------------------------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |using their own character, users go on educational quests around a virtual world leveling up subje

### Removing Stopwords

List of stopwords: "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself

**Recall from the content review lecture**
Recall that "stopwords" are any word that we feel would "distract" our model from performing it's best. This list can be customized, but for now, we will just use the default list. 

In [25]:
# from pyspark.ml.feature import StopWordsRemover

# Define a list of stop words or use default list
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
stopwords = remover.getStopWords() 

# Display default list
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [27]:
words_df = remover.transform(raw_words)
words_df.show(1,False)

+---+---------------------------------------------------------------------------------------------------------------------------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|_c0|blurb                                                                                                                            |state |words                                                                                                                                              |filtered                                                                                                                    |
+---+---------------------------------------------------------------------------------------------------------------------------------+------+--------------------------

25/11/25 09:22:25 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


### Now we need to encode state column to a column of indices

Remember that MLlib requres our dependent variable to not only be a numeric data type, but also zero indexed. We can Sparks handy built in StringIndexer function to accomplish this, just like we did in the classification lectures. 

In [28]:
indexer = StringIndexer(inputCol="state", outputCol="label")
feature_data = indexer.fit(words_df).transform(words_df)
feature_data.show(5)
feature_data.printSchema()

25/11/25 09:26:00 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
25/11/25 09:26:01 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+---+--------------------+----------+--------------------+--------------------+-----+
|_c0|               blurb|     state|               words|            filtered|label|
+---+--------------------+----------+--------------------+--------------------+-----+
|  1|using their own c...|    failed|[using, their, ow...|[using, character...|  0.0|
|  2|microfly is a qua...|successful|[microfly, is, a,...|[microfly, quadco...|  1.0|
|  3|a small indie pre...|    failed|[a, small, indie,...|[small, indie, pr...|  0.0|
|  4|zylor is a new ba...|    failed|[zylor, is, a, ne...|[zylor, new, baby...|  0.0|
|  5|hatoful boyfriend...|    failed|[hatoful, boyfrie...|[hatoful, boyfrie...|  0.0|
+---+--------------------+----------+--------------------+--------------------+-----+
only showing top 5 rows
root
 |-- _c0: string (nullable = true)
 |-- blurb: string (nullable = true)
 |-- state: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |--

## Creat an ML Pipeline

We could also create an ML Pipeline to accomplish the previous three steps in a more streamlined fashion. Pipelines allow users to combine any transformer call(s) and ONE estimator call in their ML workflow. A Pipeline can be a continuous set of transformer calls until you reach a point where you need to call ".fit()" which is an estimator call. 
<br>

Notice in the script below that we reduced our .transform calls from 3 to 1. So the benefit here is not necessarily speed but a bit less and more organized code (always nice) and little more streamlined. This feature can be esspecially useful when you get to the point where you want to move your model into production. You can save this pipeline to be called on whenever you need to prep new text. 

In [None]:
# from pyspark.sql.functions import *
# from pyspark.ml.feature import StopWordsRemover

In [33]:
######################## BEFORE #############################
# Tokenize
regex_tokenizer = RegexTokenizer(inputCol="blurb", outputCol="words", pattern="\\W") # These also work as well: "\W", r"\W"
raw_words = regex_tokenizer.transform(df)

# Remove Stop words
remover = StopWordsRemover(inputCol="words", outputCol="filtered")
words_df = remover.transform(raw_words)

# Zero Index Label Column
indexer = StringIndexer(inputCol="state", outputCol="label")
feature_data = indexer.fit(words_df).transform(words_df)

feature_data.show(10,False)

25/11/25 09:47:10 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+---+-----------------------------------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+-----+
|_c0|blurb                                                                                                                              |state     |words                                                                                                                                                |filtered                                                                                                                    |label|
+---+-----------------------------------------------------------------------------------------------------------------------------------+---

25/11/25 09:47:10 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


In [34]:
################# AFTER ##################

# Tokenize
regex_tokenizer = RegexTokenizer(inputCol="blurb", outputCol="words", pattern="\\W")
# raw_words = regex_tokenizer.transform(df)

# Remove Stop words
remover = StopWordsRemover(inputCol=regex_tokenizer.getOutputCol(), outputCol="filtered")
# words_df = remover.transform(raw_words)

# Zero Index Label Column
indexer = StringIndexer(inputCol="state", outputCol="label")
# feature_data = indexer.fit(words_df).transform(words_df)

# Create the Pipeline
pipeline = Pipeline(stages=[regex_tokenizer,remover,indexer])
data_prep_pl = pipeline.fit(df)
# print(type(data_prep_pl))
# print(" ")
# Now call on the Pipeline to get our final df
feature_data = data_prep_pl.transform(df)
feature_data.show(1,False)

25/11/25 09:47:19 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
25/11/25 09:47:20 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


+---+---------------------------------------------------------------------------------------------------------------------------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+-----+
|_c0|blurb                                                                                                                            |state |words                                                                                                                                              |filtered                                                                                                                    |label|
+---+---------------------------------------------------------------------------------------------------------------------------------+------+--------------

Now take a look at the Spark UI again. You'll see the last 2 "countbyvalue" job ids for each one of these. If you take a look at how long it took each of those job ids to run, you will see that the second job id actually took just a bit less time to run. Since we do not have much data here it only saved us .2 seconds but that may translate to a couple of miniutes on a much larger df. 

## Converting text into vectors

We will test out the following three vectorizors:

1. Count Vectors
2. TF-IDF
3. Word2Vec

In [35]:
# Count Vector (count vectorizer and hashingTF are basically the same thing)
# cv = CountVectorizer(inputCol="filtered", outputCol="features")
# model = cv.fit(feature_data)
# countVectorizer_features = model.transform(feature_data)

# Hashing TF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawfeatures", numFeatures=20)
HTFfeaturizedData = hashingTF.transform(feature_data)

# TF-IDF
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(HTFfeaturizedData)
TFIDFfeaturizedData = idfModel.transform(HTFfeaturizedData)
TFIDFfeaturizedData.name = 'TFIDFfeaturizedData'

#rename the HTF features to features to be consistent
HTFfeaturizedData = HTFfeaturizedData.withColumnRenamed("rawfeatures","features")
HTFfeaturizedData.name = 'HTFfeaturizedData' #We will use later for printing

25/11/25 09:47:37 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
                                                                                

In [39]:
HTFfeaturizedData.limit(5).toPandas()

25/11/25 09:56:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


Unnamed: 0,_c0,blurb,state,words,filtered,label,features
0,1,"using their own character, users go on educati...",failed,"[using, their, own, character, users, go, on, ...","[using, character, users, go, educational, que...",0.0,"(3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 2.0, 1.0, ..."
1,2,"microfly is a quadcopter packed with wifi, 6 s...",successful,"[microfly, is, a, quadcopter, packed, with, wi...","[microfly, quadcopter, packed, wifi, 6, sensor...",1.0,"(1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 2.0, 2.0, 1.0, ..."
2,3,"a small indie press, run as a collective for a...",failed,"[a, small, indie, press, run, as, a, collectiv...","[small, indie, press, run, collective, authors...",0.0,"(3.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
3,4,zylor is a new baby cosplayer! back this kicks...,failed,"[zylor, is, a, new, baby, cosplayer, back, thi...","[zylor, new, baby, cosplayer, back, kickstarte...",0.0,"(2.0, 0.0, 0.0, 1.0, 1.0, 3.0, 0.0, 2.0, 2.0, ..."
4,5,hatoful boyfriend meet skeletons! a comedy dat...,failed,"[hatoful, boyfriend, meet, skeletons, a, comed...","[hatoful, boyfriend, meet, skeletons, comedy, ...",0.0,"(2.0, 0.0, 1.0, 0.0, 1.0, 3.0, 0.0, 0.0, 0.0, ..."


In [40]:
# Word2Vec
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="filtered", outputCol="features")
model = word2Vec.fit(feature_data)

W2VfeaturizedData = model.transform(feature_data)
# W2VfeaturizedData.show(1,False)

# W2Vec Dataframes typically has negative values so we will correct for that here so that we can use the Naive Bayes classifier
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(W2VfeaturizedData)

# rescale each feature to range [min, max].
scaled_data = scalerModel.transform(W2VfeaturizedData)
W2VfeaturizedData = scaled_data.select('state','blurb','label','scaledFeatures')
W2VfeaturizedData = W2VfeaturizedData.withColumnRenamed('scaledFeatures','features')

W2VfeaturizedData.name = 'W2VfeaturizedData' # We will need this to print later

25/11/25 09:59:50 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv
25/11/25 09:59:51 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///home/redsteam/Desktop/bear/ML/big-data/codes/Machine-Learning/Datasets/kickstarter.csv


## Train and Evaluate your model

From here on out, is straight up classification. So we can go and use our trusty function! I'll just go ahead and copy and paste it in here.

In [41]:
def ClassTrainEval(classifier,features,classes,train,test):

    def FindMtype(classifier):
        # Intstantiate Model
        M = classifier
        # Learn what it is
        Mtype = type(M).__name__
        
        return Mtype
    
    Mtype = FindMtype(classifier)
    

    def IntanceFitModel(Mtype,classifier,classes,features,train):
        
        if Mtype == "OneVsRest":
            # instantiate the base classifier.
            lr = LogisticRegression()
            # instantiate the One Vs Rest Classifier.
            OVRclassifier = OneVsRest(classifier=lr)
#             fitModel = OVRclassifier.fit(train)
            # Add parameters of your choice here:
            paramGrid = ParamGridBuilder() \
                .addGrid(lr.regParam, [0.1, 0.01]) \
                .build()
            #Cross Validator requires the following parameters:
            crossval = CrossValidator(estimator=OVRclassifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 is best practice
            # Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
        if Mtype == "MultilayerPerceptronClassifier":
            # specify layers for the neural network:
            # input layer of size features, two intermediate of features+1 and same size as features
            # and output of size number of classes
            # Note: crossvalidator cannot be used here
            features_count = len(features[0][0])
            layers = [features_count, features_count+1, features_count, classes]
            MPC_classifier = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
            fitModel = MPC_classifier.fit(train)
            return fitModel
        if Mtype in("LinearSVC","GBTClassifier") and classes != 2: # These classifiers currently only accept binary classification
            print(Mtype," could not be used because PySpark currently only accepts binary classification data for this algorithm")
            return
        if Mtype in("LogisticRegression","NaiveBayes","RandomForestClassifier","GBTClassifier","LinearSVC","DecisionTreeClassifier"):
  
            # Add parameters of your choice here:
            if Mtype in("LogisticRegression"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .addGrid(classifier.maxIter, [10, 15,20])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("NaiveBayes"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.smoothing, [0.0, 0.2, 0.4, 0.6]) \
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("RandomForestClassifier"):
                paramGrid = (ParamGridBuilder() \
                               .addGrid(classifier.maxDepth, [2, 5, 10])
#                                .addGrid(classifier.maxBins, [5, 10, 20])
#                                .addGrid(classifier.numTrees, [5, 20, 50])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("GBTClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
#                              .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .addGrid(classifier.maxIter, [10, 15,50,100])
                             .build())
                
            # Add parameters of your choice here:
            if Mtype in("LinearSVC"):
                paramGrid = (ParamGridBuilder() \
                             .addGrid(classifier.maxIter, [10, 15]) \
                             .addGrid(classifier.regParam, [0.1, 0.01]) \
                             .build())
            
            # Add parameters of your choice here:
            if Mtype in("DecisionTreeClassifier"):
                paramGrid = (ParamGridBuilder() \
#                              .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                             .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                             .build())
            
            #Cross Validator requires all of the following parameters:
            crossval = CrossValidator(estimator=classifier,
                                      estimatorParamMaps=paramGrid,
                                      evaluator=MulticlassClassificationEvaluator(),
                                      numFolds=2) # 3 + is best practice
            # Fit Model: Run cross-validation, and choose the best set of parameters.
            fitModel = crossval.fit(train)
            return fitModel
    
    fitModel = IntanceFitModel(Mtype,classifier,classes,features,train)
    
    # Print feature selection metrics
    if fitModel is not None:
        
        if Mtype in("OneVsRest"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype + '\033[0m')
            # Extract list of binary models
            models = BestModel.models
            for model in models:
                print('\033[1m' + 'Intercept: '+ '\033[0m',model.intercept,'\033[1m' + '\nCoefficients:'+ '\033[0m',model.coefficients)

        if Mtype == "MultilayerPerceptronClassifier":
            print("")
            print('\033[1m' + Mtype," Weights"+ '\033[0m')
            print('\033[1m' + "Model Weights: "+ '\033[0m',fitModel.weights.size)
            print("")

        if Mtype in("DecisionTreeClassifier", "GBTClassifier","RandomForestClassifier"):
            # FEATURE IMPORTANCES
            # Estimate of the importance of each feature.
            # Each featureâ€™s importance is the average of its importance across all trees 
            # in the ensemble The importance vector is normalized to sum to 1. 
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Feature Importances"+ '\033[0m')
            print("(Scores add up to 1)")
            print("Lowest score is the least important")
            print(" ")
            print(BestModel.featureImportances)
            
            if Mtype in("DecisionTreeClassifier"):
                global DT_featureimportances
                DT_featureimportances = BestModel.featureImportances.toArray()
                global DT_BestModel
                DT_BestModel = BestModel
            if Mtype in("GBTClassifier"):
                global GBT_featureimportances
                GBT_featureimportances = BestModel.featureImportances.toArray()
                global GBT_BestModel
                GBT_BestModel = BestModel
            if Mtype in("RandomForestClassifier"):
                global RF_featureimportances
                RF_featureimportances = BestModel.featureImportances.toArray()
                global RF_BestModel
                RF_BestModel = BestModel

        if Mtype in("LogisticRegression"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficient Matrix"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficientMatrix))
            print("Intercept: " + str(BestModel.interceptVector))
            global LR_coefficients
            LR_coefficients = BestModel.coefficientMatrix.toArray()
            global LR_BestModel
            LR_BestModel = BestModel

        if Mtype in("LinearSVC"):
            # Get Best Model
            BestModel = fitModel.bestModel
            print(" ")
            print('\033[1m' + Mtype," Coefficients"+ '\033[0m')
            print("You should compares these relative to eachother")
            print("Coefficients: \n" + str(BestModel.coefficients))
            global LSVC_coefficients
            LSVC_coefficients = BestModel.coefficients.toArray()
            global LSVC_BestModel
            LSVC_BestModel = BestModel
        
   
    # Set the column names to match the external results dataframe that we will join with later:
    columns = ['Classifier', 'Result']
    
    if Mtype in("LinearSVC","GBTClassifier") and classes != 2:
        Mtype = [Mtype] # make this a list
        score = ["N/A"]
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
    else:
        predictions = fitModel.transform(test)
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy") # redictionCol="prediction",
        accuracy = (MC_evaluator.evaluate(predictions))*100
        Mtype = [Mtype] # make this a string
        score = [str(accuracy)] #make this a string and convert to a list
        result = spark.createDataFrame(zip(Mtype,score), schema=columns)
        result = result.withColumn('Result',result.Result.substr(0, 5))
        
    return result
    #Also returns the fit model important scores or p values

Declare the algorithims you want to test plus declare a list of all the different feature vectors we want to test out that we created above.

In [37]:
# from pyspark.ml.classification import *
# from pyspark.ml.evaluation import *
# from pyspark.sql import functions
# from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Comment out Naive Bayes if your data still contains negative values
classifiers = [
                LogisticRegression()
                ,OneVsRest()
               ,LinearSVC()
               ,NaiveBayes()
               ,RandomForestClassifier()
               ,GBTClassifier()
               ,DecisionTreeClassifier()
               ,MultilayerPerceptronClassifier()
              ] 

featureDF_list = [HTFfeaturizedData,TFIDFfeaturizedData,W2VfeaturizedData]

Loop through all feature types (hashingTF, TFIDF and Word2Vec)

In [39]:
from pyspark.sql.functions import countDistinct

In [None]:
for featureDF in featureDF_list:
    print('\033[1m' + featureDF.name," Results:"+ '\033[0m')
    train, test = featureDF.randomSplit([0.7, 0.3],seed = 11)
    features = featureDF.select(['features']).collect()
    # Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
    class_count = featureDF.select(countDistinct("label")).collect()
    classes = class_count[0][0]

    #set up your results table
    columns = ['Classifier', 'Result']
    vals = [("Place Holder","N/A")]
    results = spark.createDataFrame(vals, columns)

    for classifier in classifiers:
        new_result = ClassTrainEval(classifier,features,classes,train,test)
        results = results.union(new_result)
    results = results.where("Classifier!='Place Holder'")
    print(results.show(truncate=False))

[1mHTFfeaturizedData  Results:[0m


25/11/25 10:11:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:45 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:46 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.05517463,  0.23481073, -0.07094839,  0.34735178,  0.16765349,
               0.04584908, -0.00686083, -0.04391622,  0.00670871,  0.03470061,
              -0.25120455, -0.09360895,  0.0171562 , -0.38372562, -0.09206775,
              -0.36270194,  0.14506969,  0.13073395,  0.12362967, -0.2393998 ]])
Intercept: [-0.08323505657283753]


25/11/25 10:11:50 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:50 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:51 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:54 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mOneVsRest[0m
[1mIntercept: [0m 0.10858610766653626 [1m
Coefficients:[0m [0.05292493584366355,-0.22178815433139276,0.0633479381034587,-0.33018343172440384,-0.15786159619728587,-0.046858380212716634,0.00366909440465003,0.04111193467145429,-0.01234516598460722,-0.033478813658173276,0.23614903919748523,0.08731693084601048,-0.012785580014162537,0.36133967982807425,0.08696541086135677,0.34409869484903255,-0.1362841189408522,-0.1249292527567007,-0.11276363352708488,0.22689462203760194]
[1mIntercept: [0m -0.10858610766653731 [1m
Coefficients:[0m [-0.05292493584366339,0.2217881543313929,-0.06334793810345866,0.33018343172440395,0.15786159619728596,0.04685838021271656,-0.0036690944046497984,-0.041111934671454076,0.012345165984607293,0.03347881365817347,-0.23614903919748526,-0.08731693084601026,0.012785580014162502,-0.3613396798280742,-0.08696541086135687,-0.34409869484903266,0.13628411894085243,0.12492925275670079,0.11276363352708517,-0.22689462203760194]


25/11/25 10:11:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:11:59 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:01 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[0.03461382046362729,0.2758712142116449,-0.051482416075606635,0.4206405641593104,0.1587429750403592,0.028963764029055736,-0.04838560069860725,-0.15509465061400057,0.008312755126511353,-0.028371415514025198,-0.2741246074780288,-0.07054209068097864,0.05080748838353236,-0.3387532405147863,-0.05179287818402555,-0.3635271183401684,0.13062282554508448,0.222285663445786,0.16026019387394866,-0.21877814391863806]


25/11/25 10:12:05 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:05 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:06 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.059635953775238795,0.02581824600033026,0.03310481123726922,0.05791121481442624,0.0366222589046033,0.052388981496590895,0.06419572068991806,0.09355852060440585,0.04994636239390528,0.044877125305896645,0.047067750639984286,0.08354755274301315,0.06765519564115358,0.05058253120149651,0.02688467599267564,0.06744371206506729,0.043279535315217374,0.03372581366432093,0.030801050990205643,0.030952986524281013])


25/11/25 10:12:12 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:12 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:14 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:12:37 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mGBTClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.056729755065052616,0.04926887635188957,0.04456618186214464,0.04424828865491271,0.05620813131310391,0.05122014178249525,0.03254917242636707,0.02826629112514666,0.05701555436671242,0.05122044424726035,0.045869809139343645,0.05464203108253165,0.11483895391139602,0.03754582105799004,0.04524638542869983,0.05363898934253462,0.02675258753968943,0.057385014368710484,0.05140309064325561,0.041384480290763276])


25/11/25 10:13:14 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:14 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:15 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:16 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,2,3,4,5,6,7,8,9,10,11,12,13,15,17,19],[0.0476876383277113,0.027566858269252655,0.0780905962392416,0.08618321341526569,0.02289006639730143,0.042969968393148476,0.042138531322304905,0.09258278969786106,0.06564063158049677,0.07415601943702996,0.0515026493939282,0.04226858851774412,0.07158561924139231,0.09051595465555712,0.11116182697241517,0.05305904813934906])


25/11/25 10:13:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:19 WARN CSVHeaderChecker: CSV header does not conform to the sch


[1mMultilayerPerceptronClassifier  Weights[0m
[1mModel Weights: [0m 923



25/11/25 10:13:21 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv


+------------------------------+------+
|Classifier                    |Result|
+------------------------------+------+
|LogisticRegression            |64.22 |
|OneVsRest                     |64.22 |
|LinearSVC                     |61.46 |
|NaiveBayes                    |64.22 |
|RandomForestClassifier        |58.71 |
|GBTClassifier                 |55.04 |
|DecisionTreeClassifier        |61.46 |
|MultilayerPerceptronClassifier|57.79 |
+------------------------------+------+

None
[1mTFIDFfeaturizedData  Results:[0m


25/11/25 10:13:21 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:21 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:22 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:23 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mLogisticRegression  Coefficient Matrix[0m
You should compares these relative to eachother
Coefficients: 
DenseMatrix([[-0.07989012,  0.22069604, -0.08080767,  0.51787101,  0.21195683,
               0.04942282, -0.00724493, -0.04865891,  0.00878586,  0.04148598,
              -0.2821933 , -0.10230127,  0.02022808, -0.64506386, -0.12765095,
              -0.65920125,  0.15963434,  0.15629773,  0.12702101, -0.28423074]])
Intercept: [-0.08323505657283214]


25/11/25 10:13:26 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:27 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:29 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:29 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mOneVsRest[0m
[1mIntercept: [0m 0.10858610766653548 [1m
Coefficients:[0m [0.0766326749045762,-0.20845626023544145,0.07215102665517456,-0.4922745116992831,-0.19957737236960016,-0.05051079153917079,0.0038745064871014167,0.045551779340062236,-0.01616747245680029,-0.040025275522009844,0.2652805346374384,0.09542499427156811,-0.015074884477900802,0.6074318587301761,0.12057661296500408,0.6253903418174785,-0.14996671200858055,-0.14935797347541332,-0.11585689841130095,0.26938379401669016]
[1mIntercept: [0m -0.10858610766653759 [1m
Coefficients:[0m [-0.07663267490457588,0.20845626023544161,-0.07215102665517437,0.49227451169928343,0.19957737236960013,0.0505107915391708,-0.0038745064871011487,-0.04555177934006211,0.01616747245680064,0.04002527552201012,-0.2652805346374381,-0.09542499427156784,0.015074884477900678,-0.607431858730176,-0.1205766129650042,-0.6253903418174779,0.14996671200858105,0.14935797347541346,0.11585689841130115,-0.26938379401669]


25/11/25 10:13:32 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:32 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:33 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:34 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mLinearSVC  Coefficients[0m
You should compares these relative to eachother
Coefficients: 
[0.05011909052908364,0.25928833663155704,-0.05863662316013196,0.6271381554216586,0.20069166031423047,0.03122136617655296,-0.05109443996629275,-0.17184395135017633,0.010886547796568285,-0.03391917451116285,-0.3079407931371495,-0.07709247833055712,0.059904753413269646,-0.569462813036953,-0.07181027221379858,-0.6607009913200178,0.14373703856696993,0.2657514992870212,0.16465635613426519,-0.25974739254491913]


25/11/25 10:13:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:39 WARN CSVHeaderChecker: CSV header does not conform to the sch

 
[1mRandomForestClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(20,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19],[0.059635953775238795,0.02581824600033026,0.03310481123726922,0.05791121481442624,0.0366222589046033,0.052388981496590895,0.06419572068991806,0.09355852060440585,0.04994636239390528,0.044877125305896645,0.047067750639984286,0.08354755274301315,0.06765519564115358,0.05058253120149651,0.02688467599267564,0.06744371206506729,0.043279535315217374,0.03372581366432093,0.030801050990205643,0.030952986524281013])


25/11/25 10:13:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:13:46 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , blurb, state
 Schema: _c0, blurb, state
Expected: _c0 but found: 
CSV file: file:///Users/slema/Desktop/Big%20Data%20with%20Python/Jupyter%20Notebooks%20and%20Datasets/Machine%20Learning/Datasets/kickstarter.csv
25/11/25 10:14:08 WARN CSVHeaderChecker: CSV header does not conform to the sch

KeyboardInterrupt: 

25/11/25 10:26:47 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 411547 ms exceeds timeout 120000 ms
25/11/25 10:26:47 WARN SparkContext: Killing executors is not supported by current scheduler.
25/11/25 10:26:56 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:53)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:342)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:81)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:669)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1296)
	at o

Looks like the Decision Tree classifier with the W2VfeaturizedData was our best performing feature list/classifier combo. Let's go with that and create our final model and play around with the test dataframe. 

In [49]:
classifier = DecisionTreeClassifier()
featureDF = W2VfeaturizedData

train, test = featureDF.randomSplit([0.7, 0.3],seed = 11)
features = featureDF.select(['features']).collect()

# Learn how many classes there are in order to specify evaluation type based on binary or multi and turn the df into an object
class_count = featureDF.select(countDistinct("label")).collect()
classes = class_count[0][0]

#running this afain with generate all the objects need to play around with test data
ClassTrainEval(classifier,features,classes,train,test)

 
[1mDecisionTreeClassifier  Feature Importances[0m
(Scores add up to 1)
Lowest score is the least important
 
(3,[0,1,2],[0.2808481209180027,0.16992749361121173,0.5492243854707857])


DataFrame[Classifier: string, Result: string]

Let's see some results!

In [50]:
predictions = DT_BestModel.transform(test)
print("Predicted Failures:")
predictions.select("state","blurb").filter("prediction=0").orderBy(predictions["prediction"].desc()).show(3,False)
print(" ")
print("Predicted Success:")
predictions.select("state","blurb").filter("prediction=1").orderBy(predictions["prediction"].desc()).show(3,False)

Predicted Failures:
+------+-------------------------------------------------------------------------------------------------------------------------------+
|state |blurb                                                                                                                          |
+------+-------------------------------------------------------------------------------------------------------------------------------+
|failed|a book about abuse and how to remove ones self from an abusive situation                                                       |
|failed|a collection of essays about life as an alien artistponderer in the iconic city of angels                                      |
|failed|a complete game and a great learning opportunity for aspiring web game developers learn how to build it and have fun playing it|
+------+-------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 ro

## What could be next?

Once we have our model and all the vectorizer the sky is really the limit! We could do any of the following for starters:

1. Allow a user to input their own "blurb" and we could return a prediction of whether or not it would pass
2. If we had a time variable here, we could show the most popular words over time
3. Provide this algorithim to Kickstarter for prescreening so they can prioritize entries