# Screencast Code

The follow code is the same used in the "Numeric Features" screencast. Run each code cell to see how 

In [13]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import RegexTokenizer, VectorAssembler, Normalizer, StandardScaler
from pyspark.sql.functions import avg, col, concat, count, desc, explode, lit, min, max, split, stddev, udf
from pyspark.sql.types import IntegerType

import re

In [2]:
# create a SparkSession: note this step was left out of the screencast
spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .getOrCreate()

# Read in the Data Set

In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)

In [5]:
df.persist()

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

In [6]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php')

# Tokenization

Tokenization splits strings into separate words. Spark has a [Tokenizer](https://spark.apache.org/docs/latest/ml-features.html#tokenizer) class as well as RegexTokenizer, which allows for more control over the tokenization process.

In [7]:
# split the body text into separate words

regexTokenizer = RegexTokenizer(inputCol="Body", outputCol="words", pattern="\\W")
df = regexTokenizer.transform(df)
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

In [8]:
# count the number of words in each body tag

body_length = udf(lambda x: len(x), IntegerType())
df = df.withColumn("BodyLength", body_length(df.words))

QUESTION 1 OF 5
Select the question with Id = 1112. How many words does its body contain (check the BodyLength column)?

In [11]:
df1 = df.where(df.Id == "1112").withColumn("BodyLength", body_length(df.words))
df1.head()

Row(Body='<p>I submitted my iPhone application to iTunesConnect.Now it is in "Waiting for review". I want to release it only when i decide.. But am not able to see the option to set release date as "Automatically after success review"  or "release date will be set by Developer"(I mean Version Release Control option) . Somebody please help me ..Thanks in advance..</p>\n', Id=1112, Tags='iphone app-store itunes itunesconnect', Title='iPhone app release date option in iTunes Connect', oneTag='iphone', words=['p', 'i', 'submitted', 'my', 'iphone', 'application', 'to', 'itunesconnect', 'now', 'it', 'is', 'in', 'waiting', 'for', 'review', 'i', 'want', 'to', 'release', 'it', 'only', 'when', 'i', 'decide', 'but', 'am', 'not', 'able', 'to', 'see', 'the', 'option', 'to', 'set', 'release', 'date', 'as', 'automatically', 'after', 'success', 'review', 'or', 'release', 'date', 'will', 'be', 'set', 'by', 'developer', 'i', 'mean', 'version', 'release', 'control', 'option', 'somebody', 'please', 'help'

QUESTION 2 OF 5
Create a new column that concatenates the question title and body. Apply the same functions we used before to compute the number of words in this combined column. What's the value in this new column for Id = 5123?

In [14]:
df = df.withColumn("Desc", concat(col("Title"), lit(' '), col("Body")))

regexTokenizer = RegexTokenizer(inputCol="Desc", outputCol="words2", pattern="\\W")
df = regexTokenizer.transform(df)
df = df.withColumn("DescLength", body_length(df.words2))

df.where(df.Id == 5123).collect()

[Row(Body="<p>Here's an interesting experiment with using Git. Think of Github's ‘pages’ feature: I write a program in one branch (e.g. <code>master</code>), and a documentation website is kept in another, entirely unrelated branch (e.g. <code>gh-pages</code>).</p>\n\n<p>I can generate documentation in HTML format from the code in my <code>master</code>-branch, but I want to publish this as part of my documentation website in the <code>gh-pages</code> branch.</p>\n\n<p>How could I intelligently generate my docs from my code in <code>master</code>, move it to my <code>gh-pages</code> branch and commit the changes there? Should I use a post-commit hook or something? Would this be a good idea, or is it utterly foolish?</p>\n", Id=5123, Tags='git branch', Title='Git branch experiment', oneTag='git', words=['p', 'here', 's', 'an', 'interesting', 'experiment', 'with', 'using', 'git', 'think', 'of', 'github', 's', 'pages', 'feature', 'i', 'write', 'a', 'program', 'in', 'one', 'branch', 'e', '

In [15]:
# count the number of paragraphs and links in each body tag

number_of_paragraphs = udf(lambda x: len(re.findall("</p>", x)), IntegerType())
number_of_links = udf(lambda x: len(re.findall("</a>", x)), IntegerType())

In [16]:
df = df.withColumn("NumParagraphs", number_of_paragraphs(df.Body))
df = df.withColumn("NumLinks", number_of_links(df.Body))

In [17]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

# VectorAssembler

Combine the body length, number of paragraphs, and number of links columns into a vector

In [18]:
assembler = VectorAssembler(inputCols=["BodyLength", "NumParagraphs", "NumLinks"], outputCol="NumFeatures")
df = assembler.transform(df)

In [19]:
df.head()

Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'which

# Normalize the Vectors
To have unit norm

In [15]:
scaler = Normalizer(inputCol="NumFeatures", outputCol="ScaledNumFeatures")
df = scaler.transform(df)

In [16]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

QUESTION 3 OF 5
Using the Normalizer method what's the normalized value for question Id = 512?

In [20]:
assembler2 = VectorAssembler(inputCols=["DescLength"], outputCol="DescVec")
df = assembler2.transform(df)

scaler_q1 = Normalizer(inputCol="DescVec", outputCol="DescVecNormalizer")
df = scaler_q1.transform(df)

df.where(df.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 

# Scale the Vectors
Unit standard deviation and mean 0

In [17]:
scaler2 = StandardScaler(inputCol="NumFeatures", outputCol="ScaledNumFeatures2", withStd=True)
scalerModel = scaler2.fit(df)
df = scalerModel.transform(df)

In [18]:
df.head(2)

[Row(Body="<p>I'd like to check if an uploaded file is an image file (e.g png, jpg, jpeg, gif, bmp) or another file. The problem is that I'm using Uploadify to upload the files, which changes the mime type and gives a 'text/octal' or something as the mime type, no matter which file type you upload.</p>\n\n<p>Is there a way to check if the uploaded file is an image apart from checking the file extension using PHP?</p>\n", Id=1, Tags='php image-processing file-upload upload mime-types', Title='How to check if an uploaded file is an image without mime type?', oneTag='php', words=['p', 'i', 'd', 'like', 'to', 'check', 'if', 'an', 'uploaded', 'file', 'is', 'an', 'image', 'file', 'e', 'g', 'png', 'jpg', 'jpeg', 'gif', 'bmp', 'or', 'another', 'file', 'the', 'problem', 'is', 'that', 'i', 'm', 'using', 'uploadify', 'to', 'upload', 'the', 'files', 'which', 'changes', 'the', 'mime', 'type', 'and', 'gives', 'a', 'text', 'octal', 'or', 'something', 'as', 'the', 'mime', 'type', 'no', 'matter', 'whic

QUESTION 4 OF 5
Using the StandardScaler method (scaling both the mean and the standard deviation) what's the normalized value for question Id = 512?

In [21]:
scaler_q2 = StandardScaler(inputCol="DescVec", outputCol="DescVecStandardScaler4", withMean=False, withStd=False)
scalerModel_q2 = scaler_q2.fit(df)
df = scalerModel_q2.transform(df)

df.where(df.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 

QUESTION 5 OF 5
Using the MinMAxScaler method what's the normalized value for question Id = 512?

In [22]:
from pyspark.ml.feature import MinMaxScaler
scaler_q3 = MinMaxScaler(inputCol="DescVec", outputCol="DescVecMinMaxScaler")
scalerModel_q3 = scaler_q3.fit(df)
df = scalerModel_q3.transform(df)

df.where(df.Id == 512).collect()

[Row(Body="<p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code that HotSpot is using after it's been running for a while?</p>\n", Id=512, Tags='java optimization hotspot', Title='How can I see the code that HotSpot generates after optimizing?', oneTag='java', words=['p', 'i', 'd', 'like', 'to', 'have', 'a', 'better', 'understanding', 'of', 'what', 'optimizations', 'hotspot', 'might', 'generate', 'for', 'my', 'java', 'code', 'at', 'run', 'time', 'p', 'p', 'is', 'there', 'a', 'way', 'to', 'see', 'the', 'optimized', 'code', 'that', 'hotspot', 'is', 'using', 'after', 'it', 's', 'been', 'running', 'for', 'a', 'while', 'p'], BodyLength=46, Desc="How can I see the code that HotSpot generates after optimizing? <p>I'd like to have a better understanding of what optimizations HotSpot might generate for my Java code at run time. </p>\n\n<p>Is there a way to see the optimized code 