![x](https://zdnet4.cbsistatic.com/hub/i/r/2017/12/17/e9b8f576-8c65-4308-93fa-55ee47cdd7ef/resize/370xauto/30f614c5879a8589a22e57b3108195f3/databricks-logo.png)

&copy; 2019 Databricks, Inc. All rights reserved.<br/>

# Getting access to our data

### Databricks File System - DBFS
Databricks File System (DBFS) is a distributed file system installed on Azure Databricks clusters. Files in DBFS persist to Azure Blob storage, so you won’t lose data even after you terminate a cluster.

You can access files in DBFS using the Databricks CLI, DBFS API, Databricks Utilities, Spark APIs, and local file APIs.

On your local computer you access DBFS using the Databricks CLI or DBFS API. In a Spark cluster you access DBFS using Databricks Utilities, Spark APIs, or local file APIs.

DBFS allows you to mount containers so that you can seamlessly access data without requiring credentials.

**Databricks Mount Points:**
- Connect to our Azure Storage Account - https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-storage.html
- Connect to our Azure Data Lake - https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html

### 1. Mounting Blob Storage

Next, let's connect to the read-only Blob store you'll have access to for data needed in this course.  We can easily mount data in blob stores to Azure Databricks for fast and scalable data storage

*Note:* You will have to have a cluster running to execute this code

### Cluster Setup

Please ensure you have a cluster with the following configuration:

Cluster Mode: Standard  
Databricks Runtime: 5.4 ML  
NO autoscaling  
Standard VMs (DS3 v2)  
1 worker node

**IMPORTANT** If you are using a shared workspace, please be careful whenever writing files or creating tables. These will be shared across your instance, so please add a prefix/suffix to your tables/file names on write, and, whenever reading, make sure you propagate the changes.

In [7]:
dbutils.fs.help()

In [8]:
#In case you have run this training before, you can unmount in order to be able to re-mount
try:
  dbutils.fs.unmount("/mnt/databricks-workshop-datasets") # Use this to unmount as needed
except:
  print("{} already unmounted".format("/mnt/databricks-workshop-datasets"))

In [9]:
# This only needs to be ran once, globably. Once we have mounted the storage account no need need to do it again (unless you unmount). 
# These credentials DO NOT have write access

STORAGE_ACCOUNT = "channelsapublicprodblob"
CONTAINER = "channelsa-datasets"
MOUNT_POINT = "/mnt/databricks-workshop-datasets"
SAS_KEY = "?sv=2018-03-28&ss=b&srt=sco&sp=rwlac&se=2022-07-07T02:01:53Z&st=2019-03-24T19:01:53Z&spr=https&sig=o6rvr92oZH4nzdn7r4gR%2Bxv%2Fj%2BkOgv5BhXIfbTJYM%2Bg%3D"

#Define strings to be passed to the mount function
source_str = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)
conf_key = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)

#Run the mount function using template in the documentation
try:
  dbutils.fs.mount(
    source = source_str,
    mount_point = MOUNT_POINT,
    extra_configs = {conf_key: SAS_KEY}
  )
except Exception as e:
  print("ERROR: {} already mounted. Run previous cells to unmount first".format(MOUNT_POINT))
  
#If needed to unmount, use this:
#try:
#  dbutils.fs.unmount(MOUNT_POINT) # Use this to unmount as needed
#except:
#  print("{} already unmounted".format(MOUNT_POINT))

We can use **%fs** to issue filesystem commands such as **ls** to browse through our folder

In [11]:
%fs ls /mnt/databricks-workshop-datasets/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/Contoso-retail/,Contoso-retail/,0
dbfs:/mnt/databricks-workshop-datasets/Demo-datasets/,Demo-datasets/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/,End-to-End-ML-Lifecycle/,0


In [12]:
%fs ls /mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_sample_sub.csv,malware_sample_sub.csv,290570393
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_test.csv,malware_test.csv,3795687226
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_train.csv,malware_train.csv,4384966482
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_stream_merged/,workshop_stream_merged/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_stream_merged_sample/,workshop_stream_merged_sample/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/,workshop_train.csv/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_clean_delta/,workshop_train_clean_delta/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_clean_delta_sample/,workshop_train_clean_delta_sample/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_enrich.csv/,workshop_train_enrich.csv/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_enrich_sample.csv/,workshop_train_enrich_sample.csv/,0


In [13]:
%fs ls /mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_SUCCESS,_SUCCESS,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_committed_8681468303182107569,_committed_8681468303182107569,1824
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_started_8681468303182107569,_started_8681468303182107569,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00000-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8483-1-c000.csv,part-00000-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8483-1-c000.csv,115657107
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00001-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8484-1-c000.csv,part-00001-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8484-1-c000.csv,115659562
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00002-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8485-1-c000.csv,part-00002-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8485-1-c000.csv,115645620
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00003-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8486-1-c000.csv,part-00003-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8486-1-c000.csv,115665123
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00004-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8487-1-c000.csv,part-00004-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8487-1-c000.csv,115662100
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00005-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8488-1-c000.csv,part-00005-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8488-1-c000.csv,115664837
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00006-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8489-1-c000.csv,part-00006-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8489-1-c000.csv,115651064


If, by any chance, you cannot write to the local FileStore, you can use this mounted blob below

In [15]:
STORAGE_ACCOUNT = "channelsapublicexercises"
CONTAINER = "exercise-container"
MOUNT_POINT = "/mnt/databricks-workshop-exercises"
SAS_KEY = "?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2021-11-27T03:35:07Z&st=2019-03-24T19:35:07Z&spr=https&sig=w%2Fp9iG2FGDlgNT716Kt3ZFnWQuUGlaxz3Bu4yVAVEwo%3D"

#Define strings to be passed to the mount function
source_str = "wasbs://{container}@{storage_acct}.blob.core.windows.net/".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)
conf_key = "fs.azure.sas.{container}.{storage_acct}.blob.core.windows.net".format(container=CONTAINER, storage_acct=STORAGE_ACCOUNT)

#Run the mount function using template in the documentation
try:
  dbutils.fs.mount(
    source = source_str,
    mount_point = MOUNT_POINT,
    extra_configs = {conf_key: SAS_KEY}
  )
except Exception as e:
  print("ERROR: {} already mounted. Paste the code below in a cell above and run it to unmount first".format(MOUNT_POINT))
  
#If needed to unmount, use this:
#try:
#  dbutils.fs.unmount(MOUNT_POINT) # Use this to unmount as needed
#except:
#  print("{} already unmounted".format(MOUNT_POINT))

### 2. Reading the training file

**Technical Accomplishments:**
- Read data from CSV using PySpark
- Read data from CSV using SQL
- Defining schemas
- Switching languages

In [17]:
%fs ls /mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_sample_sub.csv,malware_sample_sub.csv,290570393
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_test.csv,malware_test.csv,3795687226
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/malware_train.csv,malware_train.csv,4384966482
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_stream_merged/,workshop_stream_merged/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_stream_merged_sample/,workshop_stream_merged_sample/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/,workshop_train.csv/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_clean_delta/,workshop_train_clean_delta/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_clean_delta_sample/,workshop_train_clean_delta_sample/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_enrich.csv/,workshop_train_enrich.csv/,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_enrich_sample.csv/,workshop_train_enrich_sample.csv/,0


### The Data Source
For this exercise, we will be using a few different files:
  - workshop_train_sample.csv
  - workshop_test_sample.csv 

The goal of this workshop is ultimately to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The telemetry data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft's endpoint protection solution, Windows Defender.

Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier. HasDetections is the ground truth and indicates that Malware was detected on the machine. 

There are other files in our folder which relate to either the initial kaggle dataset (which we've subsampled), or to different checkpoints that we've prepared for the workshop at hand.

We can use **&percnt;head ...** to view the first few lines of the file.

In [20]:
%fs ls /mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/

path,name,size
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_SUCCESS,_SUCCESS,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_committed_8681468303182107569,_committed_8681468303182107569,1824
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/_started_8681468303182107569,_started_8681468303182107569,0
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00000-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8483-1-c000.csv,part-00000-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8483-1-c000.csv,115657107
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00001-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8484-1-c000.csv,part-00001-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8484-1-c000.csv,115659562
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00002-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8485-1-c000.csv,part-00002-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8485-1-c000.csv,115645620
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00003-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8486-1-c000.csv,part-00003-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8486-1-c000.csv,115665123
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00004-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8487-1-c000.csv,part-00004-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8487-1-c000.csv,115662100
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00005-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8488-1-c000.csv,part-00005-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8488-1-c000.csv,115664837
dbfs:/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00006-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8489-1-c000.csv,part-00006-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8489-1-c000.csv,115651064


In [21]:
%fs head /mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train.csv/part-00000-tid-8681468303182107569-814dbf30-b1b0-483f-bfb0-9164155c9bf8-8483-1-c000.csv

Let's start with the bare minimum by specifying that the file we want to read is delimited and the location of the file:
The default delimiter for `spark.read.csv( )` is comma but we can change by specifying the option delimiter parameter.

For the purpose of the workshop, we will be reading samples of the files, as to not spend too many resources from our Azure Pass. If you're using your own, bigger clusters, feel free to use the full data files (just remove _sample_ from the name).

In [23]:
csvFile = "/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_sample.csv"

df = (spark.read                        # The DataFrameReader
   .option("header", "true")       # Use first line of all files as header
   .option("inferSchema", "true")  # Automatically infer data types
   .csv(csvFile)                   # Creates a DataFrame from CSV after reading in the file
)

In [24]:
display(df)

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


Alternatively, we can accomplish the same results using SQL

In [26]:
%sql

DROP TABLE IF EXISTS malware_train; 
CREATE OR REPLACE TEMPORARY VIEW malware_train
USING CSV
OPTIONS (path "/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_sample.csv/", header "true", inferSchema "true")

In [27]:
%sql

SELECT * FROM malware_train

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


The table above is only temporary, and will disappear on cluster restart. However, we also have the option of creating a permanent table.

In [29]:
%sql

DROP TABLE IF EXISTS malware_train_permanent; 
CREATE TABLE malware_train_permanent
USING CSV
OPTIONS (path "/mnt/databricks-workshop-datasets/End-to-End-ML-Lifecycle/workshop_train_sample.csv/", header "true", inferSchema "true")

This time we are going to read the same file.

The difference here is that we are going to define the schema beforehand to avoid the execution of any extra jobs.

In [31]:
#What is the current schema inferred?
df.schema

#or alternatively
#df.printSchema()

In [32]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("MachineIdentifier", StringType(), True),
  StructField("ProductName", StringType(), True),
  StructField("EngineVersion", StringType(), True),
  StructField("AppVersion", StringType(), True),
  StructField("AvSigVersion", StringType(), True),
  StructField("IsBeta", IntegerType(), True),
  StructField("RtpStateBitfield", IntegerType(), True),
  StructField("IsSxsPassiveMode", IntegerType(), True),
  StructField("DefaultBrowsersIdentifier", IntegerType(), True),
  StructField("AVProductStatesIdentifier", IntegerType(), True),
  StructField("AVProductsInstalled", IntegerType(), True),
  StructField("AVProductsEnabled", IntegerType(), True),
  StructField("HasTpm", IntegerType(), True),
  StructField("CountryIdentifier", IntegerType(), True),
  StructField("CityIdentifier", IntegerType(), True),
  StructField("OrganizationIdentifier", IntegerType(), True),
  StructField("GeoNameIdentifier", IntegerType(), True),
  StructField("LocaleEnglishNameIdentifier", IntegerType(), True),
  StructField("platfrm", StringType(), True),
  StructField("prcsr", StringType(), True),
  StructField("OsVer", StringType(), True),
  StructField("OsBuild", IntegerType(), True),
  StructField("OsSuite", IntegerType(), True),
  StructField("OsPlatformSubRelease", StringType(), True),
  StructField("OsBuildLab", StringType(), True),
  StructField("SkuEdition", StringType(), True),
  StructField("IsProtected", IntegerType(), True),
  StructField("AutoSampleOptIn", IntegerType(), True),
  StructField("PuaMode", StringType(), True),
  StructField("SMode", IntegerType(), True),
  StructField("IeVerIdentifier", IntegerType(), True),
  StructField("SmartScreen", StringType(), True),
  StructField("Firewall", IntegerType(), True),
  StructField("UacLuaenable", IntegerType(), True),
  StructField("Wdft_IsGamer", IntegerType(), True),
  StructField("Wdft_RegionIdentifier", IntegerType(), True),
  StructField("HasDetections", IntegerType(), True),
  StructField("serviceDate", IntegerType(), True),
  StructField("recentIncident", IntegerType(), True)])
  
#Read in our data (and print the schema).
#We can specify the schema, or rather the `StructType`, with the `schema(..)` command:

malwareDF = (spark.read                   # The DataFrameReader
  .option('header', 'true')   # Ignore line #1 - it's a header
  .schema(csvSchema)          # Use the specified schema
  .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

In [33]:
display(malwareDF)

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


With our DataFrame created, we can now create a temporary view and then view the data via SQL

In [35]:
#Create a view called malware_view
malwareDF.createOrReplaceTempView("malware_view")

In [36]:
%sql

SELECT * FROM malware_view

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


### 3. Summary of data

In [38]:
malwareDF.count()

In [39]:
%sql

SELECT COUNT(*) FROM malware_view

count(1)
88340


#### Columns
What do we have in terms of columns in our dataset?
Unavailable or self-documenting column names are marked with an "NA".

**MachineIdentifier** - Individual machine ID  
**ProductName** - Defender state information e.g. win8defender  
**EngineVersion** - Defender state information e.g. 1.1.12603.0  
**AppVersion** - Defender state information e.g. 4.9.10586.0  
**AvSigVersion** - Defender state information e.g. 1.217.1014.0  
**IsBeta** - Defender state information e.g. false  
**RtpStateBitfield** - NA  
**IsSxsPassiveMode** - NA  
**DefaultBrowsersIdentifier** - ID for the machine's default browser  
**AVProductStatesIdentifier** - ID for the specific configuration of a user's antivirus software  
**AVProductsInstalled** - NA  
**AVProductsEnabled** - NA  
**HasTpm** - True if machine has tpm  
**CountryIdentifier** - ID for the country the machine is located in  
**CityIdentifier** - ID for the city the machine is located in  
**OrganizationIdentifier** - ID for the organization the machine belongs in, organization ID is mapped to both specific companies and broad industries  
**GeoNameIdentifier** - ID for the geographic region a machine is located in  
**LocaleEnglishNameIdentifier** - English name of Locale ID of the current user  
**Platform** - Calculates platform name (of OS related properties and processor property)  
**Processor** - This is the process architecture of the installed operating system  
**OsVer** - Version of the current operating system  
**OsBuild** - Build of the current operating system  
**OsSuite** - Product suite mask for the current operating system.  
**OsPlatformSubRelease** - Returns the OS Platform sub-release (Windows Vista, Windows 7, Windows 8, TH1, TH2)  
**OsBuildLab** - Build lab that generated the current OS. Example: 9600.17630.amd64fre.winblue_r7.150109-2022  
**SkuEdition** - The goal of this feature is to use the Product Type defined in the MSDN to map to a 'SKU-Edition' name that is useful in population reporting. The valid Product Type are defined in %sdxroot%\data\windowseditions.xml. This API has been used   since Vista and Server 2008, so there are many Product Types that do not apply to Windows 10. The 'SKU-Edition' is a string value that is in one of three classes of results. The design must hand each class.  
**IsProtected** - This is a calculated field derived from the Spynet Report's AV Products field. Returns: a. TRUE if there is at least one active and up-to-date antivirus product running on this machine. b. FALSE if there is no active AV product on this   machine, or if the AV is active, but is not receiving the latest updates. c. null if there are no Anti Virus Products in the report. Returns: Whether a machine is protected.  
**AutoSampleOptIn** - This is the SubmitSamplesConsent value passed in from the service, available on CAMP 9+  
**PuaMode** - Pua Enabled mode from the service  
**SMode** - This field is set to true when the device is known to be in 'S Mode', as in, Windows 10 S mode, where only Microsoft Store apps can be installed  
**IeVerIdentifier** - NA  
**SmartScreen** - This is the SmartScreen enabled string value from registry. This is obtained by checking in order, HKLM\SOFTWARE\Policies\Microsoft\Windows\System\SmartScreenEnabled and   HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\SmartScreenEnabled. If the value exists but is blank, the value "ExistsNotSet" is sent in telemetry.  
**Firewall** - This attribute is true (1) for Windows 8.1 and above if windows firewall is enabled, as reported by the service.  
**UacLuaenable** - This attribute reports whether or not the "administrator in Admin Approval Mode" user type is disabled or enabled in UAC. The value reported is obtained by reading the regkey   HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\EnableLUA.
**Wdft_IsGamer** - Indicates whether the device is a gamer device or not based on its hardware combination.  
**Wdft_RegionIdentifier** - NA  
**serviceDate** - The date when this device was last serviced
**recentIncident** - The unix timestamp of the most recent incident

#### Transformations and Actions

In [42]:
#Let's create a new dataframe with the first 3 rows of our initial one
limitedMalwareDF = malwareDF.limit(3)

As you can see, there's no Spark Job above. This is because limit() is only a Transformation.

In [44]:
#Let's then show our new dataframe
limitedMalwareDF.show(100)

Now, a Spark Job is triggered. This is because show() is an Action.

#### Visualization

We can use the display function to browse through our data, and we can setup multiple plots that others can review for later.

In [47]:
display(malwareDF)

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


In [48]:
display(malwareDF)

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


In [49]:
display(malwareDF)

MachineIdentifier,ProductName,EngineVersion,AppVersion,AvSigVersion,IsBeta,RtpStateBitfield,IsSxsPassiveMode,DefaultBrowsersIdentifier,AVProductStatesIdentifier,AVProductsInstalled,AVProductsEnabled,HasTpm,CountryIdentifier,CityIdentifier,OrganizationIdentifier,GeoNameIdentifier,LocaleEnglishNameIdentifier,platfrm,prcsr,OsVer,OsBuild,OsSuite,OsPlatformSubRelease,OsBuildLab,SkuEdition,IsProtected,AutoSampleOptIn,PuaMode,SMode,IeVerIdentifier,SmartScreen,Firewall,UacLuaenable,Wdft_IsGamer,Wdft_RegionIdentifier,HasDetections,serviceDate,recentIncident
08001e93f7bdc641010343d64fe4020c,win8defender,1.1.15200.1,4.18.1807.18075,1.275.383.0,0,7.0,0,,53447.0,1.0,1.0,1,21,122796.0,,39,34,windows10,x64,10.0.0.0,16299,256,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Pro,1.0,0,,0.0,117.0,ExistsNotSet,1.0,1.0,0.0,3.0,1,112017,1520315836
088e53a33c98e2787eca4383d2799a78,win8defender,1.1.15200.1,4.10.209.0,1.275.213.0,0,7.0,0,,46901.0,2.0,2.0,1,93,13354.0,,119,64,windows8,x64,6.3.0.0,9600,768,windows8.1,9600.19101.amd64fre.winblue_ltsb_escrow.180718-1800,Home,1.0,0,,0.0,333.0,RequireAdmin,1.0,1.0,0.0,8.0,1,82018,1525440628
0656a968ce18723ff3d3e9887f0abd77,win8defender,1.1.15100.1,4.18.1807.18075,1.273.806.0,0,7.0,0,,53447.0,1.0,1.0,1,29,143155.0,18.0,35,171,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Education,1.0,0,,0.0,137.0,RequireAdmin,1.0,1.0,0.0,10.0,1,42016,1490438726
09cfba6e56fd3d20538cab0d1bf99806,win8defender,1.1.15100.1,4.11.15063.0,1.273.861.0,0,7.0,0,,53447.0,1.0,1.0,1,152,7470.0,27.0,184,69,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,105.0,,1.0,1.0,0.0,1.0,1,52016,1511126815
057907d6a93702b77464a637de412ce4,win8defender,1.1.15200.1,4.16.17656.18052,1.275.363.0,0,7.0,0,,53447.0,1.0,1.0,1,201,66673.0,27.0,267,251,windows10,x64,10.0.0.0,17134,768,rs4,17134.1.amd64fre.rs4_release.180410-1804,Home,1.0,0,,0.0,137.0,ExistsNotSet,1.0,1.0,0.0,11.0,1,112016,1514421618
097c675b62c517169d4deacb2abddd74,win8defender,1.1.15100.1,4.16.17656.18052,1.273.1264.0,0,7.0,0,,47238.0,2.0,1.0,1,60,86819.0,,240,233,windows10,x64,10.0.0.0,15063,768,rs2,15063.0.amd64fre.rs2_release.170317-1834,Home,1.0,0,,0.0,108.0,,1.0,1.0,0.0,15.0,0,42017,1456582749
0a47676955d453c1eb4b4f0a4304ac53,win8defender,1.1.14700.5,4.12.17007.18022,1.265.206.0,0,7.0,0,,23796.0,2.0,1.0,1,110,1931.0,18.0,211,182,windows10,x86,10.0.0.0,16299,256,rs3,16299.15.x86fre.rs3_release.170928-1534,Pro,0.0,0,,0.0,117.0,RequireAdmin,1.0,1.0,0.0,3.0,0,12018,1517315890
06029134a6671e6790350bcb9bf2434b,win8defender,1.1.15200.1,4.18.1807.18075,1.275.884.0,0,7.0,0,,53447.0,1.0,1.0,1,205,75528.0,,274,253,windows10,x64,10.0.0.0,17134,256,rs4,17134.1.amd64fre.rs4_release.180410-1804,Pro,1.0,0,,0.0,137.0,,1.0,1.0,0.0,3.0,1,12017,1459684098
09fcf4cc35a6a3974d3a243b98bbfce7,win8defender,1.1.15200.1,4.18.1807.18075,1.275.699.0,0,7.0,0,,53447.0,1.0,1.0,1,60,15034.0,27.0,240,233,windows10,x64,10.0.0.0,16299,768,rs3,16299.431.amd64fre.rs3_release_svc_escrow.180502-1908,Home,1.0,0,,0.0,117.0,,1.0,1.0,0.0,15.0,1,62018,1460681618
0a249c0acc1e9c7ec374fb36be998078,win8defender,1.1.15100.1,4.18.1807.18075,1.273.1369.0,0,7.0,0,,,,,1,51,40629.0,,98,103,windows2016,x64,10.0.0.0,14393,272,rs1,14393.2068.amd64fre.rs1_release.180209-1727,Invalid,,0,,0.0,98.0,Off,0.0,1.0,0.0,6.0,1,102017,1474110609


## Next Step

[Cleaning Data]($./1-03 Cleaning Data)

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>