diff --git a/samples/features/sql-big-data-cluster/spark/README.md b/samples/features/sql-big-data-cluster/spark/README.md index 571bf29aaf..199ab595a4 100644 --- a/samples/features/sql-big-data-cluster/spark/README.md +++ b/samples/features/sql-big-data-cluster/spark/README.md @@ -1,29 +1,27 @@ # SQL Server big data clusters -The new built-in notebooks in Azure Data Studio enables data scientists and data engineers to run Python, R, Scala, or Spark SQL code against the cluster. +SQL Server Big Data cluster bundles Spark and HDFS together with SQL server. Azure Data Studio IDE provides built in notebooks that enables data scientists and data engineers to run Spark notebooks and job in Python, R, or Scala code against the Big Data Cluster. This folder contains spark sample notebook on using Spark in SQL server Big data cluster -## Instructions to open a notebook from Azure Data Studio and execute the commands +## Folder contents -1. Connect to the SQL Server Master instance in a big data cluster +[PySpark Hello World](dataloading/hello_PySpark.ipynb) -1. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and use open Notebook. +[Scala Hello World ](dataloading/hello_Scala.ipynb) -1. Open the notebook in Azure Data Studio, wait for the “Kernel” and the target context (“Attach to”) to be populated. +[SparkR Hello World ](dataloading/hello_sparkR.ipynb) -1. Run each cell in the Notebook sequentially. +[DataLoading - Transforming CSV to Parquet](dataloading/transform-csv-files.ipynb/) -## __[data-loading](data-loading/)__ +[Data Transfer - Spark to SQL using Spark JDBC connector](data-virtualization/spark_to_sql_jdbc.ipynb/) -This folder contains samples that show how to load data using Spark and query them using SQL statements. +[Data Transfer - Spark to SQL using MSSQL Spark connector](spark_to_sql/mssql_spark_connector.ipynb/) -[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/) - -This samnple notebook shows how to transform CSV files in HDFS to parquet files. +## Instructions on how to run in Azure Data Studio -[dataloading/spark-sql.ipynb](dataloading/spark-sql.ipynb/) +[data-loading/transform-csv-files.ipynb](dataloading/transform-csv-files.ipynb/) -This samnple notebook shows how to query hive tables created from Spark. +2. From Azure Data Studio Connect to the SQL Server Master instance in a big data cluster. -## __[data-virtualization](data-virtualization/)__ +3. Right-click on the server name, select **Manage**, switch to **SQL Server Big Data Cluster** tab, and open the notebook in Azure Data Studio. Wait for the “Kernel” and the target context (“Attach to”) to be populated. If required set the relevant “Kernel” ( e.g **PySpark3** ) and **Attach to** needs to be the IP address of your big data cluster endpoint. -This folder contains samples that show how to integrate Spark with other data sources. +4. Run each cell in the Notebook sequentially. diff --git a/samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb b/samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb index 0531f80d7a..d574d79298 100644 --- a/samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb +++ b/samples/features/sql-big-data-cluster/spark/data-virtualization/spark_to_sql_jdbc.ipynb @@ -19,7 +19,7 @@ "cells": [ { "cell_type": "markdown", - "source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and writing the processed data to SQLServer. The following samples shows \r\n- reading a HDFS file, \r\n- some basic processing on it and \r\n- then processed data to SQL Server table.\r\n\r\nNeed a database precreated in SQL for this sample. Here we are using database name \"MyTestDatabase\" that can be created using SQL statements below.\r\n\r\n``` sql\r\nCreate DATABASE MyTestDatabase\r\nGO \r\n``` \r\n ", + "source": "# Read and write from Spark to SQL\r\nA typical big data scenario is large scale ETL in Spark and post processing the data is written out to SQLServer for access to LOB applications. This sample shows how to write to SQLServer from Spark. The main steps in the sample are \r\n- Reading a HDFS file, \r\n- Basic processing on it and \r\n- Then writing processed data to SQL Server table using JDBC\r\n\r\nPreReq : \r\n- The sample uses a SQL database named \"MyTestDatabase\". Create this before you run this sample. The database can be created as follows\r\n ``` sql\r\n Create DATABASE MyTestDatabase\r\n GO \r\n ``` \r\n- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine. Create a hdfs folder named spark_data and upload the file there. \r\n\r\n \r\n ", "metadata": {} }, { @@ -28,12 +28,30 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", - "text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows" + "text": "Starting Spark application\n", + "output_type": "stream" + }, + { + "data": { + "text/plain": "", + "text/html": "\n
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
2application_1554755839506_0003pyspark3idleLinkLink
" + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "text": "SparkSession available as 'spark'.\n", + "output_type": "stream" + }, + { + "name": "stdout", + "text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education-num| marital-status| occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows", + "output_type": "stream" } ], - "execution_count": 8 + "execution_count": 3 }, { "cell_type": "code", @@ -41,25 +59,25 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", - "text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows" + "text": "+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows", + "output_type": "stream" } ], - "execution_count": 9 + "execution_count": 4 }, { "cell_type": "code", - "source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://mssql-master-pool-0.service-master-pool\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\nc = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"****\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n", + "source": "#Write from Spark to SQL table using JDBC\r\nprint(\"Use build in JDBC connector to write to SQLServer master instance in Big data \")\r\n\r\nservername = \"jdbc:sqlserver://master-0.master-svc\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\ndbtable = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"Yukon900\"\r\n\r\nprint(\"url is \", url)\r\n\r\ntry:\r\n df.write \\\r\n .format(\"jdbc\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"JDBC Write failed\", error)\r\n\r\nprint(\"JDBC Write done \")\r\n\r\n\r\n", "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", - "text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://mssql-master-pool-0.service-master-pool;databaseName=MyTestDatabase;\nJDBC Write done" + "text": "Use build in JDBC connector to write to SQLServer master instance in Big data \nurl is jdbc:sqlserver://master-0.master-svc;databaseName=MyTestDatabase;\nJDBC Write done", + "output_type": "stream" } ], - "execution_count": 10 + "execution_count": 9 }, { "cell_type": "code", @@ -67,12 +85,12 @@ "metadata": {}, "outputs": [ { - "output_type": "stream", "name": "stdout", - "text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows" + "text": "read data from SQL server table \n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n|age| workclass|fnlwgt|education|education_num| marital_status| occupation| relationship| race| sex|capital_gain|capital_loss|hours_per_week|native_country|income|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\n| 39| State-gov| 77516|Bachelors| 13| Never-married| Adm-clerical|Not-in-family|White| Male| 2174| 0| 40| United-States| <=50K|\n| 50|Self-emp-not-inc| 83311|Bachelors| 13|Married-civ-spouse| Exec-managerial| Husband|White| Male| 0| 0| 13| United-States| <=50K|\n| 38| Private|215646| HS-grad| 9| Divorced|Handlers-cleaners|Not-in-family|White| Male| 0| 0| 40| United-States| <=50K|\n| 53| Private|234721| 11th| 7|Married-civ-spouse|Handlers-cleaners| Husband|Black| Male| 0| 0| 40| United-States| <=50K|\n| 28| Private|338409|Bachelors| 13|Married-civ-spouse| Prof-specialty| Wife|Black|Female| 0| 0| 40| Cuba| <=50K|\n+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+\nonly showing top 5 rows", + "output_type": "stream" } ], - "execution_count": 13 + "execution_count": 11 } ] } \ No newline at end of file diff --git a/samples/features/sql-big-data-cluster/spark/spark_to_sql/mssql_spark_connector.ipynb b/samples/features/sql-big-data-cluster/spark/spark_to_sql/mssql_spark_connector.ipynb new file mode 100644 index 0000000000..def8a13a1c --- /dev/null +++ b/samples/features/sql-big-data-cluster/spark/spark_to_sql/mssql_spark_connector.ipynb @@ -0,0 +1,81 @@ +{ + "metadata": { + "kernelspec": { + "name": "pyspark3kernel", + "display_name": "PySpark3" + }, + "language_info": { + "name": "pyspark3", + "mimetype": "text/x-python", + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "pygments_lexer": "python3" + } + }, + "nbformat_minor": 2, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "source": "# Read and write from Spark to SQL using the MSSQL jdbc Connector\r\nA typical big data scenario a key usage pattern is high volume, velocity and variety data processing in Spark followed with batch/streaming writes to SQL for access to LOB applications. These usage patterns greatly benefit from a connector that utilizes key SQL optimizations and provides an efficient and reliable write to SQLServer Big Data Cluster or SQL DB. \r\n\r\nMSSQL JDBC connector, referenced by the name com.microsoft.sqlserver.jdbc.spark, uses [SQL Server Bulk copy APIS](https://docs.microsoft.com/en-us/sql/connect/jdbc/using-bulk-copy-with-the-jdbc-driver?view=sql-server-2017#sqlserverbulkcopyoptions) to implement an efficient write to SQL Server. The connector is based on Spark Data source APIs and provides a familiar JDBC interface for access.\r\n\r\nThe following sample shows how to use the MSSQL JDBC Connector for writing and reading to/from a SQL Source. In this sample we' ll \r\n- Read a file from HDFS and do some basic processing \r\n- post that we'll write the dataframe to SQL server table using the MSSQL Connector. \r\n- Followed by the write we'll read using the MSSQLConnector.\r\n\r\nPreReq : \r\n- The sample uses a SQL database named \"MyTestDatabase\". Create this before you run this sample. The database can be created as follows\r\n ``` sql\r\n Create DATABASE MyTestDatabase\r\n GO \r\n ``` \r\n- Download [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine. Create a hdfs folder named spark_data and upload the file there. \r\n- Configure the spark session to use the MSSQL Connector jar. The jar can be found at /jar/spark-mssql-connector-assembly-1.0.0.jar post deployment of Big Data Cluster.\r\n\r\n``` sh\r\n %%configure -f\r\n {\"conf\": {\"spark.jars\": \"/jar/spark-mssql-connector-assembly-1.0.0.jar\"}}\r\n```\r\n\r\n \r\n ", + "metadata": {} + }, + { + "cell_type": "markdown", + "source": "# Configure the notebook to use the MSSQL Spark connector\r\nThis step woould be removed in subsequent CTPs. As of CTP2.5 this step is required to point the spark session to the relevant jar.\r\n ", + "metadata": {} + }, + { + "cell_type": "code", + "source": "%%configure -f\r\n{\"conf\": {\"spark.jars\": \"/jar/spark-mssql-connector-assembly-1.0.0.jar\"}}\r\n\r\n\r\n\r\n", + "metadata": {}, + "outputs": [], + "execution_count": 4 + }, + { + "cell_type": "markdown", + "source": "# Read data into a data frame\r\nIn this step we read the data into a data frame and do some basic clearup steps. \r\n\r\n", + "metadata": {} + }, + { + "cell_type": "code", + "source": "#Read a file and then write it to the SQL table\r\ndatafile = \"/spark_data/AdultCensusIncome.csv\"\r\ndf = spark.read.format('csv').options(header='true', inferSchema='true', ignoreLeadingWhiteSpace='true', ignoreTrailingWhiteSpace='true').load(datafile)\r\ndf.show(5)\r\n", + "metadata": {}, + "outputs": [], + "execution_count": 6 + }, + { + "cell_type": "code", + "source": "\r\n#Process this data. Very simple data cleanup steps. Replacing \"-\" with \"_\" in column names\r\ncolumns_new = [col.replace(\"-\", \"_\") for col in df.columns]\r\ndf = df.toDF(*columns_new)\r\ndf.show(5)\r\n\r\n", + "metadata": {}, + "outputs": [], + "execution_count": 8 + }, + { + "cell_type": "markdown", + "source": "# Write dataframe to SQL using MSSQL Spark Connector", + "metadata": {} + }, + { + "cell_type": "code", + "source": "#Write from Spark to SQL table using MSSQL Spark Connector\r\nprint(\"Use MSSQL connector to write to master SQL instance \")\r\n\r\nservername = \"jdbc:sqlserver://master-0.master-svc\"\r\ndbname = \"MyTestDatabase\"\r\nurl = servername + \";\" + \"databaseName=\" + dbname + \";\"\r\n\r\ndbtable = \"dbo.AdultCensus\"\r\nuser = \"sa\"\r\npassword = \"****\" # Please specify password here\r\n\r\n\r\ntry:\r\n df.write \\\r\n .format(\"com.microsoft.sqlserver.jdbc.spark\") \\\r\n .mode(\"overwrite\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password)\\\r\n .save()\r\nexcept ValueError as error :\r\n print(\"MSSQL Connector write failed\", error)\r\n\r\nprint(\"MSSQL Connector write succeeded \")\r\n\r\n\r\n", + "metadata": {}, + "outputs": [], + "execution_count": 10 + }, + { + "cell_type": "markdown", + "source": "# Read SQL Table using MSSQL Spark connector.\r\nThe following code uses the connetor to read the tables. To confirm the write about check table directly using SQL", + "metadata": {} + }, + { + "cell_type": "code", + "source": "#Read from SQL table using MSSQ Connector\r\nprint(\"read data from SQL server table \")\r\njdbcDF = spark.read \\\r\n .format(\"com.microsoft.sqlserver.jdbc.spark\") \\\r\n .option(\"url\", url) \\\r\n .option(\"dbtable\", dbtable) \\\r\n .option(\"user\", user) \\\r\n .option(\"password\", password) \\\r\n .load()\r\n\r\njdbcDF.show(5)", + "metadata": {}, + "outputs": [], + "execution_count": 11 + } + ] +} \ No newline at end of file diff --git a/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb b/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb index 26567e6d8c..df4bd26895 100644 --- a/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb +++ b/samples/features/sql-big-data-cluster/spark/sparkml/train_score_export_ml_models_with_spark.ipynb @@ -24,7 +24,7 @@ }, { "cell_type": "markdown", - "source": "## Step 1 - Explore your data\r\n### Load the data\r\nFor this example we'll use **AdultCensusIncome** data from [here]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ). From your Azure Data Studio connect to the HDFS/Spark gateway and create a directory called spark_ml under HDFS. \r\nDownload [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine and upload to HDFS.Upload AdultCensusIncome.csv to the folder we created.\r\n\r\n### Exploratory Analysis\r\n- Baisc exploration on the data\r\n- Labels & Features\r\n1. **Label** - This refers to predicted value. This is represented as a column in the data. Label is **income** \r\n2. **Features** - This refers to the characteristics that are used to predict. **age** and **hours_per_week**\r\n\r\nNote : In reality features are chosen by applying some correlations techniques to understand what best characterize the Label we are predicting.\r\n\r\n### The Model we will build\r\nIn AdultCensusIncome.csv contains several columsn like Income range, age, hours-per-week, education, occupation etc. We'll build a model that can predict income range would be >50K or <50K.\r\n", + "source": "## Step 1 - Explore your data\r\n### Load the data\r\nFor this example we'll use **AdultCensusIncome** data from [here]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ). From your Azure Data Studio connect to the HDFS/Spark gateway and create a directory called spark_data under HDFS. \r\nDownload [AdultCensusIncome.csv]( https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv ) to your local machine and upload to HDFS.Upload AdultCensusIncome.csv to the folder we created.\r\n\r\n### Exploratory Analysis\r\n- Baisc exploration on the data\r\n- Labels & Features\r\n1. **Label** - This refers to predicted value. This is represented as a column in the data. Label is **income** \r\n2. **Features** - This refers to the characteristics that are used to predict. **age** and **hours_per_week**\r\n\r\nNote : In reality features are chosen by applying some correlations techniques to understand what best characterize the Label we are predicting.\r\n\r\n### The Model we will build\r\nIn AdultCensusIncome.csv contains several columsn like Income range, age, hours-per-week, education, occupation etc. We'll build a model that can predict income range would be >50K or <50K.\r\n", "metadata": {} }, {