d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Reading Data - Text Files

**Technical Accomplishments:**
- Reading data from a simple text file

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "../Includes/Classroom-Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading from Fixed-Width Text File

We can read in just about any file when each record is delineated only by a new line just as we saw with CSV and JSON (or rather JSON-Lines), formats.

To accomplish this, we can use `DataFrameReader.text(..)` which gives a `DataFrame` with just one column named **value** of type **string**.

The difference is that we now need to take responsibility for parsing out the data in each "column" ourselves.

One of the more common use cases is fixed-width files or even Apache's HTTP Access Logs. In the first case, it would require a sequence of substrings. In the second, a sequence of regular expressions would be a better solution to extract the value of each column. In either case, additional transformations are required - which we will go into later.

For this example, we are going to create a `DataFrame` from the full text of the book *The Adventures of Tom Sawyer* by Mark Twain.

In [0]:
%fs ls /mnt/training/tom-sawyer/tom.txt

In [0]:
%fs head /mnt/training/tom-sawyer/tom.txt

In [0]:
textFile = "/mnt/training/tom-sawyer/tom.txt"

textDF = (spark.read        # The DataFrameReader
          .text(textFile)   # Creates a DataFrame from raw text after reading in the file
)

textDF.printSchema()

And with the `DataFrame` created, we can view the data, one record for each line in the text file.

In [0]:
display(textDF)

As simple as this example is, it's also the premise for loading more complex text files like fixed-width text files.

We will see later exactly how to do this, but for each line that is read in, it's simply a matter of a couple of more transformations (like substring-ing values) to convert each line into something more meaningful.

Let's take a look at some of the other details of the `DataFrame` we just created for comparison sake.

In [0]:
print("Partitions: " + str(textDF.rdd.getNumPartitions()))
printRecordsPerPartition(textDF)
print("-"*80)

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "../Includes/Classroom-Cleanup"

## Next Steps

* [Reading Data #1 - CSV]($./Reading Data 1 - CSV)
* [Reading Data #2 - Parquet]($./Reading Data 2 - Parquet)
* [Reading Data #3 - Tables]($./Reading Data 3 - Tables)
* [Reading Data #4 - JSON]($./Reading Data 4 - JSON)
* Reading Data #5 - Text
* [Reading Data #6 - JDBC]($./Reading Data 6 - JDBC)
* [Reading Data #7 - Summary]($./Reading Data 7 - Summary)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>