### Overview about `flatmap`

##### Map()
The `map()` transformation applies a given function to each element of the RDD and returns a new RDD containing the results of these function applications. Each input element is transformed into exactly one output element.

**Characteristics:**
* One-to-One Transformation: Each element in the input RDD is mapped to one element in the output RDD.
* Output RDD Size: The output RDD has the same number of elements as the input RDD.

##### flatmap()
The `flatMap()` transformation applies a given function to each element of the RDD and returns a new RDD by flattening the results. The function can return multiple output elements for each input element (typically a list or another iterable), and flatMap() will flatten these lists into a single RDD.

**Characteristics**:
* One-to-Many Transformation: Each element in the input RDD can be mapped to zero or more elements in the output RDD.
* Output RDD Size: The output RDD can have more or fewer elements than the input RDD.

##### Examples
We have a RDD:
```python
rdd = sc.parallelize(["hello world", "apache spark"])
```

If use map():
```python
mapped_rdd = rdd.map(lambda sentence: sentence.split(" "))
print(mapped_rdd.collect())
```

Output:
```python
[['hello', 'world'], ['apache', 'spark']]
```


if use flatmap():
```python
flat_mapped_rdd = rdd.flatMap(lambda sentence: sentence.split(" "))
print(flat_mapped_rdd.collect())
```

Output:
```python
['hello', 'world', 'apache', 'spark']
```


### Setup and Initialization

In [None]:
import re
from pyspark import SparkConf, SparkContext

def normalizeWords(text):
    return re.compile(r'\W+', re.UNICODE).split(text.lower())


#### 1. Define libraries
* `re`: Python's regular expression library for string manipulation.
* `SparkConf` and `SparkContext` from pyspark for Spark configuration and context creation.

#### 2. Define normalized functions
* `normalizeWords(text)`: This function takes a string (text) as input.
* `re.compile(r'\W+', re.UNICODE).split(text.lower())`:
  * `re.compile(r'\W+', re.UNICODE)`: Compiles a regular expression pattern that matches one or more non-word characters (anything other than a letter, digit, or underscore).
  * `.split(text.lower())`: Splits the text into words based on the compiled regular expression, after converting all characters to lowercase.

### Configuration and SparkContext Initialization

In [None]:
conf = SparkConf().setMaster("local").setAppName("WordCountExplicit")
sc = SparkContext(conf = conf)

##### 3. Configuring Spark:

* `SparkConf().setMaster("local")`: Configures Spark to run locally with a single thread.
* `setAppName("WordCountExplicit")`: Names the Spark application "WordCount".

##### 4. Creating SparkContext:

* `sc = SparkContext(conf = conf)`: Initializes the Spark context with the specified configuration.

### Reading Input Data and Processing

In [None]:
input = sc.textFile("./book.txt")
words = input.flatMap(normalizeWords)
wordCounts = words.countByValue()

##### 5. Loading Data:

* `sc.textFile("./book.txt")`: Reads the text file located at the specified path and creates an RDD called input. Each element of this RDD is a line from the text file.

##### 6. FlatMap Transformation:

* `input.flatMap(normalizeWords):`
  * `flatMap` applies the `normalizeWords` function to each line of the input RDD.
  * The `normalizeWords` function splits each line into words, converting all characters to lowercase and splitting based on non-word characters.
  * The resulting RDD (`words`) contains all words as individual elements, flattened into a single RDD.

##### 7. Counting Words:

* `words.countByValue():`
  * countByValue counts the occurrences of each word in the words RDD.
  * It returns a dictionary where the keys are words and the values are their counts.

In [None]:
for word, count in wordCounts.items():
    cleanWord = word.encode('ascii', 'ignore')
    if (cleanWord):
        print(cleanWord.decode() + " " + str(count))