* This line imports the necessary classes from the PySpark library. `SparkConf` is used to configure the Spark application, and `SparkContext` is the entry point to Spark functionality

In [20]:
from pyspark import SparkConf, SparkContext

#### This line set up the Spark configuration and initialize the Spark context.

In [None]:
conf = SparkConf().setMaster("local").setAppName("NumberFriendsByAge")
sc = SparkContext(conf = conf)

* `setMaster("local")` tells Spark to run locally with one thread (no distributed computing).
* `setAppName("FriendsByAge")` names the Spark application "FriendsByAge". This name will be invisible in the Spark web UI and logs. By setting an appropiate and meaningful application name, you can make it easier for yourself and others to manage and debug the Spark application effectively.
* `SparkContext(conf = conf)` initializes the Spark context with the given configuration.

In [21]:
def PareLine(line):
    fields = line.split(',')
    age = int(fields[2])
    numFriends = int(fields[3])
    return (age, numFriends)

# This function `PareLine` takes a line of text from the input data (Which is expected to be a CSV), splits it by comma, and extracts the `age` and `numFriends` fields.
# It returns a tuple `(age, numFriends)`

In [None]:
lines = sc.textFile("./fakefriends.csv")

In [None]:
rdd = lines.map(PareLine)
# THis applies the `PareLine` function to each element of the `lines` RDD, transforming it into a new RDD where each element is a tuple `(age, numFriends)`

### Overview of `map` Transformation
The `map` transformation is one of the most commonly used operations in Apache Spark. It applies a given function to each element of an RDD, resulting in a new RDD containing the transformed elements.


In [None]:
totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

### Detailed Explanation
**1. Input RDD: `lines`**
* Type: `RDD[String]`
* Content: Each element is a line of text from the `fakefriends.csv` file. for example it will contain the following lines:
```bash
0,Will,33,385
1,Jean,26,2
2,Jane,45,2
```
then the `lines` RDD will have
```css
["0,Will,33,385", "1,Jean,26,2", "2,Jane,45,2"]
```

**2. Applying `map(PareLine)`**
The `map` transformation applies the `PareLine` function to each element of the `lines RDD. This results in a new RDD where each line has been transformed from a string to a tuple containing the age and the number of friends.

### Visual Representation
Here is a visual representation of how `map(PareLine)` transforms the data:
```less
Input RDD (lines):
["0,Will,33,385", "1,Jean,26,2", "2,Jane,45,2"]

After map(PareLine):
[(33, 385), (26, 2), (45, 2)]
```

### Why use `map`?
The `map` transformation is useful for:
* **Data Cleaning**: Converting raw data into a more useful format.
* **Feature Extraction**: Extracting relevant fields from input data.
* **Transformation**: Applying any kind of transformation or computation to each element.

### Step-by-step Explanation
#### 1. Input RDD: `rdd`
* Type: `RDD[(int, int)]`
* Content: Each element is a tuple `(age, numFriends)`
* Example: `[(33, 385), (26, 2), (45, 2)]`

#### 2. Transformation: `mapValues(lambda x: (x,1))`
* Purpose: To transform each value in the RDD while keeping the key unchanged.
  * Input: `x` is the number of friends.
  * Output: A tuple `(x, 1)` where `x` is the number of friends and 1 is the count.

**Results**: The resulting RDD will have the same keys (age) but the values will be transformed into tuples `(numFriends, 1)`.

**Example**:
* Before `mapValues`: `[(33, 385), (26, 2), (45, 2)]`
* After `mapValues`: `[(33, (385, 1)), (26, (2, 1)), (45, (2, 1))]`

#### 3. Transformation: `reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))`
* Purpose: To aggregate the values for each key (age) by summing up the number of friends and the count.
* Function: lambda x, y: (x[0] + y[0], x[1] + y[1])
  * Input: x and y are tuples of the form `(numFriends, count)`.
  * Output: A new tuple where:
    * The first element is the sum of the number of friends.
    * The second element is the sum of the counts.

**Explanation**: `reduceByKey` groups the values by key (age) and applies the given function to combine the values

**Resulting RDD**: 
* After `reduceByKey`, the RDD will contain tuples where the first element is the age and the second element is a tuple with the total number of friends and the count of entries for that age.
**Example**: Result: `[(33, (395, 2)), (26, (10, 2)), (45, (2, 1))]`

#### Result
The resulting `totalsByAge` RDD contains the total number of friends and the total count of entries for each age. This prepares the data for calculating the average number of friends per age in the next step of the pipeline.

In [None]:
final_results = averagesByAge.collect()
# This collect the results from the "averagesByAge` RDD into a list on the driver node. this is an action that triggers the excution of the RDD transformations and brings the results to the driver.

In [None]:
for No in final_results:
    print(f"Age: {No[0]}, Average Number of Friends: {No[1]:.2f}")