# Exercise 10 | RDDs

Data files used in this task are located on HDFS in the following directory: `/loudacre/weblogs`. You may wish to perform the exercise below using a smaller dataset, consisting of only a few of the web log files, rather than all of them (which can take a lot of time). You can specify a wildcard, e.g., `/loudacre/weblogs/*6.log` would include only log files whose names end with the digit `6`.

## 1. Count the Number of Requests from Each User

Using map-reduce, count the number of requests from each user:

1. Use `map` to create a Pair RDD with the user ID as the key and the integer `1` as the value.  
   - *(The user ID is the third field in each line of the log file.)*
2. Use `reduce` to sum the values for each user ID.

## 2. Determine User Visit Frequency

Determine how many users visited the site for each frequency. That is, how many users visited once, twice, three times, and so on.

1. Use `map` to reverse the key and value from the result of step 1.
2. Use the `countByKey` action to return a Map (data structure) of frequency:user-count pairs.

## 3. Create an RDD of Users and Their IP Addresses

Create an RDD where the user ID is the key, and the value is the list of all the IP addresses that the user has connected from.  
*(IP address is the first field in each line of the log file.)*

1. Use `map` to create a Pair RDD with the user ID as the key and the IP address as the value.
2. Use `groupByKey` to group the list of all the IP addresses that the user has connected from.
3. You can use the following code to print out the first 5 user IDs and their IP lists:

```python
# "userips" is the name of the RDD where the user ID is the key,
# and the value is the list of all the IP addresses that user has connected from.
for (userid, ips) in userips.take(5):
    print(userid, ":")
    for ip in ips:
        print("\t", ip)
```


## Step 1: Count the Number of Requests from Each User

In [None]:
from pyspark import SparkContext

sc = SparkContext("local", "WebLogAnalysis")

# load
logs = sc.textFile("/loudacre/weblogs/*")

# extract userID (third field) and map to (user_id, 1)
user_requests = logs.map(lambda line: (line.split()[2], 1))

# reduce by key
user_request_counts = user_requests.reduceByKey(lambda a, b: a + b)

# print results (first 10)
for user, count in user_request_counts.take(10):
    print(user, count)

## Step 2: Determine User Visit Frequency

In [None]:
# reverse KVPs: (count, user_id) -> (visit_count, 1)
visit_frequencies = user_request_counts.map(lambda x: (x[1], 1))

# count
frequency_count = visit_frequencies.countByKey()

# print
for visits, count in sorted(frequency_count.items()):
    print(f"Users who visited {visits} times: {count}")

## Step 3: Create an RDD of Users and Their IP Addresses

In [None]:
# extract userID (third field) and IP (first field) -> (user_id, ip)
user_ips = logs.map(lambda line: (line.split()[2], line.split()[0]))

# group IPs by user
user_ips_grouped = user_ips.groupByKey()

# print first 5
for userid, ips in user_ips_grouped.take(5):
    print(userid, ":")
    for ip in ips:
        print("\t", ip)