# RDD API - XYZ - Research

Imagine working as a Data Scientist for a company called XYZ Research that performs research on many diversified topics, and each research project comes with a Research ID.

Each Research can be completed within a year and beyond.

Take a look at the data below;

data2001List = ['RIN1', 'RIN2', 'RIN3', 'RIN4', 'RIN5', 'RIN6', 'RIN7']

data2002List = ['RIN3', 'RIN4', 'RIN7', 'RIN8', 'RIN9']

data2003List = ['RIN4', 'RIN8', 'RIN10', 'RIN11', 'RIN12']

The above-mentioned data indicates the number of projects that have been completed over a period of 3 years.

In 2001, there were at least 7 research projects [RIN1 - 7] that were initiated, RIN8 and RIN9 were initiated in 2002, while RIN10, RIN11, and RIN12 were initiated in 2003.

RIN3 also appears in 2002, meaning that the research project has actually spanned over 2 years, RIN4, on the other hand, appears in 2001, 2002, and 2003 which means that the research project spanned over 3 years.

XYZ Research Requires you to examine the data and find answers to the following questions;
1. How many research projects were initiated in the three years?

2. How many projects were completed in the first year?

3. How many projects were completed in the first two years?

4. How many Project were completed in the one year?

5. How many Project were completed in the two years?

6. How many Project were completed in the three years?

In [1]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("XYZ Research").getOrCreate())

In [26]:
data2001List = ['RIN1', 'RIN2', 'RIN3', 'RIN4', 'RIN5', 'RIN6', 'RIN7']
data2002List = ['RIN3', 'RIN4', 'RIN7', 'RIN8', 'RIN9']
data2003List = ['RIN4', 'RIN8', 'RIN10', 'RIN11', 'RIN12']

In [27]:
data2001RDD = spark.sparkContext.parallelize(data2001List)
data2002RDD = spark.sparkContext.parallelize(data2002List)
data2003RDD = spark.sparkContext.parallelize(data2003List)

**Project Completed in First Year

In [28]:
dataFirstYearRDD = data2001RDD.subtract(data2002RDD)
firstYear = []
for data in dataFirstYearRDD.collect():
    firstYear.append(data)
print(f"completed in the first year {firstYear}")

completed in the first year ['RIN2', 'RIN5', 'RIN6', 'RIN1']


** Project Completed in first two years

In [30]:
UnionFirstTwoYear = data2001RDD.union(data2002RDD)
dataFirstTwoYearRDD = UnionFirstTwoYear.subtract(data2003RDD)
firstTwoYear = []
for data in dataFirstTwoYearRDD.distinct().collect():
    firstTwoYear.append(data)
print(f"completed in the first two year {firstTwoYear}")

completed in the first two year ['RIN1', 'RIN9', 'RIN2', 'RIN3', 'RIN5', 'RIN6', 'RIN7']


** Project Initiated in Three Years

In [31]:
TwoYearsInitiate = data2001RDD.intersection(data2002RDD)
ThreeYearsInitiate = TwoYearsInitiate.intersection(data2003RDD)
threeYearsInitiateData = []
for data in ThreeYearsInitiate.distinct().collect():
    threeYearsInitiateData.append(data)
print(f"Initiate in three Years {threeYearsInitiateData}")

Initiate in three Years ['RIN4']


** How many project completed in One Years, Two Years, and Three Years?

In [4]:
research_rdd = spark.sparkContext.parallelize([data2001List, data2002List, data2003List])

In [6]:
flattened_rdd = research_rdd.flatMap(lambda arr: arr)

In [7]:
for data in flattened_rdd.collect():
    print(data)

RIN1
RIN2
RIN3
RIN4
RIN5
RIN6
RIN7
RIN3
RIN4
RIN7
RIN8
RIN9
RIN4
RIN8
RIN10
RIN11
RIN12


In [17]:
count_by_value = flattened_rdd.countByValue()
project1times=[]
project2times=[]
project3times=[]
for element, count in count_by_value.items():
    if count == 1:
        project1times.append(element)
    elif count == 2:
        project2times.append(element)
    elif count == 3:
        project3times.append(element)

In [21]:
print(f"Project were completed in the one year {project1times}")
print(f"Project were completed in the two years {project2times}")
print(f"Project were completed in the three years {project3times}")

Project were completed in the one year ['RIN1', 'RIN2', 'RIN5', 'RIN6', 'RIN9', 'RIN10', 'RIN11', 'RIN12']
Project were completed in the two years ['RIN3', 'RIN7', 'RIN8']
Project were completed in the three years ['RIN4']
