Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
.DS_Store
.idea/
build
gradle
.gradle
.idea
build
!gradle-wrapper.jar
3 changes: 1 addition & 2 deletions code/bonus_chapters/lambda_expressions/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# What are [Lambda Functions](./Lambda_Expressions.pdf)?

Lambda functions = Anonymous functions

A lambda function is a small function containing a single expression.
Lambda functions can also act as anonymous functions where they don’t
require any name. These are very helpful when we have to perform small
tasks with less code.


# Python Example

Python programming language supports the creation of anonymous functions
Expand Down Expand Up @@ -63,4 +63,3 @@ You may use lambda expressions or functions in PySpark:
# rdd: RDD[(String, Integer)]
rdd2 = rdd.filter(filter_function)


4 changes: 4 additions & 0 deletions code/chap01/scala/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.idea
.idea/
.gradle/
build/
4 changes: 4 additions & 0 deletions code/chap02/scala/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.idea
.idea/
.gradle/
build/
43 changes: 42 additions & 1 deletion code/chap02/scala/README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1 +1,42 @@
Scala Solutions
# Chapter 2

## DNA-Base-Count Programs using FASTA Input Format

Using FASTA input files, there are 3 versions of DNA-Base-Count

* Version-1:
* Uses basic MapReduce programs
* Using Spark (`org.data.algorithms.spark.ch02.DNABaseCountVER1`)

* Version-2:
* Uses InMapper Combiner design pattern
* Using PySpark (`org.data.algorithms.spark.ch02.DNABaseCountVER2`)

* Version-3:
* Uses InMapper Combiner design pattern (by using mapPartitions() transformations)
* Using PySpark (`org.data.algorithms.spark.ch02.DNABaseCountVER3`)


## DNA-Base-Count Programs using FASTQ Input Format

Using FASTQ input files, the following solution is available:

* Uses InMapper Combiner design pattern (by using mapPartitions() transformations)
* Using PySpark (`org.data.algorithms.spark.ch02.DNABaseCountFastq`)


## FASTA Files to Test DNA-Base-Count

* A small sample FASTA file (`data/sample.fasta`) is provided.

* To test DNA-Base-Count programs with large size FASTA files,
you may download them from here:


````
ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/dna/

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/rs_fasta/

ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
````
23 changes: 23 additions & 0 deletions code/chap02/scala/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apply plugin: 'scala'
apply plugin: 'application'

ext.scalaClassifier = '2.13'
ext.scalaVersion = '2.13.7'

group 'com.spark.algos.data'
version '1.0-SNAPSHOT'

repositories {
mavenLocal()
mavenCentral()
}

dependencies {
implementation group: "org.scala-lang", name: "scala-library", version: "2.13.7"
implementation group: "org.apache.spark", name: "spark-core_2.13", version: "3.2.0"
implementation group: "org.apache.spark", name: "spark-sql_2.13", version: "3.2.0"
}

application {
mainClass = project.hasProperty("mainClass") ? project.getProperty("mainClass") : "NULL"
}
12 changes: 12 additions & 0 deletions code/chap02/scala/data/sample.fasta
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
>seq1
cGTAaccaataaaaaaacaagcttaacctaattc
>seq2
agcttagTTTGGatctggccgggg
>seq3
gcggatttactcCCCCCAAAAANNaggggagagcccagataaatggagtctgtgcgtccaca
gaattcgcacca
AATAAAACCTCACCCAT
agagcccagaatttactcCCC
>seq4
gcggatttactcaggggagagcccagGGataaatggagtctgtgcgtccaca
gaattcgcacca
Loading