<a href="https://colab.research.google.com/github/hartmann-pereira/D/blob/main/pyspark_adventures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Spark Example

Word count implemented in pure Python.

This notebook exemplifies the execution of a Spark program in Python.
In this example, spark runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.

Spark documentation available at:
https://spark.apache.org/docs/2.3.1/


### Download the dataset 

In [1]:
!wget -O os_maias.txt https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0

--2021-11-01 18:56:18--  https://www.dropbox.com/s/n24v0z7y79np319/os_maias.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601f:18::a27d:912
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/n24v0z7y79np319/os_maias.txt [following]
--2021-11-01 18:56:18--  https://www.dropbox.com/s/raw/n24v0z7y79np319/os_maias.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucb1617c2df579f47cd6e0aede51.dl.dropboxusercontent.com/cd/0/inline/BZL2o0Bife9MsEYWEDC52AKtGjdLcXWKUplslZkmS267RvzzqMBntOSGQElOoqha9DDbekvbXMjlMFmvG89Tu6MC49ZGLc96aqIw65-oi3rX1n2GLaOaZkAKZ-ig9-PpYqikrAAlu9BFT4nt69nZEQQC/file# [following]
--2021-11-01 18:56:19--  https://ucb1617c2df579f47cd6e0aede51.dl.dropboxusercontent.com/cd/0/inline/BZL2o0Bife9MsEYWEDC52AKtGjdLcXWKUplslZkmS267RvzzqMBntOSGQElOoqha9DDbekvbXMjlMFmvG89Tu

## WordCount Example
Read the words from input and count them.

The processing executes the following steps:

+ Filter empty liness.
+ (Flat)Maps each line to a sequence of words.
+ Maps each word in a tuple.
+ Reduces by key, using function sum.
+ Takes and print the first 10 results.

In [10]:
# instalar as dependências
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

import pyspark
from operator import add as sum

sc = pyspark.SparkContext('local[*]')
try :
    lines = sc.textFile('os_maias.txt')
    non_empty_lines = lines.filter( lambda line : len(line) > 0 )
    words = non_empty_lines.flatMap( lambda line : line.split(' '))
    words_tuples = words.map( lambda word : (word, 1))
    occurrences = words_tuples.reduceByKey( sum )
    top = occurrences.map(lambda x: (x[1], x[0])).sortByKey(False)
    for (k,v) in top.take(10):
        print( k, v )

    #for (k,v) in occurrences.take(10):
     #   print( k, v )
    
    

    sc.stop()
except:
    sc.stop()


8308 de
6720 a
6602 o
4846 que
4441 e
3535 -
3004 um
2792 com
2564 do
2200 da


## Sorted WordCount Example
Get the 10 words that appear more frequently.


In [11]:
!wget -O web_log.txt https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0


--2021-11-01 19:44:47--  https://www.dropbox.com/s/0r8902uj9yum7dg/web.log?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/0r8902uj9yum7dg/web.log [following]
--2021-11-01 19:44:47--  https://www.dropbox.com/s/raw/0r8902uj9yum7dg/web.log
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucbb83156adc9fa6170cb85b268a.dl.dropboxusercontent.com/cd/0/inline/BZLdUkIzfSQxeyScjMYb2sDAU5OVwUZwRZBbzPwBwGOOMKRhugky5Wa3pQfa9LHQGVPD69cVpQxavk-cU41T2r94o8oSfHIBS44-jLjqRy-rNDy4gc17Bc0bkNw7ctgL3wj9py4Xm69LkFq1Irt7XfPN/file# [following]
--2021-11-01 19:44:47--  https://ucbb83156adc9fa6170cb85b268a.dl.dropboxusercontent.com/cd/0/inline/BZLdUkIzfSQxeyScjMYb2sDAU5OVwUZwRZBbzPwBwGOOMKRhugky5Wa3pQfa9LHQGVPD69cVpQxavk-cU41T2r94o8oSfHIBS44-

In [12]:
sc = pyspark.SparkContext('local[*]')

In [13]:
try:
  lines = sc.textFile('web_log.txt')
  non_empty_lines = lines.filter(lambda line: len(line) > 0)
  ips = non_empty_lines.map(lambda line:line.split(" ")[1])
  distinct_ips = ips.distinct()
  for v in distinct_ips.collect():
    print(v) 
  sc.stop()
except:
  sc.stop()

185.28.193.95
2002:894a:3a93:d:250:56ff:fe00:88c0
192.241.151.220
97.77.104.22
211.140.26.58
2602:ff62:104:7c9:8000::
120.52.73.98
202.106.16.36
2001:41d0:8:e7b5::1
201.18.115.114
2a02:c207:2008:5973::1
31.14.134.193
195.225.123.14
118.178.86.82
202.98.152.252
82.146.37.33
2001:41d0:a:2417::1
217.61.2.106
180.234.223.91
128.199.215.91
123.30.108.67
202.170.126.68
46.14.171.138
125.31.19.25
177.159.113.114
81.169.232.7
52.50.2.169
177.54.250.18
192.81.220.47
177.220.156.58
14.152.90.148
200.168.250.196
71.183.112.122
119.29.183.143
207.249.125.35
41.76.44.76
137.74.254.198
5.10.167.204
187.60.170.22
2a02:c207:2009:3128::1
13.67.211.33
90.152.38.178
203.70.11.180
2404:8000:90:1:42f2:e9ff:fe32:7928
177.89.161.223
173.239.197.125
190.15.222.55
119.29.119.49
103.27.118.146
120.92.3.127
94.177.240.125
191.102.89.10
91.194.42.51
5.196.58.88
184.49.237.6
118.70.212.142
151.80.197.192
202.29.221.90
89.191.131.243
138.197.28.136
85.143.210.233
165.231.0.242
41.160.187.186
163.172.67.180
209.133.