<a href="https://colab.research.google.com/github/mlmaniac-neelothkulaambaal/PROJECT_NASA/blob/main/NASA_Jul_Log.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Download text from given URL**

In [None]:
!wget https://ditotw.space/NASA_access_log_Jul95.gz

--2022-02-07 23:07:44--  https://ditotw.space/NASA_access_log_Jul95.gz
Resolving ditotw.space (ditotw.space)... 162.241.217.135
Connecting to ditotw.space (ditotw.space)|162.241.217.135|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20676672 (20M) [application/x-gzip]
Saving to: ‘NASA_access_log_Jul95.gz’


2022-02-07 23:07:46 (17.3 MB/s) - ‘NASA_access_log_Jul95.gz’ saved [20676672/20676672]



**2. Install pyspark and importing necessary libraries**

In [None]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 37 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 63.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=f7b55938445aee836d448bb1eba49e630fd91389180c88ac129dc7bbe2811e92
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [None]:
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
import os.path
import re

**3. Confirm that the download is successful. The file is in the local working direrctory in Google Colab.**

In [None]:
mypath='/content/NASA_access_log_Jul95.gz'
os.path.isfile(mypath)

True

**4. Create sparkcontext and RDD**

In [None]:
config = SparkConf().setAppName("NASA_Logs").setMaster("local[*]")
sc = SparkContext.getOrCreate(config)
sqlcontext = SQLContext(sc)
rdd = sc.textFile(mypath)




**5. Check number of rows in the RDD**

In [None]:
rdd.count()

1891715

**6. Printing first 5 records in the RDD**

In [None]:
rdd.take(1)

['199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245']

In [None]:
i=0
for line in rdd.sample(withReplacement = False, fraction = 0.00001, seed = 100).collect():
    i=i+1
    print(line)
    if i >5:
      break

ix-lb7-05.ix.netcom.com - - [02/Jul/1995:04:04:00 -0400] "GET /shuttle/missions/sts-71/images/images.html HTTP/1.0" 200 7634
zachar.fast.net - - [03/Jul/1995:00:06:07 -0400] "GET /shuttle/technology/images/sts_body_2-small.gif HTTP/1.0" 200 30067
ix-bos8-08.ix.netcom.com - - [03/Jul/1995:11:48:45 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 200 234
199.120.22.3 - - [03/Jul/1995:21:24:27 -0400] "GET /shuttle/missions/sts-67/sts-67-patch-small.gif HTTP/1.0" 200 17083
wstabnow.clark.net - - [04/Jul/1995:13:37:33 -0400] "GET /shuttle/technology/images/srb_mod_compare_1-small.gif HTTP/1.0" 200 36902
alyssa.prodigy.com - - [05/Jul/1995:23:59:57 -0400] "GET /history/apollo/apollo-4/images/ HTTP/1.0" 200 514


**7. Set the regular expression to filter those without desired data format. Data might include undesirable characters.**

In [None]:
search_regex_template1='^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*(\S+)\s*(\S+)\s*([\w\.\s*]+)?\s*"*(\d{3}) (\S+)'
search_regex_template2='^(\S+) (\S+) (\S+) \[(\S+) [-](\d{4})\] "(\S+)\s*([/\w\.]+)>*([\w/\s\.]+)\s*(\S+)\s*(\d{3})\s*(\S+)'

**8. Check the number of failing line entries(records) to ensure data stability**

In [None]:
def find_failed_items(line):
    match = re.search(search_regex_template1, line)
    if match is None:
        return 0
    else:
        return 1

rddRecordCount = rdd.count()
failedRecords = rdd.map(lambda line: find_failed_items(line)).filter(lambda line: line == 0).count()
print('{}/{} records failed to parse'.format(failedRecords,rddRecordCount))

855/1891715 records failed to parse


**9. Filtering all the records without proper format, there is still 790 such undesirable records. It is a very low count for the massive dataset, so we can proceed with the dataset.**

In [None]:
def deep_cleaning_log(line):
    match = re.search(search_regex_template1,line)
    if match is None:
        match = re.search(search_regex_template2,line)
    if match is None:
        return (line, 0)
    else:
        return (line, 1)
failedRecords = rdd.map(lambda line: deep_cleaning_log(line)).filter(lambda line: line[1] == 0).count()
print('{}/{} records failed to parse'.format(failedRecords,rddRecordCount))

790/1891715 records failed to parse


**10. map_log function returns the desired dataset to be parsed for counting.**

In [None]:
def map_log(line):
    match = re.search(search_regex_template1,line)
    if match is None:
        match = re.search(search_regex_template2,line)

    return(match.groups())

**11. Extracting the 11 elements from the record.**

In [None]:
parsedRdd = rdd.map(lambda line: deep_cleaning_log(line)).filter(lambda line: line[1] == 1).map(lambda line : line[0])

In [None]:
parsedRdd2 = parsedRdd.map(lambda line: map_log(line))

**12. Print the first 3 extracted values in each record to view sample**

In [None]:
for element in parsedRdd2.take(3):
    print(element)
    print('\n')

('199.72.81.55', '-', '-', '01/Jul/1995:00:00:01', '0400', 'GET', '/history/apollo/', 'HTTP/1.0"', None, '200', '6245')


('unicomp6.unicomp.net', '-', '-', '01/Jul/1995:00:00:06', '0400', 'GET', '/shuttle/countdown/', 'HTTP/1.0"', None, '200', '3985')


('199.120.110.21', '-', '-', '01/Jul/1995:00:00:09', '0400', 'GET', '/shuttle/missions/sts-73/mission-sts-73.html', 'HTTP/1.0"', None, '200', '4085')




**13. Ensure all the records have 11 elements in total and are generic.**

In [None]:
parsedRdd2.map(lambda line: len(line)).distinct().collect()

[11]

14. **PROGRAM EXECUTION:**
In order to view the top N sites, run the following cells and key in the N value. The reducer function outputs the summation of each distinct URL, and returns the top N most visited sites.

In [None]:
def result(n):
  result = parsedRdd2.map(lambda line: (line[0],1)).reduceByKey(lambda previousCount, nextCount: previousCount + nextCount).takeOrdered(n, lambda x: -x[1])
  print(result)

In [None]:
n=input('Enter top N sites you wish to see:')

result(int(n))

Enter top N sites you wish to see:3
[('piweba3y.prodigy.com', 17572), ('piweba4y.prodigy.com', 11591), ('piweba1y.prodigy.com', 9868)]
