## Logic to upload GHArchive Files

Let us upload the files from GHArchive website to HDFS.
* Make sure the folder is created to place the files.
* Get the content of the file using `requests` module using`requests.get`. We need to pass the URI from which the files are supposed to be downloaded.
* It will return an object of type **Response** which contain the `content` attribute.
* The content can be uploaded to any file system. As our cluster is not confifured with Web HDFS, we will first add files to local file system, then we will upload to HDFS using `hdfs dfs -put` command.

* Creating the base directory.

In [1]:
!hdfs dfs -rm -R -skipTrash /user/${USER}/github/streaming/landing/ghactivity

rm: `/user/itv007304/github/streaming/landing/ghactivity': No such file or directory


In [2]:
!hdfs dfs -mkdir -p /user/${USER}/github/streaming/landing/ghactivity

* Get the content of the file.

In [3]:
import requests

In [4]:
file = '2023-07-11-0.json.gz'

In [5]:
res = requests.get(f'https://data.gharchive.org/{file}')

In [6]:
type(res.content)

bytes

In [7]:
res

<Response [200]>

* Uploading the file to local file system.

In [8]:
!rm -rf /home/${USER}/data/ghactivity

In [9]:
!mkdir -p /home/${USER}/data/ghactivity

In [10]:
import getpass

username = getpass.getuser()
username

'itv007304'

In [11]:
year = file[:4]
year

'2023'

In [12]:
month = file[5:7]
month

'07'

In [13]:
dayofmonth = file[8:10]
dayofmonth

'11'

In [14]:
target_local_folder = f'/home/{username}/data/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'
target_local_folder

'/home/itv007304/data/ghactivity/year=2023/month=07/dayofmonth=11'

In [17]:
import subprocess

subprocess.check_call(f'mkdir -p {target_local_folder}', shell=True)

0

In [18]:
target_file = open(f'{target_local_folder}/{file}', 'wb')

In [19]:
upload_res = target_file.write(res.content)

In [20]:
upload_res

96670341

In [21]:
target_file.close()

In [4]:
!find /home/${USER}/data/ghactivity -type f | xargs ls -ltrh

-rw-r--r-- 1 itv007304 students 93M Jul 12 17:44 /home/itv007304/data/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz


In [2]:
!ls -lR /home/${USER}/data/ghactivity

/home/itv007304/data/ghactivity:
total 4
drwxr-xr-x 3 itv007304 students 4096 Jul 12 17:23 year=2023

/home/itv007304/data/ghactivity/year=2023:
total 4
drwxr-xr-x 3 itv007304 students 4096 Jul 12 17:23 month=07

/home/itv007304/data/ghactivity/year=2023/month=07:
total 4
drwxr-xr-x 2 itv007304 students 4096 Jul 12 17:26 dayofmonth=11

/home/itv007304/data/ghactivity/year=2023/month=07/dayofmonth=11:
total 94412
-rw-r--r-- 1 itv007304 students 96670341 Jul 12 17:44 2023-07-11-0.json.gz


* Copy the files to HDFS and validate

In [23]:
target_hdfs_folder = f'/user/{username}/github/streaming/landing/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'
target_hdfs_folder

'/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11'

In [24]:
subprocess.check_call(f'hdfs dfs -mkdir -p {target_hdfs_folder}', shell=True)

0

In [26]:
subprocess.check_call(f'hdfs dfs -put {target_local_folder}/{file} {target_hdfs_folder}', shell=True)

0

In [27]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/landing/ghactivity

drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023
drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07
drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:59 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11
-rw-r--r--   3 itv007304 supergroup   96670341 2023-07-12 17:59 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz


* We can upload as many files as we want
* Let's upload files for all the hours of 2023-07-11.

In [9]:
import requests
import subprocess
import getpass

def upload_gharchive_files_to_hdfs(file_name):
    year = file_name[:4]
    month = file_name[5:7]
    dayofmonth = file_name[8:10]
    username = getpass.getuser()
    res = requests.get(f'https://data.gharchive.org/{file_name}')
    target_local_folder = f'/home/{username}/data/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'
    subprocess.check_call(f'mkdir -p {target_local_folder}', shell=True)
    target_file = open(f'{target_local_folder}/{file_name}', 'wb')
    upload_res = target_file.write(res.content)
    target_file.close()
    target_hdfs_folder = f'/user/{username}/github/streaming/landing/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'
    subprocess.check_call(f'hdfs dfs -mkdir -p {target_hdfs_folder}', shell=True)
    subprocess.check_call(f'hdfs dfs -put {target_local_folder}/{file_name} {target_hdfs_folder}', shell=True)
    

In [10]:
for hour in range(1, 24):
    print(f'Processing file 2023-07-11-{hour}.json.gz')
    upload_gharchive_files_to_hdfs(f'2023-07-11-{hour}.json.gz')

Processing file 2023-07-11-1.json.gz
Processing file 2023-07-11-2.json.gz
Processing file 2023-07-11-3.json.gz
Processing file 2023-07-11-4.json.gz
Processing file 2023-07-11-5.json.gz
Processing file 2023-07-11-6.json.gz
Processing file 2023-07-11-7.json.gz
Processing file 2023-07-11-8.json.gz
Processing file 2023-07-11-9.json.gz
Processing file 2023-07-11-10.json.gz
Processing file 2023-07-11-11.json.gz
Processing file 2023-07-11-12.json.gz
Processing file 2023-07-11-13.json.gz
Processing file 2023-07-11-14.json.gz
Processing file 2023-07-11-15.json.gz
Processing file 2023-07-11-16.json.gz
Processing file 2023-07-11-17.json.gz
Processing file 2023-07-11-18.json.gz
Processing file 2023-07-11-19.json.gz
Processing file 2023-07-11-20.json.gz
Processing file 2023-07-11-21.json.gz
Processing file 2023-07-11-22.json.gz
Processing file 2023-07-11-23.json.gz


In [11]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11

-rw-r--r--   3 itv007304 supergroup   96670341 2023-07-12 17:59 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz
-rw-r--r--   3 itv007304 supergroup   90660972 2023-07-13 15:22 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-1.json.gz
-rw-r--r--   3 itv007304 supergroup  104839007 2023-07-13 15:24 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-10.json.gz
-rw-r--r--   3 itv007304 supergroup  110194573 2023-07-13 15:24 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-11.json.gz
-rw-r--r--   3 itv007304 supergroup  112296692 2023-07-13 15:25 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-12.json.gz
-rw-r--r--   3 itv007304 supergroup  111578082 2023-07-13 15:25 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=1

In [14]:
!hdfs dfs -ls -t -r /user/${USER}/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11

Found 24 items
-rw-r--r--   3 itv007304 supergroup   96670341 2023-07-12 17:59 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz
-rw-r--r--   3 itv007304 supergroup   90660972 2023-07-13 15:22 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-1.json.gz
-rw-r--r--   3 itv007304 supergroup   87462965 2023-07-13 15:22 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-2.json.gz
-rw-r--r--   3 itv007304 supergroup   78345974 2023-07-13 15:23 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-3.json.gz
-rw-r--r--   3 itv007304 supergroup   70414630 2023-07-13 15:23 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-4.json.gz
-rw-r--r--   3 itv007304 supergroup   84607321 2023-07-13 15:23 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/