## Logic to upload GHArchive Files

Let us go through the logic to upload GHArchive website to HDFS.
* Make sure the folder is created to place the files.
* Get the content of the file using `requests` module using`requests.get`. We need to pass the URI from which the files are supposed to be downloaded.
* It will return an object of type **Response** which contain the `content` attribute.
* The content can be uploaded to any file system. As our cluster is not configured with Web HDFS, we will first add files to local file system, then we will upload to HDFS using `hdfs dfs -put` command.

* Creating the base directory.

In [None]:
!hdfs dfs -rm -R -skipTrash /user/${USER}/itv-github/streaming/landing/ghactivity

In [None]:
!hdfs dfs -mkdir -p /user/${USER}/itv-github/streaming/landing/ghactivity

* Get the content of the file.

In [None]:
import requests

In [None]:
file = '2021-01-13-0.json.gz'

In [None]:
res = requests.get(f'https://data.gharchive.org/{file}')

In [None]:
res

In [None]:
type(res.content)

* Uploading to local file system.

In [None]:
!rm -rf /home/${USER}/data/ghactivity

In [None]:
!mkdir -p /home/${USER}/data/ghactivity

In [None]:
import getpass
username = getpass.getuser()

In [None]:
year = file[:4]
year

In [None]:
month = file[5:7]
month

In [None]:
dayofmonth = file[8:10]
dayofmonth

In [None]:
target_local_folder = f'/home/{username}/data/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'

In [None]:
target_local_folder

In [None]:
import subprocess
subprocess.check_call(f'mkdir -p {target_local_folder}', shell=True)

In [None]:
target_file = open(f'{target_local_folder}/{file}', 'wb')

In [None]:
upload_res = target_file.write(res.content)

In [None]:
upload_res

In [None]:
target_file.close()

In [None]:
!ls -lR /home/${USER}/data/ghactivity/

* Copy the files to HDFS and validate.

In [None]:
target_hdfs_folder = f'/user/{username}/itv-github/streaming/landing/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}'

In [None]:
target_hdfs_folder

In [None]:
subprocess.check_call(f'hdfs dfs -mkdir -p {target_hdfs_folder}', shell=True)

In [None]:
subprocess.check_call(f'hdfs dfs -put {target_local_folder}/{file} {target_hdfs_folder}', shell=True)

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/landing/ghactivity