## Upload GHArchive Files to s3

Let us upload the files from GHArchive website to s3.
* Make sure the folder is created to place the files.
* Get the content of the file using `requests` module using`requests.get`.
* It will return an object of type **Response** which contain the `content` attribute.
* The content can be uploaded to s3 using s3 client's `put_object`. If we specify right object name, the zip file that is downloaded using `requests.get` will be uploaded into s3.

* Creating the base directory.

In [None]:
%fs mkdirs /mnt/itv-github-db/streaming/landing/ghactivity

* Get the content of the file.

In [None]:
import requests

In [None]:
file = '2021-01-13-0.json.gz'

In [None]:
res = requests.get(f'https://data.gharchive.org/{file}')

* Uploading the file to s3.

In [None]:
import boto3

In [None]:
s3_client = boto3.client('s3')

In [None]:
year = file[:4]
year

In [None]:
month = file[5:7]
month

In [None]:
dayofmonth = file[8:10]
dayofmonth

In [None]:
upload_res = s3_client.put_object(
   Bucket='itv-github-db',
   Key=f'streaming/landing/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}/{file}',
   Body=res.content
)

In [None]:
upload_res

* Validating whether the file is uploaded to s3 as object or not.

In [None]:
%fs ls /mnt/itv-github-db/streaming/landing/ghactivity

path,name,size
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/,year=2021/,0


In [None]:
%fs ls /mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13

path,name,size
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-0.json.gz,2021-01-13-0.json.gz,47825349


* We can upload as many files as we want.
* Let us upload files for all the hours of **2021-01-13**.

In [None]:
import requests
import boto3

def upload_gharchive_files_to_s3(file_name):
  year = file_name[:4]
  month = file_name[5:7]
  dayofmonth = file_name[8:10]
  res = requests.get(f'https://data.gharchive.org/{file_name}')
  s3_client = boto3.client('s3')
  upload_res = s3_client.put_object(
    Bucket='itv-github-db',
    Key=f'streaming/landing/ghactivity/year={year}/month={month}/dayofmonth={dayofmonth}/{file_name}',
    Body=res.content
  )

In [None]:
for hour in range(1, 24):
  upload_gharchive_files_to_s3(f'2021-01-13-{hour}.json.gz')

In [None]:
%fs ls /mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13

path,name,size
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-0.json.gz,2021-01-13-0.json.gz,47825349
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-1.json.gz,2021-01-13-1.json.gz,45560145
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-10.json.gz,2021-01-13-10.json.gz,71293671
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-11.json.gz,2021-01-13-11.json.gz,65318647
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-12.json.gz,2021-01-13-12.json.gz,65044936
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-13.json.gz,2021-01-13-13.json.gz,77894277
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-14.json.gz,2021-01-13-14.json.gz,81246956
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-15.json.gz,2021-01-13-15.json.gz,85821693
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-16.json.gz,2021-01-13-16.json.gz,80773183
dbfs:/mnt/itv-github-db/streaming/landing/ghactivity/year=2021/month=01/dayofmonth=13/2021-01-13-17.json.gz,2021-01-13-17.json.gz,74211217
