## Interacting with Common AWS services using Python 3 ##

**This notebook will capture how to :**
1. connect to create buckets in s3.
2. Listing buckets in s3
3. How to connect to existing buckets in s3 and read files in them.
4. How to download files from s3 onto local computer
5. How to copy files from one bucket to another
6. Deleting s3 buckets

In [4]:
import boto3
import pandas as pd

### Connecting to s3 ###

In [2]:
s3_resource = boto3.resource('s3')

In [3]:
#list available buckets
for bucket in s3_resource.buckets.all():
    print(bucket.name)

aws-emr-resources-910991713532-us-west-1
aws-logs-910991713532-us-west-1
dataeng-capstone-1
faraz-bucket-a-20200712
faraz-test-bucket-20200712
fk-new-bucket-20200711
sparkify-fk
sparkify-fk3
sparkify-fk4


In [11]:
%%time
#df = pd.read_csv('s3://dataeng-capstone-1/h1b_disclosure_data_2017_2018.dat',sep="|")
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='dataeng-capstone-1', Key='h1b_disclosure_data_2017_2018.dat')
df = pd.read_csv(obj['Body'],sep="|")
df.head()



CPU times: user 12.7 s, sys: 2.17 s, total: 14.8 s
Wall time: 1min 12s


Unnamed: 0,FY_YEAR,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,EMPLOYMENT_START_DATE,EMPLOYMENT_END_DATE,EMPLOYER_NAME,EMPLOYER_BUSINESS_DBA,...,H1B_DEPENDENT,WILLFUL_VIOLATOR,SUPPORT_H1B,LABOR_CON_AGREE,PUBLIC_DISCLOSURE_LOCATION,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,ORIGINAL_CERT_DATE
0,2018,I-200-18026-338377,CERTIFIED,2018-01-29,2018-02-02,H-1B,2018-07-28,2021-07-27,MICROSOFT CORPORATION,,...,N,N,,,,REDMOND,KING,WA,98052,
1,2018,I-200-17296-353451,CERTIFIED,2017-10-23,2017-10-27,H-1B,2017-11-06,2020-11-06,ERNST & YOUNG U.S. LLP,,...,N,N,,,,SANTA CLARA,SAN JOSE,CA,95110,
2,2018,I-200-18242-524477,CERTIFIED,2018-08-30,2018-09-06,H-1B,2018-09-10,2021-09-09,LOGIXHUB LLC,,...,N,N,,,,IRVING,DALLAS,TX,75062,
3,2018,I-200-18070-575236,CERTIFIED,,2018-03-30,H-1B,2018-09-10,2021-09-09,"HEXAWARE TECHNOLOGIES, INC.",,...,Y,N,Y,,,NEW CASTLE,NEW CASTLE,DE,19720,
4,2018,I-200-18243-850522,CERTIFIED,2018-08-31,2018-09-07,H-1B,2018-09-07,2021-09-06,"ECLOUD LABS,INC.",,...,Y,N,Y,Y,,BIRMINGHAM,SHELBY,AL,35244,


In [13]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [14]:
df.head()

Unnamed: 0,FY_YEAR,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,EMPLOYMENT_START_DATE,EMPLOYMENT_END_DATE,EMPLOYER_NAME,EMPLOYER_BUSINESS_DBA,EMPLOYER_ADDRESS,EMPLOYER_CITY,EMPLOYER_STATE,EMPLOYER_POSTAL_CODE,EMPLOYER_COUNTRY,EMPLOYER_PROVINCE,EMPLOYER_PHONE,EMPLOYER_PHONE_EXT,AGENT_REPRESENTING_EMPLOYER,AGENT_ATTORNEY_NAME,AGENT_ATTORNEY_CITY,AGENT_ATTORNEY_STATE,JOB_TITLE,SOC_CODE,SOC_NAME,NAICS_CODE,TOTAL_WORKERS,NEW_EMPLOYMENT,CONTINUED_EMPLOYMENT,CHANGE_PREVIOUS_EMPLOYMENT,NEW_CONCURRENT_EMPLOYMENT,CHANGE_EMPLOYER,AMENDED_PETITION,FULL_TIME_POSITION,PREVAILING_WAGE,PW_UNIT_OF_PAY,PW_WAGE_LEVEL,PW_SOURCE,PW_SOURCE_YEAR,PW_SOURCE_OTHER,WAGE_RATE_OF_PAY_FROM,WAGE_RATE_OF_PAY_TO,WAGE_UNIT_OF_PAY,H1B_DEPENDENT,WILLFUL_VIOLATOR,SUPPORT_H1B,LABOR_CON_AGREE,PUBLIC_DISCLOSURE_LOCATION,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,ORIGINAL_CERT_DATE
0,2018,I-200-18026-338377,CERTIFIED,2018-01-29,2018-02-02,H-1B,2018-07-28,2021-07-27,MICROSOFT CORPORATION,,1 MICROSOFT WAY,REDMOND,WA,98052,UNITED STATES OF AMERICA,,4258830000.0,,N,",",,,SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",51121,1,0,1,0,0,0,0,Y,112549.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,143915.0,0.0,Year,N,N,,,,REDMOND,KING,WA,98052,
1,2018,I-200-17296-353451,CERTIFIED,2017-10-23,2017-10-27,H-1B,2017-11-06,2020-11-06,ERNST & YOUNG U.S. LLP,,200 PLAZA DRIVE,SECAUCUS,NJ,7094,UNITED STATES OF AMERICA,,2018720000.0,,Y,"BRADSHAW, MELANIE",TORONTO,,TAX SENIOR,13-2011,ACCOUNTANTS AND AUDITORS,541211,1,0,0,0,0,1,0,Y,79976.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,100000.0,0.0,Year,N,N,,,,SANTA CLARA,SAN JOSE,CA,95110,
2,2018,I-200-18242-524477,CERTIFIED,2018-08-30,2018-09-06,H-1B,2018-09-10,2021-09-09,LOGIXHUB LLC,,320 DECKER DRIVE,IRVING,TX,75062,UNITED STATES OF AMERICA,,2145420000.0,,N,",",,,DATABASE ADMINISTRATOR,15-1141,DATABASE ADMINISTRATORS,541511,1,0,0,0,0,1,0,Y,77792.0,Year,Level II,OES,2018.0,OFLC ONLINE DATA CENTER,78240.0,0.0,Year,N,N,,,,IRVING,DALLAS,TX,75062,
3,2018,I-200-18070-575236,CERTIFIED,,2018-03-30,H-1B,2018-09-10,2021-09-09,"HEXAWARE TECHNOLOGIES, INC.",,101 WOOD AVENUE SOUTH,ISELIN,NJ,8830,UNITED STATES OF AMERICA,,6094100000.0,,Y,"DUTOT, CHRISTOPHER",TROY,MI,SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",541511,5,5,0,0,0,0,0,Y,84406.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,84406.0,85000.0,Year,Y,N,Y,,,NEW CASTLE,NEW CASTLE,DE,19720,
4,2018,I-200-18243-850522,CERTIFIED,2018-08-31,2018-09-07,H-1B,2018-09-07,2021-09-06,"ECLOUD LABS,INC.",,120 S WOOD AVENUE,ISELIN,NJ,8830,UNITED STATES OF AMERICA,,7327500000.0,,Y,"ALLEN, THOMAS",EDISON,NJ,MICROSOFT DYNAMICS CRM APPLICATION DEVELOPER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",541511,1,0,0,0,0,0,1,Y,87714.0,Year,Level III,OES,2018.0,OFLC ONLINE DATA CENTER,95000.0,0.0,Year,Y,N,Y,Y,,BIRMINGHAM,SHELBY,AL,35244,


In [21]:
df2 = df.astype({"PW_SOURCE_YEAR": str},errors='ignore').replace('\.0', '', regex=True)

In [17]:
df.head()

Unnamed: 0,FY_YEAR,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,EMPLOYMENT_START_DATE,EMPLOYMENT_END_DATE,EMPLOYER_NAME,EMPLOYER_BUSINESS_DBA,EMPLOYER_ADDRESS,EMPLOYER_CITY,EMPLOYER_STATE,EMPLOYER_POSTAL_CODE,EMPLOYER_COUNTRY,EMPLOYER_PROVINCE,EMPLOYER_PHONE,EMPLOYER_PHONE_EXT,AGENT_REPRESENTING_EMPLOYER,AGENT_ATTORNEY_NAME,AGENT_ATTORNEY_CITY,AGENT_ATTORNEY_STATE,JOB_TITLE,SOC_CODE,SOC_NAME,NAICS_CODE,TOTAL_WORKERS,NEW_EMPLOYMENT,CONTINUED_EMPLOYMENT,CHANGE_PREVIOUS_EMPLOYMENT,NEW_CONCURRENT_EMPLOYMENT,CHANGE_EMPLOYER,AMENDED_PETITION,FULL_TIME_POSITION,PREVAILING_WAGE,PW_UNIT_OF_PAY,PW_WAGE_LEVEL,PW_SOURCE,PW_SOURCE_YEAR,PW_SOURCE_OTHER,WAGE_RATE_OF_PAY_FROM,WAGE_RATE_OF_PAY_TO,WAGE_UNIT_OF_PAY,H1B_DEPENDENT,WILLFUL_VIOLATOR,SUPPORT_H1B,LABOR_CON_AGREE,PUBLIC_DISCLOSURE_LOCATION,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,ORIGINAL_CERT_DATE
0,2018,I-200-18026-338377,CERTIFIED,2018-01-29,2018-02-02,H-1B,2018-07-28,2021-07-27,MICROSOFT CORPORATION,,1 MICROSOFT WAY,REDMOND,WA,98052,UNITED STATES OF AMERICA,,4258830000.0,,N,",",,,SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",51121,1,0,1,0,0,0,0,Y,112549.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,143915.0,0.0,Year,N,N,,,,REDMOND,KING,WA,98052,
1,2018,I-200-17296-353451,CERTIFIED,2017-10-23,2017-10-27,H-1B,2017-11-06,2020-11-06,ERNST & YOUNG U.S. LLP,,200 PLAZA DRIVE,SECAUCUS,NJ,7094,UNITED STATES OF AMERICA,,2018720000.0,,Y,"BRADSHAW, MELANIE",TORONTO,,TAX SENIOR,13-2011,ACCOUNTANTS AND AUDITORS,541211,1,0,0,0,0,1,0,Y,79976.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,100000.0,0.0,Year,N,N,,,,SANTA CLARA,SAN JOSE,CA,95110,
2,2018,I-200-18242-524477,CERTIFIED,2018-08-30,2018-09-06,H-1B,2018-09-10,2021-09-09,LOGIXHUB LLC,,320 DECKER DRIVE,IRVING,TX,75062,UNITED STATES OF AMERICA,,2145420000.0,,N,",",,,DATABASE ADMINISTRATOR,15-1141,DATABASE ADMINISTRATORS,541511,1,0,0,0,0,1,0,Y,77792.0,Year,Level II,OES,2018.0,OFLC ONLINE DATA CENTER,78240.0,0.0,Year,N,N,,,,IRVING,DALLAS,TX,75062,
3,2018,I-200-18070-575236,CERTIFIED,,2018-03-30,H-1B,2018-09-10,2021-09-09,"HEXAWARE TECHNOLOGIES, INC.",,101 WOOD AVENUE SOUTH,ISELIN,NJ,8830,UNITED STATES OF AMERICA,,6094100000.0,,Y,"DUTOT, CHRISTOPHER",TROY,MI,SOFTWARE ENGINEER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",541511,5,5,0,0,0,0,0,Y,84406.0,Year,Level II,OES,2017.0,OFLC ONLINE DATA CENTER,84406.0,85000.0,Year,Y,N,Y,,,NEW CASTLE,NEW CASTLE,DE,19720,
4,2018,I-200-18243-850522,CERTIFIED,2018-08-31,2018-09-07,H-1B,2018-09-07,2021-09-06,"ECLOUD LABS,INC.",,120 S WOOD AVENUE,ISELIN,NJ,8830,UNITED STATES OF AMERICA,,7327500000.0,,Y,"ALLEN, THOMAS",EDISON,NJ,MICROSOFT DYNAMICS CRM APPLICATION DEVELOPER,15-1132,"SOFTWARE DEVELOPERS, APPLICATIONS",541511,1,0,0,0,0,0,1,Y,87714.0,Year,Level III,OES,2018.0,OFLC ONLINE DATA CENTER,95000.0,0.0,Year,Y,N,Y,Y,,BIRMINGHAM,SHELBY,AL,35244,


In [23]:
df.loc[df['CASE_NUMBER'] == 'I-200-18005-173269']

Unnamed: 0,FY_YEAR,CASE_NUMBER,CASE_STATUS,CASE_SUBMITTED,DECISION_DATE,VISA_CLASS,EMPLOYMENT_START_DATE,EMPLOYMENT_END_DATE,EMPLOYER_NAME,EMPLOYER_BUSINESS_DBA,EMPLOYER_ADDRESS,EMPLOYER_CITY,EMPLOYER_STATE,EMPLOYER_POSTAL_CODE,EMPLOYER_COUNTRY,EMPLOYER_PROVINCE,EMPLOYER_PHONE,EMPLOYER_PHONE_EXT,AGENT_REPRESENTING_EMPLOYER,AGENT_ATTORNEY_NAME,AGENT_ATTORNEY_CITY,AGENT_ATTORNEY_STATE,JOB_TITLE,SOC_CODE,SOC_NAME,NAICS_CODE,TOTAL_WORKERS,NEW_EMPLOYMENT,CONTINUED_EMPLOYMENT,CHANGE_PREVIOUS_EMPLOYMENT,NEW_CONCURRENT_EMPLOYMENT,CHANGE_EMPLOYER,AMENDED_PETITION,FULL_TIME_POSITION,PREVAILING_WAGE,PW_UNIT_OF_PAY,PW_WAGE_LEVEL,PW_SOURCE,PW_SOURCE_YEAR,PW_SOURCE_OTHER,WAGE_RATE_OF_PAY_FROM,WAGE_RATE_OF_PAY_TO,WAGE_UNIT_OF_PAY,H1B_DEPENDENT,WILLFUL_VIOLATOR,SUPPORT_H1B,LABOR_CON_AGREE,PUBLIC_DISCLOSURE_LOCATION,WORKSITE_CITY,WORKSITE_COUNTY,WORKSITE_STATE,WORKSITE_POSTAL_CODE,ORIGINAL_CERT_DATE
13665,2018,I-200-18005-173269,CERTIFIED,2018-01-05,2018-01-11,H-1B,2018-06-01,2021-05-31,THE EXECU|SEARCH GROUP LLC,,675 THIRD AVENUE,NEW YORK,NY,10017,UNITED STATES OF AMERICA,,2129220000.0,,N,",",,,SENIOR SPEECH-LANGUAGE PATHOLOGIST,29-1127,SPEECH-LANGUAGE PATHOLOGISTS,561320,8,2,4,2,0,0,2,Y,46.35,Hour,Level III,OES,2018.0,OFLC ONLINE DATA CENTER,46.35,0.0,Hour,N,N,,,,BRONX,BRONX,NY,10461,


### Reading Files in Buckets ###

In [4]:
### connect to existing bucket and print files in there ###
bucket_0711 = s3_resource.Bucket('fk-new-bucket-20200711')

for objct in bucket_0711.objects.all():
    print(objct.key)

some_file
some_file2
some_file_new


In [None]:
# Reading files in the bucket bucket_0711

#below code reads all the lines in te s3 bucket
for objct in bucket_0711.objects.all():
    print(objct.key)
    print(objct.get()['Body'].read())

In [None]:
# read a specific file in the bucket
s3_obj = bucket_0711.Object('some_file')

s3_obj.get()['Body'].read().decode('utf-8')

### Downloading Files from s3 to local machine ###

Download "some_file" in fk-new_bucket-20200711 to /Users/faraz/Desktop/aws_test


Farazs-MBP: aws_test faraz$ ls <br>
config      credentials     some_file.txt

**Farazs-MBP:aws_test faraz$** ls test_all/<br>
some_file.txt   some_file2.txt   some_file_new.txt

### Copy Files from one bucket to another ###

Copy files in fk-new_bucket-20200711 to faraz-bucket-a-20200712



## Connecting to other buckets in S ##

Using boto3 connect to the following endpoints:
>Song data: s3://udacity-dend/song_data  
Log data: s3://udacity-dend/log_data  
Log data json path: s3://udacity-dend/log_json_path.json 

In [None]:
dataeng_bucket = s3_resource.Bucket('udacity-dend')

In [None]:
# see all files under the song_data/A/A/A
song_files = dataeng_bucket.objects.filter(Prefix = 'song_data/A/A/A/')

In [None]:
for song in song_files:
    print(song.key)
    

In [None]:

# see contents of song_data/A/A/A/TRAAAAK128F9318786.json
song_786 = dataeng_bucket.Object('song_data/A/A/A/TRAAAAK128F9318786.json')
print(song_786.get()['Body'].read().decode('utf-8'))


In [None]:
# see all files under the log_data/
log_files = dataeng_bucket.objects.filter(Prefix = 'log_data/')

In [None]:
for log in log_files:
    print(log)

In [None]:
# see contents of log_data/2018/11/2018-11-01-events.json
log_a = dataeng_bucket.Object('log_data/2018/11/2018-11-29-events.json')
print(log_a.get()['Body'].read().decode('utf-8'))

In [None]:
# reading s3://udacity-dend/log_json_path.json

json_log_path = dataeng_bucket.Object('log_json_path.json')
print(json_log_path)

In [None]:
print(json_log_path.get()['Body'].read().decode('utf-8'))