## Part 1: S3 on AWS

S3 is the storage system on AWS. Here, you will practice interacting with it via the Amazon GUI
and with the Python library `boto3`. You should know how to read and write files to S3 using a
python script at the end of this exercise.

In [None]:
import numpy as np
import pandas as pd
import boto3
import matplotlib.pyplot as plt
plt.style.use('ggplot')

1. Log into your [Amazon account](http://aws.amazon.com/console/), and create an S3 bucket using the GUI.
   **The bucket name must be globally unique (not used on any AWS account).**

   [Rules for S3 Bucket Names](http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html):
   * Bucket names must be at least 3 and no more than 63 characters long.
   * Bucket names can contain lowercase letters, numbers, and hyphens.
   * Periods are allowed but can cause problems. Avoid using periods.
   * Bucket names cannot start or end with a hyphen or period.

2. Upload (using the GUI) `data/cancer.csv` to your bucket, and note the link to the file.

In [None]:
#this would be the object URL
#example: https://g90-demo-bucket-nick.s3-us-west-2.amazonaws.com/cancer.csv

3. Use `read_csv()` in `pandas` to read in the file from S3 (you can treat the S3 URL as a file path). Include the `chunksize` argument in `read_csv`
   to read in a subset of the file. In this case, with 301 rows, you would not need to subset your data.
   For larger datasets, this would become handy.

In [None]:
#Instantiating the boto resource and client for downloading/uploading files
s3_connection = boto3.resource('s3')
s3_client = boto3.client('s3')

In [None]:
#When your files and/or bucket are set to private, use this function for loading in data
#Private buckets and files are recommended as a good general practice
def load_csv_from_s3(bucketname, filename, n_rows=300):
    """
    Input:
        bucketname (str): Name of bucket that file is stored in
        filename (str): Name of csv within bucket (ex: "cool_data.csv")
        
    Output:
        pandas dataframe of csv (assuming no read_csv arguments are needed)
    """
    
    boto_object = s3_client.get_object(Bucket=bucketname, Key=filename)
    return pd.read_csv(boto_object['Body'], nrows=n_rows)

In [None]:
bucket = 'g90-demo-bucket-nick' #name of bucket (change to your personal bucket name)
csv_name = 'cancer.csv' #name of file

df = load_csv_from_s3(bucket, csv_name)

In [None]:
# ONLY USE pd.read_csv IF YOUR BUCKET AND FILE ARE PUBLIC (this is not recommended)
df = pd.read_csv("""This is where you'd put the object URL from question 2""")

4. Compute the rates of cancer for each row, and make a histogram of the rates. Save the histogram as a `.png`
   file using `savefig` in matplotlib. Save a `.csv` file of the rates you use for the histogram as well.

In [None]:
df.head()

In [None]:
df["cancer_rate"] = df["cancer"] / df["population"]

In [None]:
def plot_cancer_rates(x, save_figure=False, plot_save_path=None):
    
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x_ticks = np.linspace(0, 0.01, 10)
    x_tick_labels = [str(np.around((tick * 100), 2)) + "%" for tick in x_ticks]
    
    ax.hist(x, bins=50)
    ax.set_xlabel("Cancer Rates")
    ax.set_xticks(x_ticks)
    ax.set_xticklabels(x_tick_labels, rotation=45)
    
    ax.set_ylabel("Counts")
    ax.set_title("Frequency of Cancer Rates")
    
    plt.tight_layout()
    if save_figure:
        plt.savefig(plot_save_path, dpi=500)

In [None]:
plot_cancer_rates(df["cancer_rate"].values, save_figure=True, plot_save_path="cancer_rates.png")

In [None]:
cancer_df_filename = "cancer_rates.csv"
df["cancer_rate"].to_csv(cancer_df_filename, header=["cancer_rate"])

5. Write a script using `boto3` to upload the histogram `.png` and the rates `.csv` to the bucket you have created.
   Confirm you have uploaded the files by checking the GUI console.

In [None]:
s3_client.upload_file(cancer_df_filename, bucket, cancer_df_filename)

## Part 2: EC2 on AWS

EC2 is a remote virtual machine that runs programs much like your local machine. Here you will learn how to
run tasks on an EC2 machine. Most EC2 instances come without a lot of the packages you need. Here, we will use
an instance that has most of the data science packages installed.

1. Create an EC2 instance. Search for a Machine Image (Community AMI) that has `anaconda3` and `Ubuntu`. **Optional - you could choose only Ubuntu as the AMI and then build your instance with anaconda3 from the ground up.  See directions at the end of this guide.** Choose `t2.micro` for the instance type. Give the instance an IAM role that allows it full access to S3. Choose an *all-lowercase* name for the instance and add a `Name` tag (Key=`Name`, Value=`examplename`). Careful: Do not replace `Name` in the key field. Set the value instead by replacing `examplename`.
  

In [None]:
"""
When creating the IAM role for your EC2 instance:
        
        1) Make sure to specify that it will be used by EC2 instances
        
        2) For the permission policy, choose "AmazonS3FullAccess"
        
When creating your EC2 instance:

        1) Make sure to select your new IAM role under "Configure Instance Details"
"""

2. Log into the instance you have launched using `ssh`. 

In [None]:
# ssh -i path/to/key.pem ubuntu@ec2_url_here

"""
Note:
    Replace "ubuntu" with ec2-user if running an Amazon Linux AMI rather than an Ubuntu AMI
"""

3. Update `apt` sources and perform routine updates:

```
sudo apt update
sudo apt upgrade
```

4. Modify the script you have written to process `cancer.csv` in `Part 1`. Instead of writing the results to
   the same S3 bucket as where `cancer.csv` is, change the script to write to a new bucket.  

   You will have to modify the script in another way, because EC2 linux servers don't have the same visual resources as your laptop.  Therefore, you'll need to change how you import `matplotlib.`  Modify the import in your script:
   ```python
   import matplotlib
   matplotlib.use("Agg")
   import matplotlib.pyplot as plt
   ```

In [None]:
#the following code would be ran on the EC2 instance
import numpy as np
import pandas as pd
import boto3
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

s3_connection = boto3.resource('s3')
s3_client = boto3.client('s3')

def load_csv_from_s3(bucketname, filename, n_rows=300):
    """
    Input:
        bucketname (str): Name of bucket that file is stored in
        filename (str): Name of csv within bucket (ex: "cool_data.csv")
        
    Output:
        pandas dataframe of csv (assuming no read_csv arguments are needed)
    """
    
    boto_object = s3_client.get_object(Bucket=bucketname, Key=filename)
    return pd.read_csv(boto_object['Body'], nrows=n_rows)

def plot_cancer_rates(x, save_figure=False, plot_save_path=None):
    
    fig, ax = plt.subplots(figsize=(12, 6))
    
    x_ticks = np.linspace(0, 0.01, 10)
    x_tick_labels = [str(np.around((tick * 100), 2)) + "%" for tick in x_ticks]
    
    ax.hist(x, bins=50)
    ax.set_xlabel("Cancer Rates")
    ax.set_xticks(x_ticks)
    ax.set_xticklabels(x_tick_labels, rotation=45)
    
    ax.set_ylabel("Counts")
    ax.set_title("Frequency of Cancer Rates")
    
    plt.tight_layout()
    if save_figure:
        plt.savefig(plot_save_path, dpi=500)
        
if __name__ == "__main__":
    
    new_bucket_name = 'aws-denver-branch-bucket'
    csv_name = 'cancer.csv'
    cancer_df_filename = "cancer_rates.csv"
    plot_filename = "cancer_rates.png"
    
    df = load_csv_from_s3()
    df["cancer_rate"] = df["cancer"] / df["population"]
    plot_cancer_rates(df["cancer_rate"].values, save_figure=True, plot_save_path=plot_filename)
    
    df["cancer_rate"].to_csv(cancer_df_filename, header=["cancer_rate"])
    
    
    s3_client.create_bucket(Bucket=new_bucket_name)
    
    s3_client.upload_file(cancer_df_filename, new_bucket_name, cancer_df_filename)
    s3_client.upload_file(plot_filename, new_bucket_name, plot_filename)
    

5. Use `scp` or `git` to copy the script onto the EC2 instance. 

In [None]:
#if you need help installing/configuring git on your ec2 instance
#try this link: https://cloudaffaire.com/how-to-install-git-in-aws-ec2-instance/

6. Run the script on the EC2 instance and check S3 to make sure the results are transferred to a new bucket. In practice, you will be testing the script locally with a smaller subset of the data, and run the script on the whole set on EC2. If your task requires more processing power, you have the option to run it on a more powerful EC2 instance with more RAM and more cores.

**Optional (if you didn't use an anaconda3 AMI)**

Install Anaconda

```
# Download Anaconda3
wget -S -T 10 -t 5 https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh -O $HOME/anaconda.sh

# Install Anaconda
bash anaconda.sh

# when prompted for an installation path, 
# press "enter" to accpet the default


# when prompted to "prepend the install location 
# to your PATH", type 'yes'

# once installation is finished, you still have
# to execute the commands in ~/.bashrc
source ~/.bashrc

conda install boto3
```