# Homework 3
Louis Lu, Angela Liu, Winci Liang, Angelina Ying, Kristen Li

Complete the exercises working in your group. You may share the notebook with your group members using the share button in the upper right.







In [None]:
from os.path import basename, exists
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import linregress

# Question 1

## Plotting practice
These questions follow the examples presented in Notebook 9 from the Textbook. See that notebook for the dataset and template code.
*  Generate a violin plot that shows the distribution of height in each income group. Can you see a relationship between these variables?
* Make a boxplot that shows the distribution of weight in each income group. Plot the y-axis on a logarithmic scale.
* Generate a visualization of the relationship between weight and vegetable consumption.

In [None]:
# Import the dataset
def download(url):
    filename = basename(url)

    # Check if file already exists to avoid re-downloading
    if not exists(filename):
        from urllib.request import urlretrieve

        # Download the file and save it locally
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

# Download the specified file
download('https://github.com/AllenDowney/' +
         'ElementsOfDataScience/raw/master/brfss.hdf5')

# Load the downloaded HDF5 file into a pandas DataFrame
brfss = pd.read_hdf('brfss.hdf5', 'brfss')

# Output the shape of the DataFrame
brfss.shape


In [None]:
# Load the headers of the dataframe to gain a basic understanding of the composition of the dataset
brfss.head()

In [None]:
# Graphing the Violin Plot
# Set the aesthetic style of the plots to whitegrid
sns.set(style="whitegrid")

# Create a new figure for the violin plot
plt.figure(figsize=(8, 6))

# Create a violin plot of the distribution of HTM4 (height) values for each income group (INCOME2)
sns.violinplot(x="INCOME2", y="HTM4", data=brfss, inner="quart", palette="Set1")

# Labeling the axes and the plot
plt.xlabel("Group")
plt.ylabel("Value")
plt.title("Violin Plot of HTM4 Grouped by INCOME2")


## Interpretation of the violin plot
This violin plot shows the distribution of height (HTM4) in each income group (INCOME2).

### Overall Observations:

All income groups have data points in the range of around 90 to 230 and there's noticeable variability in the distribution of height across different income groups.

### Group-wise Observations:

Group 1 (Red): This group has a wide distribution around the median (the middle horizontal line) which is indicating a high density of data points around that value.

Group 2 (Blue), Group 3 (Green), Group 4 (Purple), Group 5 (Orange): The median, quartiles, and the shape of distribution for these three groups are similar. The medians for these group seems to be slightly higher than that of Group 1 and similar to Group 1 the wide distribution is around the median.

Group 6 (Yellow) and Group 7 (Brown): While the median for these two groups are only sligtly higher than that of groups 2,3,4,5, the upper and lower quartiles are high than group 2,3,4,5 by a lot. Also, group 6 and 7 have smaller ranges withint the groups.

Group 8 (Pink): The median and upper and lower quartiles are the highest among all groups.

### Inferring on relationship between height and income:

Since the higher the income group, the higher the median/quartiles of the height, there would be a positive correlation between income and height.

In [None]:
# Graphing the Box Plot

# Plotting a boxplot for weight distribution across income groups
sns.boxplot(x="INCOME2", y="WTKG3", data=brfss)

# Setting the y-axis to logarithmic scale
plt.yscale("log")

# Setting labels and title
plt.xlabel("Income Group")
plt.ylabel("Weight (log scale)")
plt.title("Boxplot of Weight in Each Income Group")

## Interpretation of the Box Plot

### Functionality of Box Plot
Similar to the violine plot, box plot also provides us information on the median, upper/lower quartiles, and the range. However, different from the violine plot, the box plot also shows the outliers. For example, in Group 4, 5, 6, there are a couple outliers below the $1.5 \times \text{IQR}$ from the lower quartile.

### Relationship between income group and logged weight
All income groups have a similar IQR for weight, but the median weight slightly increases as income group rises.

In [None]:
# Graphing the Scatter Plot

# Plotting a scatter plot for vegetable consumption and weight with transparency level at 0.5
sns.scatterplot(x="_VEGESU1", y="WTKG3", data=brfss, alpha = 0.5)

# Setting labels and title
plt.xlabel("Vegetable Consumption")
plt.ylabel("Weight")
plt.title("Weight versus Vegetable Consumption")

## Interpretation of the Scatter Plot

### Functionality of Scatter Plot
Different from both box plot and violine plot, the scatterplot shows each datapoint on the plot, which provide as a visualization for the relationship between the variables.

### Relationship between vegetable consumption and weight
The scatterplot shows that increased vegetable consumption correlates with lower weight -- as we move to the right on the x-axis, the weight moves down the y-axis as well.


# Question 2

## Correlation
Recall that correlation only measures linear relationships. If the relationship is nonlinear, correlation generally underestimates how strong it is.

To demonstrate, generate another example of fake data that is different from the data presented in the text which was $y=x^2$. Explain what is the association between the variables (e.g. $y=x^4$, $y=sin(x)$) and why the correlation is low or zero despite there being a clear relationship between the variables.

In [None]:
# Generate Fake Data:
x_sample = np.linspace(-10,10,1000)
y_sample = np.cos(x_sample)

In [None]:
# Plot the fake data:
plt.clf()
plt.plot(x_sample, y_sample, 'o', alpha=0.5)

In [None]:
corr = np.corrcoef(x_sample, y_sample)[0,1]
print(f"Correlation between x_sample and y_sample: {corr}")

In our example, we choose the function $y = cos(x)$. This is a periodic function, so the the y value would oscilliate between 1 and -1 periodically as x changes. However, the calcualted corrlation coeffcient is almost 0 This is because the correlation coefficient can only measure the monotonous linear relationship between the independent and dependent variables. So it cannot capture the periodic patterns.

# Question 3

## Simple regression
Following up on the exercises from Notebook 9, who do you think eats more vegetables, people with low income, or people with high income? To answer this question, make a scatter plot with vegetable servings versus income, then estimate the slope of the relationship between vegetable consumption and income.

What is the slope of the regression line? Write a sentence that explains what this slope means in the context of the question we are exploring.

Finally, plot the regression line on top of the scatter plot.

In [None]:
# Graphing the Scatter Plot
sns.scatterplot(x="INCOME2", y="_VEGESU1", data=brfss, alpha = 0.5)
plt.xlabel("Income Level")
plt.ylabel("Vegertable Consumptions")
plt.title("Vegetable consumptions versus Income Level")

In [None]:
# Remove rows with missing values
brfss= brfss.dropna()

In [None]:
# Perform linear regression between 'INCOME2' and '_VEGESU1'
res1 = linregress(brfss['INCOME2'], brfss['_VEGESU1'])

# Extract the slope and intercept from the regression results
slope, intercept = res1.slope, res1.intercept

# Convert the regression results to a dictionary format for easy viewing
res1._asdict()


In [None]:
sns.scatterplot(x="INCOME2", y="_VEGESU1", data=brfss, alpha = 0.5)

x_vals = np.array([brfss['INCOME2'].min(), brfss['INCOME2'].max()])
y_vals = intercept + slope * x_vals
plt.plot(x_vals, y_vals, color='red')

plt.xlabel("Income Level")
plt.ylabel("Vegetable Consumptions")
plt.title("Vegetable consumptions versus Income Level")

##Interpretion of Linear Regression Results:

Slope (0.06899):  As INCOME2 increases by 1 unit, _VEGESU1 is expected to increase by roughly 0.06899 units.

Intercept (1.5356): When the income level is zero, the predicted value of vegetable consumption is approximately 1.5356.

R-value (0.1172): A small, positive r value indicates weak positive linear correlation between the variables.

# Question 4

## AWS ML APIs: Part I
Find an AWS ML API from the ones available at to you using your credits (see [this list](https://emory-my.sharepoint.com/:b:/g/personal/jajaco3_emory_edu/ERpKVDYVZt1AuwUMOp0dx8EBtNQI_JXQvDHPQcB8M8SDkA?e=x1TSgF) of all services available via credits).

Some of them will be hard to work with, so, I would like you to try working with whichever interest your group members and then explain in your solution here which ML services you were interested in and why, which you tried to use but couldn't use, e.g. there may a data format which was too difficult to work with, limited tutorial, etc.

Note, you may need access to AWS resources in some of the tutorials. For those, you will need to use AWS SageMaker and other AWS tools via the AWS Console. I will be happy to meet with your group to assist you on this part.

## Answer: AWS Translate

**AWS Translate**

Our group is interested in working with the ML service Amazon Translate, which is a neural machine translation service for translating text to and from English across a breadth of supported languages. The machine translation engine has been trained on a wide variety of content across different domains to produce quality translations that serve any industry need.

The service has wide applications. For instance, it provides a managed, continually trained solution that we can use to translate unstructured text documents or to build applications that work in multiple languages. Amazon Translate can also automatically detect the language used in the source text by calling Amazon Comprehend.

We want to try the functions '**' translate_client.translate_text(), '** which translates input text from the source language to the target language, and **' translate_client.start_text_translation_job(), '** through which we can input documents with different source languages and specify one or more target languages.

However, Our group wasn't able to call the functions from AWS Translate API via SageMaker. An error message was received:

      *ClientError: An error occurred (AccessDeniedException) when calling the StartTextTranslationJob operation: User: arn:aws:sts::096611606074:assumed-role/LabRole/SageMaker is not authorized to perform: translate:StartTextTranslationJob because no identity-based policy allows the translate:StartTextTranslationJob action*

# Question 5

## AWS ML APIs: Part II
In the [lecture notebook](https://github.com/jeremyallenjacobson/qtm350/blob/master/CourseAssets/Rekognition_notebook/AWS-ML-API.ipynb) we walked you through setting up Sagemaker in the Educate account in order to use the AWS ML APIs.

Your task in this question is to select another ML service from AWS and then, using the documentation, create a walkthrough similar to the one from lecture for the ML service that you picked.

In your walkthrough, demonstrate how to call the service from within a Sagemaker notebook using either the AWS CLI or the Python SDK (your pick).

As always, be sure to narrate your code with lots of text, images, and links to share. Explain which ML APIs you were interested in, which you tried but couldn't use, and potential use cases that you would like to explore as well as data sources you would like to apply them to.

Finally, because we can't run Sagemaker code in a colab notebook, convert your walkthrough notebook from this exercise into a .html file using `jupyter nbconvert` as the command line. Then, create an S3 bucket, upload the .html file to your bucket, and follow the instructions [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html) for making the .html available as a webpage.

 Share the link to it in this notebook, so that I and other students can read your walkthrough.

## Answer: Walkthrough

[Here](https://newbusketforqtm350.s3.amazonaws.com/walkthrough.html) is the link to the notebook in the html format.