# Project: Sentiment Analysis for Movie Reviews using SageMaker

## Step 1: Data Preparation

Obtain a dataset of labeled movie reviews, where each review is labeled as positive or negative sentiment.
Preprocess the text data by removing punctuation, lowercasing the text, and performing tokenization.
Split the dataset into training and testing sets.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import string

# Load the dataset of labeled movie reviews (assuming it's in a CSV format)
data = pd.read_csv('movie_reviews.csv')

# Preprocessing: Remove punctuation, lowercase the text, and perform tokenization
def preprocess_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Lowercase the text
    text = text.lower()
    
    # Tokenization (split the text into individual words)
    tokens = text.split()
    
    return tokens

# Apply preprocessing to the text data
data['preprocessed_text'] = data['text'].apply(preprocess_text)

# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(data['preprocessed_text'], data['label'], test_size=0.2, random_state=42)

# Print the shapes of the training and testing sets
print("Training data shape:", train_data.shape)
print("Training labels shape:", train_labels.shape)
print("Testing data shape:", test_data.shape)
print("Testing labels shape:", test_labels.shape)

# Save the preprocessed data and labels as separate CSV files
train_data.to_csv('train_data.csv', index=False)
train_labels.to_csv('train_labels.csv', index=False)
test_data.to_csv('test_data.csv', index=False)
test_labels.to_csv('test_labels.csv', index=False)


## Step 2: Model Training

Create an Amazon S3 bucket to store the preprocessed data and model artifacts.
Use SageMaker's built-in algorithms or custom-built models for training.
Choose an algorithm such as linear learners, convolutional neural networks (CNN), or recurrent neural networks (RNN).
Configure hyperparameters like learning rate, batch size, and number of epochs.
Train the model using the labeled training data.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.session import s3_input

# Set up the SageMaker session and role
sagemaker_session = sagemaker.Session()
role = get_execution_role()

# Set up the S3 bucket to store data and model artifacts
bucket = 'your-s3-bucket-name'  # Replace with your S3 bucket name
prefix = 'sagemaker/sentiment-analysis'  # Prefix for the data and model artifacts in the bucket

# Upload the preprocessed data to the S3 bucket
train_data_s3 = sagemaker_session.upload_data(path='train_data.csv', bucket=bucket, key_prefix=prefix+'/train')
train_labels_s3 = sagemaker_session.upload_data(path='train_labels.csv', bucket=bucket, key_prefix=prefix+'/train')
test_data_s3 = sagemaker_session.upload_data(path='test_data.csv', bucket=bucket, key_prefix=prefix+'/test')
test_labels_s3 = sagemaker_session.upload_data(path='test_labels.csv', bucket=bucket, key_prefix=prefix+'/test')

# Configure and train the model using SageMaker's built-in algorithms or custom-built models

# Example using LinearLearner (built-in algorithm)
container = get_image_uri(sagemaker_session.boto_region_name, 'linear-learner')

linear_learner = sagemaker.estimator.Estimator(container,
                                               role,
                                               train_instance_count=1,
                                               train_instance_type='ml.c4.xlarge',
                                               output_path=f's3://{bucket}/{prefix}/output',
                                               sagemaker_session=sagemaker_session)

linear_learner.set_hyperparameters(feature_dim=100,  # Replace with the appropriate feature dimension
                                   predictor_type='binary_classifier',  # Assuming binary sentiment classification
                                   epochs=10,
                                   mini_batch_size=32)

train_input = s3_input(s3_data=train_data_s3, content_type='text/csv')
validation_input = s3_input(s3_data=test_data_s3, content_type='text/csv')

linear_learner.fit({'train': train_input, 'validation': validation_input})
# Specify the S3 location to store the model artifacts
model_path = f's3://{bucket}/{prefix}/model'

# Save the model artifacts to S3
model.save(model_path)


### Explanation:

The code starts by importing the necessary modules from the SageMaker library.
The SageMaker session and execution role are set up using sagemaker.Session() and get_execution_role() respectively.
You need to specify your S3 bucket name in the bucket variable and a prefix for the data and model artifacts in the prefix variable.
The preprocessed data (train and test) and labels are uploaded to the S3 bucket using sagemaker_session.upload_data().
The example includes two scenarios: one using the built-in LinearLearner algorithm and another using a custom-built model. You can choose the approach that fits your requirements.
For the built-in LinearLearner algorithm, the code sets up the estimator object with the necessary configurations such as the container, role, instance type, and hyperparameters. It then fits the estimator to the training and validation data.
For a custom-built model, you would need to define and configure your model accordingly, and then train it using your preferred deep learning framework (e.g., TensorFlow or PyTorch).

## Step 3: Model Evaluation

Evaluate the trained model using the labeled testing data.
Calculate metrics such as accuracy, precision, recall, and F1 score to assess the model's performance.
Fine-tune the model if necessary based on the evaluation results.


In [None]:
import sagemaker
from sagemaker.predictor import csv_serializer

# Set up the SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Set up the S3 bucket and prefix
bucket = 'your-s3-bucket-name'  # Replace with your S3 bucket name
prefix = 'sagemaker/sentiment-analysis'  # Prefix for the data and model artifacts in the bucket

# Load the trained model
model = sagemaker.LinearLearnerModel(model_data=f's3://{bucket}/{prefix}/output/model.tar.gz',
                                     role=role,
                                     sagemaker_session=sagemaker_session)

# Create a predictor object for model evaluation
predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

# Set up the input data for evaluation
test_data = sagemaker_session.upload_data(path='test_data.csv', bucket=bucket, key_prefix=prefix + '/test')
test_labels = sagemaker_session.upload_data(path='test_labels.csv', bucket=bucket, key_prefix=prefix + '/test')

# Configure the predictor to accept CSV input
predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer

# Perform model evaluation
results = predictor.predict(test_data)

# Convert the predicted results to a list of labels
predicted_labels = [round(float(result)) for result in results]

# Load the actual labels from the CSV file
actual_labels = pd.read_csv('test_labels.csv')

# Calculate evaluation metrics
accuracy = sum(predicted_labels == actual_labels) / len(actual_labels)
precision = precision_score(actual_labels, predicted_labels)
recall = recall_score(actual_labels, predicted_labels)
f1 = f1_score(actual_labels, predicted_labels)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

# Delete the predictor endpoint
sagemaker.Session().delete_endpoint(predictor.endpoint)

### Explanation:

The code begins by importing the necessary modules from the SageMaker library.
The SageMaker session and execution role are set up using sagemaker.Session() and get_execution_role() respectively.
You need to specify your S3 bucket name in the bucket variable and a prefix for the data and model artifacts in the prefix variable.
The trained model is loaded using sagemaker.LinearLearnerModel(), specifying the S3 location of the model artifacts.
A predictor object is created by deploying the model to a SageMaker endpoint using model.deploy().
The test data and labels are uploaded to the S3 bucket using sagemaker_session.upload_data().
The predictor is configured to accept CSV input and perform predictions on the test data using predictor.predict().
The predicted results are converted to a list of labels, and the actual labels are loaded from the CSV file.
Evaluation metrics such as accuracy, precision, recall, and F1 score are calculated using appropriate functions (e.g., precision_score(), recall_score(), f1_score()).
The evaluation metrics are printed, and the predictor endpoint is deleted using sagemaker.Session().delete_endpoint().

# STEP 4: Model deployment:

Deploying the trained model as an endpoint using SageMaker.
Configuring the endpoint to handle incoming text data.

In [None]:
import sagemaker

# Set up the SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Set up the S3 bucket and prefix
bucket = 'your-s3-bucket-name'  # Replace with your S3 bucket name
prefix = 'sagemaker/sentiment-analysis'  # Prefix for the data and model artifacts in the bucket

# Load the trained model
model = sagemaker.LinearLearnerModel(model_data=f's3://{bucket}/{prefix}/output/model.tar.gz',
                                     role=role,
                                     sagemaker_session=sagemaker_session)

# Deploy the model as an endpoint
predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

# Configure the predictor to handle incoming text data
predictor.serializer = sagemaker.serializers.CSVSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

# Example usage of the endpoint
input_text = 'This is a great movie!'
response = predictor.predict(input_text)
print(response)

# Delete the predictor endpoint
sagemaker_session.delete_endpoint(predictor.endpoint)

### Explanation:

The code begins by importing the necessary modules from the SageMaker library.
The SageMaker session and execution role are set up using sagemaker.Session() and get_execution_role() respectively.
You need to specify your S3 bucket name in the bucket variable and a prefix for the data and model artifacts in the prefix variable.
The trained model is loaded using sagemaker.LinearLearnerModel(), specifying the S3 location of the model artifacts.
The model is deployed as an endpoint using model.deploy(), specifying the number of instances and instance type for the endpoint.
The predictor object is created for making predictions using the deployed endpoint.
The predictor is configured to handle incoming text data by setting the serializer to CSVSerializer() to accept CSV formatted input and the deserializer to JSONDeserializer() to return predictions in JSON format.
An example usage of the endpoint is shown, where input_text contains the text to be analyzed, and the response from the predictor is printed.
Finally, the endpoint is deleted using sagemaker_session.delete_endpoint() to avoid incurring unnecessary costs.


# Resources:

AWS SageMaker documentation: Provides detailed instructions and code examples for training and deploying machine learning models using SageMaker.

Link: https://docs.aws.amazon.com/sagemaker/
AWS Developer Guide: Covers various aspects of machine learning on AWS, including data preparation, model training, and deployment.

Link: https://aws.amazon.com/developers/guides/
AWS Samples GitHub repository: Contains sample code and notebooks for different AWS services, including SageMaker.

Link: https://github.com/aws-samples
Amazon SageMaker Examples: Provides a collection of Jupyter notebooks with step-by-step guides for various machine learning tasks, including NLP.

Link: https://github.com/aws/amazon-sagemaker-examples