### Importing Important Libraries
#### Steps To Be Followed
1. Importing necessary Libraries
2. Creating S3 bucket
3. Mapping train And Test Data in S3
4. Mapping The path of the models in S3

In [99]:
import sagemaker # in this ml project iam using the some built in ml algorithims like Xg-Boost that are present in SageMaker and we will download that image container which has Xg-boost inbuilt alogrithim 
import boto3 # with the help of python from our local enivironment we can also able to read S3 bucket which it is Public by this boto3 algorithim
from sagemaker.amazon.amazon_estimator import get_image_uri # and the whole thing which is XG-boost ml algorithim can be download by get_image_url 
from sagemaker.session import s3_input,Session  # if we really want to use the instance w.r.t sagemaker we have create a session


Certainly! Let's break down each line of the code:

1. `import sagemaker`: This line imports the `sagemaker` library, which provides the necessary tools and functionalities for working with Amazon SageMaker, a fully managed service for building, training, and deploying machine learning models.

2. `import boto3`: This line imports the `boto3` library, which is the official AWS SDK (Software Development Kit) for Python. It allows you to interact with various AWS services, including Amazon SageMaker.

3. `from sagemaker.amazon.amazon_estimator import get_image_uri`: This line imports the `get_image_uri` function from the `amazon_estimator` module within the `sagemaker.amazon` package. The `get_image_uri` function is used to retrieve the container image URI (Uniform Resource Identifier) for a specific built-in SageMaker algorithm.

4. `from sagemaker.session import s3_input, Session`: This line imports the `s3_input` class and the `Session` class from the `session` module within the `sagemaker.session` package. The `s3_input` class is used to create input channels for training a model using data stored in Amazon S3. The `Session` class represents an active SageMaker session and provides methods for working with SageMaker resources.

In summary, these import statements bring in the necessary libraries and modules required for working with SageMaker, including interacting with AWS services, accessing built-in algorithms, and managing sessions and data inputs.

In [100]:
bucket_name ='bankapplication126' # Change this varaible name to a unique name for your bucket 
my_region=boto3.session.Session().region_name # i really want to check my region name becasue based on the region name i may be access different subset folders inside my bucket  set the region of the instance and currently iam working in the us east virgina region and there will be scenario that we need to work on diiferent regions based on servers or based on the response time that we get quickly,so if we are in europe region then we will use Europe region then we can get the quick response time at the services thatwe are tyring to use  
print(my_region)

us-east-1


The code snippet provided sets the variable `bucket_name` to a string value representing the name of an Amazon S3 bucket. It also retrieves the AWS region name using the `boto3` library and prints it.

Let's break down the code further:

1. `bucket_name = 'bankapplication'`: This line assigns the string value `'bankapplication'` to the variable `bucket_name`. This represents the name of the Amazon S3 bucket that will be used in the code. You can change this value to a unique name of your choice, following the rules for bucket naming conventions in Amazon S3.

2. `my_region = boto3.session.Session().region_name`: This line retrieves the AWS region name using the `boto3` library. The `boto3.session.Session()` creates a new session object, and `.region_name` retrieves the name of the current AWS region associated with the session.

3. `print(my_region)`: This line prints the value of the `my_region` variable, which represents the AWS region name. The region name is typically a string representing the geographical location of the AWS resources being used.

By printing the AWS region name, you can verify which region is currently set in the session. This information can be useful when working with AWS services that are region-specific, such as Amazon S3 or Amazon SageMaker.

In [101]:
# creating bucket name by using code 
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1': 
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


Basically iam trying to built my model over here i will save my model inside the S3 bucket and i will do the versioning of the model if the new data comes and then again i have to train my model and again i have put back in my s3 bucket

In [102]:
# set an output path where the trained model will be saved
prefix='Xgboost-as-a-built-in-algo' # currently iam using the Xg-boost algorithim by using Sagemaker ,so prefix is just another folder which iam giving for my algorithim so the model can save inside my output folder
output_path ='s3://{}/{}/output'.format(bucket_name, prefix) # suppose if i want to access my bank application i have to create a path like this  "s3://" = it means s3 and bank application and since i want to create my model or whatever my model is trained if i want to store in s3 bucket for that i use {in this it is replace by bucket name}/{it will be replace by prefix}
print(output_path)
     

s3://bankapplication126/Xgboost-as-a-built-in-algo/output


- the above output is my output path that basically mean once i train my model it will get saved inside this output folder and every time i retrain my model every time a new version of my model will be store with a proper folder 

### Downloading The Dataset And Storing in S3

In [103]:
import pandas as pd
import urllib
try:
    urllib.request.urlretrieve("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
    print('Success: downloaded bank_clean.csv.')
except Exception as e:
    print('Data load error: ',e)

try:
    model_data = pd.read_csv('./bank_clean.csv',index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ',e)


Success: downloaded bank_clean.csv.
Success: Data loaded into dataframe.


In [104]:
model_data.shape

(41188, 61)

Certainly! Let's go through the code step by step to explain what it does:

1. The code first imports the necessary libraries. `pandas` is imported as `pd` to provide data manipulation and analysis tools. `urllib` is imported to handle the URL retrieval.

2. The code then attempts to download a CSV file from a specified URL using the `urlretrieve` function from `urllib.request`. The URL points to a file named `bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv`. If the download is successful, it saves the file as `bank_clean.csv`.

3. Inside a try-except block, the code checks if the download was successful or if any exception occurred during the process. If the download is successful, it prints the message "Success: downloaded bank_clean.csv." If an exception occurs, it prints the error message "Data load error" along with the specific error.

4. The code then attempts to read the downloaded CSV file into a pandas DataFrame using the `read_csv` function. The `index_col=0` parameter specifies that the first column of the CSV should be used as the index of the DataFrame.

5. Inside another try-except block, the code checks if the data loading into the DataFrame was successful or if any exception occurred. If the data is loaded successfully, it prints the message "Success: Data loaded into dataframe." If an exception occurs, it prints the error message "Data load error" along with the specific error.

Overall, this code downloads a CSV file from a specified URL and reads it into a pandas DataFrame, providing a way to access and work with the data in the file.

In [105]:
# Train test split 
# we are performing the train_test split to save that model_data in the S3 bucket so that we can be re use any no of time later furthur if we needed
import numpy as np
train_data,test_data=np.split(model_data.sample(frac=1,random_state=1729),[int(0.7 * len(model_data))])
print(train_data.shape,test_data.shape)

(28831, 61) (12357, 61)


### IMP 
whenever if we are specifcally dealing with AWS Amazon Sagemaker the dependent column should be our first column

In [106]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
# Save the train data to a CSV file
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)],  axis=1).to_csv('train.csv', index=False, header=False)
                                                
                                               

# Upload the train data to S3 Bucket
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')

# whenever we train our model the path data is given to S3 bucket
# Create an S3 input channel for the train data
s3_input_train = s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


Certainly! Let's go through each part of the code snippet step by step:

1. `pd.concat([...], axis=1).to_csv('train.csv', index=False, header=False)`: This line concatenates the `'y_yes'` column from the `train_data` DataFrame with the remaining columns (excluding `'y_no'` and `'y_yes'`) using `pd.concat()`. The resulting DataFrame is then saved as a CSV file named `'train.csv'` using the `to_csv()` method. The `index=False` and `header=False` arguments ensure that the index and header are not written to the CSV file.

2. `boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')`: This line uses the `boto3` library to upload the `'train.csv'` file to an Amazon S3 bucket. It creates an S3 resource using a new session with `boto3.Session().resource('s3')`. Then, it specifies the bucket to which the file should be uploaded using `Bucket(bucket_name)`. The `os.path.join(prefix, 'train/train.csv')` constructs the S3 object key by joining the `prefix` (which represents the directory path within the bucket) with the file name. Finally, the `upload_file()` method is called to upload the local file `'train.csv'` to the specified S3 object.

3. `s3_input_train = s3_input(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')`: This line creates an S3 input channel for the train data using the `s3_input` class. The `s3_data` argument specifies the S3 location of the data, which is constructed by formatting the bucket name and prefix into the S3 URI `'s3://{}/{}/train'`. The `content_type` argument specifies the type of data, which is set to `'csv'` in this case.

Overall, this code snippet saves the train data from a DataFrame to a CSV file, uploads the file to an Amazon S3 bucket, and then creates an S3 input channel for the train data using the `s3_input` class. This process prepares the data for training a machine learning model using Amazon SageMaker.

In [107]:
### Saving Train And Test Into Buckets
## We start with Train Data
import os
# Save the train data to a CSV file
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)],  axis=1).to_csv('test.csv', index=False, header=False)
                                                
                                               

# Upload the train data to S3 Bucket
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')

# whenever we train our model the path data is given to S3 bucket
# Create an S3 input channel for the train data
s3_input_test = s3_input(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

The class sagemaker.session.s3_input has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


### Building Models Xgboot- Inbuilt Algorithm

In [108]:
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
container=get_image_uri(boto3.Session().region_name,'xgboost',repo_version='1.0-1')
# the sagemaker contains the build in algorithims and which are present in the containers or images so we need to pull the container or images in my instance which iam running in the sagemaker  

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [115]:
# initialize hyperparameters
hyperparameters={
    "max_depth":"5",
    "eta":"0.2",
    "gamma":"4",
    "min_child_weight":"6",
    "subsample":"0.7",
    "objective":"binary:logistic", # because it binary classification algorithim 
    "num_round":50
}

we should run the hyperparameter tuning in the sagemaker because it takes more time ,so what i have done is that i have executed this whole hyperparameter tuning in my local environment and i took out the information and put it over here ,and i have hyperparameters and all this information is w.r.t XG_boost


**"max_depth": "5":**

This hyperparameter controls the maximum depth of each tree in the XGBoost ensemble. It limits the depth of the individual decision trees, helping to prevent overfitting. Setting it to 5 means that each tree can have a maximum depth of 5.
"eta": "0.2":

Also known as the learning rate, this hyperparameter controls the step size at each iteration while moving toward a minimum of the loss function. A lower learning rate can make the model's training more robust but may require more iterations.

**"gamma": "4":**

Gamma is a regularization parameter that encourages pruning of the tree. It specifies a regularization term that penalizes the complexity of the tree. A higher gamma value encourages more aggressive pruning.
"min_child_weight": "6":

This hyperparameter specifies the minimum sum of instance weight (hessian) needed in a child. It can be used to control over-fitting. A higher value makes the algorithm more conservative.

**"subsample": "0.7":**

Subsample is a fraction of the training data that is randomly sampled to grow trees during each boosting round. Setting it to 0.7 means that 70% of the training data will be used for each boosting round.

**"objective": "binary:logistic":**

This specifies the learning task and the corresponding objective function. In this case, it's set to "binary:logistic," indicating that you're performing binary classification, and XGBoost should use logistic regression as the objective function.

**"num_round": 50:**

This hyperparameter specifies the number of boosting rounds or iterations for the training process. The algorithm will perform 50 iterations to build the ensemble of decision trees.

- It's important to note that this code snippet primarily defines hyperparameters and their values. To perform hyperparameter tuning using SageMaker, you would typically create a SageMaker Estimator object, set up a HyperparameterTuner to search for the best combination of hyperparameters, and then start the tuning job. The SageMaker platform would take care of the distributed training and optimization of hyperparameters for you.


In [119]:
# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=container,  # here iam giving the container which i have pulled the image of XG_Boost 
                                          hyperparameters=hyperparameters, # here i have specified the hyperparameters which are in key-value paris in the dictionaries
                                          role=sagemaker.get_execution_role(), # the role aslo need to be specified because the IAMrole invovled in this particular instance so that we can access those S3 buckets to pull the training and test for training purposue
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.2xlarge', # i have taken as ml.m5.2xlarge because i have perfom the training fast
                                          train_volume_size=5, # 5 GB 
                                          output_path=output_path, # i specifed the output path where the model should be saved and my output path is s3://bankapplication126/Xgboost-as-a-built-in-algo/output
                                          train_use_spot_instances=True, # this and rest of the below 2 lines i have used it because it reduce the billing time  as the sagemaker runs based on the running time 
                                          train_max_run=300,
                                          train_max_wait=600)

#now i finally i pulled the whole image of the XG_Boost and now iam going to fit 

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_use_spot_instances has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_wait has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_volume_size has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
NOTEBOOK_METADATA_FILE detected but failed to get valid domain and user from it.


Sure! Let me explain the code you provided.

The code is used to construct a SageMaker estimator that calls the XGBoost container for training a machine learning model.

Here is a breakdown of the code:

1. `estimator = sagemaker.estimator.Estimator(...)`: This line creates an instance of the `Estimator` class from the SageMaker SDK.

2. `image_uri=container`: The `image_uri` parameter specifies the URI of the container image to use for training. In this case, it is the URI of the XGBoost container image.

3. `hyperparameters=hyperparameters`: The `hyperparameters` parameter is used to specify the hyperparameters for the XGBoost algorithm. Hyperparameters are passed as key-value pairs in a dictionary.

4. `role=sagemaker.get_execution_role()`: The `role` parameter specifies the IAM role that SageMaker will assume to perform tasks on your behalf, such as accessing S3 buckets for data and saving the model artifacts.

5. `train_instance_count=1`: The `train_instance_count` parameter specifies the number of instances to use for training. In this case, it is set to 1, meaning training will be performed on a single instance.

6. `train_instance_type='ml.m5.2xlarge'`: The `train_instance_type` parameter specifies the type of instance to use for training. In this case, it is set to `ml.m5.2xlarge`, which is a specific instance type optimized for general-purpose machine learning workloads.

7. `train_volume_size=5`: The `train_volume_size` parameter specifies the size (in GB) of the EBS volume to use for storing data during training. In this case, it is set to 5 GB.

8. `output_path=output_path`: The `output_path` parameter specifies the S3 location where the model artifacts and training output should be saved.

9. `train_use_spot_instances=True`: The `train_use_spot_instances` parameter specifies whether to use Amazon EC2 Spot Instances for training. Spot Instances can significantly reduce the cost of training but may be interrupted if the spot price exceeds your bid price.

10. `train_max_run=300`: The `train_max_run` parameter specifies the maximum time in seconds that training can run before it is stopped. In this case, it is set to 300 seconds (5 minutes).

11. `train_max_wait=600`: The `train_max_wait` parameter specifies the maximum time in seconds that SageMaker will wait for a spot instance to become available. If a spot instance is not available within this time, training will be terminated. In this case, it is set to 600 seconds (10 minutes).

That's a high-level explanation of the code you provided. Let me know if you have any further questions!

In [120]:
estimator.fit({'train':s3_input_train,'validation':s3_input_test}) # se_input_test is my path of the data which is present in s3

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-10-05-10-04-22-586


2023-10-05 10:04:22 Starting - Starting the training job...
2023-10-05 10:04:37 Starting - Preparing the instances for training......
2023-10-05 10:05:39 Downloading - Downloading input data......
2023-10-05 10:06:51 Training - Training image download completed. Training in progress.
2023-10-05 10:06:51 Uploading - Uploading generated training model[34m[2023-10-05 10:06:42.105 ip-10-2-252-161.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m

### Deploy Machine Learning Model As Endpoint

In [122]:
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m4.xlarge') # initial_instance_count helps you to have parallel instances so that multi parallel processing can happen to get response quickly,and instance type i selected has ml.m4.xlarge which it give very fast response 

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-10-05-10-16-45-412
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-10-05-10-16-45-412
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-10-05-10-16-45-412


-------!

whenever i do load the S3 bucket in that if i click on the bank application >ouput> the model file get created 
and every when i train my model on new data a new file get created and remember a new file get created based on the time stamps 

### Prediction of the Test Data

In [127]:
# Convert test_data to CSV format because whenever we are giving data to end points the endpoints will be accepting some input and input is in the form of an Excel_dataset and it given to the model and model will actually give the output
test_data_csv = test_data.drop(['y_no', 'y_yes'], axis=1).to_csv(index=False, header=False)

# Set the data type for inference
xgb_predictor.content_type = 'text/csv'

# Predict
predictions = xgb_predictor.predict(test_data_csv).decode('utf-8') # when we are doing prediction it should be in encoded format so we need to decoded that

# Convert predictions to an array
predictions_array = np.fromstring(predictions[1:], sep=',') # now once i get the predictions , i will take the first part of the data so that we will get highest value w.r.t binary classifcation

print(predictions_array.shape)

(12357,)


In [128]:
predictions_array # this is the output of my test data

array([0.05214286, 0.05660191, 0.05096195, ..., 0.03436061, 0.02942475,
       0.03715819])

In [129]:
# confusion matrix and the whole code is taken from AWS page
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predicted'])
tn = cm.iloc[0,0]; fn = cm.iloc[1,0]; tp = cm.iloc[1,1]; fp = cm.iloc[0,1]; p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100,tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100,fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.7%

Predicted      No Purchase    Purchase
Observed
No Purchase    91% (10785)    34% (151)
Purchase        9% (1124)     66% (297) 



### observation
the accuracy is very very less because it is imbalance dataset
and as we can see the right side is actual value and above one is predicted value
and we can improve the model performance with help of the hyperparameter tuning

### Imp 
**And dont run the code again and again once the end points get created  we need to delete those so for deleting use the below code**

In [131]:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete=boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()

See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
INFO:sagemaker:Deleting endpoint with name: sagemaker-xgboost-2023-10-05-10-16-45-412


[{'ResponseMetadata': {'RequestId': '8D0MN301X9PQJFBA',
   'HostId': 'OUNhqudG/uhbEYnUOMnUMD0LDawvnwMfBURc3zy1SHj3hiAZAIdmO5QkRq+A4gH6UnyZ3lk76qg=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'OUNhqudG/uhbEYnUOMnUMD0LDawvnwMfBURc3zy1SHj3hiAZAIdmO5QkRq+A4gH6UnyZ3lk76qg=',
    'x-amz-request-id': '8D0MN301X9PQJFBA',
    'date': 'Thu, 05 Oct 2023 10:57:38 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'Xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2023-10-05-10-04-22-586/output/model.tar.gz'},
   {'Key': 'Xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2023-10-05-10-04-22-586/debug-output/events/000000000020/000000000020_worker_0.tfevents'},
   {'Key': 'Xgboost-as-a-built-in-algo/output/sagemaker-xgboost-2023-10-05-10-04-22-586/debug-output/index/000000000/000000000000_worker_0.json'},
   {'Key': 'Xgboost-as-a-built-in-algo/output/

### Observation
So finally we have completed the deployment of ml model in the Amazon Sagemaker and deleted it and  data which we stored in S3 bucket got deleted in aws services
dont run the code top to botton untill and unless if you want to make practice, otherwise they make charges if we run the code continuoslly

and if we dont want to get more on the bill just do delete all the files which are present in the notebook instances like ipynb file ,train.csv,test.csv files in order to not to  get more charges in montly billing section of AWS services 