## Python Notebook for Visualization your AWS Doc Statistics 

This notebook provides a super easy and quick way to generate several bar charts and line charts to visualize CSAT data over time.

### Prerequisites:
0. Install the packages by executing the code cell below.
1. Put the csv files for the months of interest into a folder called "files" in the same directory as this Jupyter notebook.
2. In the **Setup** section below, update the `file_names` list object to the names of your csv files. Make sure that the files are in **chronological order**!
3. In the **Setup** section below, update the `months` list object to the months that your csv files cover. Make sure that the month names are in **chronological order**!

In [2]:
# Install some packages
#!pip install --upgrade pip
#!pip install pandas
#!pip install plotly
import pandas as pd
import plotly.graph_objects as go

## Setup
Update this section based on how many csv files you're using. 

In [3]:
# Name of the directory that contains your csv files
folder = 'files/'

# Add or remove file names here as needed. Make sure that they are in chronological order!
file_names = [
    
   
    
    #'Monthly20200301-20200331-sagemaker dg-default.csv',
    #'Monthly20200401-20200430-sagemaker dg-default.csv',
    #'Monthly20200501-20200531-sagemaker dg-default.csv',
    #'Monthly20200701-20200731-sagemaker dg-default.csv',
    #'Monthly20200801-20200831-sagemaker dg-default.csv',
    #'Monthly20200901-20200930-sagemaker dg-default.csv',
    
    
    
              'Weekly20200726-20200801-sagemaker dg-default.csv',
              'Weekly20200802-20200808-sagemaker dg-default.csv',
              'Weekly20200809-20200815-sagemaker dg-default.csv',
              'Weekly20200816-20200822-sagemaker dg-default.csv',
              'Weekly20200823-20200829-sagemaker dg-default.csv',
              'Weekly20200830-20200905-sagemaker dg-default.csv',
              'Weekly20200906-20200912-sagemaker dg-default.csv',
              'Weekly20200913-20200919-sagemaker dg-default.csv',
    
             
]

# Add or remove months here as needed. Make sure that your csv files are in the same chronological order as the months list.
#months = ['Mar', 'Apr', 'May', 'Jul', 'Aug', 'Sep']

weeks= ['Jul4', 'Aug1', 'Aug2', 'Aug3', 'Aug4', 'Sep1','Sep2','Sep3']
# Load each csv into a dataframe
dataframes = []
for file in file_names:
    dataframes.append(pd.read_csv(folder+file).drop(labels=['Guide Name'], axis=1))

# Sanity check first two dataframes
print(dataframes[0].head(5))
# print(dataframes[1].head(5))

               Topic  Yes  No  %Yes
0            gs.html    8   8    50
1         algos.html    9   2    81
2     gs-studio.html    3   3    50
3  how-it-works.html    4   1    80
4   ex1-cleanup.html    4   1    80


## Function definitions:
This cell defines the various functions that are used to produce the charts. You can just execute this cell as is. You don't need to edit anything in this cell (unless you'd like to add additional functionality to this notebook!).

In [8]:
# This function returns a set of all the pages with keyword in html page name
# Note: keyword can be a regex expression
def get_pages_with_word(keyword, dfs):
    my_page_set = set()
    for df in dfs:
        temp_list = df[df['Topic'].str.contains(keyword)]['Topic'].to_list()
        # print(temp_list)
        my_page_set.update(temp_list)
    return my_page_set

# This function combines all pages in "set" and their data into a single dataframe
def get_merged_df(set):
    all_total_dfs = []
    for df in dataframes:
        new_df = df.loc[df['Topic'].isin(set)]
        total_df = new_df.sum().to_frame().transpose().drop(labels=['Topic','%Yes'], axis=1)
        all_total_dfs.append(total_df)
    concat_df = pd.concat(all_total_dfs, ignore_index=True)
    concat_df['percent_yes'] = concat_df.apply(lambda row: 100*row.Yes/(row.Yes + row.No) if row.Yes + row.No > 0 else 0, axis=1)
    return concat_df

# Produce a single barchart for 
def barplot_single_page_set(merged_df, keyword):
    fig = go.Figure()
    fig.add_trace(go.Bar(
        x=months,
        y=merged_df['No'],
        name='No',
        marker_color='indianred'
    ))
    fig.add_trace(go.Bar(
        x=months,
        y=merged_df['Yes'],
        name='Yes',
        marker_color='green'
    ))
    fig.update_layout(barmode='group',
                    xaxis_tickangle=-45,
                    title={
                            'text': f"Reactions by Month (\"{keyword}\" pages)",
                            'y':0.85,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'
                    },
                    xaxis_title="Month",
                    yaxis_title="Number of responses",
    )
    fig.show()

def lineplot_percentage_multiple_dfs(merged_dfs, keywords):
    fig = go.Figure()
    for i in range(len(keywords)):
        fig.add_trace(go.Scatter(
                                    x=months, y=merged_dfs[i]['percent_yes'],
                                    mode='lines+markers',
                                    name=f'Percentage Yes by Month for \"{keywords[i]}\"'
                                )
                     )
    fig.update_layout(barmode='group',
                    xaxis_tickangle=0,
                    title={
                            'text': f"Reactions by Week for All Page Segments",
                            'y':0.85,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'
                    },
                    xaxis_title="Week",
                    yaxis_title="Percentage of Yes responses",
                    yaxis=dict(range=[0,100])
    )
    fig.show()

#################  #################  #################
#################    Your functions   #################
#################  #################  #################

# Full analysis by passing in list of keywords
def full_analysis_keywords(file_names, months, keywords):
    # Load each csv into a dataframe
    dataframes = []
    for file in file_names:
        dataframes.append(pd.read_csv(folder+file).drop(labels=['Guide Name'], axis=1))
    
    this_merged_dfs = []
    for keyword in keywords:
        keyword_page_set = get_pages_with_word(keyword, dataframes)
        keyword_merged_df = get_merged_df(keyword_page_set)
        # print(keyword_set)
        this_merged_dfs.append(keyword_merged_df)

        # plot bar chart
        barplot_single_page_set(keyword_merged_df, keyword)
    
    lineplot_percentage_multiple_dfs(this_merged_dfs, keywords)

# Full analysis by passing in a list of page sets (or any iterable of strings)
def full_analysis_page_set(file_names, months, page_sets, segment_names):
    # Load each csv into a dataframe
    dataframes = []
    for file in file_names:
        dataframes.append(pd.read_csv(folder+file).drop(labels=['Guide Name'], axis=1))
    
    this_merged_dfs = []
    for i in range(len(page_sets)):
        keyword_merged_df = get_merged_df(page_sets[i])
        # print(keyword_set)
        this_merged_dfs.append(keyword_merged_df)

        # plot bar chart
        barplot_single_page_set(keyword_merged_df, segment_names[i])
    
    lineplot_percentage_multiple_dfs(this_merged_dfs, segment_names)

## Call functions
There are two types of functions provided that take slightly different inputs and produce slightly different outputs.

The first function provided is `full_analysis_keywords`. This function outputs bar charts of the number of Yes/No reactions for any set of keywords.
Here are the parameters it takes in:

1. `file_names`: a list of csv file names that you want processed

2. `months`: a list of months corresponding to each file you specified in `file_names`.

3. `keywords`: a list of keywords. The function will find the list of page names that contains each keyword, and generate graphs for those set of pages. In the example below, "gs" will produce a set of pages like \["gs.html", "gs-studio.html", etc.\]. "sms" will produce a set of pages like \["sms-bounding-box.html", "sms-data-input.html", etc.\]. And so on. Note that you can input regexes if desired (e.g., "gs|a2i" for the regex "gs" OR "a2i").

Run the following code cell to see an example of `full_analysis_keywords` and its output.

In [9]:
# Each keyword or regex you put in this list will correspond to a segment of pages
my_keywords = ["^.*(deploy|model-monitor|batch|multi-model|auto-scaling|inference-pipeline).*$" ]

full_analysis_keywords(file_names, months, my_keywords)

The second function provided is `full_analysis_page_set`. This function is very similar to the previous except that instead of using keywords, you can directly input custom sets of pages to produce graphs for each set.
Here are the parameters it takes in:

1. `file_names`: a list of csv file names that you want processed.

2. `months`: a list of months corresponding to each file you specified in `file_names`.

3. `page_sets`: a list of page sets that contain the names of pages you want per segment.

4. `segment_names`: a list of names that you want to name each segment. Make sure you have the size of this list is the same as the size of `page_sets`.

Run the following code cell to see an example of `full_analysis_page_set` on 3 custom sets and its output.

In [None]:
# Create sets of the text names of any pages that you want. Make sure to update "page_sets" and "segment_names" accordingly.
set_1 = {'gs.html','gs-config-permissions.html','gs-studio.html','how-it-works.html','whatis.html'}
set_2 = {'automatic-model-tuning-ex-data.html','ex1-cleanup.html'}
set_3 = {'algos.html','xgboost_hyperparameters.html','xgboost-tuning.html'}
page_sets = [
                set_1,
                set_2,
                set_3
             ]
segment_names = [
                    "custom set 1",
                    "custom set 2",
                    "custom set 3"
                 ]

In [None]:
full_analysis_page_set(file_names, months, page_sets, segment_names)