<font size="6"><b>Mobile App Customer Segmentation</b></font>

# Executive Summary

This analysis aims to segment app users by their purchase behavior, affinity to technology, and personality traits to tailor promotional strategies that successfully reach each type of potential customer and drive engagement to the platform and its services.

Five different segments were identified, with the App Value Leaders and Free App Enthusiasts having the most revenue potential, based on how active they are online and how prone they are to acquire paid features. 

Both of these groups are inclined to try new things and be up-to-date with technology, yet the main difference between them is that App Value Leaders would be more likely to spend on premium features, while Free App Enthusiasts should be approached through discounts and deals.

The platforms with the most traffic amongst the potential customer groups are Facebook and YouTube. Therefore, social media campaigns should be prioritized through these two channels to reach larger audiences and optimize resources.

Carrying out a new survey considering newer social media platforms, such as Instagram and TikTok, could prove useful to better tailor the strategies and consider new behavioral trends.

# Data Preparation

## Importing the data

The first step in modeling is preparing the data. For better readability, it is also useful to rename the columns.

In [None]:
# import standard packages
import numpy             as np                          # mathematical essentials
import pandas            as pd                          # data science essentials
import matplotlib.pyplot as plt                         # fundamental data visualization
import seaborn           as sns                         # enhanced visualization

# import machine learning packages
from sklearn.preprocessing   import StandardScaler      # standard scaler
from sklearn.decomposition   import PCA                 # pca
from scipy.cluster.hierarchy import dendrogram, linkage # dendrograms
from sklearn.cluster         import KMeans              # k-means clustering

# new column names
cols = ['Case_ID', 'age', 'has_iPhone', 'has_iPod', 'has_Android', 'has_Blackberry', 'has_Nokia', 'has_Windows', 'has_HP',
       'has_Tablet', 'has_other_smartphone', 'has_no_device', 'app_music', 'app_tv_checkin', 'app_entertainment',
       'app_tv_show', 'app_gaming', 'app_social', 'app_general_news', 'app_shopping', 'app_specific_news', 'app_other',
       'app_none', 'nr_apps', 'free_apps_pct', 'facebook_frq', 'twitter_frq', 'myspace_frq', 'pandora_frq', 'vevo_frq',
       'youtube_frq', 'aol_frq', 'lastfm_frq', 'yahoo_frq', 'imdb_frq', 'linkedin_frq', 'netflix_frq', '24_tech_dev',
       '24_tech_advise', '24_purchase', '24_too_much_tech', '24_control_life', '24_save_time', '24_music', '24_tv_shows',
       '24_information', '24_networking', '24_contact', '24_avoid_contact', '25_opinion_leader', '25_stand_out', 
       '25_offer_advice', '25_decision_making', '25_new_things', '25_responsibility', '25_control', '25_risk_taker',
       '25_creative', '25_optimistic', '25_active', '25_stretched','26_attracted_luxury', '26_discount', '26_enjoy_shopping', 
        '26_package_deals', '26_online_shopping', '26_buy_designer', '26_love_apps', '26_cool_apps', '26_show_off',
       '26_children_impact', '26_pay_features', '26_spend_money', '26_hot_not', '26_reflect_style', '26_impulse_purchase',
       '26_entertainment_source', 'level_education', 'marital_status', 'children_none', 'children_under_6', 'children_6_12',
        'children_13_17', 'children_above_18', 'race', 'hispanic', 'income', 'gender']

# loading data with our column names
survey_df = pd.read_excel(io        = 'Mobile_App_Survey_Data.xlsx',
                          names     = cols,
                          index_col = None)


## Merging Categories

Demographic questions results are merged based on the customers' preferred app store, the kind of applications they use and, how frequently they do so.

In [None]:
# create new column for store usage
survey_df['Apple AppStore'] = np.where(survey_df['has_iPhone'] + survey_df['has_iPod'] > 0 , 1, 0)
survey_df['Android PlayStore'] = np.where(survey_df['has_Android'] + survey_df['has_Nokia'] + survey_df['has_HP']  + survey_df['has_Blackberry'] > 0 , 1, 0)
survey_df['Windows Store'] = np.where(survey_df['has_Windows'] > 0 , 1, 0)

# group similar types of apps together
survey_df['news'] = np.where(survey_df['app_general_news']+survey_df['app_specific_news'] > 0 , 1, 0)
survey_df['entertainment'] = np.where(survey_df['app_tv_checkin']+survey_df['app_entertainment']+survey_df['app_tv_show'] > 0 , 1, 0)

# App usage --> 1 and 2 are users, 3 and 4 rarely use the app
survey_df['Uses_Facebook'] = np.where(survey_df['facebook_frq'] <= 2 , 1, 0)
survey_df['Uses_Twitter'] = np.where(survey_df['twitter_frq'] <= 2 , 1, 0)
survey_df['Uses_Pandora'] = np.where(survey_df['pandora_frq'] <= 2 , 1, 0)
survey_df['Uses_Vevo'] = np.where(survey_df['vevo_frq'] <= 2 , 1, 0)
survey_df['Uses_Youtube'] = np.where(survey_df['youtube_frq'] <= 2 , 1, 0)
survey_df['Uses_AOL'] = np.where(survey_df['aol_frq'] <= 2 , 1, 0)
survey_df['Uses_LastFM'] = np.where(survey_df['lastfm_frq'] <= 2 , 1, 0)
survey_df['Uses_yahoo'] = np.where(survey_df['yahoo_frq'] <= 2 , 1, 0)
survey_df['Uses_imdb'] = np.where(survey_df['imdb_frq'] <= 2 , 1, 0)
survey_df['Uses_LinkedIn'] = np.where(survey_df['linkedin_frq'] <= 2 , 1, 0)
survey_df['Uses_netflix'] = np.where(survey_df['netflix_frq'] <= 2 , 1, 0)

## Defining psychometric columns

For later use, group the columns into three main groups: technical, personal, and purchase behavior type of questions:
- Technical: Related to people's relationship with technology and whether it influences their daily life.
- Personal: Traits associated with leadership and approaches to risk.
- Purchase behavior: Related to the thought process behind their shopping behavior and what criteria they set in place when acquiring new technology.

In [None]:
# questions about technology --> usage 
technical_usage = ['24_tech_dev', '24_tech_advise', 
                   '24_purchase', '24_too_much_tech', '24_control_life', '24_save_time', '24_music', '24_tv_shows', 
                   '24_information', '24_networking', '24_contact', '24_avoid_contact']

# questions about leadership/ if a person is adventurous --> personal features
personal_features = ['25_opinion_leader', '25_stand_out', '25_offer_advice', '25_decision_making', '25_new_things', 
                    '25_responsibility', '25_control', '25_risk_taker', '25_creative', '25_optimistic', '25_active', 
                    '25_stretched']

# questions if a person is attracted to money, the newest trends, apps --> purchase behaviour
purchase_behaviour = ['free_apps_pct', '26_attracted_luxury', '26_discount', '26_enjoy_shopping', '26_package_deals', 
                     '26_online_shopping', '26_buy_designer', '26_love_apps', '26_cool_apps', '26_show_off', 
                     '26_children_impact', '26_pay_features', '26_spend_money', '26_hot_not', '26_reflect_style', 
                     '26_impulse_purchase', '26_entertainment_source']

## Scaling the data

### Scaling Rows
An essential step in using PCA is scaling the data. First, the rows have to be scaled. That step is needed, as there are several types of people in surveys:

1.	People that tend to go to the extremes: Mainly putting strongly agree or disagree, rarely being neutral in the middle.
2.	People that avoid extremes: Mainly being neutral in the middle and rarely have a strong standing.

To compare both of these types fairly, scaling is needed.

### Scaling Columns
The columns have to be scaled for the PCA model as a second step. Columns need to be normalized as PCA maximizes the variance by projecting the original data into directions.

For both of these steps, a function is written. In addition, a function for scaling both rows and columns at once is provided for later use.

In [None]:
def scaler_columns(df):
    """
    Standardizes a dataset (mean = 0, variance = 1). Returns a new DataFrame
    with scaled columns. Requires sklearn.preprocessing.StandardScaler().
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """

    # INSTANTIATING a StandardScaler() object
    scaler = StandardScaler()

    # fitting the scaler with the data
    scaler.fit(df)

    # transforming our data after fit
    x_scaled = scaler.transform(df)
   
    # converting scaled data into a DataFrame
    new_df = pd.DataFrame(x_scaled)

    # reattaching column names
    new_df.columns = df.columns
    
    return new_df

def scaler_rows(df):
    """
    Standardizes the rows of a dataset (mean = 0, variance = 1). Returns a new DataFrame
    with scaled rows. Requires sklearn.preprocessing.StandardScaler().
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """
    
    # Transpose the DataFrame, scale and Transpose back
    df_scaled = scaler_columns(df.T).T
    
    # reattaching column names
    df_scaled.columns = df.columns
    
    return df_scaled
    
def scaler(df):
    """
    Standardizes the rows and columns of a dataset (mean = 0, variance = 1). 
    Returns a new DataFrame with scaled rows and columns. 
    Requires sklearn.preprocessing.StandardScaler().
    
    PARAMETERS
    ----------
    df     | DataFrame to be used for scaling
    """    
    
    # scale the rows
    df_scaled_rows = scaler_rows(df)
    
    # scale the columns
    df_scaled_cols = scaler_columns(df_scaled_rows)
    
    return df_scaled_cols


## Running the PCA-model

### Providing a function to run PCA

Principal Component Analysis is run and analyzed to reduce the number of features. PCA is a dimensionality-reduction method that transforms large datasets into smaller ones containing most of the information.

Scree plots are used to decide how many columns of the PCA model to analyze.

In [None]:
# function to create the elbow plot
def scree_plot(pca_object, export = False):
    """
    Visualizes a scree plot from a pca object.
    
    PARAMETERS
    ----------
    pca_object | A fitted pca object
    export     | Set to True if you would like to save the scree plot to the
               | current working directory (default: False)
    """

    # setting plot size
    fig, ax = plt.subplots(figsize=(20, 16))
    features = range(pca_object.n_components_)

    # developing a scree plot
    plt.plot(features,
             pca_object.explained_variance_ratio_,
             linewidth = 2,
             marker = 'o',
             markersize = 10,
             markeredgecolor = 'black',
             markerfacecolor = 'grey')

    # setting more plot options
    plt.title('Scree Plot')
    plt.xlabel('PCA feature')
    plt.ylabel('Explained Variance')
    plt.xticks(features)

    if export == True:   
        # exporting the plot
        plt.savefig('./__analysis_images/top_customers_correlation_scree_plot.png')
        
    # displaying the plot
    plt.show()

In [None]:
def run_pca(df, check_factors = False, name = 'test.xlsx', n_components = None,  show_plot = False):
    """
    Runs a PCA on a DataFrame. The DataFrame should be scaled before. The function
    outputs an Excel-file with the PCA-results to analyze and shows elbow
    plot if chosen. The function returns the factor loaded dataframe with the
    found pca-clusters in columns and original columns in rows 
    (if check_factors == True) or the pca dataframe with the pca-values for 
    each row of the original DataFrame (if check_factors == False).
    
    PARAMETERS
    ----------
    df            | DataFrame to be used for PCA
    check_factors | Decide, whether factors transposed or original DataFrme should
                    be returned
    name          | Name, which the Excel-file should get
    n_components  | Number of components to consider for PCA
    show_plot     | Defines, if the elbow plot should be shown. Is true by default.
    """
    
    # INSTANTIATING a PCA object with no limit to principal components
    pca = PCA(n_components = n_components,
              random_state = 219)

    # FITTING and TRANSFORMING the scaled data
    pca_fit = pca.fit_transform(df)
    
    if check_factors:
        # calling the scree_plot function
        if show_plot:
            scree_plot(pca_object = pca)


        # transposing pca components
        factor_loadings_df = pd.DataFrame(np.transpose(pca.components_.round(decimals = 2)))

        # naming rows as original features
        factor_loadings_df = factor_loadings_df.set_index(df.columns)

        # saving to Excel
        factor_loadings_df.to_excel(f'{name}.xlsx')
    else: 
        factor_loadings_df = pd.DataFrame(pca_fit)

    return factor_loadings_df

### Running PCA on the survey dataset

To run the PCA, the following steps are needed:
1.	Create DataFrames for technical person, and purchase behavior features.
2.	Run PCA for every feature. Therefore, these steps are needed:
    - Scale the DataFrame.
    - Run PCA and set n_components to None to create a model that explains all the variance. Plot the features in a scree plot and develop a reasonable number of features. To check the connection between PCA components and the original columns, group names for each PCA component are defined, and check_factors have been set to True.
1.	Check the scree plot and look for the elbow. In the survey, taking three components for each PCA model is reasonable.
2.	Go into the Excel files and come up with names fitting the PCA columns.
3.	Create a DataFrame containing all the columns of the three PCA models.

In [None]:
# create df's for the features
df_technical = survey_df[technical_usage]
df_personal = survey_df[personal_features]
df_purchase = survey_df[purchase_behaviour]
# all_dfs = [df_technical, df_personal, df_purchase]

# for df in all_dfs:
    
#     # create the name for each dataframe
#     # from: https://stackoverflow.com/questions/31727333/get-the-name-of-a-pandas-dataframe
#     name =[x for x in globals() if globals()[x] is df][0]
    
#     # scale the data
#     scaled_df = scaler(df)
    
#     # run PCA  
#     run_pca(df            = scaled_df, 
#             check_factors = True, 
#             name          = name, 
#             n_components  = None,  
#             show_plot     = True)


# get factors and name for features for personal features
scaled_personal = scaler(df_personal) 
factor_loadings_personal = run_pca(df = scaled_personal, n_components = 3)
personal_cols = ['Followers',                 # followers, easily influenced by others
                 'Calculated Risk Takers',    # conservative inactive pessimists
                 'Controlling Leaders']       # submissive non-creatives
factor_loadings_personal.columns = personal_cols

# get factors and name for features for technical features
scaled_tech = scaler(df_technical)
factor_loadings_technical = run_pca(df = scaled_tech, n_components = 3)
tech_cols = ['Offliners',               # too much info and tech out there, prefer offline contact
             'Frugal Onliners',         # always online, don't care about having the latest
             'Relaxed and Empowered']   # chill, up-to-date, with control over their lives
factor_loadings_technical.columns = tech_cols

# get factors and name for features for purchase features
scaled_purchase = scaler(df_purchase)
factor_loadings_purchase = run_pca(df = scaled_purchase, n_components = 3)
purchase_cols = ['Cheap App Lovers',       # Customer looking for best features in free apps   
                 'Spending App Lovers',    # attracted to the what's hot, and the best, sometimes regardless of price
                 'App Acquisition Planners']  # plan their purchases according to their needs
factor_loadings_purchase.columns = purchase_cols

# create DataFrame with all PCA features
factors_df = pd.concat([factor_loadings_personal, factor_loadings_technical, factor_loadings_purchase], axis = 1)

## Clustering

The next step of data preparation is to create clusters for the consumers based on PCA findings with K-Means-Clustering. Because this algorithm is based on distance and variance, the data is scaled again before being grouped. 

After analyzing the dendrogram, five clusters are defined:

~~~
Cluster
0          309
1          326
2          287
3          349
4          281
~~~

The function below allows setting the check_centers parameter to True to create an Excel file with the PCA columns and the newly created clusters to evaluate and label each cluster as follows:
- 0 - Free App Enthusiasts: They are up to date on technology, like learning new things, and are always online. Even though they love using several apps, they usually avoid paying but cannot resist a bargain.
- 1 - App Value Leaders: They are knowledgeable in their field and easily influence others. They like apps, are credible, and can be targeted through respectable social media influencers.
- 2 - Economical and Efficient: The economic buyer is cost-effective when acquiring new apps and looks for tools that add value to their everyday lives and maximize their benefits and efficiency.
- 3 - Carefree App Followers: They go with the flow, do not spend much time online and are impulsive buyers. They are attracted to paid features that add value to their lives without detracting from their in-person interactions.
- 4 - Indifferent Shopper: They are not desperate to hop on the next trend when it comes to apps. They are often satisfied with products that they already have.

In [None]:
def run_clustering(df, n_clusters = 8, dendogram = False, check_centers = False):
    """
    Runs a Clustering on a DataFrame and scales before doing so. 
    A dendogram for the clustering is shown, if chosen in the parameter dendogram.
    Returns a DataFrame with the clusters for each row
    
    PARAMETERS
    ----------
    df            | DataFrame to be used for clustering
    n_clusters    | Number of clusters to be created, by default it is 8,
                    the default value of sklearn.cluster.KMeans
    dendogram     | Whether or not a dendogram of the clustering should be shown
    check_centers | Whether or not to create an Excel-file to look into the 
                    clusters and come up with names
    """    

    # scale the df
    pca_scaled = scaler_columns(df)
    
    if dendogram:
        # grouping data based on Ward distance
        standard_mergings_ward = linkage(y                = pca_scaled,
                                         method           = 'ward',
                                         optimal_ordering = True)

        # setting plot size
        fig, ax = plt.subplots(figsize=(12, 12))

        # developing a dendrogram
        dendrogram(Z              = standard_mergings_ward,
                   leaf_rotation  = 90,
                   leaf_font_size = 6)

        # rendering the plot
        plt.show()
        
    # INSTANTIATING a k-Means object with n clusters
    k_clustering = KMeans(n_clusters   = n_clusters,
                          random_state = 219)

    # fitting the object to the data
    k_clustering.fit(pca_scaled)
    
    
    # checking the centroids
    if check_centers:
        # storing cluster centers
        centroids = k_clustering.cluster_centers_

        # converting cluster centers into a DataFrame
        centroids_df = pd.DataFrame(centroids).round(2)

        # renaming principal components
        centroids_df.columns = pca_scaled.columns
        
        # send to excel 
        centroids_df.to_excel('clusters.xlsx')

    # converting the clusters to a DataFrame giving each row a cluster
    kmeans_df = pd.DataFrame({'Cluster': k_clustering.labels_})
    
    return kmeans_df


## Translating Columns

It is recommended to rename the most critical columns. Additionally, a pivot-version for the used apps and stores is created.

In [None]:
# create clusters
df_cluster = run_clustering(factors_df, n_clusters = 5)

# replace by cluster name 
cluster_names = {0 : 'Free App Enthusiasts',
                 1 : 'App Value Leaders',
                 2 : 'Economical and Efficient',
                 3 : 'Carefree App Followers',
                 4 : 'Indifferent Shoppers'}

df_cluster = df_cluster['Cluster'].replace(cluster_names)

# concatinating cluster memberships with principal components
clst_pca_df = pd.concat([df_cluster,
                        factors_df],
                        axis = 1)

# rename age by group names of age
age_names = {1 : '1 - Gen Z',
             2 : '1 - Gen Z',
             3 : '2 - Millenials',
             4 : '2 - Millenials',
             5 : '2 - Millenials',
             6 : '3 - Gen X',
             7 : '3 - Gen X',
             8 : '3 - Gen X',
             9 : '4 - Boomers',
            10 : '4 - Boomers',
            11 : '4 - Boomers'}

survey_df['age'] = survey_df['age'].replace(age_names)

# translate age by groups of income
salary_names = {1 : '1 - Under 30k',
                2 : '1 - Under 30k',
                3 : '1 - Under 30k',
                4 : '1 - Under 30k',
                5 : '2 - 30k - 50k',
                6 : '2 - 30k - 50k',
                7 : '3 - 50k - 70k',
                8 : '3 - 50k - 70k',
                9 : '4 - 70k - 100k',
               10 : '4 - 70k - 100k',
               11 : '4 - 70k - 100k',
               12 : '5 - Over 100k',
               13 : '5 - Over 100k',
               14 : '5 - Over 100k'}

survey_df['income'] = survey_df['income'].replace(salary_names)


# concatenating demographic information with pca-clusters
final_pca_clust_df = pd.concat([survey_df.loc[ : , ['age', 'income', 'Apple AppStore', 'Android PlayStore', 
                                                    'Windows Store', 'has_Tablet', 'Uses_Facebook', 
                                                    'Uses_Twitter', 'Uses_Pandora', 'Uses_Vevo', 'Uses_Youtube', 
                                                    'Uses_AOL', 'Uses_LastFM', 'Uses_yahoo', 'Uses_imdb', 
                                                    'Uses_LinkedIn', 'Uses_netflix'
                                                   ]],
                                  clst_pca_df.round(decimals = 2)],
                                  axis = 1)

data_df = final_pca_clust_df

In [None]:
# pivot the store and create new DataFrame, only keep the rows where user uses the store
store_df = data_df[['Apple AppStore', 'Android PlayStore',  'Windows Store', 'has_Tablet', 'Cluster', 'Followers', 
                    'Calculated Risk Takers', 'Controlling Leaders', 'Offliners', 'Cheap App Lovers', 'Spending App Lovers',
                   'App Acquisition Planners']]

store_df = store_df.melt(id_vars=['Cluster', 'Followers', 'Calculated Risk Takers',
                                'Controlling Leaders', 'Offliners', 'Cheap App Lovers', 'Spending App Lovers',
                               'App Acquisition Planners'],
                                var_name='Store', value_name='Uses_Store')

store_df = store_df.loc[:, :][store_df.loc[:, 'Uses_Store'] == 1]

# pivot the app usage and create new DataFrame, only keep the rows where user uses the app
app_usage_df = data_df[['Uses_Facebook', 'Uses_Twitter', 'Uses_Pandora', 'Uses_Vevo', 'Uses_Youtube', 
                        'Uses_AOL', 'Uses_LastFM', 'Uses_yahoo', 'Uses_imdb', 'Uses_LinkedIn', 
                        'Uses_netflix', 'Cluster', 'Followers',  'Calculated Risk Takers', 
                        'Controlling Leaders', 'Offliners','Cheap App Lovers', 'Spending App Lovers',
                           'App Acquisition Planners']]

app_usage_df = app_usage_df.melt(id_vars=['Cluster', 'Followers', 'Calculated Risk Takers',
                                'Controlling Leaders', 'Offliners', 'Cheap App Lovers', 'Spending App Lovers',
                               'App Acquisition Planners'],
                                var_name='App', value_name='Uses_App')

app_usage_df = app_usage_df.loc[:, :][app_usage_df.loc[:, 'Uses_App'] == 1]

# Strategies for clusters

Now that the clusters have been identified, each type of customer will be approached as follows:

Free App Enthusiasts:
- Being one of the two most important groups in revenue potential, the apps offered to this cluster should have two versions: free and premium.
- Potential customers should be approached with free initial trials and new user deals, along with a referral system that can award them with points or credits that can be used towards premium features.
- Promotion should be carried out through the most popular social media channels within this group.
- This group should be offered apps that allow them to connect with others and express themselves, with customizable features, and where they can gather a following.

App Value Leaders:
- As they are the other leading group with the most revenue potential, the content and features offered on the application ought to add value to their daily lives, which will increase their usage of the services and even create the possibility of them being a bridge to bring new customers on board.
- Influencers dominate social media platforms. However, since the people belonging to this cluster are leaders themselves, they should be targeted through collaboration with aspirational leaders that drive engagement in the app and be offered the opportunity to connect and create.
- For this, they should be offered a creator back-end so that the content they generate is of higher quality than a regular user. This will allow them to build rapport with their following and garner attention to premium features for users on other clusters.
- Converting application users to loyal users by offering rewards and redeemable points that they can accumulate over time for engagement and referrals, and later be converted into discounts packages and access to premium features.
- Engagement campaigns should be based on current affairs and run on work-related social media platforms and electronic news publications and blogs, driving more traffic of this specific type of customer to the app.

Economical and Efficient:
- This customer does not shy away from paid apps but will usually plan their purchase aiming to obtain the most benefit, which is why they should be approached through free trials that allow them to assess whether the app is relevant enough for them to commit financially.
- Since they are more hands-on and concerned with how the app brings utility to their lifestyle, they should be offered apps that enhance their relationship with health, fitness, finances, and even simple daily tasks.
- Other strategies that could prove effective are new user deals or feature packages, with bundles for different prices that will allow them to decide which best fit their needs and budget.

Carefree App Followers:
- This cluster should be targeted through in-person events or through sponsorships of large-scale events in which they can physically see and retain the brand logo and primary information.
- One-day offers could be a good idea since they are somewhat impulsive buyers, and these deals being promoted by influencers of their respective age groups could prove to be more effective.
- Online advertising should be done through news apps and websites, as well as blogs or LinkedIn.

Indifferent Shoppers:
- This type of customer should be approached through emotional targeting. Since they do not spend much time online, campaigns should have a long-lasting and impactful message.
- Since they do not follow trends and feel comfortable with what they already have, another practical approach would be explicitly mentioning the app's benefits to their daily lives.

In [None]:
def show_barplot(df, col_name, title):
    """
    Out of given DataFrame, the function returns a boxplot with the col_name on
    the x-axis and the frequency of each value on the y-axis. 
    
    PARAMETERS
    ----------
    df            | DataFrame to be used for plotting
    col_name      | Column which should be shown
    title         | Title to show on the Graph
    """
    
    # create DataFrame for free app enthusiasts and count frequency
    app_ent_df = df.loc[:, [col_name]][df.loc[:, 'Cluster'] == 'Free App Enthusiasts'].value_counts().rename_axis(col_name).to_frame('Freq')
    app_ent_df.reset_index(level=0, inplace=True)
    app_ent_df['Cluster'] = 'Free App Enthusiasts'

    # create DataFrame for app value leaders and count frequency
    app_leaders_df = df.loc[:, [col_name]][df.loc[:, 'Cluster'] == 'App Value Leaders'].value_counts().rename_axis(col_name).to_frame('Freq')
    app_leaders_df.reset_index(level=0, inplace=True)
    app_leaders_df['Cluster'] = 'App Value Leaders'
    
    # create DataFrame for economicals and count frequency
    eco_df = df.loc[:, [col_name]][df.loc[:, 'Cluster'] == 'Economical and Efficient'].value_counts().rename_axis(col_name).to_frame('Freq')
    eco_df.reset_index(level=0, inplace=True)
    eco_df['Cluster'] = 'Economical and Efficient'
    
    # create DataFrame for carefree app followers and count frequency
    carefree_df = df.loc[:, [col_name]][df.loc[:, 'Cluster'] == 'Carefree App Followers'].value_counts().rename_axis(col_name).to_frame('Freq')
    carefree_df.reset_index(level=0, inplace=True)
    carefree_df['Cluster'] = 'Carefree App Followers'
    
    # create DataFrame for free app indifferent shoppers and count frequency
    shoppers_df = df.loc[:, [col_name]][df.loc[:, 'Cluster'] == 'Indifferent Shoppers'].value_counts().rename_axis(col_name).to_frame('Freq')
    shoppers_df.reset_index(level=0, inplace=True)
    shoppers_df['Cluster'] = 'Indifferent Shoppers'
    
    # concatenate all into one DataFrame
    final_df = app_leaders_df.append([app_ent_df, eco_df, carefree_df, shoppers_df])
    final_df = final_df.sort_values(col_name)
    
    # create boxplot 
    fig, ax = plt.subplots(figsize=(16, 8))
    
    # color palette for the chart
    pal = ['navy', 'cornflowerblue', 'lightsteelblue', 'dimgray', 'lightgray']
    
    # boxplot
    sns.barplot(x= col_name, y='Freq', hue='Cluster', data=final_df, palette=pal)
    
    # rotate x label
    plt.xticks(rotation=90)
    
    # get rid of axis labels
    ax.set_ylabel('')
    ax.set_xlabel('')
    
    # create title
    ax.set_title(title, size=18)

    # show plot
    plt.show()


In [None]:
# show app usage by cluster
show_barplot(app_usage_df, 'App', 'App Usage by Cluster')

Social media campaigns should be tailored to the App Value Leaders and Free App Enthusiasts, given that they are the ones that spend the most time online and could be prone to use and spread the word about the app the most. Nonetheless, because Facebook and YouTube are the leading platforms used across all the clusters, these advertising efforts could also drive potential customers from the other groups.

In [None]:
# show income distribution by cluster
show_barplot(data_df, 'income', 'Income by Cluster')

App Value Leaders are in the high-income bracket and are often attracted to premium features which is why they should be offered paid apps or in-app purchase options.

Free App Enthusiasts have an average income lower than $50K a year, which validates the approach through free apps with premium features that they can later purchase from within the app with redeemable credits.

In [None]:
# show store preference by cluster
show_barplot(store_df, 'Store', 'Store by Cluster')

The budget should prioritize the development and maintenance of the apps offered mainly through the Apple AppStore and the Google PlayStore.

In [None]:
# show age distribution by cluster
show_barplot(data_df, 'age', 'Age by Cluster')

Millennials are the most relevant age group, and the top two social media platforms used by all clusters are Facebook and YouTube. Hence, Millennial influencers should be used to reach other age groups, as they could also become aspirational figures for Gen Z prospects.

Further, two influencers' profiles could be approached for the social media campaigns: opinion leaders and content creators. The former could prove helpful to reach quality content-seeking clusters such as the App Value Leaders, while the latter could be more appropriate for Free App Enthusiasts.