<br><h1>A3 Analysis - Unsupervised Learning </h1>
<h2>Machine Learning - DAT-5303 - FMSBA3 </h2>
By: Sophie Briques, Michael Abramson, Yevheniya Boyko, Rakesh Joe Francy, Junjie Huang, Santiago Romero<br>
Hult International Business School<br><br><br>

***

### Overview

- Survey data from Big 5 personality traits and Hult DNA to understand purchase behavior or Windows and Macbook users
- Unsupervised learning technique PCA was used to uncover psychometrics of the target audience
- Unsupervised learning technique k-Means clustering was used to determine similar groups of customers

***


<strong> Case: Microsoft </strong> <br>
<i> Audience: Microsoft Analytics Team </i> <br>
<strong> Goal: </strong> understand consumer buying behavior in regards to the decision making behind choosing Mac or Windows through the lenses of Big Five Personalities and Hult DNA<br>
<strong> Source: </strong> survey conducted through Google Forms on 245 individuals in February 2020

***



<strong> Data and Assumptions: </strong> <br><br>
<u>Dataset:</u><br>
The dataset in this script comes from a survey conducted in February 2020 to students at Hult International Business School. Each survey answer corresponded to a range between 1 to 5, where 1 corresponded to strongly disagree and 5 to strongly agree and where 3 represents a neutral feeling. There are 245 respondents.
<br>

<u>Specifications: </u> 
- audience surveyed is representative of the population Microsoft is attempting to study
- audience surveyed is composed solely of Hult students

<u>Assumptions: </u>
- Confirmatory factor analysis: pre-checks are completed
- All survey respondents have are fluent in English
- No respondents were aware of the Big Five Personalities previous to completing the survey
- All respondents identify themselves as female or male solely

<u> Demographic Data: </u> <br>
- age
- nationality
- gender
- ethnicity

<u> Spending Behavior data: </u> <br>
- Current laptop spending
- Ideal new laptop (if all prices were the same)


***

<strong> Null Hypothesis: </strong> <br><br>
<u> Big 5 Personalities </u> <br>
The survey conducted comes from the <a href="https://doi.org/10.1111/j.1744-6570.1999.tb00174.x">Big Five personality traits</a> which was a survey conducted on American population to determine 5 overarching personality traits. In this case, the personalities are being analyzed in relation to the customers buying behavior:


1. Agreeableness is a measure of ones tendency towards social harmony. In other words, the level of cooperation and team interaction is scored.
2. Conscientiousness is a measure of self-discipline and individual level of productivity displayed.
3. Extraversion represents how social and energetic an individual tends to be. The measure is also used to identify a person as either introvert or an extrovert. 
4. Openness measures the extent of imagination and creativity displayed, contrary to a rather conventional personality.
5. Stress Tolerance represents ones ability to handle a stressful situation.

The approach adapted in the present study intends to analyze and present the insights of the role the Big 5 Traits play towards the overall performance of Microsoft. <br><br>
We expect that due to the diverse and international nature of all respondents, we might see one or multiple different traits emerge as opposed to the original 5 traits.
<br><br>

<u> Hult DNA </u> <br>
<a href = "https://www.hult.edu/blog/why-every-leader-needs-growth-mindset/"> Hult DNA </a>  is a combination of cognitive-behavioral skills, consisting of the following categories: 
1. Thinking reflects students’ ability recognize and challenge their bias towards a fixed mindset. 
2. Communicating category reflects the students’ ability to recognize their weak and strong points and communicate those in a proper manner.
3. Team Building skill is a foothold of the growth mindset, representing the ability of a team-member to leverage cultural differences in pursue of mutual understanding.

***

<strong> Outline: </strong>
0. Preparation
1. Part 1: Anomaly Detection and Handling
2. Part 2: Exploratory Data Analysis
3. Part 3: Transformations
4. Part 4: Build an unsupervised learning model
5. Part 5: Interpreting the Model (Correlations and Clustering)

***

## Preparation

Run the following code to import necessary packages, load data, and set display options:

In [1]:
# Importing Necessary Libraries
import pandas            as pd  # data science essentials
import matplotlib.pyplot as plt # fundamental data visualization
import seaborn           as sns # enhanced visualization
import sys                      # system-specific parameters and functions
from sklearn.preprocessing   import StandardScaler # standard scaler
from sklearn.decomposition   import PCA            # pca
from sklearn.manifold        import TSNE           # t-SNE
from scipy.cluster.hierarchy import dendrogram, linkage # dendrograms
from sklearn.cluster         import KMeans              # k-means clustering

# setting print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)

# loading the data
file    = 'Survey_Data_Final_Exam.xlsx'
full_df = pd.read_excel(file)

<br>
Run the following dictionary to separate survey questions and answers according to survey source and question type. <br><br>

<i> Note: This dictionary also removes duplicate questions from the full dataset. Some questions were duplicated to be able to identify if respondents were responding seriously as opposed to randomly selecting answers.</i>

In [2]:
dictionary = {
   'negative': [
       "Don't talk a lot", 
       'Am not interested in abstract ideas',
       "Am not interested in other people's problems",
       'Do not have a good imagination',
       'Am not really interested in others',
       "Don't like to draw attention to myself",
       "Don't mind being the center of attention",
       "Don't  generate ideas that are new and different",
       "Don't persuasively sell a vision or idea",
       "Can't rally people on the team around a common goal"
    ],
    'bigfive': [
        'Am the life of the party',           'Feel little concern for others', 
        'Am always prepared',                 'Get stressed out easily', 
        'Have a rich vocabulary',             "Talk a lot", 
        'Am interested in people',            'Leave my belongings around', 
        'Am relaxed most of the time',        'Have difficulty understanding abstract ideas',
        'Feel comfortable around people',     'Insult people',
        'Pay attention to details',           'Worry about things',
        'Have a vivid imagination',           'Keep in the background', 
        "Sympathize with others' feelings",   'Make a mess of things', 
        'Seldom feel blue',                   'Interested in abstract ideas',
        'Start conversations',                "Interested in other people's problems", 
        'Get chores done right away',         'Am easily disturbed',
        'Have excellent ideas',               'Have little to say',
        'Have a soft heart',                  'Often forget to put things back in their proper place',
        'Get upset easily',                   'Have a good imagination',
        'Really interested in others',        'Talk to a lot of different people at parties',
        'Like order',                         'Change my mood a lot',
        'Am quick to understand things',      "Like to draw attention to myself",
        'Take time out for others',           'Shirk my duties',
        'Have frequent mood swings',          'Use difficult words',
        "Feel others' emotions",              'Follow a schedule',
        'Get irritated easily',               "Mind being the center of attention",
        'Spend time reflecting on things',    'Am quiet around strangers',
        'Make people feel at ease',           'Am exacting in my work', 
        'Often feel blue',                    'Am full of ideas'
    ],
    'hultdna': [
        'See underlying patterns in complex situations', 
        "Generate ideas that are new and different",
        'Demonstrate an awareness of personal strengths and limitations',
        'Display a growth mindset', 
        'Respond effectively to multiple priorities',
        "Take initiative even when circumstances, objectives, or rules aren't clear", 
        'Encourage direct and open discussions', 
        #'Respond effectively to multiple priorities.1', 
        #"Take initiative even when circumstances, objectives, or rules aren't clear.1",
        #'Encourage direct and open discussions.1', 
        'Listen carefully to others',
        "Persuasively sell a vision or idea",
        'Build cooperative relationships',
        'Work well with people from diverse cultural backgrounds',
        'Effectively negotiate interests, resources, and roles', 
        "Can rally people on the team around a common goal",
        'Translate ideas into plans that are organized and realistic', 
        'Resolve conflicts constructively',
        'Seek and use feedback from teammates',
        'Coach teammates for performance and growth', 
        'Drive for results'
    ],
    'demographic' : [
        'What laptop do you currently have?',
        'What laptop would you buy in next assuming if all laptops cost the same?',
        'What program are you in?',
        'What is your age?',
        'Gender',
        'What is your nationality? ',
        'What is your ethnicity?'
    ],
    'corr' : [
        'A', 'E', 
        'N', 'C', 'O',
        'What laptop do you currently have?',
        'What laptop would you buy in next assuming if all laptops cost the same?'
    ]
}

***

### User-defined Functions

Run the following code to load user-defined functions needed

In [3]:
#Defining a function to standardize numerical variables in the dataset:
def standard(num_df, by_rows = False):
    """
    This function standardizes a dataframe that contains variables which are either
    integers or floats.
    
    ------
    num_df  : DataFrame, must contain only numerical variables
    by_rows : bool, Default = False. If True, will standardize DataFrame per rows
    
    """
    # INSTANTIATING a StandardScaler() object
    scaler = StandardScaler()

    if by_rows == False:
        # FITTING the scaler
        scaler.fit(num_df)
    
        # TRANSFORMING our data after fit
        scaled = scaler.transform(num_df)
    
        # converting scaled data into a DataFrame
        scaled_df = pd.DataFrame(scaled)
        
        # adding labels to the scaled DataFrame
        scaled_df.columns = num_df.columns
        
        # returning the standardized data frame into the global environment
        return scaled_df
    
    elif by_rows == True:
        
        # Transposing data frame
        transpose_df = num_df.transpose()
        
        # FITTING the scaler
        scaler.fit(transpose_df)
    
        # TRANSFORMING our data after fit
        transposed_scaled = scaler.transform(transpose_df)
        
        # Re-transposing our data frame 
        scaled = transposed_scaled.transpose()
        
        # converting scaled data into a DataFrame
        scaled_df = pd.DataFrame(scaled)
        
        # adding labels to the scaled DataFrame
        scaled_df.columns = num_df.columns
        
        # returning the standardized data frame into the global environment
        return scaled_df
    
    else:
        print('Something went wrong. Please specifiy by_rows argument as True or False.')
        
        
########################################
# scree_plot
########################################
def scree_plot(pca_object, export = False):
    # building a scree plot

    # setting plot size
    fig, ax = plt.subplots(figsize=(10, 8))
    features = range(pca_object.n_components_)


    # developing a scree plot
    plt.plot(features,
             pca_object.explained_variance_ratio_,
             linewidth = 2,
             marker = 'o',
             markersize = 10,
             markeredgecolor = 'black',
             markerfacecolor = 'grey')


    # setting more plot options
    plt.title('Scree Plot')
    plt.xlabel('PCA feature')
    plt.ylabel('Explained Variance')
    plt.xticks(features)

    if export == True:
    
        # exporting the plot
        plt.savefig('top_customers_correlation_scree_plot.png')
        
    # displaying the plot
    plt.show()
    
########################################
# inertia
########################################
def interia_plot(data, max_clust = 50):
    """
PARAMETERS
----------
data      : DataFrame, data from which to build clusters. Dataset should be scaled
max_clust : int, maximum of range for how many clusters to check interia, default 50
    """

    ks = range(1, max_clust)
    inertias = []


    for k in ks:
        # INSTANTIATING a kmeans object
        model = KMeans(n_clusters = k)


        # FITTING to the data
        model.fit(data)


        # append each inertia to the list of inertias
        inertias.append(model.inertia_)



    # plotting ks vs inertias
    fig, ax = plt.subplots(figsize = (12, 8))
    plt.plot(ks, inertias, '-o')


    # labeling and displaying the plot
    plt.xlabel('number of clusters, k')
    plt.ylabel('inertia')
    plt.xticks(ks)
    plt.show()

## Part 1: Anomaly Detection and Handling

<strong> Reversed Answer Scales: </strong>
<br><br>
Questions in the survey range from 1 to 5, where 5 corresponds to strongly agree and 1 to strongly disagree.
However, certain questions are phrased in a negative conotation, reversing the above scale. <br>

For example: In the question <i> "Don't talk a lot" </i>, a 5 would be someone who is very quiet. <br>

This represents an issue in our analysis, since we are not able to directly compare these questions. We will therefore reverse the scoring for the 'negative' questions, so that <i> "Don't talk a lot" </i> becomes <i> "Talk a lot" </i> and where a previous 5 (very quiet) becomes a 1 with the same meaning (very quiet).

In [4]:
# Reversing answer scales
full_df[dictionary['negative']] = full_df[dictionary['negative']].replace(to_replace = [5,4,2,1],
                                                                          value=[1,2,4,5])


positive = ["Talk a lot",
            'Interested in abstract ideas',
            "Interested in other people's problems",
            'Have a good imagination',
            'Really interested in others',
            "Like to draw attention to myself",
            "Mind being the center of attention",
            "Generate ideas that are new and different",
            "Persuasively sell a vision or idea",
            "Can rally people on the team around a common goal"]

# Renaming column questions with positive conotations
full_df.rename(columns=dict(zip(dictionary['negative'], positive)), inplace=True)

***
<strong> Duplicates: <strong>

As we analyse the survey, we identified multiple duplicate observations. These observations had the exact same answers, including text entry for age and nationalities. However, their survey ID was different. We decided to remove these from our dataset to avoid enflating a certain personality trait over any other.

In [5]:
# Removing Survey ID column to be able to identify duplicates
full_df0 = full_df.drop('surveyID', axis = 1)

# Dropping duplicates from our dataset
full_df_noduplicates = full_df0.drop_duplicates()

***

<strong> Data Cleaning: </strong> Nationality

The question that asked respondents what were their nationalities was set up so that respondents could enter free short-text. Even though this allowed respondents to answer with specific cases (ex: double nationalities), we also need to make sure each answer follows the same formatting to be able to analyze.

In [6]:
#Nationality
#change all to lowercase
full_df_noduplicates.loc[:,'What is your nationality? '] = full_df_noduplicates.loc[:,'What is your nationality? '].str.lower()

# Standardizing country formatting 
full_df_noduplicates.loc[:,'What is your nationality? '] = full_df_noduplicates.loc[:,'What is your nationality? '].replace(
                                                            to_replace = ['india',             'china',          'korea',
                                                                          'korean',            'ecuador',        'taiwan',
                                                                          'usa',               'japan',          'russia',
                                                                          'spain',             'brazil',         'colombia',
                                                                          'republic of korea', 'costarrican',    'mauritius',
                                                                          'cameroon',          'indonesia',      'panama',
                                                                          'germany',           'czech republic', 'nigeria',
                                                                          'canada',            'belgian',        'south korea',
                                                                          'venezuela',         'congolese (dr congo)',
                                                                          'mexico',            'armenia',         'thailand',
                                                                          'dominican republic','philippines',     'peru',
                                                                          'indian.',           'malaysia',        'iran',
                                                                          'english',           'el salvador',     'belarus',
                                                                          'taiwan( r.o.c)',    'costa rica'
                                                                         ],
                                                            value      = ['indian',       'chinese',     'south korean',
                                                                          'south korean', 'ecuadorian',  'taiwanese',
                                                                          'american',     'japanese',    'russian',
                                                                          'spanish',      'brazilian',   'colombian',
                                                                          'south korean', 'costa rican', 'mauritian',
                                                                          'cameroonian',  'indonesian',  'panamanian',
                                                                          'german',       'czech',       'nigerian',
                                                                          'canadian',     'belgian',     'south korean',
                                                                          'venezuelan',   'congolese',   'mexican',
                                                                          'armenian',     'thai',        'dominican',
                                                                          'filipino',     'peruvian',    'indian',
                                                                          'malaysian',    'iranian',     'british',
                                                                          'salvadoran',   'belarusian',  'taiwanese',
                                                                          'costa rican'
                                                                         ])

# Grouping unknown nationalities
full_df_noduplicates.loc[:,'What is your nationality? '] = full_df_noduplicates.loc[:,'What is your nationality? '].replace(
                                                             to_replace = ['.', 'calm', 'multi-ethnic',
                                                                           'prefer not to answer'
                                                                          ],
                                                              value     = ['unknown','unknown','unknown',
                                                                           'unknown'
                                                                          ])
# Checking results
#full_df_noduplicates['What is your nationality? '].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


***
<strong> Data Cleaning: </strong> Laptop Used
<br>
One respondent used the text entry for computers to answer the question about their current laptop. We will group this observation with the corresponding category.

In [7]:
# Replacing MAC as Macbook
full_df_noduplicates.loc[: , 'What laptop do you currently have?'] = full_df_noduplicates.loc[: , 'What laptop do you currently have?'].replace(
                                                                        to_replace = 'MAC',
                                                                        value = 'Macbook')


full_df_noduplicates.loc[: ,'What laptop would you buy in next assuming if all laptops cost the same?'] = full_df_noduplicates.loc[: ,'What laptop would you buy in next assuming if all laptops cost the same?'].replace(
                                                                        to_replace = 'MAC',
                                                                        value = 'Macbook')


#full_df_noduplicates.loc[: , 'What laptop do you currently have?'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


***
<strong> Missing Values: </strong> Ethnicity <br>
One missing value but all others are aligned. We will impute the missing value with 'Prefer not to answer'. Due to the sensitivity of this question, we assume that a person that did not answer this question preferred not to answer it at all.

In [8]:
# Filling missing values
full_df_noduplicates.loc[:,'What is your ethnicity?'] = full_df_noduplicates.loc[:,'What is your ethnicity?'].fillna('Prefer not to answer')

# Checking our results
#full_df_noduplicates.loc[:,'What is your ethnicity?'].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## Part 2: Exploratory Data Analysis

***
<strong> Correlation Analysis: </strong> <br>
After completing the Big Five personality survey, the following formulas are used to calculate a score for each of the personalities, where $qt_{\substack{n}}$ is the survey question number $n$:



$$ E = 20 + qt_{\substack{1}} + qt_{\substack{6}} + qt_{\substack{11}} + qt_{\substack{16}} + qt_{\substack{21}} + qt_{\substack{26}} + qt_{\substack{31}} + qt_{\substack{36}} + qt_{\substack{41}} + qt_{\substack{46}} $$

$$ A = 14 - qt_{\substack{2}} + qt_{\substack{7}} + qt_{\substack{12}} + qt_{\substack{17}} + qt_{\substack{22}} + qt_{\substack{27}} + qt_{\substack{32}} + qt_{\substack{37}} + qt_{\substack{42}} + qt_{\substack{47}} $$


$$ C = 14 + qt_{\substack{3}} + qt_{\substack{8}} + qt_{\substack{13}} + qt_{\substack{18}} + qt_{\substack{23}} + qt_{\substack{28}} + qt_{\substack{33}} + qt_{\substack{38}} + qt_{\substack{43}} + qt_{\substack{48}} $$


$$ N = 2 - qt_{\substack{4}} + qt_{\substack{9}} + qt_{\substack{14}} + qt_{\substack{19}} + qt_{\substack{24}} + qt_{\substack{29}} + qt_{\substack{34}} + qt_{\substack{39}} + qt_{\substack{44}} + qt_{\substack{49}} $$


$$ O = 8 + qt_{\substack{5}} + qt_{\substack{10}} + qt_{\substack{155}} + qt_{\substack{20}} + qt_{\substack{25}} + qt_{\substack{30}} + qt_{\substack{35}} + qt_{\substack{40}} + qt_{\substack{45}} + qt_{\substack{50}} $$

<a href="https://openpsychometrics.org/printable/big-five-personality-test.pdf"> Source</a>

<i> Note: We noticed that their equation for our source's neurotic personality score seemed to be reversed. It was subtracting points for neurotic where it should have been adding points. To fix this, we reversed the signs associated with each question and inverted the intercept in relation to the mean.<br><br>
***

Before performing our own personality test, we want to compute the scores for each individual, in order to determine their scores for each category they would belong to in the traditional test:

In [9]:
# Computing scores based on original Big 5 formula
full_df = full_df_noduplicates.copy()

# Extraversion
full_df['E'] = (20 + full_df['Am the life of the party']
                   + full_df['Talk a lot'] # reversed the sign because of reversed scale
                   - full_df['Keep in the background']
                   + full_df['Feel comfortable around people']
                   + full_df['Start conversations']
                   - full_df['Have little to say']
                   + full_df['Talk to a lot of different people at parties']
                   + full_df['Like to draw attention to myself']   # reversed the sign because of reversed scale
                   - full_df['Mind being the center of attention'] # reversed the sign because of reversed scale
                   - full_df['Am quiet around strangers'] 
               )

# Neuroticism
full_df['N'] = (2 + full_df['Get stressed out easily']
                  -  full_df['Am relaxed most of the time']  
                   + full_df['Worry about things']
                   - full_df['Seldom feel blue']
                   + full_df['Am easily disturbed']
                   + full_df['Get upset easily']
                   + full_df['Change my mood a lot']
                   + full_df['Have frequent mood swings']
                   + full_df['Get irritated easily']
                   + full_df['Often feel blue']
               )

# Conscientiousness
full_df['C'] = (14 + full_df['Am always prepared'] 
                   - full_df['Leave my belongings around'] 
                   + full_df['Pay attention to details'] 
                   - full_df['Make a mess of things'] 
                   + full_df['Get chores done right away'] 
                   - full_df['Often forget to put things back in their proper place']
                   + full_df['Like order']
                   - full_df['Shirk my duties']
                   + full_df['Follow a schedule']
                   + full_df['Am exacting in my work'] 
               )

# Agreeableness
full_df['A'] = (14 - full_df['Feel little concern for others']
                   + full_df['Am interested in people']
                   - full_df['Insult people'] 
                   + full_df["Sympathize with others' feelings"]
                   + full_df["Interested in other people's problems"] # reversed the sign because of reversed scale
                   + full_df['Have a soft heart']
                   + full_df['Really interested in others']  # reversed the sign because of reversed scale
                   + full_df['Take time out for others'] 
                   + full_df["Feel others' emotions"] 
                   + full_df['Make people feel at ease']
               )

# Openness
full_df['O'] = (8 + full_df['Have a rich vocabulary']
                  - full_df['Have difficulty understanding abstract ideas']
                  + full_df['Have a vivid imagination']
                  + full_df['Interested in abstract ideas'] # reversed the sign because of reversed scale
                  + full_df['Have excellent ideas']
                  + full_df['Have a good imagination'] # reversed the sign because of reversed scale
                  + full_df['Am quick to understand things']
                  + full_df['Use difficult words']
                  + full_df['Spend time reflecting on things']
                  + full_df['Am full of ideas']
               )

***

To further understand how these scores relate to each person's purchase behavior, we'll run a correlation analysis. A positive relationship means this category of personality has or wishes to purchase a <strong> Windows </strong> laptop. A negative correlation means this category of personality does not.

In [10]:
# creating a dataframe for correlation matrix
corr_df = full_df[dictionary['corr']]


In [11]:
# Assigning numeric values to purchase behavior
corr_df['What laptop would you buy in next assuming if all laptops cost the same?'] = corr_df['What laptop would you buy in next assuming if all laptops cost the same?'].replace(to_replace = ['Windows laptop','Macbook','Chromebook','MAC'],
                                                                                           value = [1,-1,-1,-1])

# Assigning numeric values to purchase behavior
corr_df['What laptop do you currently have?'] = corr_df['What laptop do you currently have?'].replace(to_replace = ['Windows laptop','Macbook','MAC'],
                                                                                           value = [1,-1,-1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [12]:
corr_df.corr().iloc[:5,5:7]

Unnamed: 0,What laptop do you currently have?,What laptop would you buy in next assuming if all laptops cost the same?
A,0.019648,-0.046888
E,-0.197458,-0.136809
N,-0.178324,-0.128762
C,-0.014387,-0.035001
O,-0.043952,-0.055262


Observations:
- Extroverts have significance preference for MAC's
- Neurotic People have significance preference for MAC's
- other personalities don't have particularly strong preferences (observation is aligned with the definition of an agreeable personality)
***
***

<strong> Demographic Analysis: </strong> <br>
Next we will take a look at the purchase behavior based on respondents gender, ethnicity and program of study. <br> <br>

<i> Note: gender in the survey consisted of a binary single choice answer, and did not include non-binary options or a  </i> 'prefer not to answer' <i> option which would have made the survey more inclusive and more accurate (respondents might not have answered with their correct gender and would bias the results).

In [13]:
######################## Percent based on Gender #############################################
# Subsetting Mac users
full_df_Mac      = full_df[full_df['What laptop do you currently have?'] == 'Macbook']

# Subsetting Male users of Mac
full_df_Mac_male = full_df_Mac[full_df_Mac['Gender'] == 'Male']
value_male       = full_df_Mac_male.loc[:,'Gender'].value_counts()

# Calculating percent
percent_male_mac =  (value_male / 245) * 100

# Formatting for table
percent_male_mac_print = float(percent_male_mac.round(2))



# Subsetting Female users of Mac
full_df_Mac_female = full_df_Mac[full_df_Mac['Gender'] == 'Female']
value_female       = full_df_Mac_female.loc[:,'Gender'].value_counts()

# Calculating percent
percent_female_mac =  (value_female / 245) * 100

# Formatting for table
percent_female_mac_print = float(percent_female_mac.round(2))


# Subsetting users of Windows
full_df_windows = full_df[full_df['What laptop do you currently have?'] == 'Windows laptop']

# Subsetting Male users of Windows
full_df_windows_male = full_df_windows[full_df_windows['Gender'] == 'Male']
value_male_win       = full_df_windows_male.loc[:,'Gender'].value_counts()

# Calculating percent
percent_male_win =  (value_male_win / 245) * 100

# Formatting for table
percent_male_window_print = float(percent_male_win.round(2))




# Subsetting Female users of Windows
full_df_windows_female = full_df_windows[full_df_windows['Gender'] == 'Female']
value_female_win       = full_df_windows_female.loc[:,'Gender'].value_counts()

# Calculating percent
percent_female_windows =  (value_female_win / 245) * 100

# Formatting for table
percent_female_win_print = float(percent_female_windows.round(2))

In [14]:
######################## Percent based on Ethnicity #############################################

#African American users of Mac
full_df_Mac_African_American = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'African American']
value_African_American       = full_df_Mac_African_American.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_African_American_mac =  (value_African_American / 245) * 100

# Formatting for table
percent_African_American_mac_print = float(percent_African_American_mac.round(2))


#African American users of Windows
full_df_Win_African_American = full_df_windows[full_df_windows['What is your ethnicity?'] == 'African American']
value_African_American_win   = full_df_Win_African_American.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_African_American_win =  (value_African_American_win / 245) * 100

# Formatting for table
percent_African_American_win_print = float(percent_African_American_win.round(2))


#Far east Asian users of Mac
full_df_Mac_Far_east_Asian = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'Far east Asian']
value_Far_east_Asian       = full_df_Mac_Far_east_Asian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Far_east_Asian_mac =  (value_Far_east_Asian / 245) * 100

# Formatting for table
percent_Far_east_Asian_mac_print = float(percent_Far_east_Asian_mac.round(2))


#Far east Asian users of Windows
full_df_windows_Far_east_Asian = full_df_windows[full_df_windows['What is your ethnicity?'] == 'Far east Asian']
value_Far_east_Asian_win       = full_df_windows_Far_east_Asian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Far_east_Asian_windows =  (value_Far_east_Asian_win / 245) * 100

# Formatting for table
percent_Far_east_Asian_windows_print = float(percent_Far_east_Asian_windows.round(2))


#Hispanic / Latino users of Mac
full_df_Mac_Hispanic_Latino = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'Hispanic / Latino']
value_Hispanic_Latino_mac   = full_df_Mac_Hispanic_Latino.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Hispanic_Latino_mac =  (value_Hispanic_Latino_mac / 245) * 100

# Formatting for table
percent_Hispanic_Latino_mac_print = float(percent_Hispanic_Latino_mac.round(2))


#Hispanic / Latino users of Windows
full_df_windows_Hispanic_Latino = full_df_windows[full_df_windows['What is your ethnicity?'] == 'Hispanic / Latino']
value_Hispanic_Latino_windows   = full_df_windows_Hispanic_Latino.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Hispanic_Latino_windows =  (value_Hispanic_Latino_windows / 245) * 100

# Formatting for table
percent_Hispanic_Latino_windows_print = float(percent_Hispanic_Latino_windows.round(2))


#Middle Eastern users of Mac
full_df_Mac_Middle_Eastern = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'Middle Eastern']
value_Middle_Eastern_mac   = full_df_Mac_Middle_Eastern.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Middle_Eastern_mac =  (value_Middle_Eastern_mac / 245) * 100

# Formatting for table
percent_Middle_Eastern_mac_print = float(percent_Middle_Eastern_mac.round(3))


#Middle Eastern users of Windows
full_df_windows_Middle_Eastern = full_df_windows[full_df_windows['What is your ethnicity?'] == 'Middle Eastern']
value_Middle_Eastern_windows   = full_df_windows_Middle_Eastern.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Middle_Eastern_windows =  (value_Middle_Eastern_windows / 245) * 100

# Formatting for table
percent_Middle_Eastern_windows_print = float(percent_Middle_Eastern_windows.round(3))


#Prefer not to answer users of Mac
full_df_Mac_Prefer_not_to_answer = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'Prefer not to answer']
value_Prefer_not_to_answer_mac   = full_df_Mac_Prefer_not_to_answer.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Prefer_not_to_answer_mac =  (value_Prefer_not_to_answer_mac / 245) * 100

# Formatting for table
percent_Prefer_not_to_answer_mac_print = float(percent_Prefer_not_to_answer_mac.round(2))


#Prefer not to answer users of Windows
full_df_windows_Prefer_not_to_answer = full_df_windows[full_df_windows['What is your ethnicity?'] == 'Prefer not to answer']
value_Prefer_not_to_answer_windows   = full_df_windows_Prefer_not_to_answer.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_Prefer_not_to_answer_windows =  (value_Prefer_not_to_answer_windows / 245) * 100

# Formatting for table
percent_Prefer_not_to_answer_windows_print = float(percent_Prefer_not_to_answer_windows.round(2))



#West Asian / Indian users of Mac
full_df_Mac_West_Asian_Indian = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'West Asian / Indian']
value_West_Asian_Indian_mac   = full_df_Mac_West_Asian_Indian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_West_Asian_Indian_mac =  (value_West_Asian_Indian_mac / 245) * 100

# Formatting for table
percent_West_Asian_Indian_mac_print = float(percent_West_Asian_Indian_mac.round(2))


#West Asian / Indian users of Windows
full_df_windows_West_Asian_Indian = full_df_windows[full_df_windows['What is your ethnicity?'] == 'West Asian / Indian']
value_West_Asian_Indian_windows   = full_df_windows_West_Asian_Indian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_West_Asian_Indian_windows =  (value_West_Asian_Indian_windows / 245) * 100

# Formatting for table
percent_West_Asian_Indian_windows_print = float(percent_West_Asian_Indian_windows.round(2))



#White / Caucasian users of Mac
full_df_Mac_White_Caucasian = full_df_Mac[full_df_Mac['What is your ethnicity?'] == 'White / Caucasian']
value_White_Caucasian_mac   = full_df_Mac_White_Caucasian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_White_Caucasian_mac =  (value_White_Caucasian_mac / 245) * 100

# Formatting for table
percent_White_Caucasian_mac_print = float(percent_White_Caucasian_mac.round(2))


#White / Caucasian users of Windows
full_df_windows_White_Caucasian = full_df_windows[full_df_windows['What is your ethnicity?'] == 'White / Caucasian']
value_White_Caucasian_windows   = full_df_windows_White_Caucasian.loc[:,'What is your ethnicity?'].value_counts()

# Calculating percent
percent_White_Caucasian_windows =  (value_White_Caucasian_windows / 245) * 100

# Formatting for table
percent_White_Caucasian_windows_print = float(percent_White_Caucasian_windows.round(2))

print(f"""
Gender                 % Macbook Users      % Windows Users
-------               ---------------        ----------
Male                     {percent_male_mac_print}                 {percent_male_window_print}

Female                   {percent_female_mac_print}                  {percent_female_win_print}



Ethnicity             % Macbook Users      % Windows Users
-----------           ---------------     -------------
African American           {percent_African_American_mac_print}              {percent_African_American_win_print}

Far east Asian             {percent_Far_east_Asian_mac_print}             {percent_Far_east_Asian_windows_print}

Hispanic/Latino            {percent_Hispanic_Latino_mac_print}              {percent_Hispanic_Latino_windows_print}

Middle Eastern             {percent_Middle_Eastern_mac_print}             {percent_Middle_Eastern_windows_print}

Prefer not to answer       {percent_Prefer_not_to_answer_mac_print}              {percent_Prefer_not_to_answer_windows_print}

West Asian / Indian        {percent_West_Asian_Indian_mac_print}              {percent_West_Asian_Indian_windows_print}

White/Caucasian            {percent_White_Caucasian_mac_print}             {percent_White_Caucasian_windows_print}
""")


Gender                 % Macbook Users      % Windows Users
-------               ---------------        ----------
Male                     26.53                 31.84

Female                   24.9                  16.73



Ethnicity             % Macbook Users      % Windows Users
-----------           ---------------     -------------
African American           2.45              2.45

Far east Asian             11.43             10.61

Hispanic/Latino            6.53              11.43

Middle Eastern             0.816             0.816

Prefer not to answer       5.31              3.27

West Asian / Indian        8.57              12.24

White/Caucasian            15.92             7.76



<strong> Observations: </strong>
- no significance difference between macbook users but lower percentage of respondents use Windows
- no significant conclustions from ethnicity

***
<strong> Churn Analysis: </strong><br><br>

In the survey, we asked respondent what are their current laptops and what laptop would they buy assuming all laptops cost the same. Using these two columns, we can determine the churn rate, or in other words, how many customers are switching from Microsoft to Apple. <br>

In [15]:
# WINDOWS
full_df_windows      = full_df[full_df['What laptop do you currently have?'] == 'Windows laptop']
total_windows        = float(full_df_windows.loc[:,'What laptop do you currently have?'].value_counts())

# Windows to Mac churn
windows_to_mac       = full_df_windows[full_df_windows['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Macbook']
value_windows_to_mac = float(windows_to_mac.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_windows_to_mac = round(((value_windows_to_mac / total_windows) * 100),2)


# Windows to Windows churn
windows_to_windows       = full_df_windows[full_df_windows['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Windows laptop']
value_windows_to_windows = float(windows_to_windows.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_windows_to_windows = round(((value_windows_to_windows / total_windows) * 100),2)


# Windows to Chrome churn
windows_to_chrom       = full_df_windows[full_df_windows['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Chromebook']
value_windows_to_chrom = float(windows_to_chrom.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_windows_to_chrom = round(((value_windows_to_chrom / total_windows) * 100),2)

# MACBOOKS
full_df_Mac = full_df[full_df['What laptop do you currently have?'] == 'Macbook']
total_mac   = float(full_df_Mac.loc[:,'What laptop do you currently have?'].value_counts())

# Mac to Mac churn
mac_to_mac       = full_df_Mac [full_df_Mac['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Macbook']
value_mac_to_mac = float(mac_to_mac.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_mac_to_mac = round(((value_mac_to_mac / total_mac) * 100),2)


# Mac to Windows churn
mac_to_windows       = full_df_Mac[full_df_Mac['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Windows laptop']
value_mac_to_windows = float(mac_to_windows.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_mac_to_windows = round(((value_mac_to_windows / total_mac) * 100),2)


# Mac to Chrome churn
mac_to_chrom       = full_df_Mac[full_df_Mac['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Chromebook']
value_mac_to_chrom = float(mac_to_chrom.loc[:,'What laptop would you buy in next assuming if all laptops cost the same?'].value_counts())

# Formatting for table
percent_mac_to_chrom = round(((value_mac_to_chrom / total_mac) * 100),2)

print(f"""
                      % To Macbook             % To Windows        % To Chrome
-------               ---------------        --------------       -------------
Churn From Windows        {percent_windows_to_mac}                  {percent_windows_to_windows}                {percent_windows_to_chrom}

Churn From Mac            {percent_mac_to_mac}                  {percent_mac_to_windows}                 {percent_mac_to_chrom}


""")


                      % To Macbook             % To Windows        % To Chrome
-------               ---------------        --------------       -------------
Churn From Windows        19.33                  77.31                3.36

Churn From Mac            89.68                  8.73                 1.59





<strong> Observations: </strong>
- conversion rate to macbooks is higher than to windows -> investigate further

***
***

## Part 3: Transformations

***
<strong> Variance and Scaling: </strong>
<br>
Because PCA and Clustering methods are both sensitive to the variance in our data, we want to ensure that every variable and observation are under the same scale. Since this survey was homogenous in terms of the scale of the answers, there is no initial need to scale the data across variables. However, since we are looking at people's behaviors, we know that one person's 5 can actually correspond to another person's 4. Therefore, we will standardize each observation to take this into account in our analysis.

***
As mentioned earlier, the survey consists of 2 different data sources: Hult DNA and Big Five personality test. We will split these into 2 different data frames, without demographic data, to conduct PCA and Clustering Techniques.

In [16]:
# dropping demographic data
questions  = full_df.drop(dictionary['demographic'],  axis = 1)


# Standardizing per observation with user-defined function
questions_rows = standard(questions,      
                          by_rows = True)

# Standardizing per column
questions_cols = standard(questions_rows, 
                          by_rows = False)



In [17]:
# splitting between Big Five Questions and Hult DNA Questions
big_5    = questions_cols[dictionary['bigfive']]
hult_dna = questions_cols[dictionary['hultdna']]


***
***

## Part 4: Build an unsupervised learning model

<strong> Purpose: </strong> <br>
The goal of our analysis is to identify potential personality types that could be used in Windows' marketing segmentation strategies. Therefore, principal component analysis, or PCA, is useful to uncover these personalities from our survey respondents' answers since personality is something we cannot measure directly (latent trait exploration).
<br><br>
***
To do so, we will first run a PCA model and then interpret each component's facotr loadings in the next session, which is an interpretable correlation metric. With this, we will be able to identify the different personalities and hult traits.

***
### Source 1: Hult DNA

We will start by conducting a Principal Component Analysis on the questions related to Hult DNA. The goal is to identify certain traits of respondents. <br>
The first model does not specify the number of components, since we will use the Scree Plot to do so.

In [18]:
# INSTANTIATING a PCA model
pca = PCA(n_components = None,
          random_state = 222)


# FITTING and TRANSFORMING
hult_dna_pca = pca.fit_transform(hult_dna)


In [19]:
#scree_plot(pca, export = False)

<strong> Observation: </strong> <br> Best number of components is 5. Upon analysis of factor loadings however, with 5 components, the 2 last features were not significantly clear in terms of customer persona's. Therefore, we selected the analysis with 3 components. 

In [20]:
# INSTANTIATING a PCA model
pca = PCA(n_components = 3,
          random_state = 222)

# FITTING and TRANSFORMING 
hult_dna_pca    = pca.fit_transform(hult_dna)

***
### Source 2: BIG 5 Personalities

In [21]:
# INSTANTIATING a PCA model
pca2 = PCA(n_components = None,
          random_state = 222)


#scree_plot(pca2,export = False)

<strong> Observation: </strong> <br> Best number of components is 5

In [22]:
# INSTANTIATING a PCA model
pca2 = PCA(n_components = 5,
          random_state = 222)

# FITTING and TRANSFORMING 
big5_pca = pca2.fit_transform(big_5)

***
***

## Part 5:  Interpreting the Model

### 1. Factor Loadings Analysis: 
To interpret our principal components, we will analyze its factor loadings, which correspond to how each of the features correlate with each principal component.

1. Hult DNA:

In [23]:
# transposing pca components
factor_loadings_df  = pd.DataFrame(pd.np.transpose(pca.components_))

# naming rows as original features
factor_loadings_df  = factor_loadings_df.set_index(hult_dna.columns)

  


After analysis, we have identified 3 personas from the Hult DNA questions:
1. Individualistic: works better alone, as opposed to in teams
2. Structured Collaborations: connects to others and works collaboratively in team environments
3. Visionaries: generates innovative ideas and gather people around them

Now that we have developed personas, we can analyze how much each customer fits into each group. Run the following code to view the personas and factor loadings for each customer.


In [24]:
# naming each principal component
factor_loadings_df.columns  = ['Individualistic',
                               'Structured Collaborations',
                               'Visionaries']

# converting PCA results into a DataFrame 
hultdna_pca = pd.DataFrame(hult_dna_pca)


# renaming columns
hultdna_pca.columns = factor_loadings_df.columns

2. Big 5 Personalities:

In [25]:
# transposing pca components
factor_loadings_df2 = pd.DataFrame(pd.np.transpose(pca2.components_))


# naming rows as original features
factor_loadings_df2 = factor_loadings_df2.set_index(big_5.columns)


  


After analysis, we have identified 5 personas from the Big 5 questions:
1. Neuroticism
2. Extraversion
3. Agrreableness
4. Free-spirit
5. Conscientiousness

Note that we have kept 4 of the 5 original personality types. These matched the factor loadings of our analysis. However, instead of openness, we found that the 5th personality was 'free-spirited'. These are people that do not get stressed often and do not really pay attention to details.<br>
Run the following code to view the personas and factor loadings for each customer.

In [26]:
# naming each principal component
factor_loadings_df2.columns = ['Neuroticism',
                               'Extraversion',
                               'Agreeableness',
                               'Free-spirit',
                               'Conscientiousness']

# converting into a DataFrame 
big5_pca = pd.DataFrame(big5_pca)

# renaming columns
big5_pca.columns = factor_loadings_df2.columns


***
### 2. Correlation Analysis: 
As we did with the original big 5 personality traits, we will run a correlation to evaluate each personality's purchase behavior.

In [27]:
# creating a dataframe to join demographic data for correlation
corr_df2 = pd.concat([big5_pca,
                      full_df[dictionary['demographic']]], 
                     axis = 1)

In [28]:
corr_df2['What laptop would you buy in next assuming if all laptops cost the same?'] = \
corr_df2['What laptop would you buy in next assuming if all laptops cost the same?'].replace(to_replace = ['Windows laptop','Macbook','Chromebook','MAC'],
                                                                                             value = [1,-1,-1,-1])

corr_df2['What laptop do you currently have?'] = \
corr_df2['What laptop do you currently have?'].replace(to_replace = ['Windows laptop','Macbook','MAC'],
                                                       value = [1,-1,-1])

In [29]:
corr_df2.corr().iloc[:5,5:7]

Unnamed: 0,What laptop do you currently have?,What laptop would you buy in next assuming if all laptops cost the same?
Neuroticism,-0.072583,-0.045011
Extraversion,-0.264118,-0.120408
Agreeableness,0.096621,0.133254
Free-spirit,0.16085,0.089075
Conscientiousness,-0.020654,0.000566


<br>
Further investigating Extraversion correlation:

In [30]:
# Extraverts churning to Macbook
len(corr_df2['Extraversion'][corr_df2['Extraversion'] > 0]\
                            [corr_df2['What laptop would you buy in next assuming if all laptops cost the same?'] == 'Macbook'])

round((47/245),2)


  res_values = method(rvalues)


0.19

***
### 3. Clustering

To gain further insight on our customer groups, we will perform a clustering technique. This will group individuals based on their similarities. If we identify a group that is similar among themselves but somewhat different from others, we can recommend a specific targeted marketing strategy for that group.

To build the clusters, we need to first rescale the data, since we will be using the factor loadings and after analysis, the variance amongst our features is no longer equal.<br><br>
Then, we can use the inertia plot and dendrogram to establish the number of clusters we will be using.
<br><br> 
Finally, we will create the clusters using k-Means technique. Let's start with Hult DNA data.
***

<br>
1. Hult DNA Clusters

In [32]:
# Re-scaling our data
hultdna_pca_scaled = standard(hultdna_pca, by_rows = False)

<br>After analyzing the dendrogram and the inertia plot, we establish that the ideal number of clusters is 4. If we use 4 clusters, we reduce the sum of squares variance by 50% (from 1200 to around 600).

In [36]:
# INSTANTIATING a k-Means object with four clusters
hultdna_k_pca = KMeans(n_clusters = 4,
                        random_state = 222)


# fitting the object to the data
hultdna_k_pca.fit(hultdna_pca_scaled)


# converting the clusters to a DataFrame
hultdna_kmeans_pca = pd.DataFrame({'Cluster': hultdna_k_pca.labels_})


In [35]:
# storing cluster centers
hultdna_centroids_pca = hultdna_k_pca.cluster_centers_


# converting cluster centers into a DataFrame
hultdna_centroids_pca_df = pd.DataFrame(hultdna_centroids_pca)


# renaming principal components
hultdna_centroids_pca_df.columns = ['Individualistic',
                                    'Structured Collaborations',
                                    'Visionaries']


<br><strong> Concatenating PCA with Demographic Data for Cluster Analysis </strong><br><br>
We will later analyze these clusters using box plots and counts.

In [37]:
# concat demographic back with hult-dna data
hultdna_demo_df = pd.concat([hultdna_kmeans_pca ,
                         hultdna_pca],
                         axis = 1)

final_hultdna_demo_df1 = pd.concat([hultdna_demo_df,
                         full_df.loc[:148,dictionary['demographic']]],
                         axis = 1,
                         join = 'inner')

# subsetting to remove missing values from previous duplicates
df1 = full_df.loc[295:, dictionary['demographic']].dropna(axis = 1)
df2 = df1.tail(n = 97)
df2.index = range(148,245)

# joining the filtered data 
final_hultdna_demo_df2      = pd.concat([hultdna_demo_df.iloc[148:,:],
                                         df2],
                                         axis = 1)
final_final_hultdna_demo_df = pd.concat([final_hultdna_demo_df1,
                                         final_hultdna_demo_df2],
                                         axis = 0)
# Naming our clusters
hultdna_cluster_names = {0 : 'Cluster 1',
                         1 : 'Cluster 2',
                         2 : 'Cluster 3',
                         3 : 'Cluster 4'}


final_final_hultdna_demo_df['Cluster'].replace(hultdna_cluster_names, inplace = True)


<br>Now we plot our boxplots to visually interpret the personalities of each of these clusters as well as the range of personality traits in these clusters.<br><br>
<i> Note: the code has been commented out to prevent unnecessary output

In [None]:
## Box plots for current laptop used
#hultdna_lst=['Individualistic','Structured Collaborations','Visionaries']
#for i in hultdna_lst:
#    fig, ax = plt.subplots(figsize = (12, 8))
#    sns.boxplot(x = 'What laptop do you currently have?',
#            y = i,
#            hue = 'Cluster',
#            data = final_final_hultdna_demo_df)
#    plt.tight_layout()
#    plt.show()

We plot our boxplots for the future purchase question.

In [None]:
## Box plots for future purchase
#for i in hultdna_lst:
#    fig, ax = plt.subplots(figsize = (12, 8))
#    sns.boxplot(x = 'What laptop would you buy in next assuming if all laptops cost the same?',
#            y = i,
#            hue = 'Cluster',
#            data = final_final_hultdna_demo_df)
#    plt.tight_layout()
#    plt.show()

***
<br>
2. Big 5 Clusters

We scale our PCA dataframe for more interpretable results.

In [38]:
# Re-scaling our data
big5_pca_scaled = standard(big5_pca, by_rows = False)

<br>After analyzing the dendrogram and the inertia plot, we establish that the ideal number of clusters is 7. If we use 7 clusters, we reduce the sum of squares variance by 50% (from 2000 to around 1000).

In [39]:
# INSTANTIATING a k-Means object with seven clusters
big5_k_pca = KMeans(n_clusters = 7,
                        random_state = 222)


# fitting the object to the data
big5_k_pca.fit(big5_pca_scaled)


# converting the clusters to a DataFrame
big5_kmeans_pca = pd.DataFrame({'Cluster': big5_k_pca.labels_})


1    65
2    39
5    36
4    35
0    28
3    27
6    15
Name: Cluster, dtype: int64


We examine our centroids dataframe to understand the balance of personality traits within each of our clusters. This is later used to interpret cluster behavior between current ownership and future purchases.

In [40]:
# storing cluster centers
big5_centroids_pca = big5_k_pca.cluster_centers_


# converting cluster centers into a DataFrame
big5_centroids_pca_df = pd.DataFrame(big5_centroids_pca)


# renaming principal components
big5_centroids_pca_df.columns = ['Neuroticism',
                                 'Extraversion',
                                 'Agreeableness',
                                 'Free-spirit',
                                 'Conscientiousness']


In [41]:
# concat demographic back with hult-dna data
big5_demo_df = pd.concat([big5_kmeans_pca ,
                          big5_pca],
                          axis = 1)

final_big5_demo_df1 = pd.concat([big5_demo_df,
                                 full_df.loc[:148,dictionary['demographic']]],
                                 axis = 1,
                                 join='inner')

# subsetting to remove missing values from previous duplicates
df1 = full_df.loc[295:,dictionary['demographic']].dropna(axis = 1)
df2 = df1.tail(n = 97)
df2.index = range(148,245)

# joining the filtered data 
final_big5_demo_df2      = pd.concat([big5_demo_df.iloc[148:,:],
                                      df2],
                                      axis = 1)
final_final_big5_demo_df = pd.concat([final_big5_demo_df1,
                                      final_big5_demo_df2],
                                      axis = 0)

# Naming our clusters
big5_cluster_names = {0 : 'Cluster 1',
                      1 : 'Cluster 2',
                      2 : 'Cluster 3',
                      3 : 'Cluster 4',
                      4 : 'Cluster 5',
                      5 : 'Cluster 6',
                      6 : 'Cluster 7'}


final_final_big5_demo_df['Cluster'].replace(big5_cluster_names, inplace = True)



<br>
For each question, we assign counts of Macbook and PC to different objects. We then print these objects in a dynamic string for interpretation of consumer migration between platforms for each cluster.

In [42]:
C1_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 1'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C2_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 2'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C3_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 3'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C4_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 4'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C5_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 5'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C6_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 6'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()
C7_Macowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 7'][final_final_big5_demo_df['What laptop do you currently have?']=='Macbook'].count()

C1_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 1'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C2_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 2'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C3_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 3'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C4_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 4'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C5_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 5'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C6_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 6'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()
C7_Windowsowners = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 7'][final_final_big5_demo_df['What laptop do you currently have?']=='Windows laptop'].count()

C1_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 1'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C2_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 2'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C3_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 3'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C4_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 4'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C5_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 5'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C6_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 6'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()
C7_Macbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 7'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Macbook'].count()

C1_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 1'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C2_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 2'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C3_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 3'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C4_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 4'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C5_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 5'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C6_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 6'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()
C7_Windowsbuyers = final_final_big5_demo_df['Cluster'][final_final_big5_demo_df['Cluster']=='Cluster 7'][final_final_big5_demo_df['What laptop would you buy in next assuming if all laptops cost the same?']=='Windows laptop'].count()

print(f"""
            Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
            
Mac Owner:     {C1_Macowners}         {C2_Macowners}         {C3_Macowners}        {C4_Macowners}         {C5_Macowners}         {C6_Macowners}        {C7_Macowners}

Windows Owner:{C1_Windowsowners}         {C2_Windowsowners}         {C3_Windowsowners}         {C4_Windowsowners}         {C5_Windowsowners}         {C6_Windowsowners}         {C7_Windowsowners}
__________________________________________________________________________________

Mac Buyer:    {C1_Macbuyers}         {C2_Macbuyers}         {C3_Macbuyers}        {C4_Macbuyers}         {C5_Macbuyers}         {C6_Macbuyers}        {C7_Macbuyers}

Windows Buyer:{C1_Windowsbuyers}         {C2_Windowsbuyers}         {C3_Windowsbuyers}         {C4_Windowsbuyers}         {C5_Windowsbuyers}         {C6_Windowsbuyers}         {C7_Windowsbuyers}
""")


            Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
            
Mac Owner:     19         37         19        18         11         14        8

Windows Owner:9         28         20         9         24         22         7
__________________________________________________________________________________

Mac Buyer:    19         42         22        14         16         15        8

Windows Buyer:8         23         16         12         18         19         7



***
Now we plot our boxplots for both questions to better interpret these clusters.
<br><br>
<i> Note: the code has been commented out to prevent unnecessary output

In [None]:
#big5_lst=['Neuroticism', 'Extraversion', 'Agreeableness', 'Free-spirit', 'Conscientiousness']
#for i in big5_lst:
#    fig, ax = plt.subplots(figsize = (12, 8))
#    sns.boxplot(x = 'What laptop do you currently have?',
#            y = i,
#            hue = 'Cluster',
#            data = final_final_big5_demo_df)
#    plt.tight_layout()
#    plt.show()

In [None]:
#big5_lst=['Neuroticism', 'Extraversion', 'Agreeableness', 'Free-spirit', 'Conscientiousness']
#for i in big5_lst:
#    fig, ax = plt.subplots(figsize = (12, 8))
#    sns.boxplot(x = 'What laptop would you buy in next assuming if all laptops cost the same?',
#                y = i,
#                hue = 'Cluster',
#                data = final_final_big5_demo_df)
#    plt.tight_layout()
#    plt.show()

<strong> Observations: </strong>
- Large percentages of windows owners which are in clusters 2 (calm and tranquil extraverted individuals), 3 (people with the personality trait of being careful or diligent) and 5 (people with kind, sympathetic, cooperative personalities), intend to switch to MacBook computers.

***
***

## Conclusion

- In an international crowd, not all personalities are represented by the Big 5
- Microsoft is currently losing to Apple in the extroverts’ market for laptops. Of our respondents, 20% are extroverts that would choose to buy a Macbook for their next laptop.
- A group of people want to move from Mac to Windows. This group includes personalities that enjoy being around people, participating in social gatherings, and are full of energy. They are an opportunity for Microsoft to gain new customers by tapping this niche market with targeted advertising themed around social gatherings and connections.