# DS104 Data Wrangling and Visualization : Lesson Nine Companion Notebook

DS104### Table of Contents <a class="anchor" id="DS104L9_toc"></a>

* [Table of Contents](#DS104L9_toc)
    * [Page 1 - Introduction](#DS104L9_page_1)
    * [Page 2 - Variable Types and Levels](#DS104L9_page_2)
    * [Page 3 - What is the Purpose of your Analyses?](#DS104L9_page_3)
    * [Page 4 - Describing Data](#DS104L9_page_4)
    * [Page 5 - Drawing Conclusions about your Data](#DS104L9_page_5)
    * [Page 6 - Drawing Conclusions When the Independent Variable is Continuous](#DS104L9_page_6)
    * [Page 7 - When your Dependent Variable is Continuous](#DS104L9_page_7)
    * [Page 8 - Making Associations between your Data](#DS104L9_page_8)
    * [Page 9 - Key Terms](#DS104L9_page_9)
    * [Page 10 - Lesson 9 Hands-On](#DS104L9_page_10)
    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS104L9_page_1"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Infographics
VimeoVideo('388135956', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO104L09overview.zip)**.

# Introduction

One of the most important things you can learn as a data scientist is how to choose the appropriate statistical analysis.  You will be walked through how to tackle that on this page. By the end of this lesson, you will be able to: 

* Understand the purpose of your analysis
* Choose analyses when your purpose is to describe data
* Choose analyses when your purpose is to draw conclusions
* Choose analyses when your purpose is to make associations

This lesson will culminate in a hands on in which you will choose the appropriate statistic for a wide variety of realistic scenarios.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/434221066"> recorded live workshop </a> that goes over the content in this lesson. </p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Variable Types and Levels<a class="anchor" id="DS104L9_page_2"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Variable Types and Levels

There are several major factors that go into choosing what type of analysis you will run, but the most important component of all is what data and variables you have. Reviewing variable types and levels as they pertain to choosing statistical analyses is critical! 

---

## Data Types

Two major variable types will be discussed throughout this lesson.  

---

### Continuous Variables 

The first is *continuous*.  A continuous variable is one that is numeric, and where a bigger number actually means that you have more of something.  For instance, height is continuous - the bigger the number of feet tall you are, the taller you are.  Don't be fooled - sometimes data can be in numbers, but not actually be continuous.  This happens a lot after recoding.  What if you recode hair color from black, brown, and blonde into 0, 1, and 2? This is numeric, but does the number actually mean anything? Are blondes, given a 2, any better or larger than those with black hair, given a 0? No.  So make sure that when you are assessing your variables, you take into account not only whether it is numeric or not, but if the numbers have meaning. 

---

### Categorical Variables

The second variable type is *categorical*.  A categorical variable is a category - it's right there in its name.  You have distinct things that are grouped into like sections.  Often a categorical variable is a string variable, meaning that it is in letters.  Something like writing out the names of each hair color.  But sometimes, numbers can be substituted for the groups to make analysis easier.  You still have categories underlying it though - you've just substituted a number for the name.

A special kind of categorical variable is one that is *dichotomous* or *binary*.  All those terms mean is that you have only two groups, instead of more. So something like "pass / fail" categories or "dead / alive" categories etc. are great examples of dichotomous or binary categorical variables.

---

### Variable Levels

When you have categorical variables, those variables can have *levels*.  The number of levels is the number of groups that variable has.  For instance, in the hair color situation, where the options are black, brown, or blonde, hair color is one variable, but it has three levels, or three choices, of hair color.  It's all one variable - hair color - but has multiple responses that someone could have, which is the levels.  Certain analyses require variables to have only two levels, so its important to identify the number of variable levels. 

---

## Independent vs. Dependent Variables

The last thing you'll need to take note of when choosing analyses is determining your *independent* and your *dependent* variables.  The independent variable, also known as the *predictor* variable, and abbreviated *IV*, is the variable or variables that influence your dependent variable.  They are causing some sort of effect.  

The dependent variable, also known as the *outcome* variable, or abbreviated *DV* is what you are influencing.  It's what you're predicting.  It is the effect that is being caused by your independent variable.  The dependent variable **depends** on the independent variable.

In choosing an analysis, after you know the purpose of the analysis, typically you'll need to choose the data types of your IVs and DVs. Being able to distinguish between continuous and categorical variables, understand the levels of the variables, and identify your independent and dependent variables, are essential skills for data scientists.

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - What is the Purpose of your Analyses?<a class="anchor" id="DS104L9_page_3"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# What is the Purpose of your Analyses?

The first thing you need to determine is the purpose of your analyses.  Are you interested in describing your data, drawing conclusions from your data, or making associations with your data?

---

## Describing Data

The purpose of describing data is just to understand it.  You are just finding some information out about it, without trying to draw any conclusions or make inferences from your data.  You're not making any jumps in logic, not trying to extrapolate from your sample to a larger population...just trying to look at it and see what is there.  The majority of the descriptive statistics you have so far learned would be classified as part of describing data.

---

## Drawing Conclusions from Data

The official term for drawing conclusions from your data is *inferential statistics*.  You are making inferences from your small sample, and applying them to a larger population.  And you are most likely trying to predict something or tell if there are differences between groups when drawing conclusions from your data. All hypothesis testing falls into this category as well.

---

## Making Associations with Data

The last thing you could be doing with your data is examining it for associations.  You don't necessarily know what you are looking for - you just want to see how your data relates to other data. 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Describing Data<a class="anchor" id="DS104L9_page_4"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Describing Data

If you are interested in describing your data, then you can follow the flow chart below to make decisions.  The decisions are based on the data type. If you have **categorical data**, then you will want to describe your data with **frequencies and/or percents**.  

If you have **continuous data**, then you have a second choice to make.  Are you interested in seeing where your data falls, or are you interested in seeing how spread out your data is? If you want to just **see where the data falls**, then you will be looking at **measures of central tendency - mean, median, and mode**.  If you want to see **how the data is spread**, then you'll be looking at **measures of dispersion**, like variance, standard deviation, and range.

![A box labeled purpose: describing data, what is your data type? Is connected to two boxes labeled categorical and continuous, what do you want to know about your data?. The box categorical is connected to a box labeled frequency and percent. The continuous box is connected to two boxes labeled where the data falls and it is connected to another box labeled mean, median and mode. The next box is labeled how data is spread and it is connected to another box labeled range and standard deviation.](Media/analyses1.png)

Together, everything in the chart above makes up your descriptive statistics.  They will most likely be your initial "go-to" statistics, and you will use them a lot.  They are basic, but eternally useful. 

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Drawing Conclusions about your Data<a class="anchor" id="DS104L9_page_5"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Drawing Conclusions about your Data

If you are drawing conclusions about your data, then there are two major components that matter when picking an analysis - the data type of the independent variable, and the data type of the dependent variable.  You will start by picking analyses for drawing conclusions when you have a categorical independent variable, using the flow chart below.

---

## When the DV is Categorical

When your independent variable is categorical, the next question you need to ask is about the data type of the dependent variable.  Is it categorical, or continuous? If it is categorical, then you need to determine if you are trying to **compare a sample to a population**.  If you are, then you will do a **Goodness of Fit Chi-Square**.  

If you are not trying to compare a sample to a population, then you have yet another choice ahead of you: are you looking at **changes over time**, or have some sort of repeated measures element?  If the answer is yes, then the **levels of the dependent variable** matter. If you only have **two levels of the DV**, then you will be using a **McNemar Chi-Square**.  

If you are looking at changes over time, but have **more than two levels of the DV**, then you will use a test called the **Bhapkar Chi-Square**. 

If you are not looking for changes over time at all, you are not comparing a sample to a population, and you have a categorical IV and a categorical DV, then you are all set to run an **Independent Chi-Square**!

---

## When the DV is Continuous

When you have a categorical independent variable, and a continuous dependent variable, then the next question you need to determine is **how many dependent variables** you have. If it is **only one**, then the next question is: **how many levels** of that one independent variable do you have? 

If it's only two levels, then you're going for a *t*-test of some sort.  But what sort? The discerning factor is if you are examining changes over time or looking at related measures.  If **yes**, then you will run a **Dependent *t*-test**, and if **no**, then you can hit the **Independent *t*-test** up. 

But what if you have **more than two levels** of your single independent variable? Then you are firmly in the land of analyses of variance (ANOVAs). If you need to **control for other factors**, then you can have a covariate, turning your ANOVA into an **ANCOVA (analysis of covariance)**.  If you are **not controlling** for anything else, a straight **ANOVA** will do nicely.

And what if you have **more than one dependent variable**? If that is the case, then you again will ask yourself if you want to **control for other factors**.  If **yes**, then you are looking for a **multivariate analysis of covariance (MANCOVA)**.  If **no**, then you can skip the C for covariate, and instead just have a **MANOVA (multivariate analysis of variance)**.

![The purpose of drawing conclusions for categorical Ivs is connected to two boxes labeled categorical and continuous. The continuous box is connected to two other boxes labeled one and many. The box labeled many is connected to two boxes labeled yes and no. Yes is again connected to MANCOVA and no is connected to MANOVA. The box labeled one is connected to two other boxes labeled two and more than two. Both the above-mentioned boxes are connected to yes and no.](Media/analyses2.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Drawing Conclusions When the Independent Variable is Continuous<a class="anchor" id="DS104L9_page_6"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Drawing Conclusions When the Independent Variable is Continuous

When the independent variable is continuous, the next thing you need to pay attention to is whether your dependent variable is categorical or continuous. This page will just go over the scenarios in which your independent variable is continuous and your dependent variable is categorical.

---

## When your Dependent Variable is Categorical

When you have a continuous IV and a categorical DV,  then you are focused on *logistic regression*. Logistic regression always has categorical outcomes! But the type of logistic regression you run depends on the number of levels of your dependent variable, and on whether you want to know how much influence each variable has. 

If you only have **two levels** of your dependent variable, then you're going with a form of binary logistic regression.  "Bi" meaning two, from your ancient language roots.  That means two levels of your outcome variable! Then, if you want to know **how much each independent variable influences the dependent variable**, you can make use of **stepwise binary logistic regression**.  It's also called hierarchical logistic regression, so you may see both names.  It allows you to examine each predictor variable one at a time as it gets added to the model. If you **don't care about the individual variables** and their addition to the model, then you can just do regular **binary logistic regression**.

If you have more than two levels of your dependent variable, you are still stuck in logistic regression land, but you've moved into *multinomial* logistic regression. "Multi" meaning many, from your ancient language roots.  So, you can have **more than two levels of that dependent variable**.  You have the same choices as you did for the binary logistic regression though - if you want to know **how much each individual independent variable predicts your dependent**, then choose the **stepwise multinomial logistic regression** option.  If **no**, then **multinomial logistic regression**, without the stepwise or hierarchical component, it is!

![A box labeled purpose of drawing conclusions for continuous Ivs with categorical DVs is connected to two other boxes labeled two and more than two. The box labeled two is connected to two other boxes labeled yes and no. The yes box is connected to another box labeled stepwise binary logistic regression and the no box is connected to another box labeled binary logistic regression. The box labeled more than two is connected to two boxes labeled yes and no. The yes box is connected to a box labeled stepwise multinomial logistic regression and no box is connected to another box labeled multinomial logistic regression.](Media/analyses3.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - When your Dependent Variable is Continuous<a class="anchor" id="DS104L9_page_7"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# When your Dependent Variable is Continuous

On this page, you will be learning about the scenarios in which your independent variable is continuous, and so is your dependent variable.  

---

## One Dependent Variable

The first thing to ask yourself is how many dependent variables you have.  If you only have one DV, then  you also need to ask yourself how many independent variables you have.  

If it is only **one**, then you have **simple linear regression**.  

If it is many, then the options abound! If you want to know how much influence each variable has, then you are looking at doing a form of stepwise regression, where you examine the addition or subtraction of a variable to a model one at a time, to see the effect that it has.  If you also think that other variables can influence your dependent variable, then you'll need to perform regression with mediation and moderation, which are ways to determine partial influences of variables in a model.  

To sum up, if you want to know **how much influence the variables have**, and you think there may be **hidden influential variables**, then you need to **check for moderation and mediation while running stepwise linear regression** (what a mouthful)! 

If you need to know **how much influence your variables have**, but **don't think you have hidden influential variables**, just hit the **stepwise linear regression**.

If you **don't need to know how much influence each variable has**, but do think there are other **influential variables** hanging out, try **checking for moderation and mediation while doing multiple regression**. 

And if you **don't need to know how much influence each variable has**, and you **don't think there are other influential variables**, then you can just stick with a **multiple regression model**. 

---

## Multiple Dependent Variables

What if you have multiple dependent variables?  Well, when you have **multiple continuous independent variables**, and **multiple continuous dependent variables**, there is only one analysis to run: **Canonical Correlation**. Think of it like a regular correlation amped up - you correlate every IV with every possible DV. It does not have a lot of practical uses, so you will not learn it here, but it is good to know what is in the realm of possibility, just in case.

![A box labeled purpose of drawing conclusions for continuous Ivs with categorical DVs is connected to two other boxes labeled one and many. The box labeled one is then connected to one and many again. The box labeled one is connected to another box labeled simple linear regression. The box labeled many is again connected to two boxes labeled yes and no. The yes box has a question that reads do you think other variables can influence your DV?. It again points to two other boxes labeled yes and no. The yes is connected to a box labeled check for mediation and moderation while running stepwise linear regression. The no box is connected to a box labeled stepwise multiple linear regression. The box no has a question that reads, do you think other variables can influence your DV?. It is connected to yes and no. The yes box points to an end box labeled check for mediation and moderation while running multiple linear regression and the no box points to another end box labeled multiple linear regression.](Media/analyses4.png)

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Having trouble differentiating between analyses? </h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/453879671"> recorded live workshop (Part I) </a> and <a href="https://vimeo.com/460790119"> this recorded live workshop (Part II) </a>that goes over different scenarios that use these statistics. </p>
    </div>
</div>

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Making Associations between your Data<a class="anchor" id="DS104L9_page_8"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Making Associations between your Data

Lastly, there is a branch of statistics purely for making associations between your data, or understanding how data fits together.  It has nothing to do with hypothesis testing or drawing conclusions.  When you are trying to navigate these statistics, first ask yourself if you are trying to validate a scale (survey).  If you are **trying to validate a scale**, then determine whether the scale has been validated before.  If it has been **previously validated**, then you will need to run **Confirmatory Factor Analysis**.  

If the scale has not been validated before, then you then want to determine if you want to find the minimum necessary number of factors.  If you do want the **minimum necessary factors**, try **Exploratory Factor Analysis using Principal Axis Factoring**.  If you **don't require the minimum factors**, then you can still run **Exploratory Factor Analysis**, but it needs to be **Principal Components Analysis** instead.

What if you're not trying to validate a scale? What then? Well, then you need to determine if you are trying to create a theory.  If you are **trying to create a theory**, then you will undertake **structural equation modeling**.  

If you are not trying to create a theory, then you need to determine how many variables you have. If you have **two variables** (independent or dependent, it doesn't matter), then you need to note the type of data you have.  If it is **categorical**, then you will run a **Spearman Rank Correlation**.  If you instead have **continuous** data, then you will need to run a **Pearson Correlation**.  

What happens when you have more than two variables? In that case, you will need to ask yourself if you are **trying to predict group membership**.  If **yes**, then make use of **Discriminant Function Analysis**.  If **no**, then try using **Cluster Analysis** instead. This is all documented in the flow chart below.

![A box labeled purpose: make associations, are you trying to validate a scale? Is connected to two boxes labeled yes and no. The yes box is also labeled a question that reads, has the scale been validated before? And the no box is also labeled a question that reads, are you trying to create a theory? The Yes box is again connected to a box labeled structural equation modeling and the no box is connected to again a no box labeled with a question that reads, how many variable do you have? It is connected to two boxes labeled two, what data types do you have? and more than two, are you trying to predict group membership?. The box labeled two is connected to categorical and continuous which are again connected to two boxes labeled spearman rank correlation and Pearson correlation. The box labeled more than two is connected to two boxes labeled yes and no and that are connected to two boxes labeled discriminant function analysis and cluster analysis. The first yes box connected to another two boxes labeled yes and no, do you want the minimum number of factors? and it is connected to yes and no. The yes is again connected to a box labeled exploratory factor analysis: principal axis factoring and the no box is again connected to a box labeled exploratory factor analysis: principal components analysis.](Media/analyses5.png)

---


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Key Terms<a class="anchor" id="DS104L9_page_9"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Inferential Statistics</td>
        <td>Branch of statistics when you are drawing conclusions from your data.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Frequency and Percent</td>
        <td>A statistic for when you want to describe your data and your data is categorical.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Measures of Central Tendency</td>
        <td>Mean, median, and mode, for when you want to describe your data, your data is continuous, and you want to know where the data falls.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Measures of Distribution</td>
        <td>Range and standard deviation, for when you want to describe your data, the data is continuous, and you want to know how the data is spread.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Goodness of Fit Chi-Square</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV and a categorical DV, and are comparing a sample to a population.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Bhapkar Chi-Square</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV and a categorical DV, are looking at changes over time, and have more than two levels of your dependent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>McNemar Chi-Square</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV and a categorical DV, are looking at changes over time, and have only two levels of your dependent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Independent Chi-Square</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV and a categorical DV, and are not looking at changes over time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Dependent t-test</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two levels and a continuous DV, and are looking at changes over time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Independent t-test</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two levels and a continuous DV, and are not looking at changes over time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Analysis of Covariance (ANCOVA)</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two or more levels and a continuous DV, and want to control for other factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Analysis of Variance (ANOVA)</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two or more levels and a continuous DV, and do not want to control for other factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multivariate Analysis of Covariance (MANCOVA)</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two levels or more and multiple continuous DVs, and want to control for other factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multivariate Analysis of Variance (MANOVA)</td>
        <td>For when you are drawing conclusions about your data, have a categorical IV with two levels or more and multiple continuous DVs, and do not want to control for other factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Stepwise Binary Logistic Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a categorical DV with two levels, and want to see how much influence each individual variable has.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Binary Logistic Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a categorical DV with two levels, and do not want to see how much influence each individual variable has.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Stepwise Multinomial Logistic Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a categorical DV with more than two levels, and want to see how much influence each individual variable has.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multinomial Logistic Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a categorical DV with more than two levels, and do not want to see how much influence each individual variable has.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Simple Linear Regression</td>
        <td>For when you are drawing conclusions about your data, and have a continuous IV with a continuous DV. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mediation and Moderation while running Stepwise Linear Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a continuous DV, want to know how much influence each individual variable has, and think other variables might influence your DV.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Stepwise Linear Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a continuous DV, want to know how much influence each individual variable has, and do not think other variables might influence your DV.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mediation and Moderation while running Multiple Linear Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a continuous DV, do not want to know how much influence each individual variable has, and think other variables might influence your DV.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multiple Linear Regression</td>
        <td>For when you are drawing conclusions about your data, have a continuous IV with a continuous DV, do not want to know how much influence each individual variable has, and do not think other variables might influence your DV.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Confirmatory Factor Analysis</td>
        <td>For when you are making associations about your data, are validating a scale, and it has been validated before.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Exploratory Factor Analysis - Principal Axis Factoring</td>
        <td>For when you are making associations about your data, are validating a scale, you are validating that scale for the first time, and you want the minimum number of factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Exploratory Factor Analysis - Principal Components Analysis</td>
        <td>For when you are making associations about your data, are validating a scale, you are validating that scale for the first time, and you don't care if you have the minimum number of factors.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Structural Equation Modeling</td>
        <td>For when you are making associations about your data, and are trying to create a theory.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Spearman Rank Correlation</td>
        <td>For when you are making associations about your data and have two categorical variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Pearson Correlation</td>
        <td>For when you are making associations about your data and have two continuos variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Discriminant Function Analysis</td>
        <td>For when you are making associations about your data, have more than two variables, and are trying to predict group membership.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Cluster Analysis</td>
        <td>For when you are making associations about your data, have more than two variables, and are not trying to predict group membership.</td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Lesson 9 Hands-On<a class="anchor" id="DS104L9_page_10"></a>

[Back to Top](#DS104L9_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">



For your Lesson 9 Hands-On, you will choose the most appropriate analysis for the scenarios below, in which a store determines the best way to utilize a new club card system.  When you are done, please submit one document with all of your findings for grading.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

#### For all scenarios, please identify the following:

**1. The independent variable(s) and its data type** 
- The independent variable, also known as the predictor variable, it is the variable or variables that influence your dependent variable. They are causing some sort of effect.

**2. The levels of the independent variable, if appropriate**

**3. The dependent variable(s) and its data type**
- The dependent variable, also known as the outcome variable, is what you are influencing. It's what you're predicting. It is the effect that is being caused by your independent variable. The dependent variable depends on the independent variable.

**4. The levels of the dependent variable, if appropriate**

**5. The most appropriate analysis**
- The most appropriate analysis can be found by following the flow charts on either of pages 5, 6 & 7 once you identified the IV(s), DV(s), their data types and levels. 

---

## Scenario 1

A store is investigating the influence of gender upon whether customers sign up for a discount club card. Options for gender are male and female, and options for signing up for the club card are signed up and not signed up.  

---

## Scenario 2

This same store has just expanded their club card system.  They now have three different tiers - silver, gold, and platinum.  They would like to know whether the type of club card the customer has dictates how much money the customer spends. 

---

## Scenario 3

Now, the store manager would like to know: Do people spend more money before or after they get a club card? 

---

## Scenario 4

Lastly, the store manager would like to know if the age of a customer predicts whether that customer will sign up for a club card or not.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>