# DS105 Intermediate Statistics : Lesson One Companion Notebook

### Table of Contents <a class="anchor" id="DS105L1_toc"></a>

* [Table of Contents](#DS105L1_toc)
    * [Page 1 - Overview of this Module](#DS105L1_page_1)
    * [Page 2 - Single Sample t-Test](#DS105L1_page_2)
    * [Page 3 - Hands-On Single Sample t-Test](#DS105L1_page_3)
    * [Page 4 - Hands-On Single Sample t-Test Solution](#DS105L1_page_4)
    * [Page 5 - Independent t-Test](#DS105L1_page_5)
    * [Page 6 - Hands-On Independent t-Test](#DS105L1_page_6)
    * [Page 7 - Hands-On Independent t-Test Solution](#DS105L1_page_7)
    * [Page 8 - Dependent t-Test](#DS105L1_page_8)
    * [Page 9 - Hands-On Dependent t-Test](#DS105L1_page_9)
    * [Page 10 - Hands-On Dependent t-Test Solution](#DS105L1_page_10)
    * [Page 11 - Independent Chi-Square](#DS105L1_page_11)
    * [Page 12 - Hands-On Independent Chi-Square](#DS105L1_page_12)
    * [Page 13 - Hands-On Independent Chi-Square Solution](#DS105L1_page_13)
    * [Page 14 - Correlations](#DS105L1_page_14)
    * [Page 15 - Hands-On Correlations](#DS105L1_page_15)
    * [Page 16 - Hands-On Correlations Solution](#DS105L1_page_16)
    * [Page 17 - Key Terms](#DS105L1_page_17)
    * [Page 18 - Hands-On Lesson 1 Review](#DS105L1_page_18)
    * [Page 19 - Hands-On Lesson 1 Review Solution](#DS105L1_page_19)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Overview of this Module<a class="anchor" id="DS105L1_page_1"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Basic Statistics in Python
VimeoVideo('388627207', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01overview.zip)**.

In [2]:
from IPython.display import VimeoVideo
# OPTIONAL: Recoreded Live Workshop
VimeoVideo('444962753', width=720, height=480)

# Introduction

This lesson will cover the basic statistics you already know how to do in MS Excel and R in Python.  

By the end of this lesson, you should be able to complete the following tasks in Python: 

* Single sample, independent, and dependent *t* tests
* Chi-Squares
* Correlation


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Single Sample t-Test <a class="anchor" id="DS105L1_page_2"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Single Sample t-Test
VimeoVideo('334045547', width=720, height=480)

The approximate transcript and code for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01pg2tutorial.zip)**.

# Single Sample t-Test

Remember that a single-sample *t*-test is meant to examine whether a particular value is different than the population mean.  You've already performed single sample *t*-tests by hand and using R.  Now it's time to learn how to complete them in Python!  

---
## Import Packages

The very first thing you will need to do is import the packages you will be using.

```python
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy import stats
```

```scipy``` and ```stats``` are packages you will be working with a lot in this module, as they have a lot of useful statistics functions.

---

## Import Data

You will be testing whether a cost of $25,000 for a hybrid vehicle in 2013 is different than the mean cost, using **[this data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/hybrid2013.zip)**.
---

## Test Assumptions

The only assumption for the single sample *t* test is that the data is normally distributed. You can test this just by creating a histogram: 

```python
hybrid2013['msrp'].hist()
```

Here's the output:

![The x-axis of a bar chart ranges from 20000 to 1 lakh in 8 units and the y-axis ranges from 0 to 10 in six units. The plot shows the highest peak in the categories of 20000 and 30000.](Media/python1.png)

Looks like things aren't quite normally distributed, but you'll let it slide for now for learning purposes. 

---

## Run the Analysis

There is only one line of code needed to run a single sample *t*-test in Python.  The function ```stats.ttest_1samp()``` performs the function, and takes the argument of the data column that contains your population values, and then the x value that you are trying to test against, which is $25,000.

```python
stats.ttest_1samp(hybrid2013['msrp'], 25000)
```

Here is the result :

```text
Ttest_1sampResult(statistic=6.003733172775179, pvalue=3.9231807518835515e-07)
```

The statistic is your *t* value, and the *p* value is the one associated with that *t*-test.  Remember that the *p* value is written in scientific notation, so this is significant at *p* < .05.  That means that buying a hybrid car for $25,000 in 2013 is different than the population mean.  Is it higher or lower? To answer that question, you will need to examine the population mean: 

```python
hybrid2013.msrp.mean()
```

It turns out that the average cost of a hybrid vehicle in 2013 was $42,943.49.  So buying one for $25,000 would have been a great steal! 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Hands-On Single Sample t-Test<a class="anchor" id="DS105L1_page_3"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Activity, you will compute a single sample *t*-test in Python. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[hybrid2013 dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/hybrid2013.zip)** you worked with in the lesson, determine whether a miles per gallon (mpg) rating of 40 is unusual for a hybrid car on the market in 2013. To do this, you will need to test for the assumption of normality by creating a histogram, and then run a single sample *t*test. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Hands-On Single Sample t-Test Solution<a class="anchor" id="DS105L1_page_4"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


## Test for Normality

First, test for the assumption of normality by creating a histogram.  

```python
hybrid2013['mpg'].hist()
```

It should look like this: 

![The x-axis of a bar chart ranges from 20 to 50 in 6 units. The y-axis axis ranges from 0 to 8 in 4 units. The highest peak in the bar chart crosses 8 units on the y-axis.](Media/python2.png)

---

## Single Sample *t*-Test

```python
stats.ttest_1samp(hybrid2013['mpg'], 40)
```

Here is the output:

```text
Ttest_1sampResult(statistic=-4.427320491687408, pvalue=6.67005084670698e-05)
```

---

## Check the Population Mean

```python
hybrid2013.mpg.mean()
```

And the output:

```text
33.48837209302326
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Independent t-Test<a class="anchor" id="DS105L1_page_5"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Single Sample t-Test
VimeoVideo('334044896', width=720, height=480)


The approximate transcript and code for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01pg5tutorial.zip)**.

# Independent t-Test

An independent *t* test is used when you have one independent variable that is categorical and a grouping variable, and one dependent continuous variable.  Use an independent *t*-test when you want to determine whether the means of two different, unrelated groups are the same or different. 

---

## Import Packages

```python
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import ttest_ind
```

---

## Import Data

You will continue to use the hybrid cars dataset from before.  However, this time, you are testing to see whether compact and mid-size hybrid cars differ in their average miles per gallon.

---

## Test Assumptions 

The only assumption that independent *t* has is normality.  You will need to test normality for each of your groups - compact and mid-sized hybrid cars.  

This code is very similar to before, but has an extra layer of specifying which values from the ```carclass``` you want to examine: 

```python
hybrid2013.mpg[hybrid2013.carclass == 'C'].hist()
```

So the name of the dataset and the name of the variable go first, but then you need to specify that you only want a histogram for the values that meet the condition ```C``` for compact cars. 

Here is the result: 

![The x-axis of a bar chart ranges from 30 to 50 in 5 units. The y-axis axis ranges from 0.00 to 2.00 in 8 units. The highest peak in the bar chart crosses 8 units on the y-axis.](Media/python3.png)

Same thing for the mid-size hybrid cars:

```python
hybrid2013.mpg[hybrid2013.carclass == 'M'].hist()
```

And the result: 

![The x-axis of a bar chart ranges from 25 to 50 in 5 units. The y-axis axis ranges from 0.0 to 4.0 in 8 units. The plot shows the highest peak in categories 30 and 40.](Media/python4.png)

It looks like neither of these are bell-shaped, and are thus not normal, but for the purposes of learning, you will continue.

---

## Run the Analysis

You will use the function ```ttest_ind()``` to run an independent *t* test in Python. The arguments are two things you want to compare to each other.  If you happen to have those two things in separate columns, then it would simply look like this mock code: 

```python
ttest_ind(data[column1], data[column2])
```

But since in this case, your data is all stored in the same ```carclass``` column, and you want to pull out the data for certain values, the code looks just a bit more complicated:

```python
ttest_ind(hybrid2013.mpg[hybrid2013.carclass == 'C'], hybrid2013.mpg[hybrid2013.carclass == 'M'])
```

And here is the results: 

```text
Ttest_indResult(statistic=1.0751886097093057, pvalue=0.29216712457079796)
```

Looks like there is no significant difference between compact and mid-size hybrid cars in terms of miles per gallon, since the *p* value is not less than .05.  The *t* value is also pretty small, which is another good indication.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Hands-On Independent t-Test<a class="anchor" id="DS105L1_page_6"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Activity, you will be computing an independent *t*-test to see if the miles per gallon differ between compact (```C```) and large (```L```) cars. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[hybrid2013 dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/hybrid2013.zip)** you worked with in the lesson, determine if the mean miles per gallon for a compact and a large car differ from each other. To do this, you will need to test for the assumption of normality for both groups by creating a histogram, and then run an independent *t*test. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Hands-On Independent t-Test Solution<a class="anchor" id="DS105L1_page_7"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

---

## Test for Normality

First, test for the assumption of normality by creating a histogram for both groups:  

```python
hybrid2013.mpg[hybrid2013.carclass == 'C'].hist()
```

It should look like this: 

![The x-axis of a bar chart ranges from 30 to 50 in 5 units. The y-axis axis ranges from 0.00 to 2.0 in 8 units. The highest peak in the bar chart reaches 2.00 on the y-axis.](Media/python3.png)

```python
hybrid2013.mpg[hybrid2013.carclass == 'L'].hist()
```

It should look like this: 

![The x-axis of a bar chart ranges from 20 to 40 in 5 units. The y-axis axis ranges from 0.00 to 2.0 in 8 units. The highest peak in the bar chart reaches 2.00 on the y-axis.](Media/python5.png)

---

## Independent *t*-Test

```python
ttest_ind(hybrid2013.mpg[hybrid2013.carclass == 'C'], hybrid2013.mpg[hybrid2013.carclass == 'L'])
```

Here is the output:

```text
Ttest_indResult(statistic=2.598820461640718, pvalue=0.026545168887970098)
```

There is a significant difference between the mpg of compact cars versus large cars, since the *p* value is < .05.

---

## Examine the Means for Each Group to Determine Where the Significant Differences Lie

To the means, you can call the ```.mean()``` function: 

```python
hybrid2013.mpg[hybrid2013.carclass == 'L'].mean()
```

The mean for large cars is 28.5 mpg. 

```python
hybrid2013.mpg[hybrid2013.carclass == 'C'].mean()
```

The mean for compact cars is 40.75 mpg.

So, you will be getting significantly more miles to the gallon if you are driving a compact hybrid versus a large-sized hybrid in 2013. 

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Dependent t-Test<a class="anchor" id="DS105L1_page_8"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [5]:
from IPython.display import VimeoVideo
# Single Sample t-Test
VimeoVideo('334044604', width=720, height=480)

The approximate transcript and code for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01pg8tutorial.zip)**.

# Dependent t-Test

Now it's on to the last *t*-test! Dependent *t*-tests are used when your samples are related in some way, but you still want to see if the means change.  It may be change over time, or change with treatment, etc.  A dependent *t* requires an independent variable that is categorical (groups to compare) and a dependent variable that is continuous. 

---
## Import Packages

The only package you will need for this is ```pandas```, to import your data, and ```stats```, from ```scipy```. 

```python
import pandas as pd
from scipy import stats
```

---

## Import Data

The hybrid car data has been restructured for dependent *t*-tests and an additional year of data, 2012, has been added.  **[It is available here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/hybrid2012-13.zip)**, and looks like this: 

![A table has 11 columns with 11 row entries. The columns are labeled vehicle, msrp2012, accelrate2012, mgp2012, mpgmpge2012, carclass2012, carclass_id2012, msrp2013, accelrate2013, mpg2013, mpgmpge2013.](Media/python6.png)

Notice that it has the same variables repeated twice, once for 2012, and once for 2013. You'll also notice that the number of rows is greatly reduced form the dataset with only hybrid cars from 2013.  This is because only those cars that had an entry for both 2012 and 2013 were included.

You will be testing to see if the price of hybrid cars changes from 2012 to 2013.  

---

## Test Assumptions

Betcha can't guess what the assumptions are for dependent *t*-test! What? You guessed normality, and that you need a histogram! This game is less fun now, but good for you! 

As with independent *t*-tests, you'll need a histogram for each variable. 

```python
hybrid201213['msrp2012'].hist()
```

This yields this graphic.  Not really normally distributed, but ignore for now for the purposes of learning. 

![The x-axis of a bar chart ranges from 20000 to 90000 in 8 units. The y-axis ranges from 0.0 to 3.0 in six units. The plot shows the highest peak in the categories of 30000 and 60000.](Media/python7.png)

```python
hybrid201213['msrp2013'].hist()
```

This yields this graphic.  Not really normally distributed, but ignore for now for the purposes of learning. 

![The x-axis of a bar chart ranges from 20000 to 70000 in 6 units. The y-axis ranges from 0.0 to 3.0 in six units. The highest peak in the bar chart reaches 3.0 on the y-axis.](Media/python8.png)

---

## Run the Analysis

You can use the function ```stats.ttest_rel()``` to compute a dependent *t*-test in Python.  Think of the ```_rel``` as standing for related, since the samples are paired.  The only arguments are the two columns of data you want to use.

```python
stats.ttest_rel(hybrid201213['msrp2012'], hybrid201213['msrp2013'])
```

This shows that there is so significant change in hybrid car price from 2012 to 2013, since the *p* value is not less than .05. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Hands-On Dependent t-Test<a class="anchor" id="DS105L1_page_9"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Activity, you will be computing an dependent *t*-test to see if the miles per gallon changes between 2012 and 2013. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[hybrid2012-13 dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/hybrid2012-13.zip)** you worked with in the lesson, determine if the mean miles per gallon changes from 2012 to 2013. To do this, you will need to test for the assumption of normality for both groups by creating a histogram, and then run a dependent *t*test. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - Hands-On Dependent t-Test Solution<a class="anchor" id="DS105L1_page_10"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

---

## Test for Normality

First, test for the assumption of normality by creating a histogram for both groups:  

```python
hybrid201213['mpg2012'].hist()
```

It should look like this: 

![The x-axis of a bar chart ranges from 20 to 50 in 6 units. The y-axis ranges from 0.0 to 3.0 in six units. The highest peak in the bar chart reaches 3.0 on the y-axis.](Media/python9.png)

```python
hybrid201213['mpg2013'].hist()
```

It should look like this: 

![The x-axis of a bar chart ranges from 20 to 50 in 6 units. The y-axis ranges from 0 to 5 in 5 units. The highest peak in the bar chart reaches 5 on the y-axis.](Media/python10.png)

Neither of these are normal, but you'll proceed for now.

---

## Dependent *t*-Test

```python
stats.ttest_rel(hybrid201213['mpg2012'], hybrid201213['mpg2013'])
```

Here is the output:

```text
Ttest_relResult(statistic=0.14466598084438312, pvalue=0.8873759030512348)
```

There is not a significant difference in the miles per gallon for hybrid cars from 2012 to 2013.  Guess that means you'd be good to buy a used model with no drop in gas mileage! This is because the *p* value is > .05.  Notice that the *t* value is tiny as well.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - Independent Chi-Square<a class="anchor" id="DS105L1_page_11"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [6]:
from IPython.display import VimeoVideo
# Single Sample t-Test
VimeoVideo('334042957', width=720, height=480)

The approximate transcript and code for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01pg11tutorial.zip)**.

# Independent Chi-Square

An independent Chi-Square is used when you want to determine whether two categorical variables influence each other.  

---
## Import Packages

All you should need for an independent Chi-Square in Python is ```stats```, and of course ```pandas``` to load in your data:

```python
import pandas as pd
from scipy import stats
```

---
## Import Data

The data located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/lead_lipstick.zip)** is about the lipstick content in lead.  However, it contains some great categorical fields that you'll be using.  The first is product type, ```prodType``` and it has two levels: ```LP``` is lipstick, and ```LG``` is lip gloss.  The second is price category, ```priceCatgry```, and it has three levels: 

* 1: < 5 euros
* 2: 5-15 euros
* 3: > 15 euros

You will test to see if the price of the product depends on whether it is a lip stick or a lip gloss. 

The data look like this initially: 

![A window displays 2 lines of source code. The source code reads, In open square bracket 58 close square bracket 1 lead_lipstick.head open and close bracket. Out open square bracket 58 close square bracket. A table has 8 columns and 5 row entries.](Media/python11.png)

---

## Test Assumptions and Run the Analysis

There is only one assumption for Chi-Square, and it is that when you are looking at the contingency tables, the expected frequencies for each cell need to have at least 5 entries per cell.  In Python, the only way to easily generate an expected frequencies table is actually to run the analysis.  So, you will conduct your independent Chi-Square first, and then make sure it meets this assumption!

---

### Create a Contingency Table

The first thing that needs to be done, before you can run the independent Chi-Square analysis, is to create a contingency table, sometimes called a *crosstab*, which shows how each level of each variable crosses with the other variable levels.  ```pandas``` saves the day with an easy to use function called ```crosstab()```: 

```python
lipstick_crosstab = pd.crosstab(lipstick['prodType'], lipstick['priceCatgry'])
```

The arguments for this function is the columns in your data frame you want to use to create the crosstab.

And this is the result:

![A window displays 2 lines of source code. The source code reads, In open square bracket 65 close square bracket 1 lipstick_crosstab. Out open square bracket 58 close square bracket. A table has 4 columns and 2 rows. The column heading represents priceCatgry 1, 2, 3. Row heading represents prodType LG, LP. The row entries are as follows. Row 1, 19, 43, 12. Row 2, 34, 92, 23.](Media/python12.png)

The three price categories are on the top, and the two different product types are along the side.  What is shown in the cells are how many products fit in both categories.  For instance, there are 19 lip glosses less than 5 euros.

---

### Running the Independent Chi-Square

Once you have the contingency table, then you can run the function ```stats.chi2_contingency``` on the contingency table you have created: 

```python
stats.chi2_contingency(lipstick_crosstab)
```

And here is the output you'll receive: 

```text
(0.2969891724608704,
 0.8620046738525345,
 2,
 array([[17.58744395, 44.79820628, 11.61434978],
        [35.41255605, 90.20179372, 23.38565022]]))
```

At first, it looks like a jumble of things.  But the first value is your Chi-Square statistic.  The second value is your *p* value associated with that Chi-Square statistic.  Looking at this, it looks like there is not a significant relationship between product type and product price. Neither lipstick nor lip gloss is pricier or cheaper than the other.

---

### Test the Assumption of 5 Cases per Expected Cell

The last piece of the output, labeled ```array```, is your expected count contingency table, albeit not a very pretty one! The expected count is what you would expect to happen if there was no relationship between the two variables.  Since all of these values are over 5, this means that the assumption has been met, and you are free to present and discuss these results without any limitations!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Hands-On Independent Chi-Square<a class="anchor" id="DS105L1_page_12"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For your Activity, you will be computing an independent Chi-Square to see if the shade of lipstick and the price category are related. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[lipstick dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/lead_lipstick.zip)** you worked with in the lesson, determine if the shade of lipstick and the price category are related. To do this, you will need to: 

* Create a contingency table
* Test for the assumption of 5 per cell in the expected contingency table
* Compute an independent Chi-Square 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Hands-On Independent Chi-Square Solution<a class="anchor" id="DS105L1_page_13"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

---

## Create a Contingency Table

First, create a contingency table: 

```python
lipstick2_crosstab = pd.crosstab(lipstick['shade'], lipstick['priceCatgry'])
```

It should look like this:

![A window displays a line of source code that reads, 1 lipstick2_crosstab. A table has 4 columns and 4 entries. The column heading represents priceCatgry 1, 2, 3. Row heading represents the shade Brown, Pink, Purple, Red. The row entries are as follows. Row 1, 20, 30, 10. Row 2, 20, 49, 12. Row 3, 8, 23, 6. Row 4, 5, 33, 7.](Media/python13.png)

---

## Independent Chi-Square

Then create the independent Chi-Square:

```python
stats.chi2_contingency(lipstick2_crosstab)
```

Here is the output:

```text
(7.860569553614045,
 0.2484973879479863,
 6,
 array([[14.26008969, 36.32286996,  9.41704036],
        [19.25112108, 49.03587444, 12.71300448],
        [ 8.79372197, 22.39910314,  5.80717489],
        [10.69506726, 27.24215247,  7.06278027]]))
```

Lipstick color does not significantly affect the price of the product.  This is because the *p* value is > .05. 

---

## Test the Assumption

The expected counts were all greater than 5 in the output below, so this assumption has been met, and you are free share these results without any limitations.

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 14 - Correlations<a class="anchor" id="DS105L1_page_14"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [7]:
from IPython.display import VimeoVideo
# Single Sample t-Test
VimeoVideo('334043602', width=720, height=480)

The approximate transcript and code for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L01pg14tutorial.zip)**.

# Correlation

Correlations can be done on two continuous variables, to determine the relationship between them.  As a reminder, a correlation can be between zero and one, and either positive or negative.  The larger it is, the more closely related the two variables are. 

---
## Import Packages

In addition to the usual ```pandas```, you will also need to import ```pyplot``` and ```seaborn``` to be able to graphically visualize correlations. 

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

---
## Import Data

The data you'll be looking at is [data from cruise ship companies](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/cruise_ship.zip).  It has information on the size, age, and number of passengers. 

Here's what the data currently looks like: 

![A window displays a line of source code that reads, 1 cruise_ship.head open and close bracket. A table has 10 columns and 5 rows. Column headings are Ship, Line, YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab.](Media/python14.png)

---

## Run One Correlation

Using the function ```.corr()```, it's easy to run a correlation on a selected two variables.  For instance, do you think that the number of passengers and the number of cabins on a cruise ship would relate to each other? 

```python
cruise_ship['passngrs'].corr(cruise_ship['Cabins'])
```

The first thing you type is one of your variables, then you call ```.cor()``` and list the second variable.

Here is the output: 

```text
0.9763413679845939
```

It should not come as a surprise to you that the number of passengers and the number of cabins are very related. .97 is about as strong as a correlation gets! This correlation is also positive, which means that as the number of passengers increase, so does the number of cabins, and vice versa.  

---

## Create a Correlation Matrix

Running one correlation can be nice, but sometimes you'd like to know how all your data relates to each other! In that case, call in the big guns, and go for a correlation matrix! 

---

### Drop Non-Continuous Variables

When you are creating a correlation matrix, you are feeding the code your entire data set.  But you can only run Pearson's correlations on continuous variables! So it's important to drop anything that is categorical or a string first. 

```python
cruise_ship1 = cruise_ship.drop(['Ship', 'Line'], axis=1)
```

As you'll recall, the ```.drop()``` function will remove any columns specified in the square brackets, and the ```axis=1``` arguments tells Python that these are the names of columns (not rows). So you are getting rid of the ```Ship``` and ```Line``` columns here.

You can confirm that everything worked as expected: 

![A window displays a line of source code that reads, 1 cruise_ship1.head open and close bracket. A table has 8 columns and 5 rows. Column headings are YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab.](Media/python15.png)

---

### Using .corr() on An Entire Dataset

You will not believe how easy it is to get all correlations at your fingertips! 

```python
cruise_ship1.corr(method='pearson')
```

Just put in the name of your cleaned up dataset, call the ```.corr()``` function, and specify ```method='pearson'``` and away you go! Here's the result: 

![A window displays a line of source code that reads, 1 cruise_ship1.corr(method='pearson'). A table has 8 columns and 5 rows. Column headings are YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab.](Media/python16.png)

Remember that you read only the top right or bottom left of this matrix; everything repeats after the diagonal row of ones. 

---

### Make .corr() Pretty!

Not pretty enough for you? Difficult to make sense of rows and rows of numbers? Well, you're in luck.  Adding a couple arguments can help you interpret things and add a little visual interest.

```python
cruise_ship1.corr(method='pearson').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
```

Start by using the ```.style.format()``` function.  In it, you will place ```{:.2}``` to specify the gradients.  Then you can use ```background_gradient()``` to specify the colors.  This pulls the ```coolwarm``` palette from ```matplotlib pyplot```.  ```cmap=``` stands for color map.  Then lastly, you need the argument ```axis=1``` so that Python knows you are focused on columns.

Here is the final product: 

![A table has 8 columns and 5 rows. The column headings are YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab. The row headings are YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab.](Media/python17.png)

---

### Use sns.heatmap()

You can also do a correlation matrix easily using the ```seaborn``` package: 

```python
sns.heatmap(cruise_ship1.corr(), annot=True)
```

Just put in your dataset name as an argument, then call the ```.corr()``` function again, and use the argument ```annot=True``` to have the values printed on the plot. There's a little less customization here, but it's also a little simpler, with fewer arguments. Here's the output:

![An 11 by 11 grid box with column headings are labeled as YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab. The row headings are YearBit, Tonnage, passngrs, Length, Cabins, Crew, PassSpcR, outcab.](Media/python18.png)

---

## Summary

In this lesson, you learned how to perform all the basic statistics you have learned in MS Excel and R, including *t*-tests, Chi-Squares, and correlations.  It is important that you become proficient with the basics in all three languages! 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 15 - Hands-On Correlations<a class="anchor" id="DS105L1_page_15"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For your Activity, you will be computing correlations on a dataset of power lifting records. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[power_lifting dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/power_lifting.zip)**, you will explore how the different variables are related with each other.  Use any means of correlation you like, and correlate any variables you like.  Make sure to note anything interesting or unusual that stands out to you, and interpret those correlations.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 16 - Hands-On Correlations Solution<a class="anchor" id="DS105L1_page_16"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

---

## Drop Any Variables That Aren't Continuous 

In order to run a correlation matrix, you will need to drop any variables that aren't continuous.  In this dataset, that includes ```Name```, ```Sex```, ```Equipment```, and ```Division```:   

```python
power_lifting1 = power_lifting.drop(['Name', 'Sex', 'Equipment', 'Division'], axis=1)
```

---

## Correlations

You could have performed correlations any way you liked! Below you will find the code for all four options, and their results. 

---

### Individual Correlations

If you want, you can look at correlations one by one, though this can be tiresome.  Here's an example of one pair: 

```python
power_lifting1['BestBenchKg'].corr(power_lifting1['BodyweightKg'])
```

Here is the output:

```text
0.5844312818128141
```

---

### .corr()

Here's one way to do a correlation matrix: 

```python
power_lifting1.corr(method='pearson').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)
```

If you didn't want it styled, you could leave oof the arguments after the first set of parentheses. Here's the output result: 

![A table has 8 columns and 5 rows. The column headings are MeetID, Age, Bodyweightkg, Squat4Kg, BestsquatKg, Bench4Kg, BestBenchKg, Deadlift4Kg, BestDeadliftKg, TotalKgKg, Wilks. The row headings are MeetID, Age, Bodyweightkg, Squat4Kg, BestsquatKg, Bench4Kg, BestBenchKg, Deadlift4Kg, BestDeadliftKg, TotalKgKg, Wilks.](Media/python19.png)

---

### sns.heatmap()

Here's the other way to do a correlation matrix.

```python
sns.heatmap(power_lifting.corr(), annot=True)
```

It yields a graph that's a little more difficult to read, just because of the number of variables:

![An 11 by 11 grid box with column headings are labeled as MeetID, Age, Bodyweightkg, Squat4Kg, BestsquatKg, Bench4Kg, BestBenchKg, Deadlift4Kg, BestDeadliftKg, TotalKgKg, Wilks. The column headings are MeetID, Age, Bodyweightkg, Squat4Kg, BestsquatKg, Bench4Kg, BestBenchKg, Deadlift4Kg, BestDeadliftKg, TotalKgKg, Wilks. all the diagonal boxes are labeled as 1.](Media/python20.png)

---

## Interpretation / Discussion

For either chart, larger correlations are shown in warmer colors.  Ignore the diagonal line of dark color down the center, since this is just all the variables correlating with themselves.  Then focus on the darker ones and move down.  Looks like both body weight and dead lifting weight correlate pretty highly with a number of other factors.  

Body weight is correlated with: 
* Best squat weight
* Best bench weight
* Best deadlift weight 

Best dead lift weight is correlated with: 
* Best squat weight
* Best bench weight
* Total kilograms
* Wilks

In addition, best squat is correlated with:
* Best bench weight
* Total kilograms
* Wilks

From this, it looks like the 4kg performances aren't greatly correlated with anything. 


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 17 - Key Terms<a class="anchor" id="DS105L1_page_17"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.hist()</td>
        <td>An argument in pandas that allows you to create a histogram even when data is missing.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>stats.ttest_1samp()</td>
        <td>Computes a single sample t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>ttest_ind()</td>
        <td>Computes an independent t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>stats.ttest_rel()</td>
        <td>Computes a dependent t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pd.crosstab()</td>
        <td>Creates a contingency table in Python.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>stats.chi2_contingency()</td>
        <td>Calculates an independent Chi-Square using a contingency table.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.corr()</td>
        <td>Creates a correlation, either on a particular variable or a whole dataset that is numeric. Takes the argument method= to specify the type of correlation. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sns.heatmap()</td>
        <td>Creates a heatmap of the relationship between variables in an all-numeric set.  An additional argument of annot=True will allow you to see the correlation values.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scipy</td>
        <td>A package often used for machine learning and statistics. </td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>stats</td>
        <td>A package within scipy that will conduct many different statistical tests, including t-tests.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>seaborn</td>
        <td>A data visualization package usually abbreviated as sns.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pyplot</td>
        <td>A package within matplotlib that does stastical graphing.  Often abbreviated as plt. </td>
    </tr>
</table>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 18 - Hands-On Lesson 1 Review<a class="anchor" id="DS105L1_page_18"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For this Hands On, you will be analyzing data about Anime.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[anime dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/anime.zip)** you worked with in the lesson, perform the appropriate analyses and answer the following questions in your Python file.  

---

### Is a Rating Score of 6.2 Different from the Mean in this Dataset? 

Use the variable ```score```.  

---

### Does Anime that is Still Airing Differ in Popularity from Anime that is No Longer Airing? 

Use the variables ```status``` and ```popularity```.

---

### Does the Source of the Anime Influence the Type of Anime?

Use the variable ```source```, recoded to have four levels: 
* Manga
* Book
* Game
* Listening

And use the variable ```type```.

---

### How do the Variables about Popularity / Ranking Relate to Each Other? 

Use the following variables: 

* ```score```
* ```scored_by```
* ```rank```
* ```popularity```
* ```members```
* ```favorites```

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 19 - Hands-On Lesson 1 Review Solution<a class="anchor" id="DS105L1_page_19"></a>

[Back to Top](#DS105L1_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

---

## Answers

1. Is a Rating Score of 6.2 Different from the Mean in this Dataset? 
    > Yes,the average anime rating is higher than 6.2. 

2. Does Anime that is Still Airing Differ in Popularity from Anime that is No Longer Airing? 
    > Yes, those that are currently airing are more popular than those anime that are no longer airing.

3. Does the Source of the Anime Influence the Type of Anime?
    > Yes.

4. How do the Variables about Popularity / Ranking Relate to Each Other? 
    > They are all somewhat correlated with each other.  However, rank and popularity seem to correlate less well the other variables, but well with each other. This probably means that those metrics are more similar in some way.

---

## Code 

Please feel free to check your work in the **[Jupyter Notebook with the answers](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/DSO105_L1_HandsOn.zip).**

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to extract the zip file and save it to your computer, then open it in Jupyter Notebook in order to open the above file.</p>
    </div>
</div>

