<a href="https://colab.research.google.com/github/katelyndiaz/KWK_DS2021/blob/main/KWK_2021_Stats_1_(Scholar).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📈 **Intro to Stats** 📈

First, go to `file > save a copy in drive`. This will make a new copy of this notebook. Next, open your new notebook and go to `edit > clear all outputs`. This will make sure that when you run your code, the output is not already shown. 


## 🎯 **Learning Goals**

- Understand core statistical concepts when exploring a new dataset
- Use Python to explore descriptive statistics of a dataframe
- Describe and apply various measures of central tendency and spread


## 📗 **Technical Vocabulary** 
- Descriptive Statistics 
- Inferential Statistics
- Measures of central tendency
- Mean, median, mode
- Measures of spread
- Variance
- Standard Deviation
- Percentiles
- Ranges







### **Importing Packages**

In [None]:
# importing packages 


## 🌎 **What is Statistics?**

Statistics is the study of how to collect, analyze, and draw conclusions from data. It’s a hugely valuable tool that you can use to bring the future into focus and infer the answer to tons of questions. For example, what is the likelihood of someone purchasing your product, how many calls will your support team receive, and how many jeans sizes should you manufacture to fit 95% of the population? 




#### **Population and Samples** 

In statistics, the population is a set of all elements or items that you’re interested in. Populations are often vast, which makes them inappropriate for collecting and analyzing data. That’s why statisticians usually try to make some conclusions about a population by choosing and examining a representative subset of that population.

This subset of a population is called a sample. Ideally, the sample should preserve the essential statistical features of the population to a satisfactory extent. That way, you’ll be able to use the sample to glean conclusions about the population ([source](https://realpython.com/python-statistics/)).

## 🧮  **Descriptive & Inferential Statistics** 

 

#### **Descriptive Statistics**

Descriptive statistics describe a sample. You simply take a group that you’re interested in, record data about the group members, and then use summary statistics and graphs to present the group properties [(source)](https://statisticsbyjim.com/basics/descriptive-inferential-statistics/). 

#### **Inferential Statistics**

Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population [(source)](https://statisticsbyjim.com/basics/descriptive-inferential-statistics/).


## 📏 **Measures of Central Tendency**

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The 3 most common measures of central tendency are mean, median, and mode.

#### **Mean**

This is the term for what we generally call an "average".  We add all of the numbers in our data set, and then divide by the size of the set. Sometimes a mean can be misleading and may not effectively show a typical value in our dataset. This is because a mean might be influenced by the outliers. Outliers are the numbers which are either extremely high or extremely low compared to the rest of the numbers in a dataset.



You can calculate the mean with pure Python using `sum()` and `len()`, without importing libraries:

```
mean = sum(x) / len(x)
print(mean)
```

you can also apply built-in Python statistics functions:

```
dataframe_name["column_name"].mean()
```

**Now let's practice using a dataset...**

Suppose 12 people went to a restaurant and all ordered atleast one item. Below is a dataset that describes the number of items each person ordered and the total price of their order.





In [None]:
# creating our dataframe
order = pd.DataFrame({'item_price':[2.39, 3.39, 3.39, 2.39, 16.98, 10.98, 1.69, 11.75, 9.25, 9.25, 4.45, 8.75],
                   'items_ordered':[1,2,2,1,4,3,1,4,3,3,2,3 ]})

# looking at our data


In [None]:
# calculating the mean 


This means that the average total order price was $7.05.

#### **Median**

for a given data set (of numbers), the median is the number that separates the top half of the data from the bottom half.  If the size of the data set is even )so there is no single data point in the middle), the median is given by the mean of the two middle values, after arranging the data into ascending order.

Note: A median is not influenced by the outliers.





Calculating the median:

```
dataframe_name["column_name"].median()
```

In [None]:
# calculating the median 


#### **Mode**

The "mode" is the value that occurs most often. If no number in the list is repeated, then there is no mode for the list.

Calculating the mode:

```
dataframe_name["column_name"].mode()
```

In [None]:
# calculating the mode 


## 📐 **Measures of Spread** 

A measure of spread, also called a measure of dispersion, is used to describe the variability in a sample or population. It is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data [(source)](https://statistics.laerd.com/statistical-guides/measures-of-spread-range-quartiles.php#:~:text=Introduction,of%20a%20set%20of%20data.).

At KWK we'll be talking about four different measures of spread:
- Variance
- Standard Deviation
- Percentiles
- Ranges



### ☯️ **Variance**

**Variance** is the average distance from each data point to the data's mean 


**Sample Variance** is the variance of a sample taken from the population.


**Population Variance** is the _variance_ of a population is the average of the squared distances from the mean.



**Calculating variance using `np.var()`** 

In [None]:
# calculate the variance 
# ddof=1 calculates population variance instead of sample variance

### 📊 **Standard Deviation**

One common method to measure the variation of our dataset is to calculate the standard deviation (SD). The SD is just a measurement to tell how a set of values spread out from their mean. A low SD shows that the values are close to the mean and a high SD shows a high diversion from the mean.

Note:
- SD must be a positive number
- SD is affected by outliers as its calculation is based on the mean
- The smallest possible value of SD is zero. If SD is zero, all the numbers in a dataset share the same value.


In [None]:
# calculated by taking the square root of the variance 


We can also get the same result using the `np.std` function:

In [None]:
# calculated using np.std function 


### 💯**Percentiles**

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found. 



#### **Quantiles**

Quantiles are the set of values/points that divides the dataset into groups of equal size. We can do this using the `np.quantile()` function:



In [None]:
# find the 50th quantile


This gives us 6.6 which means that 50% of individuals in the dataset ordered less than $6.6 worth of food and the other 50% ordered more (same as median).

#### **Quartiles**

The three dividing points (or quantiles) that split data into four equally sized groups are called quartiles. 



In [None]:
# find the quartiles 

This means that 25% of the data is between 1.69 and 3.14. Another 25% is between 3.14 and 6.6 and so on. 

### ↔️ **Ranges**



#### **Interquartile Range**

The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1 [(source)](https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-data-statistics/cc-6th/v/calculating-interquartile-range-iqr#:~:text=The%20IQR%20describes%20the%20middle,upper%20half%20of%20the%20data.&text=The%20IQR%20is%20the%20difference%20between%20Q3%20and%20Q1.).  

In [None]:
# find the interquartile range


We can also get the same result by importing a SciPy module. You can learn more about it [here](https://docs.scipy.org/doc/scipy/reference/stats.html).

In [None]:
from scipy.stats import iqr

iqr(order['item_price'])

This means that the range of the middle 50% of the data is 6.54.

## 💨 **A Shortcut**

Although it's important to know how to calculate measures of central tendency and spread individually, python has a very useful function that can do everything in one step. 

The `.describe()` function is a great summarisation tool that will quickly display statistics for any variable or group it is applied to. The `describe()` output varies depending on whether you apply it to a numeric or character column.

In [None]:
# use the describe function 




---


## ✏️ **Try It!**

Using the Spotify dataset, answer the questions below:

Spotify github url = https://raw.githubusercontent.com/mikaela-el/repo/master/kwk_spotify.csv

In [None]:
# load in data 


# preview data 


Use the Spotify dataset to answer the following questions:

1. What is the average "bpm" in the dataset?
2. what is the average "dnce" in the dataset?
3. What is the highest "nrgy"?
4. What is the lowest "bpm"?
5. What is the standard deviation of the "live" column?
6. What is the standard deviation of the "dnce" column?
7. What is the value below which 50% of the obervations in the "nrgy" column can be found?
8. What is the value above which 75% of the obervations in the "live" column can be found?



## **Congrats! You completed Intro to Stats 🎉**