# Statistics - Central Tendencies

## Introduction - Data Analysis

With large datasets, it's difficult to describe data when you first look at it.  Quick ways to get an overview is to use what are called _Indicators of Central Tendencies_, some of which you learned in grade school.  Using the *mean*, *maximum*, *minimum*, *standard deviation*, *variance* and *median* are straightforward ways to help you get a grip on your data.

The data set used for this lesson is home prices, for different cities/regions in the US.  The data is medians, tracked at the end of each month, from Oct 2017 to Jun 2020. 

## Objectives of this Lesson ##

    1. Start your Jupyter Notebook
    2. Load the data
    3. Generate and output Indicators of Central Tendencies
    4. Practice with these functions
    5. Interpret the output
    
***
    
## Initial setup ##

On your OneDrive, download the data file from Blackboard and put it in your Data Science folder.  For each Question asked, create a new cell, label it Markdown, and answer the question.

__Example 1/Setup:__  Type the code, exactly as shown, in the box in Jupyter Notebook, save it and Run it. 

In [2]:
import pandas as pd     # import the pandas library to read a .csv file

median_home_values = pd.read_csv("Median_List_raw_month.csv")   # make sure the data file is in the same directory

display(median_home_values.head(10))    # view the first 10 rows of data

Unnamed: 0,RegionName,StateName,10/31/2017,11/30/2017,12/31/2017,1/31/2018,2/28/2018,3/31/2018,4/30/2018,5/31/2018,...,9/30/2019,10/31/2019,11/30/2019,12/31/2019,1/31/2020,2/29/2020,3/31/2020,4/30/2020,5/31/2020,6/30/2020
0,"New York, NY",NY,519900.0,519999.0,515000.0,519000.0,538888.0,549000.0,550000.0,568500.0,...,575000,579000,579000,578000,575000,579900,598000,595000,599000,599900
1,"Los Angeles-Long Beach-Anaheim, CA",CA,795000.0,795000.0,799000.0,799000.0,810000.0,845000.0,850000.0,849999.0,...,839999,848000,849000,849000,850000,895000,899900,859000,879650,925000
2,"Chicago, IL",IL,315000.0,305000.0,299000.0,299000.0,319900.0,339000.0,349000.0,349900.0,...,325000,319900,314500,301640,306000,324900,332500,329000,335000,345500
3,"Dallas-Fort Worth, TX",TX,344900.0,340000.0,342844.0,345000.0,350000.0,359990.0,369447.0,369000.0,...,340000,340000,340000,335000,331500,338500,340000,339900,349900,356000
4,"Philadelphia, PA",PA,259990.0,259000.0,249900.0,249000.0,250000.0,262250.0,269900.0,279000.0,...,299890,299900,299000,289900,289000,295000,299900,300000,319990,335000
5,"Houston, TX",TX,319995.0,323042.0,324900.0,324990.0,329500.0,335000.0,339500.0,340000.0,...,316581,316672,313000,309900,309000,312836,319000,318900,325000,334290
6,"Washington, DC",DC,459990.0,439900.0,427000.0,425900.0,439000.0,457000.0,460000.0,475000.0,...,485000,488000,485000,479990,484990,499900,525000,515000,529900,550000
7,"Miami-Fort Lauderdale, FL",FL,395000.0,395000.0,395000.0,395000.0,395000.0,399000.0,399000.0,399700.0,...,399900,399995,399999,400000,399900,399999,400000,398000,395000,399000
8,"Atlanta, GA",GA,300000.0,307000.0,309000.0,313990.0,322000.0,335000.0,349000.0,350000.0,...,329900,326499,325000,324250,322500,325000,330000,325900,335000,347900
9,"Boston, MA",MA,549900.0,549000.0,534900.0,539900.0,554900.0,594900.0,599000.0,599000.0,...,574900,578000,575000,574500,575000,599000,619900,599900,624900,642610


We loaded the data into a file called $median\_home\_values$ and then, using the $.head$ command, showed the first 10 rows.  Notice that Python picked up the headers and displayed them correctly.  This is one of the strengths of Python.

Also notice, that the rows are numbered, and the first one begins with zero.  While most of us don't count starting with zero, it is common in software packages to begin with zero, so, when you're picking a row or column, you need to remember to subtract 1 to get the actual row/column.

***

__Example 1:__ Let's look at the data description.  Type the code, exactly as shown, in the box in Jupyter Notebook, save it and Run it.


In [4]:
display(median_home_values.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119 entries, 0 to 118
Data columns (total 35 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   RegionName  119 non-null    object 
 1   StateName   119 non-null    object 
 2   10/31/2017  117 non-null    float64
 3   11/30/2017  117 non-null    float64
 4   12/31/2017  117 non-null    float64
 5   1/31/2018   117 non-null    float64
 6   2/28/2018   117 non-null    float64
 7   3/31/2018   117 non-null    float64
 8   4/30/2018   117 non-null    float64
 9   5/31/2018   117 non-null    float64
 10  6/30/2018   117 non-null    float64
 11  7/31/2018   119 non-null    int64  
 12  8/31/2018   119 non-null    int64  
 13  9/30/2018   119 non-null    int64  
 14  10/31/2018  119 non-null    int64  
 15  11/30/2018  119 non-null    int64  
 16  12/31/2018  119 non-null    int64  
 17  1/31/2019   119 non-null    int64  
 18  2/28/2019   119 non-null    int64  
 19  3/31/2019   119 non-null    i

None

We used the $.info$ for the variable to look at the structure of the data - which is important in how we want to analyze it.  Most data is divided into 2 types of values, those that you can do math on, and those you can't.  Obviously, you can't add a RegionName, so Python treats that as an $Object$.  Same with StateName.  The median home costs are treated as either $floats$ (values with decimal points) or $int$ (whole numbers).  You can do math on either of those.

*** 

__Example 2:__ Let's calculate the mean of some of these values.  For that to happen, we need to first import a key mathematical library, called $Numpy$.  Normally, you load all the libraries at the very beginning.

Type the code, exactly as shown, in the box in Jupyter Notebook, save it and Run it.

In [5]:
import numpy as py              # use py for numpy from now on

mean_by_month = median_home_values.mean(axis=0)  # calculate the mean of each column

print(mean_by_month)  # output the means

10/31/2017    318782.863248
11/30/2017    317576.170940
12/31/2017    317105.760684
1/31/2018     319088.803419
2/28/2018     326966.273504
3/31/2018     335194.777778
4/30/2018     341345.606838
5/31/2018     344427.555556
6/30/2018     343827.752137
7/31/2018     339424.739496
8/31/2018     333570.680672
9/30/2018     330338.084034
10/31/2018    327906.235294
11/30/2018    324342.100840
12/31/2018    322769.899160
1/31/2019     324615.495798
2/28/2019     331416.521008
3/31/2019     341660.798319
4/30/2019     349141.630252
5/31/2019     353863.831933
6/30/2019     353282.176471
7/31/2019     349759.521008
8/31/2019     346192.109244
9/30/2019     343952.521008
10/31/2019    341937.310924
11/30/2019    340246.857143
12/31/2019    339031.420168
1/31/2020     341860.268908
2/29/2020     351774.386555
3/31/2020     359149.168067
4/30/2020     355366.075630
5/31/2020     364656.268908
6/30/2020     374990.243697
dtype: float64


Wow, cool!  That was fast!  What is it?  It's the average, by month, of all the data values across all the RegionNames listed, so, it is an average down the table, which is over time.  

***

__Question 1:__ In looking at the data, with each 'row' being a different region/city and each column being a different month, what can we determine from means over time (columns)?

If we were to do the same calculations on the locations (rows), what can we determine?

Be specific.

***

Now, let's do an average for each Region.

__Example 3:__ The axis is the dimension chosen to do the math over, with axis=0 going down the columns, and axis=1 going across (along a row).

Type the code, exactly as shown, in the box in Jupyter Notebook, save it and Run it.

In [1]:
mean_by_region = median_home_values.mean(axis=1)  # calculate the mean of each row

print(mean_by_region.apply(int))  # output the means

NameError: name 'median_home_values' is not defined

Since there are 119 rows, Python only outputs the beginning and ending 5, but the variable $meanbyRegion$ has all the values. 

***

__Question 2:__  We only looked at the first five here, but using the data from the .info above, we can determine which cities are the first five.  What does this data tell you?

***

__Example 4:__  $.Median$ is the command to calculate the median.  

Type the code, exactly as shown, in the box in Jupyter Notebook, save it and Run it.

This calculated the medians by month over all regions.

***

__Exercise 1:__ Generate a variable $median\_by\_region$ that contains the median values for each region. 
Print $median\_by\_region$ the same way you printed $mean\_by\_region$.

***


Other statistical parameters are used exactly the same way as mean and median.  

|  Parameter |  Description                                                                        |
|------------|-------------------------------------------------------------------------------------|
|$std$,$var$ | Standard deviation and variance, respectively, with default degree of freedom of $n$|
|$min$,$max$ |  Minimum and maximum values                                                         |
|$sum$       |  Sum of all elements in the array, or along an axis                                 |
|------------|-------------------------------------------------------------------------------------|


***

__Exercise 2:__  Create variables $max\_by\_month$, $min\_by\_region$ and $std\_by\_month$.  Print $max\_by\_month$ the same way you printed previous variables.


 




Now, in really LOOKING at the numbers, the home costs are medians, and Python has given us 6 digits below a dollar.  Remembering significant digits, do any of those really count?  No, so let's convert all the floating values to integer values.  What is important here is that Python truncates /rounds down the fraction.

__Example 3:__ Converting to integer for outputting.


In [11]:
print("Mean Home Values by month are:", mean_by_month.apply(int))

Mean Home Values by month are: 10/31/2017    318782
11/30/2017    317576
12/31/2017    317105
1/31/2018     319088
2/28/2018     326966
3/31/2018     335194
4/30/2018     341345
5/31/2018     344427
6/30/2018     343827
7/31/2018     339424
8/31/2018     333570
9/30/2018     330338
10/31/2018    327906
11/30/2018    324342
12/31/2018    322769
1/31/2019     324615
2/28/2019     331416
3/31/2019     341660
4/30/2019     349141
5/31/2019     353863
6/30/2019     353282
7/31/2019     349759
8/31/2019     346192
9/30/2019     343952
10/31/2019    341937
11/30/2019    340246
12/31/2019    339031
1/31/2020     341860
2/29/2020     351774
3/31/2020     359149
4/30/2020     355366
5/31/2020     364656
6/30/2020     374990
dtype: int64
