# 주제

평균(Average) 구하기
    * 평균값(Mean)
    * 중앙값(Median)
    * 최빈값(Mode)

## 오늘의 주요 예제

미국에서 판매되는 담배(식물)의 도매가격과 미국의 51개주별 인구분포와 월소득에 대한 데이터를 분석한다. 

## 주요 모듈

* pandas: 통계분석 전용 모듈
    * numpy 모듈을 바탕으로 하여 통계분석에 특화된 모듈임.
    * 마이크로소프트의 엑셀처럼 작동하는 기능을 지원함
* datetime: 날짜와 시간을 적절하게 표시하도록 도와주는 기능을 지원하는 모듈
* scipy: 수치계산, 공업수학 등을 지원하는 모듈

In [10]:
import numpy as np
import pandas as pd
from datetime import datetime as dt
from scipy import stats

### 오늘 사용할 데이터

* 주별 담배(식물) 도매가격 및 판매일자: Weed_price.csv
* 주별 인구분포 및 월소득: Demographics_State.csv
* 주별 인구수: Population_State.csv


아래 그림은 미국의 주별 담배(식물) 판매 데이터를 담은 Weed_Price.csv 파일를 엑셀로 읽었을 때의 일부를 보여준다.
실제 데이터량은 22899개이며, 아래 그림에는 5개의 데이터만을 보여주고 있다.
* 주의: 1번줄은 테이블의 열별 목록(column names)을 담고 있다.
* 열별 목록: State, HighQ, HighQN, MedQ, MedQN, LowQ, LowQN, date

<p>
<table cellspacing="20">
<tr>
<td>
<img src="img/weed_price.png", width=600>
</td>
</tr>
</table>
</p>

### csv 파일 불러오기

* pandas 모듈의 read_csv 함수 활용
* read_csv 함수의 리턴값은 DataFrame 이라는 특수한 자료형임
    * 엑셀의 위 그림 모양의 스프레드시트(spreadsheet)라고 생각하면 됨.

언급한 세 개의 csv 파일을 pandas의 read_csv 함수를 이용하여 불러들이자.

**주의**: Weed_Price.csv 파일을 불러들일 때, parse_dates라는 키워드 인자가 사용되었다. 
* parse_dates 키워드 인자: 날짜를 읽어들일 때 다양한 방식을 사용하도록 하는 기능을 갖고 있다.
    * 여기서 값을 [-1]로 준 것은 소스 데이터에 있는 날짜 데이터를 변경하지 말고 그대로 불러오라는 의미이다.
    * 위 엑셀파일에서 볼 수 있듯이, 마지막 열에 포함된 날짜표시는 굳이 변경을 요하지 않는다.

In [11]:
prices_pd = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
demography_pd = pd.read_csv("data/Demographics_State.csv")
population_pd = pd.read_csv("data/Population_State.csv")

#### Variance

> Once two statistician of height 4 feet and 5 feet have to cross a river of AVERAGE depth 3 feet. Meanwhile, a third person comes and said, "what are you waiting for? You can easily cross the river"

It's the average distance of the data values from the *mean*

<img style="float: left;" src="img/variance.png" height="320" width="320">

In [13]:
california_pd = prices_pd[prices_pd.State == "California"].copy(True)

In [14]:
california_pd['HighQ_dev'] = (california_pd['HighQ'] - ca_mean) ** 2

NameError: name 'ca_mean' is not defined

In [18]:
ca_HighQ_variance = california_pd.HighQ_dev.sum() / (ca_count - 1)
print "Variance of High Quality weed prices in CA is:", ca_HighQ_variance

Variance of High Quality weed prices in CA is: 2.98268628798


#### Standard Deviation

It is the square root of variance. This will have the same units as the data and mean. 

In [19]:
ca_HighQ_SD = np.sqrt(ca_HighQ_variance)
print "Standard Deviation of High Quality weed prices in CA is:", ca_HighQ_SD

Standard Deviation of High Quality weed prices in CA is: 1.72704553732


#### Using Pandas built-in function

In [20]:
california_pd.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,HighQ_dev
count,449.0,449.0,449.0,449.0,449.0,449.0,449.0
mean,245.376125,14947.073497,191.268909,16769.821826,189.783586,976.298441,2.976043
std,1.727046,1656.133565,1.524028,2433.943191,1.598252,120.246714,3.961134
min,241.84,12021.0,187.85,12724.0,187.83,770.0,1.5e-05
25%,244.48,13610.0,190.26,14826.0,188.6,878.0,0.106357
50%,245.31,15037.0,191.57,16793.0,188.6,982.0,0.729103
75%,246.22,16090.0,192.55,18435.0,191.32,1060.0,4.435761
max,248.82,18492.0,193.63,22027.0,193.88,1232.0,12.504178


In [21]:
california_pd.HighQ.mode()

0    245.03
1    245.05
dtype: float64

#### Co-variance 

covariance as a measure of the (average) co-variation between two variables, say x and y. Covariance describes both how far the variables are spread out, and the nature of their relationship, Covariance is a measure of how much two variables change together. Compare this to Variance, which is just the range over which one measure (or variable) varies.

<img style="float: left;" src="img/covariance.png" height="270" width="270">

<br>
<br>
<br>
<br>

#### Co-variance of weed price in California vs New York

In [22]:
ny_pd = prices_pd[prices_pd['State'] == 'New York'].copy(True)
ny_pd.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
20120,New York,351.98,5773,268.83,5786,190.31,479,2013-12-27
20885,New York,351.92,5775,268.83,5786,190.31,479,2013-12-28
21599,New York,351.99,5785,269.02,5806,190.75,480,2013-12-29
22313,New York,352.02,5791,268.98,5814,190.75,480,2013-12-30
22823,New York,351.97,5794,268.93,5818,190.75,480,2013-12-31


In [23]:
ny_pd = ny_pd.ix[:,[1,7]]
ny_pd.columns = ['NY_HighQ', 'date']

In [24]:
ny_pd.head()

Unnamed: 0,NY_HighQ,date
20120,351.98,2013-12-27
20885,351.92,2013-12-28
21599,351.99,2013-12-29
22313,352.02,2013-12-30
22823,351.97,2013-12-31


In [25]:
ca_ny_pd = pd.merge(california_pd.ix[:,[1,7]].copy(), ny_pd, on="date")
ca_ny_pd.rename(columns={"HighQ": "CA_HighQ"}, inplace=True)
ca_ny_pd.head()

Unnamed: 0,CA_HighQ,date,NY_HighQ
0,248.77,2013-12-27,351.98
1,248.74,2013-12-28,351.92
2,248.76,2013-12-29,351.99
3,248.82,2013-12-30,352.02
4,248.76,2013-12-31,351.97


In [26]:
ny_mean = ca_ny_pd.NY_HighQ.mean()
ny_mean

346.9127616926502

In [27]:
ca_ny_pd['ca_dev'] = ca_ny_pd['CA_HighQ'] - ca_mean
ca_ny_pd.head()

Unnamed: 0,CA_HighQ,date,NY_HighQ,ca_dev
0,248.77,2013-12-27,351.98,3.393875
1,248.74,2013-12-28,351.92,3.363875
2,248.76,2013-12-29,351.99,3.383875
3,248.82,2013-12-30,352.02,3.443875
4,248.76,2013-12-31,351.97,3.383875


In [28]:
ca_ny_pd['ny_dev'] = ca_ny_pd['NY_HighQ'] - ny_mean
ca_ny_pd.head()

Unnamed: 0,CA_HighQ,date,NY_HighQ,ca_dev,ny_dev
0,248.77,2013-12-27,351.98,3.393875,5.067238
1,248.74,2013-12-28,351.92,3.363875,5.007238
2,248.76,2013-12-29,351.99,3.383875,5.077238
3,248.82,2013-12-30,352.02,3.443875,5.107238
4,248.76,2013-12-31,351.97,3.383875,5.057238


In [29]:
ca_ny_cov = (ca_ny_pd['ca_dev'] * ca_ny_pd['ny_dev']).sum() / (ca_count - 1)
print "Covariance of the High Quality weed prices in CA and NY is:", ca_ny_cov

Covariance of the High Quality weed prices in CA and NY is: 5.91681496729


#### Using Pandas built-in function

In [30]:
ca_ny_pd.cov()

Unnamed: 0,CA_HighQ,NY_HighQ,ca_dev,ny_dev
CA_HighQ,2.982686,5.916815,2.982686,5.916815
NY_HighQ,5.916815,12.245147,5.916815,12.245147
ca_dev,2.982686,5.916815,2.982686,5.916815
ny_dev,5.916815,12.245147,5.916815,12.245147


### Correlation

Extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

<img style="float: left;" src="img/correlation.gif" height="270" width="270">

<br>
<br>
<br>

#### Finding correlation between weed prices in New York and California

In [31]:
ca_highq_std = ca_ny_pd.CA_HighQ.std()
ny_highq_std = ca_ny_pd.NY_HighQ.std()

ca_ny_corr = ca_ny_cov / (ca_highq_std * ny_highq_std)
print "Correlation between weed prices in NY and CA:", ca_ny_corr

Correlation between weed prices in NY and CA: 0.979043961106


In [32]:
ca_ny_pd.corr()

Unnamed: 0,CA_HighQ,NY_HighQ,ca_dev,ny_dev
CA_HighQ,1.0,0.979044,1.0,0.979044
NY_HighQ,0.979044,1.0,0.979044,1.0
ca_dev,1.0,0.979044,1.0,0.979044
ny_dev,0.979044,1.0,0.979044,1.0


# Correlation != Causation

correlation between two variables does not necessarily imply that one causes the other.


<img style="float: left;" src="img/correlation_not_causation.gif" height="570" width="570">