<h1>Variability of Data</h1>

The section <b>Variability of Data</b> will cover measures of data spread. These will include calculation of <i>deviation, absolute value deviations, variance ( mean of absolute value squared deviations), and the standard deviation (square root of the variance )

In [1]:
%matplotlib inline
import numpy as np

In [2]:
# mean of dataset
gen = np.asarray([33219, 36254, 38801, 46335, 46840, 47596, 55130, 56863,
                 78070, 88830], dtype=np.float64)

np.mean( gen )

52793.800000000003

<b>Deviation from the Mean</b> 
($ x_i - \overline{x} $)

In [3]:
deviations = gen - np.mean(gen)
np.transpose([gen, deviations])

array([[ 33219. , -19574.8],
       [ 36254. , -16539.8],
       [ 38801. , -13992.8],
       [ 46335. ,  -6458.8],
       [ 46840. ,  -5953.8],
       [ 47596. ,  -5197.8],
       [ 55130. ,   2336.2],
       [ 56863. ,   4069.2],
       [ 78070. ,  25276.2],
       [ 88830. ,  36036.2]])

<b>Average Deviation</b> = 
$ 
\frac{1}{n} \sum_{i=1}^{n}(x_i - \overline{x})
$

In [4]:
np.mean( deviations )

-2.9103830456733705e-12

In [5]:
np.mean( abs(deviations))

13543.560000000001

In [6]:
np.transpose([deviations, deviations**2])

# avg squared deviation
np.mean( deviations * deviations )

291622740.36000001

In [7]:
np.std( gen )

17076.965197598784

In [8]:
# finding new standard devaitions 
newData = [ 38946, 43420, 49191, 50430, 50557, 52580, 53595, 54135, 60181,
          62076]
np.std(newData)

6557.1632654677742

Standard Deviation in words:
The square root of the mean of squared deviations

In [9]:
samp1= np.asarray([21,15,18,18,17,20,23,22,21])

np.std(samp1)

2.4545246704860579

In [10]:
np.std(samp1, ddof=1)

2.6034165586355518

<h1>Numpy and Pandas Tutorial</h1>

<a href="http://docs.scipy.org/doc/numpy/reference/routines.statistics.html">
Numpy Statistics Documentation
</a>

In [11]:
import pandas as pd

In [12]:
series = pd.Series(['Dave','Cheng','Udacity',42])
print series

0       Dave
1      Cheng
2    Udacity
3         42
dtype: object


In [13]:
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
print series

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object


In [14]:
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                       index=['Instructor', 'Curriculum Manager',
                              'Course Number', 'Power Level'])
print series['Instructor']
print ""
print series[['Instructor', 'Curriculum Manager', 'Course Number']]

Dave

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object


In [15]:
cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig',
                                                 'Puppy', 'Kitten'])
print cuteness > 3
print ""
print cuteness[cuteness > 3]

Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64


In [16]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
            'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                     'Lions', 'Lions'],
            'wins': [11, 8, 10, 15, 11, 6, 10, 4],
            'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)

print football

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012


In [17]:
# dtypes (data types)
print football.dtypes

# .describe (basic statistics)
print football.describe()

losses     int64
team      object
wins       int64
year       int64
dtype: object
          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000


<h2>Sochi Olympics Dataframe</h2>

In [24]:
def create_dataframe():
    '''
    Create a pandas dataframe called 'olympic_medal_counts_df' containing
    the data from the table of 2014 Sochi winter olympics medal counts.  

    The columns for this dataframe should be called 
    'country_name', 'gold', 'silver', and 'bronze'.  

    There is no need to  specify row indexes for this dataframe 
    (in this case, the rows will automatically be assigned numbered indexes).
    
    You do not need to call the function in your code when running it in the
    browser - the grader will do that automatically when you submit or test it.
    '''

    countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

    
    olympic_medal_counts_df = pd.DataFrame({"Countries": pd.Series(countries),
              "Gold": pd.Series(gold), 
              "Silver": pd.Series(silver),
              "Bronze": pd.Series(bronze)})
    
    
    return olympic_medal_counts_df


In [26]:
sochi=create_dataframe()