__numpy:__ for processing large amounts of numerical data
- `import numpy as np`
- [Numpy Library Documentation](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
- optimized so that it runs fast, much faster than if you were working with Python lists directly.

__panda:__ for storing large datasets with series and dataframes.
- `import pandas as pd`
- [Pandas Library Documentation](http://pandas.pydata.org/pandas-docs/version/0.17.0/)
    - suggested: [boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)
- [Excellent series of tutorials with jupyter notebooks](https://bitbucket.org/hrojas/learn-pandas)
- [Intro to Pandas Data Structures, blog post by Greg Reda](www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/)

## Data types
### `np.ndarray`
(aliased as `np.array`)
- Numpy arrays are like
lists in Python, except that every thing inside an array must be of the
same type, like int or float.
- You can index, slice, and manipulate a Numpy array much like you would with a
a Python list.

#### example code
```python
# a 2D array/Matrix
array = np.array([[1, 2, 3], [4, 5, 6]], float)

# calculate mean of array-like object (more down below)
np.mean(x)

# dot product, vector multiplication (see 3rd quiz for more)
np.dot(a, b)
```
    
### `pd.Series`
- one-dimensional object similar to an array, list, or column in a database.
- items can be of different data types
- default indexing: [0:N-1]

#### example code

In [1]:
import pandas as pd

# customized indicies
cuteness_all = pd.Series(
    [1, 2, 3, 4, 5], index=[
    'Cockroach', 'Mini Pig', 'Fish', 'Puppy', 'Kitten'])
sub_from_indices = cuteness_all[['Fish', 'Puppy', 'Kitten']]  # takes list
output_sub_from_indices = pd.Series([
    3, 4, 5], index=[
    'Fish', 'Puppy', 'Kitten'])
print sub_from_indices.equals(output_sub_from_indices)  # True

# indexing with boolean operators
sub_from_boolean_indexing = cuteness_all[cuteness_all > 2]
print output_sub_from_indices.equals(sub_from_boolean_indexing)  # True

# applying boolean operators
bool_op_on_series = cuteness_all > 2
output_bool_op_on_series = pd.Series([
    False, False, True, True, True], index=[
    'Cockroach', 'Mini Pig', 'Fish', 'Puppy', 'Kitten'])
print bool_op_on_series.equals(output_bool_op_on_series)  # True

True
True
True


### `pd.DataFrame`
- similar to a spreadsheet, a database table, or R's data.frame object
- can be created by passing a dictionary of lists to the Dataframe
constructor
    - dictionary key will be the column name
    - the key's associated list or Series will be the values within that column
- default indexing of rows: [0:N-1]
    - this indexing is preserved as 'location' value even when there is custom indexing 
- rows and columns accessed similar in dictionary, except:
    -  order of row and column interchangeable, for example:
        - `df['team'][df['year'] > 2011]` is the same as
        - `df[df['year'] > 2011]['team']`
    - see code in next cell for more details

#### example code

In [9]:
from pandas import DataFrame, Series

### Creating a DataFrame

data = {'year':    [2010, 2011, 2012, 2011,
                    2012, 2010, 2011, 2012],
        'team':    ['Bears', 'Bears', 'Bears', 'Packers',
                    'Packers', 'Lions', 'Lions', 'Lions'],
        'wins':    [11, 8, 10, 15,
                    11, 6, 10, 4],
        'losses':  [5, 8, 6, 1,
                    5, 10, 6, 12]}

football = DataFrame(data,
                     index=['a', 'b', 'c', 'd',
                            'e', 'f', 'e', 'f'])
df = football
print '***__str__ representation***'
print df

***__str__ representation***
   losses     team  wins  year
a       5    Bears    11  2010
b       8    Bears     8  2011
c       6    Bears    10  2012
d       1  Packers    15  2011
e       5  Packers    11  2012
f      10    Lions     6  2010
e       6    Lions    10  2011
f      12    Lions     4  2012


In [10]:
### Accessing data in DataFrame with boolean operators

# Row selection by individual index
df.loc['b'].equals( df.iloc[1] )  # True, operands return Series
df.loc[['b']].equals( df.iloc[[1]] )  # True, operands return DataFrame
# Note: DataFrame returned because list as index 
#     could have more than one item.

# Column selection by individual index
df['team'].equals( df.team )  # True
df[['team', 'wins']].equals( df[[1, 2]] )  # True

# Row selection by slicing
df[3:5].equals( df.iloc[[1]] )

# Row selection by boolean indexing
df[df.wins > 10]
df[(df.wins > 10) & (df.team == "Packers")]
df['team'][df.wins > 10]  # returns single column (Series)
print  # prepend to lines of code to see return values




In [4]:
### Inspecting DataFrame

print '\n\n***datatype for each column***'
print df.dtypes

print '\n***summary stats of numerical columns***'
print df.describe()

print '\n***first 5 rows of dataset***'
print df.head()

print '\n***last 5 rows of dataset***'
print df.tail()



***datatype for each column***
losses     int64
team      object
wins       int64
year       int64
dtype: object

***summary stats of numerical columns***
          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

***first 5 rows of dataset***
   losses     team  wins  year
a       5    Bears    11  2010
b       8    Bears     8  2011
c       6    Bears    10  2012
d       1  Packers    15  2011
e       5  Packers    11  2012

***last 5 rows of dataset***
   losses     team  wins  year
d       1  Packers    15  2011
e       5  Packers    11  2012
f      10    Lions     6  2010
e       6    Lions    10  2011
f      12    Lions     4  2012


#### Pandas Vectorized Methods
`numpy.mean(df)`, same as `df.apply(numpy.mean)`
- returns new DataFrame `[c x 2]`, where `c` is the number of columns in `df` or otherwise specified
- can work with other vectorize methods, but not all
- all columns applied to must be numerical
- `numpy.mean(df[['wins', 'year']])`, same as:
    - `df[['wins', 'year']].apply(numpy.mean)`
    
`df.applymap(func)`
- returns DataFrame of same shape as `df` with values returned by `func`

`df[index].map(func)`
- applies map only to specified row

## Quizes:

In [5]:
from pandas import DataFrame, Series
import numpy

'''
Compute the average number of bronze medals earned by countries who 
earned at least one gold medal.  

Save this to a variable named avg_bronze_at_least_one_gold. You do not
need to call the function in your code when running it in the browser -
the grader will do that automatically when you submit or test it.

HINT-1:
You can retrieve all of the values of a Pandas column from a 
data frame, "df", as follows:
df['column_name']

HINT-2:
The numpy.mean function can accept as an argument a single
Pandas column. 

For example, numpy.mean(df["col_name"]) would return the 
mean of the values located in "col_name" of a dataframe df.
'''

countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
             'Netherlands', 'Germany', 'Switzerland', 'Belarus',
             'Austria', 'France', 'Poland', 'China', 'Korea', 
             'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
             'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
             'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

olympic_medal_counts = {'country_name':Series(countries),
                        'gold': Series(gold),
                        'silver': Series(silver),
                        'bronze': Series(bronze)}
df = DataFrame(olympic_medal_counts)

# YOUR CODE HERE
avg_bronze_at_least_one_gold = df[['bronze']][df.gold > 0].apply(numpy.mean)

print float(avg_bronze_at_least_one_gold)
### correct output: 4.2380952381

4.2380952381


In [6]:
import numpy as np
from pandas import DataFrame, Series

'''
Using the dataframe's apply method, create a new Series called 
avg_medal_count that indicates the average number of gold, silver,
and bronze medals earned amongst countries who earned at 
least one medal of any kind at the 2014 Sochi olympics.  Note that
the countries list already only includes countries that have earned
at least one medal. No additional filtering is necessary.

You do not need to call the function in your code when running it in the
browser - the grader will do that automatically when you submit or test it.
'''

countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
             'Netherlands', 'Germany', 'Switzerland', 'Belarus',
             'Austria', 'France', 'Poland', 'China', 'Korea', 
             'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
             'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
             'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

olympic_medal_counts = {'country_name':countries,
                        'gold': Series(gold),
                        'silver': Series(silver),
                        'bronze': Series(bronze)}    
df = DataFrame(olympic_medal_counts)
    
# YOUR CODE HERE
avg_medal_count = df[['gold', 'silver', 'bronze']].apply(np.mean)
print avg_medal_count
### correct output: 
#   gold      3.807692
#   silver    3.730769
#   bronze    3.807692
    


gold      3.807692
silver    3.730769
bronze    3.807692
dtype: float64


In [7]:
### using dot product numpy function
import numpy as np
from pandas import DataFrame, Series

'''
Imagine a point system in which each country is awarded 4 points for each
gold medal,  2 points for each silver medal, and one point for each 
bronze medal.  

Using the numpy.dot function, create a new dataframe called 
'olympic_points_df' that includes:
    a) a column called 'country_name' with the country name
    b) a column called 'points' with the total number of points the country
       earned at the Sochi olympics.

You do not need to call the function in your code when running it in the
browser - the grader will do that automatically when you submit or test it.
'''

countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
             'Netherlands', 'Germany', 'Switzerland', 'Belarus',
             'Austria', 'France', 'Poland', 'China', 'Korea', 
             'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
             'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
             'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

# YOUR CODE HERE

# data parameter ordered alphabetically 
# to match DataFrame auto-sorted representation
df = DataFrame(
    index=countries,
    data={'bronze': bronze, 'gold': gold, 'silver': silver}
              )

# index ordered alphabetically for compatibility with
# DataFrame column ordering (sorted)
points_for_medals = Series(
    index=['bronze', 'gold', 'silver'],
    data=[1, 4, 2]
                          )

points = np.dot(df, points_for_medals)
# points = np.dot(df, [1, 4, 2]) would work too

olympic_points_df = DataFrame({'country_name': countries, 'points': points})

# # alternative solution using only pandas (needs extra work)
# df['points'] = df[['gold','silver','bronze']].dot([4, 2, 1]) 
# olympic_points_df = df[['country_name','points']]

print olympic_points_df

      country_name  points
0     Russian Fed.      83
1           Norway      64
2           Canada      65
3    United States      62
4      Netherlands      55
5          Germany      49
6      Switzerland      32
7          Belarus      21
8          Austria      37
9           France      31
10          Poland      19
11           China      22
12           Korea      20
13          Sweden      28
14  Czech Republic      18
15        Slovenia      16
16           Japan      15
17         Finland      11
18   Great Britain       8
19         Ukraine       5
20        Slovakia       4
21           Italy      10
22          Latvia       6
23       Australia       5
24         Croatia       2
25      Kazakhstan       1
